From a7d700d06e9a1ab9d23c2551883d570c0c5f0f19 Mon Sep 17 00:00:00 2001
From: Jared Vititoe <jjvititoe1@gmail.com>
Date: Tue, 10 Mar 2026 12:30:03 -0400
Subject: [PATCH] =?UTF-8?q?docs:=20update=20README=20for=20Phase=206=20?=
 =?UTF-8?q?=E2=80=94=20monitoring,=20observability,=20alert=20rules?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Add Prometheus and Grafana to infrastructure table
- Update port map: Hookshot metrics on 9004, node_exporter on 9100, LiveKit metrics on 6789
- Add PostgreSQL LXC port map
- Update monitoring checklist — all Prometheus/Grafana items now complete
- Mark Hookshot metrics audit item as resolved
- Add Storj node outdated to admin checklist
- Add full Monitoring & Observability section:
  - Prometheus scrape jobs table (synapse, livekit, hookshot, matrix-node, postgres, postgres-node)
  - Grafana dashboard section listing all 21 panel groups
  - Alert rules tables (Matrix + Infrastructure folders, Prometheus rules)
  - /sync long-poll false positive note
  - Known alert watch items

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 README.md | 124 +++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 118 insertions(+), 6 deletions(-)

diff --git a/README.md b/README.md
index a15ee06..fd2bf3d 100644
--- a/README.md
+++ b/README.md
@@ -4,7 +4,7 @@ Matrix bot and server infrastructure for the Lotus Guild homeserver (`matrix.lot
 
 **Repo**: https://code.lotusguild.org/LotusGuild/matrixBot
 
-## Status: Phase 5 — Optimization, Voice Quality & Custom Client
+## Status: Phase 6 — Monitoring, Observability & Hardening
 
 ---
 
@@ -36,6 +36,8 @@ Matrix bot and server infrastructure for the Lotus Guild homeserver (`matrix.lot
 | Authelia | 10.10.10.36 | 167 | — | — | — | SSO/OIDC provider |
 | LLDAP | 10.10.10.39 | 147 | — | — | — | LDAP user directory |
 | Uptime Kuma | 10.10.10.25 | 101 | — | — | — | Uptime monitoring (micro1 node) |
+| Prometheus | 10.10.10.48 | 118 | — | — | — | Prometheus — scrapes all Matrix services |
+| Grafana | 10.10.10.49 | 107 | — | — | — | Grafana 12.4.0 — dashboard.lotusguild.org |
 
 > **Note:** PostgreSQL container IP is `10.10.10.44`, not `.2` — update any stale references.
 
@@ -79,10 +81,14 @@ Matrix bot and server infrastructure for the Lotus Guild homeserver (`matrix.lot
 | Port | Service | Bind |
 |------|---------|------|
 | 8008 | Synapse HTTP | 0.0.0.0 + ::1 |
-| 9000 | Synapse metrics (Prometheus) | 0.0.0.0 |
-| 9001 | Hookshot widgets + metrics | 127.0.0.1 |
-| 9002 | Hookshot bridge | 127.0.0.1 |
+| 9000 | Synapse metrics (Prometheus) | 127.0.0.1 + 10.10.10.29 |
+| 9001 | Hookshot widgets | 0.0.0.0 |
+| 9002 | Hookshot bridge (appservice) | 127.0.0.1 |
 | 9003 | Hookshot webhooks | 0.0.0.0 |
+| 9004 | Hookshot metrics (Prometheus) | 0.0.0.0 |
+| 9100 | node_exporter (Prometheus) | 0.0.0.0 |
+| 9101 | matrix-admin exporter | 0.0.0.0 |
+| 6789 | LiveKit metrics (Prometheus) | 0.0.0.0 |
 | 7880 | LiveKit HTTP | 0.0.0.0 |
 | 7881 | LiveKit RTC TCP | 0.0.0.0 |
 | 8070 | lk-jwt-service | 0.0.0.0 |
@@ -90,6 +96,13 @@ Matrix bot and server infrastructure for the Lotus Guild homeserver (`matrix.lot
 | 3478 | coturn STUN/TURN | 0.0.0.0 |
 | 5349 | coturn TURNS/TLS | 0.0.0.0 |
 
+**Internal port map (LXC 109 — PostgreSQL):**
+| Port | Service | Bind |
+|------|---------|------|
+| 5432 | PostgreSQL | 0.0.0.0 (hba-restricted to 10.10.10.29) |
+| 9100 | node_exporter (Prometheus) | 0.0.0.0 |
+| 9187 | postgres_exporter | 0.0.0.0 |
+
 ---
 
 ## Rooms (all v12)
@@ -462,13 +475,20 @@ cp -r dist/* /var/www/html/
 - [x] Synapse metrics endpoint (port 9000, Prometheus-compatible)
 - [x] Uptime Kuma monitors added: Synapse HTTP, LiveKit TCP, PostgreSQL TCP, Cinny Web, coturn TCP 3478, lk-jwt-service, Hookshot
 - [ ] Uptime Kuma: coturn UDP STUN monitoring (requires push/heartbeat — no native UDP type in Kuma)
-- [ ] Grafana dashboard for Synapse Prometheus metrics — Grafana at 10.10.10.49 (LXC 107), Prometheus scraping 10.10.10.29:9000 confirmed. Import dashboard ID `18618` from grafana.com
+- [x] Grafana dashboard — custom Synapse dashboard at `dashboard.lotusguild.org/d/matrix-synapse-dashboard/matrix-synapse` (140+ panels, see Monitoring section below)
+- [x] Prometheus scraping all Matrix services: Synapse, Hookshot, LiveKit, matrix-node, postgres-node, matrix-admin, postgres, postgres-exporter
+- [x] node_exporter installed on LXC 151 (Matrix) and LXC 109 (PostgreSQL)
+- [x] LiveKit Prometheus metrics enabled (`prometheus_port: 6789`)
+- [x] Hookshot metrics enabled (`metrics: { enabled: true }`) on dedicated port 9004
+- [x] Grafana alert rules — 9 Matrix/infra alerts active (see Alert Rules section below)
+- [x] Duplicate Grafana "Infrastructure" folder merged and deleted
 
 ### Admin
 - [x] Synapse admin API dashboard (synapse-admin at http://10.10.10.29:8080)
 - [x] Power levels per room
 - [ ] Draupnir moderation bot (new LXC or alongside existing bot)
 - [ ] Cinny custom branding (Lotus Guild theme — colors, title, favicon, PWA name)
+- [ ] **Storj node update** — `storj_uptodate=0` on LXC 138 (10.10.10.133), risk of disqualification
 
 ---
 
@@ -487,7 +507,7 @@ Comprehensive audit of the current infrastructure against official documentation
 | No fail2ban on `/_matrix/client/.*/login` | MEDIUM | ✅ Fixed — fail2ban installed, matrix-synapse jail active (5 retries / 24h ban) |
 | No media purge cron (retention policy set but never triggers) | MEDIUM | ✅ N/A — `media_retention` block already in homeserver.yaml; Synapse runs the purge internally on schedule |
 | PostgreSQL autovacuum not tuned per-table | LOW | ✅ Fixed — all 5 high-churn tables tuned, `autovacuum_max_workers` → 5 |
-| Hookshot metrics scrape unconfirmed | LOW | ⚠️ Port 9001 responds but `/metrics` returns 404 — hookshot bug or path mismatch; low impact |
+| Hookshot metrics scrape unconfirmed | LOW | ✅ Fixed — `metrics: { enabled: true }` added to config, metrics split to dedicated port 9004, Prometheus scraping confirmed |
 | LiveKit VP9/AV1 codec support | LOW | ✅ Applied — `video_codecs: [VP8, H264, VP9, AV1]` added to livekit config |
 | Federation allow/deny list not configured | LOW | Pending — Mjolnir/Draupnir on roadmap |
 | Sygnal push notifications not deployed | INFO | Deferred |
@@ -793,6 +813,98 @@ Add `federation_sender` next (off-loads outgoing federation from main process).
 
 ---
 
+---
+
+## Monitoring & Observability (March 2026)
+
+### Prometheus Scrape Jobs
+
+All Matrix-related services scraped by Prometheus at `10.10.10.48` (LXC 118):
+
+| Job | Target | Metrics |
+|-----|--------|---------|
+| `synapse` | `10.10.10.29:9000` | Full Synapse internals (events, federation, caches, DB, HTTP) |
+| `matrix-admin` | `10.10.10.29:9101` | DAU, MAU, room/user/media totals |
+| `livekit` | `10.10.10.29:6789` | Rooms, participants, packets, forward latency, quality |
+| `hookshot` | `10.10.10.29:9004` | Connections by service, API calls/failures, Node.js runtime |
+| `matrix-node` | `10.10.10.29:9100` | CPU, RAM, network, disk space, load avg (Matrix LXC host) |
+| `postgres` | `10.10.10.44:9187` | pg_stat_database, connections, WAL, block I/O |
+| `postgres-node` | `10.10.10.44:9100` | CPU, RAM, network, disk space, load avg (PostgreSQL LXC host) |
+| `postgres-exporter-2` | `10.10.10.160:9711` | Secondary postgres exporter |
+
+> **Disk I/O note:** All servers use Ceph-backed storage. Per-device disk I/O metrics are meaningless; use Network I/O panels to see actual storage traffic.
+
+### Grafana Dashboard
+
+**URL:** `https://dashboard.lotusguild.org/d/matrix-synapse-dashboard/matrix-synapse`
+
+140+ panels across 18 sections:
+
+| Section | Key panels |
+|---------|-----------|
+| Synapse Overview | Up status, users, rooms, DAU/MAU, media, federation peers |
+| Synapse Process Health | CPU, memory, FDs, thread pool, GC, Twisted reactor |
+| HTTP API Requests | Rate, response codes, p99/p50 latency, in-flight, DB txn time |
+| Federation | Outgoing/incoming PDUs, queue depth, staging, known servers |
+| Events & Rooms | Event persistence, notifier, sync responses |
+| Presence & Push | Presence updates, pushers, state transitions |
+| Rate Limiting | Rejections, sleeps, queue wait time p99 |
+| Users & Registration | Login rate, registration rate, growth over time |
+| Synapse Database Performance | Txn rate/duration, schedule latency, query latency |
+| Synapse Caches | Hit rate (top 5), sizes, evictions, response cache |
+| Event Processing & Lag | Lag by processor, stream positions, event fetch ongoing |
+| State Resolution | Forward extremities, state resolution CPU, state groups |
+| App Services (Hookshot) | Events sent, transactions sent vs failed |
+| HTTP Push | Push processed vs failed, badge updates |
+| Sliding Sync & Slow Endpoints | Sliding sync p99, slowest endpoints, rate limit wait |
+| Background Processes | In-flight by name, start rate, CPU, scheduler tasks |
+| PostgreSQL Database | Size, connections, transactions, block I/O, WAL, locks |
+| LiveKit SFU | Rooms, participants, network, packets out/dropped, forward latency |
+| Hookshot | Matrix API calls/failures, active connections, Node.js event loop lag |
+| Matrix LXC Host | CPU, RAM, network (incl. Ceph), load average, disk space |
+| PostgreSQL LXC Host | CPU, RAM, network (incl. Ceph), load average, disk space |
+
+### Alert Rules
+
+All alerts are Grafana-native (Alerting → Alert Rules). Current active rules:
+
+**Matrix folder (`matrix-folder`):**
+| Alert | Fires when | Severity |
+|-------|-----------|----------|
+| Synapse Down | `up{job="synapse"}` < 1 for 2m | critical |
+| PostgreSQL Down | `pg_up` < 1 for 2m | critical |
+| LiveKit Down | `up{job="livekit"}` < 1 for 2m | critical |
+| Hookshot Down | `up{job="hookshot"}` < 1 for 2m | critical |
+| PG Connection Saturation | connections > 80% of max for 5m | warning |
+| Federation Queue Backing Up | pending PDUs > 100 for 10m | warning |
+| Synapse High Memory | RSS > 2000MB for 10m | warning |
+| Synapse High Response Time | p99 latency (excl. /sync) > 10s for 5m | warning |
+| Synapse Event Processing Lag | any processor > 30s behind for 5m | warning |
+| Synapse DB Query Latency High | p99 query time > 1s for 5m | warning |
+
+**Infrastructure folder (`infra-folder`):**
+| Alert | Fires when | Severity |
+|-------|-----------|----------|
+| Service Exporter Down | any `up == 0` for 3m | critical |
+| Node High CPU Usage | CPU > 90% for 10m | warning |
+| Node High Memory Usage | RAM > 90% for 10m | warning |
+| Node Disk Space Low | available < 15% (excl. tmpfs/overlay) for 10m | warning |
+
+**Prometheus rules (`/etc/prometheus/prometheus_rules.yml`):**
+| Alert | Fires when |
+|-------|-----------|
+| InstanceDown | any `up == 0` for 1m |
+| DiskSpaceFree10Percent | available < 10% (excl. tmpfs/overlay) for 5m |
+
+> **`/sync` long-poll note:** The Matrix `/sync` endpoint is a long-poll (clients hold it open ≤30s). It is excluded from the High Response Time alert to prevent false positives. Without exclusion, p99 reads ~10s even when the server is healthy.
+
+### Known Alert False Positives / Watch Items
+
+- **Synapse Event Processing Lag** — can fire transiently after Synapse restart while processors catch up on backlog. Self-resolves in 10–20 minutes. If it grows continuously (>10 min) and doesn't plateau, restart Synapse.
+- **Node Disk Space Low** — excludes `tmpfs`, `overlay`, `squashfs`, `devtmpfs`, and `/boot`/`/run` mounts. If new filesystem types appear, add them to the `fstype!~` filter in the rule.
+
+---
+
 ## Bot Checklist
 
 ### Core