docs: update README for Phase 6 — monitoring, observability, alert rules

- Add Prometheus and Grafana to infrastructure table - Update port map: Hookshot metrics on 9004, node_exporter on 9100, LiveKit metrics on 6789 - Add PostgreSQL LXC port map - Update monitoring checklist — all Prometheus/Grafana items now complete - Mark Hookshot metrics audit item as resolved - Add Storj node outdated to admin checklist - Add full Monitoring & Observability section: - Prometheus scrape jobs table (synapse, livekit, hookshot, matrix-node, postgres, postgres-node) - Grafana dashboard section listing all 21 panel groups - Alert rules tables (Matrix + Infrastructure folders, Prometheus rules) - /sync long-poll false positive note - Known alert watch items Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 12:30:03 -04:00
parent 2b998b9ba6
commit a7d700d06e
1 changed files with 118 additions and 6 deletions
@@ -4,7 +4,7 @@ Matrix bot and server infrastructure for the Lotus Guild homeserver (`matrix.lot
 **Repo**: https://code.lotusguild.org/LotusGuild/matrixBot
-## Status: Phase 5 — Optimization, Voice Quality & Custom Client
+## Status: Phase 6 — Monitoring, Observability & Hardening
 ---
@@ -36,6 +36,8 @@ Matrix bot and server infrastructure for the Lotus Guild homeserver (`matrix.lot
 | Authelia | 10.10.10.36 | 167 | — | — | — | SSO/OIDC provider |
 | LLDAP | 10.10.10.39 | 147 | — | — | — | LDAP user directory |
 | Uptime Kuma | 10.10.10.25 | 101 | — | — | — | Uptime monitoring (micro1 node) |
 | Prometheus | 10.10.10.48 | 118 | — | — | — | Prometheus — scrapes all Matrix services |
 | Grafana | 10.10.10.49 | 107 | — | — | — | Grafana 12.4.0 — dashboard.lotusguild.org |
 > **Note:** PostgreSQL container IP is `10.10.10.44`, not `.2` — update any stale references.
@@ -79,10 +81,14 @@ Matrix bot and server infrastructure for the Lotus Guild homeserver (`matrix.lot
 | Port | Service | Bind |
 |------|---------|------|
 | 8008 | Synapse HTTP | 0.0.0.0 + ::1 |
-| 9000 | Synapse metrics (Prometheus) | 0.0.0.0 |
+| 9000 | Synapse metrics (Prometheus) | 127.0.0.1 + 10.10.10.29 |
-| 9001 | Hookshot widgets + metrics | 127.0.0.1 |
+| 9001 | Hookshot widgets | 0.0.0.0 |
-| 9002 | Hookshot bridge | 127.0.0.1 |
+| 9002 | Hookshot bridge (appservice) | 127.0.0.1 |
 | 9003 | Hookshot webhooks | 0.0.0.0 |
 | 9004 | Hookshot metrics (Prometheus) | 0.0.0.0 |
 | 9100 | node_exporter (Prometheus) | 0.0.0.0 |
 | 9101 | matrix-admin exporter | 0.0.0.0 |
 | 6789 | LiveKit metrics (Prometheus) | 0.0.0.0 |
 | 7880 | LiveKit HTTP | 0.0.0.0 |
 | 7881 | LiveKit RTC TCP | 0.0.0.0 |
 | 8070 | lk-jwt-service | 0.0.0.0 |
@@ -90,6 +96,13 @@ Matrix bot and server infrastructure for the Lotus Guild homeserver (`matrix.lot
 | 3478 | coturn STUN/TURN | 0.0.0.0 |
 | 5349 | coturn TURNS/TLS | 0.0.0.0 |
 **Internal port map (LXC 109 — PostgreSQL):**
 | Port | Service | Bind |
 |------|---------|------|
 | 5432 | PostgreSQL | 0.0.0.0 (hba-restricted to 10.10.10.29) |
 | 9100 | node_exporter (Prometheus) | 0.0.0.0 |
 | 9187 | postgres_exporter | 0.0.0.0 |
 ---
 ## Rooms (all v12)
@@ -462,13 +475,20 @@ cp -r dist/* /var/www/html/
 - [x] Synapse metrics endpoint (port 9000, Prometheus-compatible)
 - [x] Uptime Kuma monitors added: Synapse HTTP, LiveKit TCP, PostgreSQL TCP, Cinny Web, coturn TCP 3478, lk-jwt-service, Hookshot
 - [ ] Uptime Kuma: coturn UDP STUN monitoring (requires push/heartbeat — no native UDP type in Kuma)
- [ ] Grafana dashboard for Synapse Prometheus metrics — Grafana at 10.10.10.49 (LXC 107), Prometheus scraping 10.10.10.29:9000 confirmed. Import dashboard ID `18618` from grafana.com
+- [x] Grafana dashboard — custom Synapse dashboard at `dashboard.lotusguild.org/d/matrix-synapse-dashboard/matrix-synapse` (140+ panels, see Monitoring section below)
 - [x] Prometheus scraping all Matrix services: Synapse, Hookshot, LiveKit, matrix-node, postgres-node, matrix-admin, postgres, postgres-exporter
 - [x] node_exporter installed on LXC 151 (Matrix) and LXC 109 (PostgreSQL)
 - [x] LiveKit Prometheus metrics enabled (`prometheus_port: 6789`)
 - [x] Hookshot metrics enabled (`metrics: { enabled: true }`) on dedicated port 9004
 - [x] Grafana alert rules — 9 Matrix/infra alerts active (see Alert Rules section below)
 - [x] Duplicate Grafana "Infrastructure" folder merged and deleted
 ### Admin
 - [x] Synapse admin API dashboard (synapse-admin at http://10.10.10.29:8080)
 - [x] Power levels per room
 - [ ] Draupnir moderation bot (new LXC or alongside existing bot)
 - [ ] Cinny custom branding (Lotus Guild theme — colors, title, favicon, PWA name)
 - [ ] **Storj node update** — `storj_uptodate=0` on LXC 138 (10.10.10.133), risk of disqualification
 ---
@@ -487,7 +507,7 @@ Comprehensive audit of the current infrastructure against official documentation
 | No fail2ban on `/_matrix/client/.*/login` | MEDIUM | ✅ Fixed — fail2ban installed, matrix-synapse jail active (5 retries / 24h ban) |
 | No media purge cron (retention policy set but never triggers) | MEDIUM | ✅ N/A — `media_retention` block already in homeserver.yaml; Synapse runs the purge internally on schedule |
 | PostgreSQL autovacuum not tuned per-table | LOW | ✅ Fixed — all 5 high-churn tables tuned, `autovacuum_max_workers` → 5 |
-| Hookshot metrics scrape unconfirmed | LOW | ⚠️ Port 9001 responds but `/metrics` returns 404 — hookshot bug or path mismatch; low impact |
+| Hookshot metrics scrape unconfirmed | LOW | ✅ Fixed — `metrics: { enabled: true }` added to config, metrics split to dedicated port 9004, Prometheus scraping confirmed |
 | LiveKit VP9/AV1 codec support | LOW | ✅ Applied — `video_codecs: [VP8, H264, VP9, AV1]` added to livekit config |
 | Federation allow/deny list not configured | LOW | Pending — Mjolnir/Draupnir on roadmap |
 | Sygnal push notifications not deployed | INFO | Deferred |
@@ -793,6 +813,98 @@ Add `federation_sender` next (off-loads outgoing federation from main process).
 ---
 ---
 ## Monitoring & Observability (March 2026)
 ### Prometheus Scrape Jobs
 All Matrix-related services scraped by Prometheus at `10.10.10.48` (LXC 118):
 | Job | Target | Metrics |
 |-----|--------|---------|
 | `synapse` | `10.10.10.29:9000` | Full Synapse internals (events, federation, caches, DB, HTTP) |
 | `matrix-admin` | `10.10.10.29:9101` | DAU, MAU, room/user/media totals |
 | `livekit` | `10.10.10.29:6789` | Rooms, participants, packets, forward latency, quality |
 | `hookshot` | `10.10.10.29:9004` | Connections by service, API calls/failures, Node.js runtime |
 | `matrix-node` | `10.10.10.29:9100` | CPU, RAM, network, disk space, load avg (Matrix LXC host) |
 | `postgres` | `10.10.10.44:9187` | pg_stat_database, connections, WAL, block I/O |
 | `postgres-node` | `10.10.10.44:9100` | CPU, RAM, network, disk space, load avg (PostgreSQL LXC host) |
 | `postgres-exporter-2` | `10.10.10.160:9711` | Secondary postgres exporter |
 > **Disk I/O note:** All servers use Ceph-backed storage. Per-device disk I/O metrics are meaningless; use Network I/O panels to see actual storage traffic.
 ### Grafana Dashboard
 **URL:** `https://dashboard.lotusguild.org/d/matrix-synapse-dashboard/matrix-synapse`
 140+ panels across 18 sections:
 | Section | Key panels |
 |---------|-----------|
 | Synapse Overview | Up status, users, rooms, DAU/MAU, media, federation peers |
 | Synapse Process Health | CPU, memory, FDs, thread pool, GC, Twisted reactor |
 | HTTP API Requests | Rate, response codes, p99/p50 latency, in-flight, DB txn time |
 | Federation | Outgoing/incoming PDUs, queue depth, staging, known servers |
 | Events & Rooms | Event persistence, notifier, sync responses |
 | Presence & Push | Presence updates, pushers, state transitions |
 | Rate Limiting | Rejections, sleeps, queue wait time p99 |
 | Users & Registration | Login rate, registration rate, growth over time |
 | Synapse Database Performance | Txn rate/duration, schedule latency, query latency |
 | Synapse Caches | Hit rate (top 5), sizes, evictions, response cache |
 | Event Processing & Lag | Lag by processor, stream positions, event fetch ongoing |
 | State Resolution | Forward extremities, state resolution CPU, state groups |
 | App Services (Hookshot) | Events sent, transactions sent vs failed |
 | HTTP Push | Push processed vs failed, badge updates |
 | Sliding Sync & Slow Endpoints | Sliding sync p99, slowest endpoints, rate limit wait |
 | Background Processes | In-flight by name, start rate, CPU, scheduler tasks |
 | PostgreSQL Database | Size, connections, transactions, block I/O, WAL, locks |
 | LiveKit SFU | Rooms, participants, network, packets out/dropped, forward latency |
 | Hookshot | Matrix API calls/failures, active connections, Node.js event loop lag |
 | Matrix LXC Host | CPU, RAM, network (incl. Ceph), load average, disk space |
 | PostgreSQL LXC Host | CPU, RAM, network (incl. Ceph), load average, disk space |
 ### Alert Rules
 All alerts are Grafana-native (Alerting → Alert Rules). Current active rules:
 **Matrix folder (`matrix-folder`):**
 | Alert | Fires when | Severity |
 |-------|-----------|----------|
 | Synapse Down | `up{job="synapse"}` < 1 for 2m | critical |
 | PostgreSQL Down | `pg_up` < 1 for 2m | critical |
 | LiveKit Down | `up{job="livekit"}` < 1 for 2m | critical |
 | Hookshot Down | `up{job="hookshot"}` < 1 for 2m | critical |
 | PG Connection Saturation | connections > 80% of max for 5m | warning |
 | Federation Queue Backing Up | pending PDUs > 100 for 10m | warning |
 | Synapse High Memory | RSS > 2000MB for 10m | warning |
 | Synapse High Response Time | p99 latency (excl. /sync) > 10s for 5m | warning |
 | Synapse Event Processing Lag | any processor > 30s behind for 5m | warning |
 | Synapse DB Query Latency High | p99 query time > 1s for 5m | warning |
 **Infrastructure folder (`infra-folder`):**
 | Alert | Fires when | Severity |
 |-------|-----------|----------|
 | Service Exporter Down | any `up == 0` for 3m | critical |
 | Node High CPU Usage | CPU > 90% for 10m | warning |
 | Node High Memory Usage | RAM > 90% for 10m | warning |
 | Node Disk Space Low | available < 15% (excl. tmpfs/overlay) for 10m | warning |
 **Prometheus rules (`/etc/prometheus/prometheus_rules.yml`):**
 | Alert | Fires when |
 |-------|-----------|
 | InstanceDown | any `up == 0` for 1m |
 | DiskSpaceFree10Percent | available < 10% (excl. tmpfs/overlay) for 5m |
 > **`/sync` long-poll note:** The Matrix `/sync` endpoint is a long-poll (clients hold it open ≤30s). It is excluded from the High Response Time alert to prevent false positives. Without exclusion, p99 reads ~10s even when the server is healthy.
 ### Known Alert False Positives / Watch Items
 - **Synapse Event Processing Lag** — can fire transiently after Synapse restart while processors catch up on backlog. Self-resolves in 10–20 minutes. If it grows continuously (>10 min) and doesn't plateau, restart Synapse.
 - **Node Disk Space Low** — excludes `tmpfs`, `overlay`, `squashfs`, `devtmpfs`, and `/boot`/`/run` mounts. If new filesystem types appear, add them to the `fstype!~` filter in the rule.
 ---
 ## Bot Checklist
 ### Core