From a7d700d06e9a1ab9d23c2551883d570c0c5f0f19 Mon Sep 17 00:00:00 2001 From: Jared Vititoe Date: Tue, 10 Mar 2026 12:30:03 -0400 Subject: [PATCH] =?UTF-8?q?docs:=20update=20README=20for=20Phase=206=20?= =?UTF-8?q?=E2=80=94=20monitoring,=20observability,=20alert=20rules?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Add Prometheus and Grafana to infrastructure table - Update port map: Hookshot metrics on 9004, node_exporter on 9100, LiveKit metrics on 6789 - Add PostgreSQL LXC port map - Update monitoring checklist — all Prometheus/Grafana items now complete - Mark Hookshot metrics audit item as resolved - Add Storj node outdated to admin checklist - Add full Monitoring & Observability section: - Prometheus scrape jobs table (synapse, livekit, hookshot, matrix-node, postgres, postgres-node) - Grafana dashboard section listing all 21 panel groups - Alert rules tables (Matrix + Infrastructure folders, Prometheus rules) - /sync long-poll false positive note - Known alert watch items Co-Authored-By: Claude Sonnet 4.6 --- README.md | 124 +++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 118 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index a15ee06..fd2bf3d 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,7 @@ Matrix bot and server infrastructure for the Lotus Guild homeserver (`matrix.lot **Repo**: https://code.lotusguild.org/LotusGuild/matrixBot -## Status: Phase 5 — Optimization, Voice Quality & Custom Client +## Status: Phase 6 — Monitoring, Observability & Hardening --- @@ -36,6 +36,8 @@ Matrix bot and server infrastructure for the Lotus Guild homeserver (`matrix.lot | Authelia | 10.10.10.36 | 167 | — | — | — | SSO/OIDC provider | | LLDAP | 10.10.10.39 | 147 | — | — | — | LDAP user directory | | Uptime Kuma | 10.10.10.25 | 101 | — | — | — | Uptime monitoring (micro1 node) | +| Prometheus | 10.10.10.48 | 118 | — | — | — | Prometheus — scrapes all Matrix services | +| Grafana | 10.10.10.49 | 107 | — | — | — | Grafana 12.4.0 — dashboard.lotusguild.org | > **Note:** PostgreSQL container IP is `10.10.10.44`, not `.2` — update any stale references. @@ -79,10 +81,14 @@ Matrix bot and server infrastructure for the Lotus Guild homeserver (`matrix.lot | Port | Service | Bind | |------|---------|------| | 8008 | Synapse HTTP | 0.0.0.0 + ::1 | -| 9000 | Synapse metrics (Prometheus) | 0.0.0.0 | -| 9001 | Hookshot widgets + metrics | 127.0.0.1 | -| 9002 | Hookshot bridge | 127.0.0.1 | +| 9000 | Synapse metrics (Prometheus) | 127.0.0.1 + 10.10.10.29 | +| 9001 | Hookshot widgets | 0.0.0.0 | +| 9002 | Hookshot bridge (appservice) | 127.0.0.1 | | 9003 | Hookshot webhooks | 0.0.0.0 | +| 9004 | Hookshot metrics (Prometheus) | 0.0.0.0 | +| 9100 | node_exporter (Prometheus) | 0.0.0.0 | +| 9101 | matrix-admin exporter | 0.0.0.0 | +| 6789 | LiveKit metrics (Prometheus) | 0.0.0.0 | | 7880 | LiveKit HTTP | 0.0.0.0 | | 7881 | LiveKit RTC TCP | 0.0.0.0 | | 8070 | lk-jwt-service | 0.0.0.0 | @@ -90,6 +96,13 @@ Matrix bot and server infrastructure for the Lotus Guild homeserver (`matrix.lot | 3478 | coturn STUN/TURN | 0.0.0.0 | | 5349 | coturn TURNS/TLS | 0.0.0.0 | +**Internal port map (LXC 109 — PostgreSQL):** +| Port | Service | Bind | +|------|---------|------| +| 5432 | PostgreSQL | 0.0.0.0 (hba-restricted to 10.10.10.29) | +| 9100 | node_exporter (Prometheus) | 0.0.0.0 | +| 9187 | postgres_exporter | 0.0.0.0 | + --- ## Rooms (all v12) @@ -462,13 +475,20 @@ cp -r dist/* /var/www/html/ - [x] Synapse metrics endpoint (port 9000, Prometheus-compatible) - [x] Uptime Kuma monitors added: Synapse HTTP, LiveKit TCP, PostgreSQL TCP, Cinny Web, coturn TCP 3478, lk-jwt-service, Hookshot - [ ] Uptime Kuma: coturn UDP STUN monitoring (requires push/heartbeat — no native UDP type in Kuma) -- [ ] Grafana dashboard for Synapse Prometheus metrics — Grafana at 10.10.10.49 (LXC 107), Prometheus scraping 10.10.10.29:9000 confirmed. Import dashboard ID `18618` from grafana.com +- [x] Grafana dashboard — custom Synapse dashboard at `dashboard.lotusguild.org/d/matrix-synapse-dashboard/matrix-synapse` (140+ panels, see Monitoring section below) +- [x] Prometheus scraping all Matrix services: Synapse, Hookshot, LiveKit, matrix-node, postgres-node, matrix-admin, postgres, postgres-exporter +- [x] node_exporter installed on LXC 151 (Matrix) and LXC 109 (PostgreSQL) +- [x] LiveKit Prometheus metrics enabled (`prometheus_port: 6789`) +- [x] Hookshot metrics enabled (`metrics: { enabled: true }`) on dedicated port 9004 +- [x] Grafana alert rules — 9 Matrix/infra alerts active (see Alert Rules section below) +- [x] Duplicate Grafana "Infrastructure" folder merged and deleted ### Admin - [x] Synapse admin API dashboard (synapse-admin at http://10.10.10.29:8080) - [x] Power levels per room - [ ] Draupnir moderation bot (new LXC or alongside existing bot) - [ ] Cinny custom branding (Lotus Guild theme — colors, title, favicon, PWA name) +- [ ] **Storj node update** — `storj_uptodate=0` on LXC 138 (10.10.10.133), risk of disqualification --- @@ -487,7 +507,7 @@ Comprehensive audit of the current infrastructure against official documentation | No fail2ban on `/_matrix/client/.*/login` | MEDIUM | ✅ Fixed — fail2ban installed, matrix-synapse jail active (5 retries / 24h ban) | | No media purge cron (retention policy set but never triggers) | MEDIUM | ✅ N/A — `media_retention` block already in homeserver.yaml; Synapse runs the purge internally on schedule | | PostgreSQL autovacuum not tuned per-table | LOW | ✅ Fixed — all 5 high-churn tables tuned, `autovacuum_max_workers` → 5 | -| Hookshot metrics scrape unconfirmed | LOW | ⚠️ Port 9001 responds but `/metrics` returns 404 — hookshot bug or path mismatch; low impact | +| Hookshot metrics scrape unconfirmed | LOW | ✅ Fixed — `metrics: { enabled: true }` added to config, metrics split to dedicated port 9004, Prometheus scraping confirmed | | LiveKit VP9/AV1 codec support | LOW | ✅ Applied — `video_codecs: [VP8, H264, VP9, AV1]` added to livekit config | | Federation allow/deny list not configured | LOW | Pending — Mjolnir/Draupnir on roadmap | | Sygnal push notifications not deployed | INFO | Deferred | @@ -793,6 +813,98 @@ Add `federation_sender` next (off-loads outgoing federation from main process). --- +--- + +## Monitoring & Observability (March 2026) + +### Prometheus Scrape Jobs + +All Matrix-related services scraped by Prometheus at `10.10.10.48` (LXC 118): + +| Job | Target | Metrics | +|-----|--------|---------| +| `synapse` | `10.10.10.29:9000` | Full Synapse internals (events, federation, caches, DB, HTTP) | +| `matrix-admin` | `10.10.10.29:9101` | DAU, MAU, room/user/media totals | +| `livekit` | `10.10.10.29:6789` | Rooms, participants, packets, forward latency, quality | +| `hookshot` | `10.10.10.29:9004` | Connections by service, API calls/failures, Node.js runtime | +| `matrix-node` | `10.10.10.29:9100` | CPU, RAM, network, disk space, load avg (Matrix LXC host) | +| `postgres` | `10.10.10.44:9187` | pg_stat_database, connections, WAL, block I/O | +| `postgres-node` | `10.10.10.44:9100` | CPU, RAM, network, disk space, load avg (PostgreSQL LXC host) | +| `postgres-exporter-2` | `10.10.10.160:9711` | Secondary postgres exporter | + +> **Disk I/O note:** All servers use Ceph-backed storage. Per-device disk I/O metrics are meaningless; use Network I/O panels to see actual storage traffic. + +### Grafana Dashboard + +**URL:** `https://dashboard.lotusguild.org/d/matrix-synapse-dashboard/matrix-synapse` + +140+ panels across 18 sections: + +| Section | Key panels | +|---------|-----------| +| Synapse Overview | Up status, users, rooms, DAU/MAU, media, federation peers | +| Synapse Process Health | CPU, memory, FDs, thread pool, GC, Twisted reactor | +| HTTP API Requests | Rate, response codes, p99/p50 latency, in-flight, DB txn time | +| Federation | Outgoing/incoming PDUs, queue depth, staging, known servers | +| Events & Rooms | Event persistence, notifier, sync responses | +| Presence & Push | Presence updates, pushers, state transitions | +| Rate Limiting | Rejections, sleeps, queue wait time p99 | +| Users & Registration | Login rate, registration rate, growth over time | +| Synapse Database Performance | Txn rate/duration, schedule latency, query latency | +| Synapse Caches | Hit rate (top 5), sizes, evictions, response cache | +| Event Processing & Lag | Lag by processor, stream positions, event fetch ongoing | +| State Resolution | Forward extremities, state resolution CPU, state groups | +| App Services (Hookshot) | Events sent, transactions sent vs failed | +| HTTP Push | Push processed vs failed, badge updates | +| Sliding Sync & Slow Endpoints | Sliding sync p99, slowest endpoints, rate limit wait | +| Background Processes | In-flight by name, start rate, CPU, scheduler tasks | +| PostgreSQL Database | Size, connections, transactions, block I/O, WAL, locks | +| LiveKit SFU | Rooms, participants, network, packets out/dropped, forward latency | +| Hookshot | Matrix API calls/failures, active connections, Node.js event loop lag | +| Matrix LXC Host | CPU, RAM, network (incl. Ceph), load average, disk space | +| PostgreSQL LXC Host | CPU, RAM, network (incl. Ceph), load average, disk space | + +### Alert Rules + +All alerts are Grafana-native (Alerting → Alert Rules). Current active rules: + +**Matrix folder (`matrix-folder`):** +| Alert | Fires when | Severity | +|-------|-----------|----------| +| Synapse Down | `up{job="synapse"}` < 1 for 2m | critical | +| PostgreSQL Down | `pg_up` < 1 for 2m | critical | +| LiveKit Down | `up{job="livekit"}` < 1 for 2m | critical | +| Hookshot Down | `up{job="hookshot"}` < 1 for 2m | critical | +| PG Connection Saturation | connections > 80% of max for 5m | warning | +| Federation Queue Backing Up | pending PDUs > 100 for 10m | warning | +| Synapse High Memory | RSS > 2000MB for 10m | warning | +| Synapse High Response Time | p99 latency (excl. /sync) > 10s for 5m | warning | +| Synapse Event Processing Lag | any processor > 30s behind for 5m | warning | +| Synapse DB Query Latency High | p99 query time > 1s for 5m | warning | + +**Infrastructure folder (`infra-folder`):** +| Alert | Fires when | Severity | +|-------|-----------|----------| +| Service Exporter Down | any `up == 0` for 3m | critical | +| Node High CPU Usage | CPU > 90% for 10m | warning | +| Node High Memory Usage | RAM > 90% for 10m | warning | +| Node Disk Space Low | available < 15% (excl. tmpfs/overlay) for 10m | warning | + +**Prometheus rules (`/etc/prometheus/prometheus_rules.yml`):** +| Alert | Fires when | +|-------|-----------| +| InstanceDown | any `up == 0` for 1m | +| DiskSpaceFree10Percent | available < 10% (excl. tmpfs/overlay) for 5m | + +> **`/sync` long-poll note:** The Matrix `/sync` endpoint is a long-poll (clients hold it open ≤30s). It is excluded from the High Response Time alert to prevent false positives. Without exclusion, p99 reads ~10s even when the server is healthy. + +### Known Alert False Positives / Watch Items + +- **Synapse Event Processing Lag** — can fire transiently after Synapse restart while processors catch up on backlog. Self-resolves in 10–20 minutes. If it grows continuously (>10 min) and doesn't plateau, restart Synapse. +- **Node Disk Space Low** — excludes `tmpfs`, `overlay`, `squashfs`, `devtmpfs`, and `/boot`/`/run` mounts. If new filesystem types appear, add them to the `fstype!~` filter in the rule. + +--- + ## Bot Checklist ### Core