docs: update README for Phase 6 — monitoring, observability, alert rules

- Add Prometheus and Grafana to infrastructure table
- Update port map: Hookshot metrics on 9004, node_exporter on 9100, LiveKit metrics on 6789
- Add PostgreSQL LXC port map
- Update monitoring checklist — all Prometheus/Grafana items now complete
- Mark Hookshot metrics audit item as resolved
- Add Storj node outdated to admin checklist
- Add full Monitoring & Observability section:
  - Prometheus scrape jobs table (synapse, livekit, hookshot, matrix-node, postgres, postgres-node)
  - Grafana dashboard section listing all 21 panel groups
  - Alert rules tables (Matrix + Infrastructure folders, Prometheus rules)
  - /sync long-poll false positive note
  - Known alert watch items

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-10 12:30:03 -04:00
parent 2b998b9ba6
commit a7d700d06e

124
README.md
View File

@@ -4,7 +4,7 @@ Matrix bot and server infrastructure for the Lotus Guild homeserver (`matrix.lot
**Repo**: https://code.lotusguild.org/LotusGuild/matrixBot **Repo**: https://code.lotusguild.org/LotusGuild/matrixBot
## Status: Phase 5Optimization, Voice Quality & Custom Client ## Status: Phase 6Monitoring, Observability & Hardening
--- ---
@@ -36,6 +36,8 @@ Matrix bot and server infrastructure for the Lotus Guild homeserver (`matrix.lot
| Authelia | 10.10.10.36 | 167 | — | — | — | SSO/OIDC provider | | Authelia | 10.10.10.36 | 167 | — | — | — | SSO/OIDC provider |
| LLDAP | 10.10.10.39 | 147 | — | — | — | LDAP user directory | | LLDAP | 10.10.10.39 | 147 | — | — | — | LDAP user directory |
| Uptime Kuma | 10.10.10.25 | 101 | — | — | — | Uptime monitoring (micro1 node) | | Uptime Kuma | 10.10.10.25 | 101 | — | — | — | Uptime monitoring (micro1 node) |
| Prometheus | 10.10.10.48 | 118 | — | — | — | Prometheus — scrapes all Matrix services |
| Grafana | 10.10.10.49 | 107 | — | — | — | Grafana 12.4.0 — dashboard.lotusguild.org |
> **Note:** PostgreSQL container IP is `10.10.10.44`, not `.2` — update any stale references. > **Note:** PostgreSQL container IP is `10.10.10.44`, not `.2` — update any stale references.
@@ -79,10 +81,14 @@ Matrix bot and server infrastructure for the Lotus Guild homeserver (`matrix.lot
| Port | Service | Bind | | Port | Service | Bind |
|------|---------|------| |------|---------|------|
| 8008 | Synapse HTTP | 0.0.0.0 + ::1 | | 8008 | Synapse HTTP | 0.0.0.0 + ::1 |
| 9000 | Synapse metrics (Prometheus) | 0.0.0.0 | | 9000 | Synapse metrics (Prometheus) | 127.0.0.1 + 10.10.10.29 |
| 9001 | Hookshot widgets + metrics | 127.0.0.1 | | 9001 | Hookshot widgets | 0.0.0.0 |
| 9002 | Hookshot bridge | 127.0.0.1 | | 9002 | Hookshot bridge (appservice) | 127.0.0.1 |
| 9003 | Hookshot webhooks | 0.0.0.0 | | 9003 | Hookshot webhooks | 0.0.0.0 |
| 9004 | Hookshot metrics (Prometheus) | 0.0.0.0 |
| 9100 | node_exporter (Prometheus) | 0.0.0.0 |
| 9101 | matrix-admin exporter | 0.0.0.0 |
| 6789 | LiveKit metrics (Prometheus) | 0.0.0.0 |
| 7880 | LiveKit HTTP | 0.0.0.0 | | 7880 | LiveKit HTTP | 0.0.0.0 |
| 7881 | LiveKit RTC TCP | 0.0.0.0 | | 7881 | LiveKit RTC TCP | 0.0.0.0 |
| 8070 | lk-jwt-service | 0.0.0.0 | | 8070 | lk-jwt-service | 0.0.0.0 |
@@ -90,6 +96,13 @@ Matrix bot and server infrastructure for the Lotus Guild homeserver (`matrix.lot
| 3478 | coturn STUN/TURN | 0.0.0.0 | | 3478 | coturn STUN/TURN | 0.0.0.0 |
| 5349 | coturn TURNS/TLS | 0.0.0.0 | | 5349 | coturn TURNS/TLS | 0.0.0.0 |
**Internal port map (LXC 109 — PostgreSQL):**
| Port | Service | Bind |
|------|---------|------|
| 5432 | PostgreSQL | 0.0.0.0 (hba-restricted to 10.10.10.29) |
| 9100 | node_exporter (Prometheus) | 0.0.0.0 |
| 9187 | postgres_exporter | 0.0.0.0 |
--- ---
## Rooms (all v12) ## Rooms (all v12)
@@ -462,13 +475,20 @@ cp -r dist/* /var/www/html/
- [x] Synapse metrics endpoint (port 9000, Prometheus-compatible) - [x] Synapse metrics endpoint (port 9000, Prometheus-compatible)
- [x] Uptime Kuma monitors added: Synapse HTTP, LiveKit TCP, PostgreSQL TCP, Cinny Web, coturn TCP 3478, lk-jwt-service, Hookshot - [x] Uptime Kuma monitors added: Synapse HTTP, LiveKit TCP, PostgreSQL TCP, Cinny Web, coturn TCP 3478, lk-jwt-service, Hookshot
- [ ] Uptime Kuma: coturn UDP STUN monitoring (requires push/heartbeat — no native UDP type in Kuma) - [ ] Uptime Kuma: coturn UDP STUN monitoring (requires push/heartbeat — no native UDP type in Kuma)
- [ ] Grafana dashboard for Synapse Prometheus metrics — Grafana at 10.10.10.49 (LXC 107), Prometheus scraping 10.10.10.29:9000 confirmed. Import dashboard ID `18618` from grafana.com - [x] Grafana dashboard — custom Synapse dashboard at `dashboard.lotusguild.org/d/matrix-synapse-dashboard/matrix-synapse` (140+ panels, see Monitoring section below)
- [x] Prometheus scraping all Matrix services: Synapse, Hookshot, LiveKit, matrix-node, postgres-node, matrix-admin, postgres, postgres-exporter
- [x] node_exporter installed on LXC 151 (Matrix) and LXC 109 (PostgreSQL)
- [x] LiveKit Prometheus metrics enabled (`prometheus_port: 6789`)
- [x] Hookshot metrics enabled (`metrics: { enabled: true }`) on dedicated port 9004
- [x] Grafana alert rules — 9 Matrix/infra alerts active (see Alert Rules section below)
- [x] Duplicate Grafana "Infrastructure" folder merged and deleted
### Admin ### Admin
- [x] Synapse admin API dashboard (synapse-admin at http://10.10.10.29:8080) - [x] Synapse admin API dashboard (synapse-admin at http://10.10.10.29:8080)
- [x] Power levels per room - [x] Power levels per room
- [ ] Draupnir moderation bot (new LXC or alongside existing bot) - [ ] Draupnir moderation bot (new LXC or alongside existing bot)
- [ ] Cinny custom branding (Lotus Guild theme — colors, title, favicon, PWA name) - [ ] Cinny custom branding (Lotus Guild theme — colors, title, favicon, PWA name)
- [ ] **Storj node update**`storj_uptodate=0` on LXC 138 (10.10.10.133), risk of disqualification
--- ---
@@ -487,7 +507,7 @@ Comprehensive audit of the current infrastructure against official documentation
| No fail2ban on `/_matrix/client/.*/login` | MEDIUM | ✅ Fixed — fail2ban installed, matrix-synapse jail active (5 retries / 24h ban) | | No fail2ban on `/_matrix/client/.*/login` | MEDIUM | ✅ Fixed — fail2ban installed, matrix-synapse jail active (5 retries / 24h ban) |
| No media purge cron (retention policy set but never triggers) | MEDIUM | ✅ N/A — `media_retention` block already in homeserver.yaml; Synapse runs the purge internally on schedule | | No media purge cron (retention policy set but never triggers) | MEDIUM | ✅ N/A — `media_retention` block already in homeserver.yaml; Synapse runs the purge internally on schedule |
| PostgreSQL autovacuum not tuned per-table | LOW | ✅ Fixed — all 5 high-churn tables tuned, `autovacuum_max_workers` → 5 | | PostgreSQL autovacuum not tuned per-table | LOW | ✅ Fixed — all 5 high-churn tables tuned, `autovacuum_max_workers` → 5 |
| Hookshot metrics scrape unconfirmed | LOW | ⚠️ Port 9001 responds but `/metrics` returns 404 — hookshot bug or path mismatch; low impact | | Hookshot metrics scrape unconfirmed | LOW | ✅ Fixed — `metrics: { enabled: true }` added to config, metrics split to dedicated port 9004, Prometheus scraping confirmed |
| LiveKit VP9/AV1 codec support | LOW | ✅ Applied — `video_codecs: [VP8, H264, VP9, AV1]` added to livekit config | | LiveKit VP9/AV1 codec support | LOW | ✅ Applied — `video_codecs: [VP8, H264, VP9, AV1]` added to livekit config |
| Federation allow/deny list not configured | LOW | Pending — Mjolnir/Draupnir on roadmap | | Federation allow/deny list not configured | LOW | Pending — Mjolnir/Draupnir on roadmap |
| Sygnal push notifications not deployed | INFO | Deferred | | Sygnal push notifications not deployed | INFO | Deferred |
@@ -793,6 +813,98 @@ Add `federation_sender` next (off-loads outgoing federation from main process).
--- ---
---
## Monitoring & Observability (March 2026)
### Prometheus Scrape Jobs
All Matrix-related services scraped by Prometheus at `10.10.10.48` (LXC 118):
| Job | Target | Metrics |
|-----|--------|---------|
| `synapse` | `10.10.10.29:9000` | Full Synapse internals (events, federation, caches, DB, HTTP) |
| `matrix-admin` | `10.10.10.29:9101` | DAU, MAU, room/user/media totals |
| `livekit` | `10.10.10.29:6789` | Rooms, participants, packets, forward latency, quality |
| `hookshot` | `10.10.10.29:9004` | Connections by service, API calls/failures, Node.js runtime |
| `matrix-node` | `10.10.10.29:9100` | CPU, RAM, network, disk space, load avg (Matrix LXC host) |
| `postgres` | `10.10.10.44:9187` | pg_stat_database, connections, WAL, block I/O |
| `postgres-node` | `10.10.10.44:9100` | CPU, RAM, network, disk space, load avg (PostgreSQL LXC host) |
| `postgres-exporter-2` | `10.10.10.160:9711` | Secondary postgres exporter |
> **Disk I/O note:** All servers use Ceph-backed storage. Per-device disk I/O metrics are meaningless; use Network I/O panels to see actual storage traffic.
### Grafana Dashboard
**URL:** `https://dashboard.lotusguild.org/d/matrix-synapse-dashboard/matrix-synapse`
140+ panels across 18 sections:
| Section | Key panels |
|---------|-----------|
| Synapse Overview | Up status, users, rooms, DAU/MAU, media, federation peers |
| Synapse Process Health | CPU, memory, FDs, thread pool, GC, Twisted reactor |
| HTTP API Requests | Rate, response codes, p99/p50 latency, in-flight, DB txn time |
| Federation | Outgoing/incoming PDUs, queue depth, staging, known servers |
| Events & Rooms | Event persistence, notifier, sync responses |
| Presence & Push | Presence updates, pushers, state transitions |
| Rate Limiting | Rejections, sleeps, queue wait time p99 |
| Users & Registration | Login rate, registration rate, growth over time |
| Synapse Database Performance | Txn rate/duration, schedule latency, query latency |
| Synapse Caches | Hit rate (top 5), sizes, evictions, response cache |
| Event Processing & Lag | Lag by processor, stream positions, event fetch ongoing |
| State Resolution | Forward extremities, state resolution CPU, state groups |
| App Services (Hookshot) | Events sent, transactions sent vs failed |
| HTTP Push | Push processed vs failed, badge updates |
| Sliding Sync & Slow Endpoints | Sliding sync p99, slowest endpoints, rate limit wait |
| Background Processes | In-flight by name, start rate, CPU, scheduler tasks |
| PostgreSQL Database | Size, connections, transactions, block I/O, WAL, locks |
| LiveKit SFU | Rooms, participants, network, packets out/dropped, forward latency |
| Hookshot | Matrix API calls/failures, active connections, Node.js event loop lag |
| Matrix LXC Host | CPU, RAM, network (incl. Ceph), load average, disk space |
| PostgreSQL LXC Host | CPU, RAM, network (incl. Ceph), load average, disk space |
### Alert Rules
All alerts are Grafana-native (Alerting → Alert Rules). Current active rules:
**Matrix folder (`matrix-folder`):**
| Alert | Fires when | Severity |
|-------|-----------|----------|
| Synapse Down | `up{job="synapse"}` < 1 for 2m | critical |
| PostgreSQL Down | `pg_up` < 1 for 2m | critical |
| LiveKit Down | `up{job="livekit"}` < 1 for 2m | critical |
| Hookshot Down | `up{job="hookshot"}` < 1 for 2m | critical |
| PG Connection Saturation | connections > 80% of max for 5m | warning |
| Federation Queue Backing Up | pending PDUs > 100 for 10m | warning |
| Synapse High Memory | RSS > 2000MB for 10m | warning |
| Synapse High Response Time | p99 latency (excl. /sync) > 10s for 5m | warning |
| Synapse Event Processing Lag | any processor > 30s behind for 5m | warning |
| Synapse DB Query Latency High | p99 query time > 1s for 5m | warning |
**Infrastructure folder (`infra-folder`):**
| Alert | Fires when | Severity |
|-------|-----------|----------|
| Service Exporter Down | any `up == 0` for 3m | critical |
| Node High CPU Usage | CPU > 90% for 10m | warning |
| Node High Memory Usage | RAM > 90% for 10m | warning |
| Node Disk Space Low | available < 15% (excl. tmpfs/overlay) for 10m | warning |
**Prometheus rules (`/etc/prometheus/prometheus_rules.yml`):**
| Alert | Fires when |
|-------|-----------|
| InstanceDown | any `up == 0` for 1m |
| DiskSpaceFree10Percent | available < 10% (excl. tmpfs/overlay) for 5m |
> **`/sync` long-poll note:** The Matrix `/sync` endpoint is a long-poll (clients hold it open ≤30s). It is excluded from the High Response Time alert to prevent false positives. Without exclusion, p99 reads ~10s even when the server is healthy.
### Known Alert False Positives / Watch Items
- **Synapse Event Processing Lag** — can fire transiently after Synapse restart while processors catch up on backlog. Self-resolves in 1020 minutes. If it grows continuously (>10 min) and doesn't plateau, restart Synapse.
- **Node Disk Space Low** — excludes `tmpfs`, `overlay`, `squashfs`, `devtmpfs`, and `/boot`/`/run` mounts. If new filesystem types appear, add them to the `fstype!~` filter in the rule.
---
## Bot Checklist ## Bot Checklist
### Core ### Core