Enable Draupnir web server (abuse reporting) and add healthz config to repo

- draupnir/production.yaml: Add health.healthz (port 8081) and web.abuseReporting (port 8080) config — healthz was live on LXC but missing from repo; web server enables Matrix client Report button forwarding to management room (Synapse module install on LXC 151 still needed to complete the integration) - README: Add Draupnir port map, abuse reporting setup docs, updated monitoring section (3 new Prometheus scrape jobs, Draupnir Down alert, Grafana panel count), add presence-disabled federation lag fix to performance checklist, document Draupnir healthz/audit DB paths Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-22 21:12:19 -04:00
parent c1e21004be
commit 3db163e43d
2 changed files with 70 additions and 11 deletions
@@ -98,9 +98,12 @@ matrix/
 - Data/SQLite DBs: `/data/storage/`
 - Service: `draupnir.service`
 - Management room: `#management:matrix.lotusguild.org` (`!mEvR5fe3jMmzwd-FwNygD72OY_yu8H3UP_N-57oK7MI`)
- Bot account: `@draupnir:matrix.lotusguild.org` (power level 100 in all protected rooms)
+- Bot account: `@draupnir:matrix.lotusguild.org` (power level 100 in all protected rooms and the Lotus Guild space)
 - Subscribed ban lists: `#community-moderation-effort-bl:neko.dev`, `#matrix-org-coc-bl:matrix.org`
 - Rebuild: `NODE_OPTIONS="--max-old-space-size=768" npx tsc --project tsconfig.json`
+- Healthz endpoint: `http://10.10.10.24:8081/healthz` (200 = healthy, 418 = disconnected)
+- Abuse reporting endpoint: `POST http://10.10.10.24:8080/_matrix/draupnir/1/report/{roomId}/{eventId}`
+- Audit DBs: `/data/storage/user-restriction-audit-log.db`, `/data/storage/room-audit-log.db`

 **Key paths on PostgreSQL LXC (109):**
 - PostgreSQL config: `/etc/postgresql/17/main/postgresql.conf`
@@ -232,6 +235,15 @@ The token in `draupnir/production.yaml` in this repo is **intentionally redacted
 | 3478 | coturn STUN/TURN | 0.0.0.0 |
 | 5349 | coturn TURNS/TLS | 0.0.0.0 |

+**Internal port map (LXC 110 — Draupnir):**
+| Port | Service | Bind |
+|------|---------|------|
+| 8080 | Draupnir web (abuse reporting) | 0.0.0.0 |
+| 8081 | Draupnir healthz | 0.0.0.0 |
+| 9000 | webhook (auto-deploy) | 0.0.0.0 |
+| 9100 | node_exporter | 0.0.0.0 |
+| 9256 | process_exporter | 0.0.0.0 |
+
 **Internal port map (LXC 109 — PostgreSQL):**
 | Port | Service | Bind |
 |------|---------|------|
@@ -255,10 +267,8 @@ The token in `draupnir/production.yaml` in this repo is **intentionally redacted
 | Spam and Stuff | `!GttT4QYd1wlGlkHU3qTmq_P3gbyYKKeSSN6R7TPcJHg` | invite, **no E2EE** (hookshot) |

 **Power level roles (Cinny tags):**
- 100: Owner (jared)
- 50: The Nerdy Council (enhuynh, lonely)
- 48: Panel of Geeks
- 35: Cool Kids
+- 100: Owner (jared, draupnir, lotusbot)
+- 50: The Nerdy Council / Panel of Geeks (enhuynh, lonely)
 - 0: Member

 ---
@@ -305,7 +315,7 @@ bash /opt/matrix-config/hookshot/deploy.sh proxmox.js  # deploy one

 ## Moderation (Draupnir v2.9.0)

-Draupnir runs on LXC 110, manages moderation across all 9 protected rooms via `#management:matrix.lotusguild.org`.
+Draupnir runs on LXC 110, manages moderation across all protected rooms (including the Lotus Guild space) via `#management:matrix.lotusguild.org`.

 **Subscribed ban lists:**
 - `#community-moderation-effort-bl:neko.dev` — 12,599 banned users, 245 servers, 59 rooms
@@ -320,6 +330,28 @@ Draupnir runs on LXC 110, manages moderation across all 9 protected rooms via `#
 !draupnir watch <alias> --no-confirm      — subscribe to a ban list
 ```

+### Abuse Reporting
+
+When a Matrix client user clicks "Report" on a message, Synapse receives a `POST /_matrix/client/v3/rooms/{roomId}/report/{eventId}` request and stores the report internally. To forward these to the Draupnir management room, a Synapse Python module must be installed on LXC 151.
+
+**Draupnir web server** is enabled (port 8080). The endpoint is:
+```
+POST http://10.10.10.24:8080/_matrix/draupnir/1/report/{roomId}/{eventId}
+```
+
+**To complete Synapse integration (one-time, on LXC 151):**
+1. Install the module: `pip install matrix-synapse-draupnir-abuse-reports` (or equivalent — check Draupnir releases)
+2. Add to `/etc/matrix-synapse/homeserver.yaml`:
+   ```yaml
+   modules:
+     - module: "draupnir.abuse_reports.AbuseReportEndpoint"
+       config:
+         draupnir_endpoint: "http://10.10.10.24:8080"
+   ```
+3. `systemctl restart matrix-synapse`
+
+> Until the Synapse module is installed, abuse reports are stored in Synapse's DB but do NOT appear in the management room. The Draupnir web server is running and ready to receive forwarded reports.
+
 ---

 ## Cinny Dev Branch (chat.lotusguild.org)
@@ -393,6 +425,7 @@ Periodic `TLS/TCP socket error: Connection reset by peer` in coturn logs. Normal
 - [x] LiveKit ICE port range expanded to 50000-51000
 - [x] LiveKit TURN TTL reduced to 1h
 - [x] LiveKit VP9/AV1 codecs enabled
+- [x] Synapse presence disabled (`presence: enabled: false`) — eliminates federation lag spikes caused by presence EDU bursts to 50+ remote servers
 - [ ] BBR congestion control — must be applied on Proxmox host

 ### Auth & SSO
@@ -430,14 +463,15 @@ Periodic `TLS/TCP socket error: Connection reset by peer` in coturn logs. Normal
 - [x] Webhook HMAC-SHA256 validation on all auto-deploy endpoints

 ### Monitoring
- [x] Grafana dashboard — `dashboard.lotusguild.org/d/matrix-synapse-dashboard` (140+ panels)
- [x] Prometheus scraping all Matrix services (Synapse, Hookshot, LiveKit, node_exporter, postgres)
- [x] 14 active alert rules across matrix-folder and infra-folder
+- [x] Grafana dashboard — `dashboard.lotusguild.org/d/matrix-synapse-dashboard` (140+ panels, Draupnir section added)
+- [x] Prometheus scraping all Matrix services (Synapse, Hookshot, LiveKit, node_exporter, postgres, Draupnir)
+- [x] 15 active alert rules across matrix-folder and infra-folder (includes Draupnir Down)
 - [x] Uptime Kuma monitors: Synapse, LiveKit, PostgreSQL, Cinny, coturn, lk-jwt-service, Hookshot
+- [x] Draupnir: node_exporter (9100), process_exporter (9256), healthz probe via blackbox (8081)

 ### Admin
 - [x] Synapse admin API dashboard (synapse-admin at http://10.10.10.29:8080)
- [x] Draupnir moderation bot — LXC 110, v2.9.0, 9 protected rooms, 2 ban lists
+- [x] Draupnir moderation bot — LXC 110, v2.9.0, all rooms + space, 2 ban lists
 - [ ] Cinny custom branding

 ---
@@ -455,6 +489,9 @@ Periodic `TLS/TCP socket error: Connection reset by peer` in coturn logs. Normal
 | `matrix-node` | `10.10.10.29:9100` | CPU, RAM, network, load average, disk |
 | `postgres` | `10.10.10.44:9187` | pg_stat_database, connections, WAL, block I/O |
 | `postgres-node` | `10.10.10.44:9100` | CPU, RAM, network, load average, disk |
+| `draupnir-node` | `10.10.10.24:9100` | CPU, RAM, network, load average, disk |
+| `draupnir-process` | `10.10.10.24:9256` | Process CPU/memory/threads/uptime (process_exporter) |
+| `draupnir-healthz` | `10.10.10.24:8081/healthz` → `127.0.0.1:9115` | `probe_success` (1=healthy, 0=disconnected) via blackbox exporter |

 > **Disk I/O:** All servers use Ceph-backed storage. Per-device disk I/O metrics are meaningless — use Network I/O panels to see actual storage traffic.

@@ -467,6 +504,7 @@ Periodic `TLS/TCP socket error: Connection reset by peer` in coturn logs. Normal
 | PostgreSQL Down | `pg_up` < 1 for 2m | critical |
 | LiveKit Down | `up{job="livekit"}` < 1 for 2m | critical |
 | Hookshot Down | `up{job="hookshot"}` < 1 for 2m | critical |
+| Draupnir Down | `up{job="draupnir-node"}` < 0.5 for 2m | critical |
 | PG Connection Saturation | connections > 80% of max for 5m | warning |
 | Federation Queue Backing Up | pending PDUs > 100 for 10m | warning |
 | Synapse High Memory | RSS > 2000MB for 10m | warning |
@@ -484,7 +522,7 @@ Periodic `TLS/TCP socket error: Connection reset by peer` in coturn logs. Normal

 > **`/sync` long-poll:** The Matrix `/sync` endpoint is a long-poll (clients hold it open ≤30s). It is excluded from the High Response Time alert to prevent false positives.

-> **Synapse Event Processing Lag** can fire transiently after a Synapse restart while processors drain their backlog. Self-resolves in 10–20 minutes.
+> **Synapse Event Processing Lag** can fire transiently after a Synapse restart while processors drain their backlog. Self-resolves in 10–20 minutes. Root cause of recurring lag spikes was Synapse presence EDU bursts — fixed by disabling presence in `homeserver.yaml` (`presence: enabled: false`).

 ---