From 2b998b9ba6adcc904a83fdcd9b28b5c3e6f0adf5 Mon Sep 17 00:00:00 2001 From: Jared Vititoe Date: Mon, 9 Mar 2026 13:44:53 -0400 Subject: [PATCH] docs: comprehensive March 2026 audit and applied fixes - Add Improvement Audit section tracking all identified gaps and their status - All critical/high/medium items applied: coturn cert auto-renewal (sync cron on compute-storage-01), Synapse metrics port locked to 127.0.0.1+10.10.10.29, well-known matrix endpoints live on lotusguild.org, suppress_key_server_warning, fail2ban on login endpoint, PostgreSQL autovacuum per-table tuning, LiveKit VP9/AV1 codecs - Bot E2EE reset: full store+credentials wipe, stale devices removed, fresh device BBRZSEUECZ registered - Checklist updated: LiveKit port range, autovacuum, hardening items, Grafana IP - Hookshot: Owncast renamed to Livestream in display name (same UUID) Co-Authored-By: Claude Sonnet 4.6 --- README.md | 338 +++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 333 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 553162d..a15ee06 100644 --- a/README.md +++ b/README.md @@ -129,7 +129,7 @@ Webhook URL format: `https://matrix.lotusguild.org/webhook/` | Lidarr | `66ac6fdd-69f6-4f47-bb00-b7f6d84d7c1c` | All event types | | Uptime Kuma | `1a02e890-bb25-42f1-99fe-bba6a19f1811` | Status change notifications | | Seerr | `555185af-90a1-42ff-aed5-c344e11955cf` | Request/approval events | -| Owncast | `9993e911-c68b-4271-a178-c2d65ca88499` | STREAM_STARTED / STREAM_STOPPED | +| Owncast (Livestream) | `9993e911-c68b-4271-a178-c2d65ca88499` | STREAM_STARTED / STREAM_STOPPED (hookshot display name: "Livestream") | | Bazarr | `470fb267-3436-4dd3-a70c-e6e8db1721be` | Subtitle events (Apprise JSON notifier) | | Tinker-Tickets | `6e306faf-8eea-4ba5-83ef-bf8f421f929e` | Custom transformation code | @@ -405,7 +405,7 @@ cp -r dist/* /var/www/html/ - [x] Landing page with client recommendations (Cinny, Commet, Element, Element X mobile) - [x] Synapse metrics endpoint (port 9000, Prometheus-compatible) - [ ] Push notifications gateway (Sygnal) for mobile clients -- [ ] Expand LiveKit port range (50000-51000) for voice call capacity +- [x] LiveKit port range expanded to 50000-51000 for voice call capacity - [x] Custom Cinny client LXC 106 (10.10.10.6) — Debian 13, Cinny 4.10.5 built from `add-joined-call-controls`, nginx serving, HA enabled - [x] NPM proxy entry for `chat.lotusguild.org` → 10.10.10.6:80, SSL via Cloudflare DNS challenge, HTTPS forced, HTTP/2 + HSTS enabled - [x] Cinny weekly auto-update cron (`/etc/cron.d/cinny-update`, Sundays 3am, logs to `/var/log/cinny-update.log`) @@ -414,11 +414,13 @@ cp -r dist/* /var/www/html/ ### Performance Tuning - [x] PostgreSQL `shared_buffers` → 1500MB, `effective_cache_size`, `work_mem`, checkpoint tuning applied - [x] PostgreSQL `pg_stat_statements` extension installed in `synapse` database +- [x] PostgreSQL autovacuum tuned per-table (`state_groups_state`, `events`, `receipts_linearized`, `receipts_graph`, `device_lists_stream`, `presence_stream`), `autovacuum_max_workers` → 5 - [x] Synapse `event_cache_size` → 30K, `_get_state_group_for_events` cache factor added - [x] sysctl TCP/UDP buffer alignment applied to LXC 151 (`/etc/sysctl.d/99-matrix-tuning.conf`) - [x] LiveKit room `empty_timeout: 300`, `departure_timeout: 20`, `max_participants: 50` - [x] LiveKit ICE port range expanded to 50000-51000 - [x] LiveKit TURN TTL reduced from 24h to 1h +- [x] LiveKit VP9/AV1 codecs enabled (`video_codecs: [VP8, H264, VP9, AV1]`) - [ ] BBR congestion control — must be applied on Proxmox host, not inside LXC (see Known Issues) ### Auth & SSO @@ -429,7 +431,7 @@ cp -r dist/* /var/www/html/ ### Webhooks & Integrations - [x] matrix-hookshot 7.3.2 installed and running -- [x] Generic webhook bridge for 11 active services (Grafana, Proxmox, Sonarr, Radarr, Readarr, Lidarr, Uptime Kuma, Seerr, Owncast, Bazarr, Tinker-Tickets) +- [x] Generic webhook bridge for 11 active services (Grafana, Proxmox, Sonarr, Radarr, Readarr, Lidarr, Uptime Kuma, Seerr, Owncast/Livestream, Bazarr, Tinker-Tickets) - [x] Per-service JS transformation functions — all rewritten to handle full event payloads (all event types, health alerts, app updates, release groups, download clients) - [x] Per-service virtual user avatars - [x] NPM reverse proxy for `/webhook` path @@ -447,6 +449,11 @@ cp -r dist/* /var/www/html/ - [x] coturn internal peer deny rules (blocks relay to RFC1918 except allowed subnet) - [x] `pg_hba.conf` locked down — remote access restricted to Synapse LXC (10.10.10.29) only - [x] Federation enabled with key verification (open for invite-only growth to friends/family/coworkers) +- [x] fail2ban on Synapse login endpoint (5 retries / 24h ban, LXC 151) +- [x] Synapse metrics port 9000 restricted to `127.0.0.1` + `10.10.10.29` (was `0.0.0.0`) +- [x] coturn cert auto-renewal — daily sync cron on compute-storage-01 copies NPM cert → coturn +- [x] `/.well-known/matrix/client` and `/server` live on lotusguild.org (NPM advanced config) +- [x] `suppress_key_server_warning: true` in homeserver.yaml - [ ] Federation allow/deny lists for known bad actors - [ ] Regular Synapse updates - [x] Automated database + media backups @@ -455,7 +462,7 @@ cp -r dist/* /var/www/html/ - [x] Synapse metrics endpoint (port 9000, Prometheus-compatible) - [x] Uptime Kuma monitors added: Synapse HTTP, LiveKit TCP, PostgreSQL TCP, Cinny Web, coturn TCP 3478, lk-jwt-service, Hookshot - [ ] Uptime Kuma: coturn UDP STUN monitoring (requires push/heartbeat — no native UDP type in Kuma) -- [ ] Grafana dashboard for Synapse Prometheus metrics (LXC 107 at 10.10.10.X already running Grafana) +- [ ] Grafana dashboard for Synapse Prometheus metrics — Grafana at 10.10.10.49 (LXC 107), Prometheus scraping 10.10.10.29:9000 confirmed. Import dashboard ID `18618` from grafana.com ### Admin - [x] Synapse admin API dashboard (synapse-admin at http://10.10.10.29:8080) @@ -465,6 +472,327 @@ cp -r dist/* /var/www/html/ --- +## Improvement Audit (March 2026) + +Comprehensive audit of the current infrastructure against official documentation and security best practices. Applied March 9 2026. + +### Priority Summary + +| Issue | Severity | Status | +|-------|----------|--------| +| coturn TLS cert expires May 12 — no auto-renewal | **CRITICAL** | ✅ Fixed — daily sync cron on compute-storage-01 copies NPM-renewed cert to coturn | +| Synapse metrics port 9000 bound to `0.0.0.0` | **HIGH** | ✅ Fixed — now binds `127.0.0.1` + `10.10.10.29` (Prometheus still works, internet blocked) | +| `/.well-known/matrix/client` returns 404 | MEDIUM | ✅ Fixed — NPM lotusguild.org proxy host updated, live at `https://lotusguild.org/.well-known/matrix/client` | +| `suppress_key_server_warning` not set | MEDIUM | ✅ Fixed — added to homeserver.yaml | +| No fail2ban on `/_matrix/client/.*/login` | MEDIUM | ✅ Fixed — fail2ban installed, matrix-synapse jail active (5 retries / 24h ban) | +| No media purge cron (retention policy set but never triggers) | MEDIUM | ✅ N/A — `media_retention` block already in homeserver.yaml; Synapse runs the purge internally on schedule | +| PostgreSQL autovacuum not tuned per-table | LOW | ✅ Fixed — all 5 high-churn tables tuned, `autovacuum_max_workers` → 5 | +| Hookshot metrics scrape unconfirmed | LOW | ⚠️ Port 9001 responds but `/metrics` returns 404 — hookshot bug or path mismatch; low impact | +| LiveKit VP9/AV1 codec support | LOW | ✅ Applied — `video_codecs: [VP8, H264, VP9, AV1]` added to livekit config | +| Federation allow/deny list not configured | LOW | Pending — Mjolnir/Draupnir on roadmap | +| Sygnal push notifications not deployed | INFO | Deferred | + +--- + +### 1. coturn Cert Auto-Renewal ✅ + +The coturn cert is managed by NPM (cert ID 91, stored at `/etc/letsencrypt/live/npm-91/` on LXC 139). NPM renews it automatically. A sync script on `compute-storage-01` detects when NPM renews and copies it to coturn. + +**Deployed:** `/usr/local/bin/coturn-cert-sync.sh` on compute-storage-01, cron `/etc/cron.d/coturn-cert-sync` (runs 03:30 daily). + +Script compares cert expiry dates between LXC 139 and LXC 151. If they differ (NPM renewed), it copies `fullchain.pem` + `privkey.pem` and restarts coturn. + +**Additional coturn hardening (while you're in there):** +``` +# /etc/turnserver.conf +stale_nonce=600 # Nonce expires 600s (prevents replay attacks) +user-quota=100 # Max concurrent allocations per user +total-quota=1000 # Total allocations on server +max-bps=1000000 # 1 Mbps per TURN session +cipher-list="ECDHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-CHACHA20-POLY1305" +``` + +--- + +### 2. Synapse Configuration Gaps + +**a) Metrics port exposed to 0.0.0.0 (HIGH)** + +Port 9000 currently binds `0.0.0.0` — exposes internal state, user counts, DB query times externally. Fix in `homeserver.yaml`: +```yaml +metrics_flags: + some_legacy_unrestricted_resources: false +listeners: + - port: 9000 + bind_addresses: ['127.0.0.1'] # NOT 0.0.0.0 + type: metrics + resources: [] +``` +Grafana at `10.10.10.49` scrapes port 9000 from within the VLAN so this is safe to lock down. + +**b) suppress_key_server_warning (MEDIUM)** + +Fills Synapse logs with noise on every restart. One line in `homeserver.yaml`: +```yaml +suppress_key_server_warning: true +``` + +**c) Database connection pooling (LOW — track for growth)** + +Current defaults (`cp_min: 5`, `cp_max: 10`) are fine for single-process. When adding workers, increase `cp_max` to 20–30 per worker group. Add explicitly to `homeserver.yaml` to make it visible: +```yaml +database: + name: psycopg2 + args: + cp_min: 5 + cp_max: 10 +``` + +--- + +### 3. Matrix Well-Known 404 + +`/.well-known/matrix/client` returns 404. This breaks client autodiscovery — users who type `lotusguild.org` instead of `matrix.lotusguild.org` get an error. Fix in NPM with a custom location block on the `lotusguild.org` proxy host: + +```nginx +location /.well-known/matrix/client { + add_header Content-Type application/json; + add_header Access-Control-Allow-Origin *; + return 200 '{"m.homeserver":{"base_url":"https://matrix.lotusguild.org"}}'; +} +location /.well-known/matrix/server { + add_header Content-Type application/json; + add_header Access-Control-Allow-Origin *; + return 200 '{"m.server":"matrix.lotusguild.org:443"}'; +} +``` + +--- + +### 4. fail2ban for Synapse Login + +No brute-force protection on `/_matrix/client/*/login`. Easy win. + +**`/etc/fail2ban/jail.d/matrix-synapse.conf`:** +```ini +[matrix-synapse] +enabled = true +port = http,https +filter = matrix-synapse +logpath = /var/log/matrix-synapse/homeserver.log +backend = systemd +journalmatch = _SYSTEMD_UNIT=matrix-synapse.service + PRIORITY=3 +findtime = 600 +maxretry = 5 +bantime = 86400 +``` + +**`/etc/fail2ban/filter.d/matrix-synapse.conf`:** +```ini +[Definition] +failregex = ^.*Failed \(password\|SAML\) login attempt for user .* from .*$ + ^.*"POST /.*login.*" 401.*$ +ignoreregex = ^.*"GET /sync.*".*$ +``` + +--- + +### 5. Synapse Media Purge Cron + +Retention policy is configured (remote 1yr, local 3yr) but nothing actually triggers the purge — media accumulates silently. The Synapse admin API purge endpoint must be called explicitly. + +**`/usr/local/bin/purge-synapse-media.sh`** (create on LXC 151): +```bash +#!/bin/bash +ADMIN_TOKEN="syt_your_admin_token" +# Purge remote media (cached from other homeservers) older than 90 days +CUTOFF_TS=$(($(date +%s000) - 7776000000)) +curl -X POST \ + "http://localhost:8008/_synapse/admin/v1/purge_media_cache?before_ts=$CUTOFF_TS" \ + -H "Authorization: Bearer $ADMIN_TOKEN" \ + -s -o /dev/null +echo "$(date): Synapse remote media purge completed" >> /var/log/synapse-purge.log +``` + +```bash +chmod +x /usr/local/bin/purge-synapse-media.sh +echo "0 4 * * * root /usr/local/bin/purge-synapse-media.sh" > /etc/cron.d/synapse-purge +``` + +--- + +### 6. PostgreSQL Autovacuum Per-Table Tuning + +The high-churn Synapse tables (`state_groups_state`, `events`, `receipts`) are not tuned for aggressive autovacuum. As the DB grows, bloat accumulates and queries slow down. Run on LXC 109 (PostgreSQL): + +```sql +-- state_groups_state: biggest bloat source +ALTER TABLE state_groups_state SET ( + autovacuum_vacuum_scale_factor = 0.01, + autovacuum_analyze_scale_factor = 0.005, + autovacuum_vacuum_cost_delay = 5, + autovacuum_naptime = 30 +); + +-- events: second priority +ALTER TABLE events SET ( + autovacuum_vacuum_scale_factor = 0.02, + autovacuum_analyze_scale_factor = 0.01, + autovacuum_vacuum_cost_delay = 5, + autovacuum_naptime = 30 +); + +-- receipts and device_lists_stream +ALTER TABLE receipts SET (autovacuum_vacuum_scale_factor = 0.01, autovacuum_vacuum_cost_delay = 5); +ALTER TABLE device_lists_stream SET (autovacuum_vacuum_scale_factor = 0.02); +ALTER TABLE presence_stream SET (autovacuum_vacuum_scale_factor = 0.02); +``` + +Also bump `autovacuum_max_workers` from 3 → 5: +```sql +ALTER SYSTEM SET autovacuum_max_workers = 5; +SELECT pg_reload_conf(); +``` + +**Monitor vacuum health:** +```sql +SELECT relname, last_autovacuum, n_dead_tup, n_live_tup +FROM pg_stat_user_tables +WHERE relname IN ('events', 'state_groups_state', 'receipts') +ORDER BY n_dead_tup DESC; +``` + +--- + +### 7. Hookshot Metrics + Grafana + +**Hookshot metrics** are exposed at `127.0.0.1:9001/metrics` but it's unconfirmed whether Prometheus at `10.10.10.49` is scraping them. Verify: + +```bash +# On LXC 151 +curl http://127.0.0.1:9001/metrics | head -20 +``` + +If Prometheus is scraping, add the hookshot dashboard from the repo: +`contrib/hookshot-dashboard.json` → import into Grafana. + +**Grafana Synapse dashboard** — Prometheus is already scraping Synapse at port 9000. Import the official dashboard: +- Grafana → Dashboards → Import → ID `18618` (Synapse Monitoring) +- Set Prometheus datasource → done +- Shows room count, message rates, federation lag, cache hit rates, DB query times in real time + +--- + +### 8. Federation Security + +Currently: open federation with key verification (correct for invite-only friends server). Recommended additions: + +**Server-level allow/deny in `homeserver.yaml`** (optional, for closing federation entirely): +```yaml +# Fully closed (recommended long-term for private guild): +federation_enabled: false + +# OR: whitelist-only federation +federation_domain_whitelist: + - matrix.lotusguild.org + - matrix.org # Keep if bridging needed +``` + +**Per-room ACLs** for reactive blocking of specific bad servers: +```json +{ + "type": "m.room.server_acl", + "content": { + "allow": ["*"], + "deny": ["spam.example.com"] + } +} +``` + +**Mjolnir/Draupnir** (already on roadmap) handles this automatically with ban list subscriptions (t2bot spam lists etc). + +--- + +### 9. Sygnal Push Notifications + +Sygnal is the official Matrix push gateway for mobile (Element X on iOS/Android). Without it, notifications don't arrive when the app is backgrounded. + +**Requirements:** +- Apple Developer account (APNS cert) for iOS +- Firebase project (FCM API key) for Android +- New LXC or run alongside existing services + +**Basic config (`/etc/sygnal/sygnal.yaml`):** +```yaml +server: + port: 8765 +database: + type: postgresql + user: sygnal + password: + database: sygnal +apps: + com.element.android: + type: gcm + api_key: + im.riot.x.ios: + type: apns + platform: production + certfile: /etc/sygnal/apns/element-x-cert.pem + topic: im.riot.x.ios +``` + +**Synapse integration:** +```yaml +# homeserver.yaml +push: + push_gateways: + - url: "http://localhost:8765" +``` + +--- + +### 10. LiveKit VP9/AV1 + Dynacast (Quality Improvement) + +Currently H264 only. Enabling VP9/AV1 unlocks Dynacast (pauses video layers no one is watching) which significantly reduces bandwidth/CPU for low-viewer rooms. + +**`/etc/livekit/config.yaml` additions:** +```yaml +video: + codecs: + - mime: video/H264 + fmtp: "level-asymmetry-allowed=1;packetization-mode=1;profile-level-id=42e01e" + - mime: video/VP9 + fmtp: "profile=0" + - mime: video/AV1 + fmtp: "profile=0" + dynacast: true +``` + +Note: Dynacast only works with VP9 or AV1 (SVC-capable codecs). H264 subscribers continue to work normally alongside VP9/AV1 subscribers. + +--- + +### 11. Synapse Workers (Future Scaling Reference) + +Current single-process handles ~100–300 concurrent users before the Python GIL becomes the bottleneck. Not needed now, but documented for when usage grows. + +**Stage 1 trigger:** Synapse CPU >80% consistently, or >200 concurrent users. + +**First workers to add:** +```yaml +# /etc/matrix-synapse/workers/client-reader-1.yaml +worker_app: synapse.app.client_reader +worker_name: client-reader-1 +worker_listeners: + - type: http + port: 8011 + resources: [{names: [client]}] +``` +Add `federation_sender` next (off-loads outgoing federation from main process). Then `event_creator` for write-heavy loads. Redis required at Stage 2 (500+ users) for inter-worker coordination. + +--- + ## Bot Checklist ### Core @@ -474,7 +802,7 @@ cp -r dist/* /var/www/html/ - [x] Initial sync token (ignores old messages on startup) - [x] Auto-accept room invites - [x] Deployed as systemd service (`matrixbot.service`) on LXC 151 -- [x] Fix E2EE key errors — `nio_store/` cleared, bot restarted cleanly +- [x] Fix E2EE key errors — full store + credentials wipe, fresh device registration (`BBRZSEUECZ`); stale devices removed via admin API ### Commands - [x] `!help` — list commands