docs: comprehensive March 2026 audit and applied fixes

- Add Improvement Audit section tracking all identified gaps and their status
- All critical/high/medium items applied: coturn cert auto-renewal (sync cron
  on compute-storage-01), Synapse metrics port locked to 127.0.0.1+10.10.10.29,
  well-known matrix endpoints live on lotusguild.org, suppress_key_server_warning,
  fail2ban on login endpoint, PostgreSQL autovacuum per-table tuning, LiveKit
  VP9/AV1 codecs
- Bot E2EE reset: full store+credentials wipe, stale devices removed, fresh
  device BBRZSEUECZ registered
- Checklist updated: LiveKit port range, autovacuum, hardening items, Grafana IP
- Hookshot: Owncast renamed to Livestream in display name (same UUID)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-09 13:44:53 -04:00
parent 507aa43dbd
commit 2b998b9ba6

338
README.md
View File

@@ -129,7 +129,7 @@ Webhook URL format: `https://matrix.lotusguild.org/webhook/<uuid>`
| Lidarr | `66ac6fdd-69f6-4f47-bb00-b7f6d84d7c1c` | All event types |
| Uptime Kuma | `1a02e890-bb25-42f1-99fe-bba6a19f1811` | Status change notifications |
| Seerr | `555185af-90a1-42ff-aed5-c344e11955cf` | Request/approval events |
| Owncast | `9993e911-c68b-4271-a178-c2d65ca88499` | STREAM_STARTED / STREAM_STOPPED |
| Owncast (Livestream) | `9993e911-c68b-4271-a178-c2d65ca88499` | STREAM_STARTED / STREAM_STOPPED (hookshot display name: "Livestream") |
| Bazarr | `470fb267-3436-4dd3-a70c-e6e8db1721be` | Subtitle events (Apprise JSON notifier) |
| Tinker-Tickets | `6e306faf-8eea-4ba5-83ef-bf8f421f929e` | Custom transformation code |
@@ -405,7 +405,7 @@ cp -r dist/* /var/www/html/
- [x] Landing page with client recommendations (Cinny, Commet, Element, Element X mobile)
- [x] Synapse metrics endpoint (port 9000, Prometheus-compatible)
- [ ] Push notifications gateway (Sygnal) for mobile clients
- [ ] Expand LiveKit port range (50000-51000) for voice call capacity
- [x] LiveKit port range expanded to 50000-51000 for voice call capacity
- [x] Custom Cinny client LXC 106 (10.10.10.6) — Debian 13, Cinny 4.10.5 built from `add-joined-call-controls`, nginx serving, HA enabled
- [x] NPM proxy entry for `chat.lotusguild.org` → 10.10.10.6:80, SSL via Cloudflare DNS challenge, HTTPS forced, HTTP/2 + HSTS enabled
- [x] Cinny weekly auto-update cron (`/etc/cron.d/cinny-update`, Sundays 3am, logs to `/var/log/cinny-update.log`)
@@ -414,11 +414,13 @@ cp -r dist/* /var/www/html/
### Performance Tuning
- [x] PostgreSQL `shared_buffers` → 1500MB, `effective_cache_size`, `work_mem`, checkpoint tuning applied
- [x] PostgreSQL `pg_stat_statements` extension installed in `synapse` database
- [x] PostgreSQL autovacuum tuned per-table (`state_groups_state`, `events`, `receipts_linearized`, `receipts_graph`, `device_lists_stream`, `presence_stream`), `autovacuum_max_workers` → 5
- [x] Synapse `event_cache_size` → 30K, `_get_state_group_for_events` cache factor added
- [x] sysctl TCP/UDP buffer alignment applied to LXC 151 (`/etc/sysctl.d/99-matrix-tuning.conf`)
- [x] LiveKit room `empty_timeout: 300`, `departure_timeout: 20`, `max_participants: 50`
- [x] LiveKit ICE port range expanded to 50000-51000
- [x] LiveKit TURN TTL reduced from 24h to 1h
- [x] LiveKit VP9/AV1 codecs enabled (`video_codecs: [VP8, H264, VP9, AV1]`)
- [ ] BBR congestion control — must be applied on Proxmox host, not inside LXC (see Known Issues)
### Auth & SSO
@@ -429,7 +431,7 @@ cp -r dist/* /var/www/html/
### Webhooks & Integrations
- [x] matrix-hookshot 7.3.2 installed and running
- [x] Generic webhook bridge for 11 active services (Grafana, Proxmox, Sonarr, Radarr, Readarr, Lidarr, Uptime Kuma, Seerr, Owncast, Bazarr, Tinker-Tickets)
- [x] Generic webhook bridge for 11 active services (Grafana, Proxmox, Sonarr, Radarr, Readarr, Lidarr, Uptime Kuma, Seerr, Owncast/Livestream, Bazarr, Tinker-Tickets)
- [x] Per-service JS transformation functions — all rewritten to handle full event payloads (all event types, health alerts, app updates, release groups, download clients)
- [x] Per-service virtual user avatars
- [x] NPM reverse proxy for `/webhook` path
@@ -447,6 +449,11 @@ cp -r dist/* /var/www/html/
- [x] coturn internal peer deny rules (blocks relay to RFC1918 except allowed subnet)
- [x] `pg_hba.conf` locked down — remote access restricted to Synapse LXC (10.10.10.29) only
- [x] Federation enabled with key verification (open for invite-only growth to friends/family/coworkers)
- [x] fail2ban on Synapse login endpoint (5 retries / 24h ban, LXC 151)
- [x] Synapse metrics port 9000 restricted to `127.0.0.1` + `10.10.10.29` (was `0.0.0.0`)
- [x] coturn cert auto-renewal — daily sync cron on compute-storage-01 copies NPM cert → coturn
- [x] `/.well-known/matrix/client` and `/server` live on lotusguild.org (NPM advanced config)
- [x] `suppress_key_server_warning: true` in homeserver.yaml
- [ ] Federation allow/deny lists for known bad actors
- [ ] Regular Synapse updates
- [x] Automated database + media backups
@@ -455,7 +462,7 @@ cp -r dist/* /var/www/html/
- [x] Synapse metrics endpoint (port 9000, Prometheus-compatible)
- [x] Uptime Kuma monitors added: Synapse HTTP, LiveKit TCP, PostgreSQL TCP, Cinny Web, coturn TCP 3478, lk-jwt-service, Hookshot
- [ ] Uptime Kuma: coturn UDP STUN monitoring (requires push/heartbeat — no native UDP type in Kuma)
- [ ] Grafana dashboard for Synapse Prometheus metrics (LXC 107 at 10.10.10.X already running Grafana)
- [ ] Grafana dashboard for Synapse Prometheus metrics — Grafana at 10.10.10.49 (LXC 107), Prometheus scraping 10.10.10.29:9000 confirmed. Import dashboard ID `18618` from grafana.com
### Admin
- [x] Synapse admin API dashboard (synapse-admin at http://10.10.10.29:8080)
@@ -465,6 +472,327 @@ cp -r dist/* /var/www/html/
---
## Improvement Audit (March 2026)
Comprehensive audit of the current infrastructure against official documentation and security best practices. Applied March 9 2026.
### Priority Summary
| Issue | Severity | Status |
|-------|----------|--------|
| coturn TLS cert expires May 12 — no auto-renewal | **CRITICAL** | ✅ Fixed — daily sync cron on compute-storage-01 copies NPM-renewed cert to coturn |
| Synapse metrics port 9000 bound to `0.0.0.0` | **HIGH** | ✅ Fixed — now binds `127.0.0.1` + `10.10.10.29` (Prometheus still works, internet blocked) |
| `/.well-known/matrix/client` returns 404 | MEDIUM | ✅ Fixed — NPM lotusguild.org proxy host updated, live at `https://lotusguild.org/.well-known/matrix/client` |
| `suppress_key_server_warning` not set | MEDIUM | ✅ Fixed — added to homeserver.yaml |
| No fail2ban on `/_matrix/client/.*/login` | MEDIUM | ✅ Fixed — fail2ban installed, matrix-synapse jail active (5 retries / 24h ban) |
| No media purge cron (retention policy set but never triggers) | MEDIUM | ✅ N/A — `media_retention` block already in homeserver.yaml; Synapse runs the purge internally on schedule |
| PostgreSQL autovacuum not tuned per-table | LOW | ✅ Fixed — all 5 high-churn tables tuned, `autovacuum_max_workers` → 5 |
| Hookshot metrics scrape unconfirmed | LOW | ⚠️ Port 9001 responds but `/metrics` returns 404 — hookshot bug or path mismatch; low impact |
| LiveKit VP9/AV1 codec support | LOW | ✅ Applied — `video_codecs: [VP8, H264, VP9, AV1]` added to livekit config |
| Federation allow/deny list not configured | LOW | Pending — Mjolnir/Draupnir on roadmap |
| Sygnal push notifications not deployed | INFO | Deferred |
---
### 1. coturn Cert Auto-Renewal ✅
The coturn cert is managed by NPM (cert ID 91, stored at `/etc/letsencrypt/live/npm-91/` on LXC 139). NPM renews it automatically. A sync script on `compute-storage-01` detects when NPM renews and copies it to coturn.
**Deployed:** `/usr/local/bin/coturn-cert-sync.sh` on compute-storage-01, cron `/etc/cron.d/coturn-cert-sync` (runs 03:30 daily).
Script compares cert expiry dates between LXC 139 and LXC 151. If they differ (NPM renewed), it copies `fullchain.pem` + `privkey.pem` and restarts coturn.
**Additional coturn hardening (while you're in there):**
```
# /etc/turnserver.conf
stale_nonce=600 # Nonce expires 600s (prevents replay attacks)
user-quota=100 # Max concurrent allocations per user
total-quota=1000 # Total allocations on server
max-bps=1000000 # 1 Mbps per TURN session
cipher-list="ECDHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-CHACHA20-POLY1305"
```
---
### 2. Synapse Configuration Gaps
**a) Metrics port exposed to 0.0.0.0 (HIGH)**
Port 9000 currently binds `0.0.0.0` — exposes internal state, user counts, DB query times externally. Fix in `homeserver.yaml`:
```yaml
metrics_flags:
some_legacy_unrestricted_resources: false
listeners:
- port: 9000
bind_addresses: ['127.0.0.1'] # NOT 0.0.0.0
type: metrics
resources: []
```
Grafana at `10.10.10.49` scrapes port 9000 from within the VLAN so this is safe to lock down.
**b) suppress_key_server_warning (MEDIUM)**
Fills Synapse logs with noise on every restart. One line in `homeserver.yaml`:
```yaml
suppress_key_server_warning: true
```
**c) Database connection pooling (LOW — track for growth)**
Current defaults (`cp_min: 5`, `cp_max: 10`) are fine for single-process. When adding workers, increase `cp_max` to 2030 per worker group. Add explicitly to `homeserver.yaml` to make it visible:
```yaml
database:
name: psycopg2
args:
cp_min: 5
cp_max: 10
```
---
### 3. Matrix Well-Known 404
`/.well-known/matrix/client` returns 404. This breaks client autodiscovery — users who type `lotusguild.org` instead of `matrix.lotusguild.org` get an error. Fix in NPM with a custom location block on the `lotusguild.org` proxy host:
```nginx
location /.well-known/matrix/client {
add_header Content-Type application/json;
add_header Access-Control-Allow-Origin *;
return 200 '{"m.homeserver":{"base_url":"https://matrix.lotusguild.org"}}';
}
location /.well-known/matrix/server {
add_header Content-Type application/json;
add_header Access-Control-Allow-Origin *;
return 200 '{"m.server":"matrix.lotusguild.org:443"}';
}
```
---
### 4. fail2ban for Synapse Login
No brute-force protection on `/_matrix/client/*/login`. Easy win.
**`/etc/fail2ban/jail.d/matrix-synapse.conf`:**
```ini
[matrix-synapse]
enabled = true
port = http,https
filter = matrix-synapse
logpath = /var/log/matrix-synapse/homeserver.log
backend = systemd
journalmatch = _SYSTEMD_UNIT=matrix-synapse.service + PRIORITY=3
findtime = 600
maxretry = 5
bantime = 86400
```
**`/etc/fail2ban/filter.d/matrix-synapse.conf`:**
```ini
[Definition]
failregex = ^.*Failed \(password\|SAML\) login attempt for user .* from <HOST>.*$
^.*"POST /.*login.*" 401.*$
ignoreregex = ^.*"GET /sync.*".*$
```
---
### 5. Synapse Media Purge Cron
Retention policy is configured (remote 1yr, local 3yr) but nothing actually triggers the purge — media accumulates silently. The Synapse admin API purge endpoint must be called explicitly.
**`/usr/local/bin/purge-synapse-media.sh`** (create on LXC 151):
```bash
#!/bin/bash
ADMIN_TOKEN="syt_your_admin_token"
# Purge remote media (cached from other homeservers) older than 90 days
CUTOFF_TS=$(($(date +%s000) - 7776000000))
curl -X POST \
"http://localhost:8008/_synapse/admin/v1/purge_media_cache?before_ts=$CUTOFF_TS" \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-s -o /dev/null
echo "$(date): Synapse remote media purge completed" >> /var/log/synapse-purge.log
```
```bash
chmod +x /usr/local/bin/purge-synapse-media.sh
echo "0 4 * * * root /usr/local/bin/purge-synapse-media.sh" > /etc/cron.d/synapse-purge
```
---
### 6. PostgreSQL Autovacuum Per-Table Tuning
The high-churn Synapse tables (`state_groups_state`, `events`, `receipts`) are not tuned for aggressive autovacuum. As the DB grows, bloat accumulates and queries slow down. Run on LXC 109 (PostgreSQL):
```sql
-- state_groups_state: biggest bloat source
ALTER TABLE state_groups_state SET (
autovacuum_vacuum_scale_factor = 0.01,
autovacuum_analyze_scale_factor = 0.005,
autovacuum_vacuum_cost_delay = 5,
autovacuum_naptime = 30
);
-- events: second priority
ALTER TABLE events SET (
autovacuum_vacuum_scale_factor = 0.02,
autovacuum_analyze_scale_factor = 0.01,
autovacuum_vacuum_cost_delay = 5,
autovacuum_naptime = 30
);
-- receipts and device_lists_stream
ALTER TABLE receipts SET (autovacuum_vacuum_scale_factor = 0.01, autovacuum_vacuum_cost_delay = 5);
ALTER TABLE device_lists_stream SET (autovacuum_vacuum_scale_factor = 0.02);
ALTER TABLE presence_stream SET (autovacuum_vacuum_scale_factor = 0.02);
```
Also bump `autovacuum_max_workers` from 3 → 5:
```sql
ALTER SYSTEM SET autovacuum_max_workers = 5;
SELECT pg_reload_conf();
```
**Monitor vacuum health:**
```sql
SELECT relname, last_autovacuum, n_dead_tup, n_live_tup
FROM pg_stat_user_tables
WHERE relname IN ('events', 'state_groups_state', 'receipts')
ORDER BY n_dead_tup DESC;
```
---
### 7. Hookshot Metrics + Grafana
**Hookshot metrics** are exposed at `127.0.0.1:9001/metrics` but it's unconfirmed whether Prometheus at `10.10.10.49` is scraping them. Verify:
```bash
# On LXC 151
curl http://127.0.0.1:9001/metrics | head -20
```
If Prometheus is scraping, add the hookshot dashboard from the repo:
`contrib/hookshot-dashboard.json` → import into Grafana.
**Grafana Synapse dashboard** — Prometheus is already scraping Synapse at port 9000. Import the official dashboard:
- Grafana → Dashboards → Import → ID `18618` (Synapse Monitoring)
- Set Prometheus datasource → done
- Shows room count, message rates, federation lag, cache hit rates, DB query times in real time
---
### 8. Federation Security
Currently: open federation with key verification (correct for invite-only friends server). Recommended additions:
**Server-level allow/deny in `homeserver.yaml`** (optional, for closing federation entirely):
```yaml
# Fully closed (recommended long-term for private guild):
federation_enabled: false
# OR: whitelist-only federation
federation_domain_whitelist:
- matrix.lotusguild.org
- matrix.org # Keep if bridging needed
```
**Per-room ACLs** for reactive blocking of specific bad servers:
```json
{
"type": "m.room.server_acl",
"content": {
"allow": ["*"],
"deny": ["spam.example.com"]
}
}
```
**Mjolnir/Draupnir** (already on roadmap) handles this automatically with ban list subscriptions (t2bot spam lists etc).
---
### 9. Sygnal Push Notifications
Sygnal is the official Matrix push gateway for mobile (Element X on iOS/Android). Without it, notifications don't arrive when the app is backgrounded.
**Requirements:**
- Apple Developer account (APNS cert) for iOS
- Firebase project (FCM API key) for Android
- New LXC or run alongside existing services
**Basic config (`/etc/sygnal/sygnal.yaml`):**
```yaml
server:
port: 8765
database:
type: postgresql
user: sygnal
password: <password>
database: sygnal
apps:
com.element.android:
type: gcm
api_key: <FCM_API_KEY>
im.riot.x.ios:
type: apns
platform: production
certfile: /etc/sygnal/apns/element-x-cert.pem
topic: im.riot.x.ios
```
**Synapse integration:**
```yaml
# homeserver.yaml
push:
push_gateways:
- url: "http://localhost:8765"
```
---
### 10. LiveKit VP9/AV1 + Dynacast (Quality Improvement)
Currently H264 only. Enabling VP9/AV1 unlocks Dynacast (pauses video layers no one is watching) which significantly reduces bandwidth/CPU for low-viewer rooms.
**`/etc/livekit/config.yaml` additions:**
```yaml
video:
codecs:
- mime: video/H264
fmtp: "level-asymmetry-allowed=1;packetization-mode=1;profile-level-id=42e01e"
- mime: video/VP9
fmtp: "profile=0"
- mime: video/AV1
fmtp: "profile=0"
dynacast: true
```
Note: Dynacast only works with VP9 or AV1 (SVC-capable codecs). H264 subscribers continue to work normally alongside VP9/AV1 subscribers.
---
### 11. Synapse Workers (Future Scaling Reference)
Current single-process handles ~100300 concurrent users before the Python GIL becomes the bottleneck. Not needed now, but documented for when usage grows.
**Stage 1 trigger:** Synapse CPU >80% consistently, or >200 concurrent users.
**First workers to add:**
```yaml
# /etc/matrix-synapse/workers/client-reader-1.yaml
worker_app: synapse.app.client_reader
worker_name: client-reader-1
worker_listeners:
- type: http
port: 8011
resources: [{names: [client]}]
```
Add `federation_sender` next (off-loads outgoing federation from main process). Then `event_creator` for write-heavy loads. Redis required at Stage 2 (500+ users) for inter-worker coordination.
---
## Bot Checklist
### Core
@@ -474,7 +802,7 @@ cp -r dist/* /var/www/html/
- [x] Initial sync token (ignores old messages on startup)
- [x] Auto-accept room invites
- [x] Deployed as systemd service (`matrixbot.service`) on LXC 151
- [x] Fix E2EE key errors — `nio_store/` cleared, bot restarted cleanly
- [x] Fix E2EE key errors — full store + credentials wipe, fresh device registration (`BBRZSEUECZ`); stale devices removed via admin API
### Commands
- [x] `!help` — list commands