docs: comprehensive March 2026 audit and applied fixes
- Add Improvement Audit section tracking all identified gaps and their status - All critical/high/medium items applied: coturn cert auto-renewal (sync cron on compute-storage-01), Synapse metrics port locked to 127.0.0.1+10.10.10.29, well-known matrix endpoints live on lotusguild.org, suppress_key_server_warning, fail2ban on login endpoint, PostgreSQL autovacuum per-table tuning, LiveKit VP9/AV1 codecs - Bot E2EE reset: full store+credentials wipe, stale devices removed, fresh device BBRZSEUECZ registered - Checklist updated: LiveKit port range, autovacuum, hardening items, Grafana IP - Hookshot: Owncast renamed to Livestream in display name (same UUID) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
338
README.md
338
README.md
@@ -129,7 +129,7 @@ Webhook URL format: `https://matrix.lotusguild.org/webhook/<uuid>`
|
||||
| Lidarr | `66ac6fdd-69f6-4f47-bb00-b7f6d84d7c1c` | All event types |
|
||||
| Uptime Kuma | `1a02e890-bb25-42f1-99fe-bba6a19f1811` | Status change notifications |
|
||||
| Seerr | `555185af-90a1-42ff-aed5-c344e11955cf` | Request/approval events |
|
||||
| Owncast | `9993e911-c68b-4271-a178-c2d65ca88499` | STREAM_STARTED / STREAM_STOPPED |
|
||||
| Owncast (Livestream) | `9993e911-c68b-4271-a178-c2d65ca88499` | STREAM_STARTED / STREAM_STOPPED (hookshot display name: "Livestream") |
|
||||
| Bazarr | `470fb267-3436-4dd3-a70c-e6e8db1721be` | Subtitle events (Apprise JSON notifier) |
|
||||
| Tinker-Tickets | `6e306faf-8eea-4ba5-83ef-bf8f421f929e` | Custom transformation code |
|
||||
|
||||
@@ -405,7 +405,7 @@ cp -r dist/* /var/www/html/
|
||||
- [x] Landing page with client recommendations (Cinny, Commet, Element, Element X mobile)
|
||||
- [x] Synapse metrics endpoint (port 9000, Prometheus-compatible)
|
||||
- [ ] Push notifications gateway (Sygnal) for mobile clients
|
||||
- [ ] Expand LiveKit port range (50000-51000) for voice call capacity
|
||||
- [x] LiveKit port range expanded to 50000-51000 for voice call capacity
|
||||
- [x] Custom Cinny client LXC 106 (10.10.10.6) — Debian 13, Cinny 4.10.5 built from `add-joined-call-controls`, nginx serving, HA enabled
|
||||
- [x] NPM proxy entry for `chat.lotusguild.org` → 10.10.10.6:80, SSL via Cloudflare DNS challenge, HTTPS forced, HTTP/2 + HSTS enabled
|
||||
- [x] Cinny weekly auto-update cron (`/etc/cron.d/cinny-update`, Sundays 3am, logs to `/var/log/cinny-update.log`)
|
||||
@@ -414,11 +414,13 @@ cp -r dist/* /var/www/html/
|
||||
### Performance Tuning
|
||||
- [x] PostgreSQL `shared_buffers` → 1500MB, `effective_cache_size`, `work_mem`, checkpoint tuning applied
|
||||
- [x] PostgreSQL `pg_stat_statements` extension installed in `synapse` database
|
||||
- [x] PostgreSQL autovacuum tuned per-table (`state_groups_state`, `events`, `receipts_linearized`, `receipts_graph`, `device_lists_stream`, `presence_stream`), `autovacuum_max_workers` → 5
|
||||
- [x] Synapse `event_cache_size` → 30K, `_get_state_group_for_events` cache factor added
|
||||
- [x] sysctl TCP/UDP buffer alignment applied to LXC 151 (`/etc/sysctl.d/99-matrix-tuning.conf`)
|
||||
- [x] LiveKit room `empty_timeout: 300`, `departure_timeout: 20`, `max_participants: 50`
|
||||
- [x] LiveKit ICE port range expanded to 50000-51000
|
||||
- [x] LiveKit TURN TTL reduced from 24h to 1h
|
||||
- [x] LiveKit VP9/AV1 codecs enabled (`video_codecs: [VP8, H264, VP9, AV1]`)
|
||||
- [ ] BBR congestion control — must be applied on Proxmox host, not inside LXC (see Known Issues)
|
||||
|
||||
### Auth & SSO
|
||||
@@ -429,7 +431,7 @@ cp -r dist/* /var/www/html/
|
||||
|
||||
### Webhooks & Integrations
|
||||
- [x] matrix-hookshot 7.3.2 installed and running
|
||||
- [x] Generic webhook bridge for 11 active services (Grafana, Proxmox, Sonarr, Radarr, Readarr, Lidarr, Uptime Kuma, Seerr, Owncast, Bazarr, Tinker-Tickets)
|
||||
- [x] Generic webhook bridge for 11 active services (Grafana, Proxmox, Sonarr, Radarr, Readarr, Lidarr, Uptime Kuma, Seerr, Owncast/Livestream, Bazarr, Tinker-Tickets)
|
||||
- [x] Per-service JS transformation functions — all rewritten to handle full event payloads (all event types, health alerts, app updates, release groups, download clients)
|
||||
- [x] Per-service virtual user avatars
|
||||
- [x] NPM reverse proxy for `/webhook` path
|
||||
@@ -447,6 +449,11 @@ cp -r dist/* /var/www/html/
|
||||
- [x] coturn internal peer deny rules (blocks relay to RFC1918 except allowed subnet)
|
||||
- [x] `pg_hba.conf` locked down — remote access restricted to Synapse LXC (10.10.10.29) only
|
||||
- [x] Federation enabled with key verification (open for invite-only growth to friends/family/coworkers)
|
||||
- [x] fail2ban on Synapse login endpoint (5 retries / 24h ban, LXC 151)
|
||||
- [x] Synapse metrics port 9000 restricted to `127.0.0.1` + `10.10.10.29` (was `0.0.0.0`)
|
||||
- [x] coturn cert auto-renewal — daily sync cron on compute-storage-01 copies NPM cert → coturn
|
||||
- [x] `/.well-known/matrix/client` and `/server` live on lotusguild.org (NPM advanced config)
|
||||
- [x] `suppress_key_server_warning: true` in homeserver.yaml
|
||||
- [ ] Federation allow/deny lists for known bad actors
|
||||
- [ ] Regular Synapse updates
|
||||
- [x] Automated database + media backups
|
||||
@@ -455,7 +462,7 @@ cp -r dist/* /var/www/html/
|
||||
- [x] Synapse metrics endpoint (port 9000, Prometheus-compatible)
|
||||
- [x] Uptime Kuma monitors added: Synapse HTTP, LiveKit TCP, PostgreSQL TCP, Cinny Web, coturn TCP 3478, lk-jwt-service, Hookshot
|
||||
- [ ] Uptime Kuma: coturn UDP STUN monitoring (requires push/heartbeat — no native UDP type in Kuma)
|
||||
- [ ] Grafana dashboard for Synapse Prometheus metrics (LXC 107 at 10.10.10.X already running Grafana)
|
||||
- [ ] Grafana dashboard for Synapse Prometheus metrics — Grafana at 10.10.10.49 (LXC 107), Prometheus scraping 10.10.10.29:9000 confirmed. Import dashboard ID `18618` from grafana.com
|
||||
|
||||
### Admin
|
||||
- [x] Synapse admin API dashboard (synapse-admin at http://10.10.10.29:8080)
|
||||
@@ -465,6 +472,327 @@ cp -r dist/* /var/www/html/
|
||||
|
||||
---
|
||||
|
||||
## Improvement Audit (March 2026)
|
||||
|
||||
Comprehensive audit of the current infrastructure against official documentation and security best practices. Applied March 9 2026.
|
||||
|
||||
### Priority Summary
|
||||
|
||||
| Issue | Severity | Status |
|
||||
|-------|----------|--------|
|
||||
| coturn TLS cert expires May 12 — no auto-renewal | **CRITICAL** | ✅ Fixed — daily sync cron on compute-storage-01 copies NPM-renewed cert to coturn |
|
||||
| Synapse metrics port 9000 bound to `0.0.0.0` | **HIGH** | ✅ Fixed — now binds `127.0.0.1` + `10.10.10.29` (Prometheus still works, internet blocked) |
|
||||
| `/.well-known/matrix/client` returns 404 | MEDIUM | ✅ Fixed — NPM lotusguild.org proxy host updated, live at `https://lotusguild.org/.well-known/matrix/client` |
|
||||
| `suppress_key_server_warning` not set | MEDIUM | ✅ Fixed — added to homeserver.yaml |
|
||||
| No fail2ban on `/_matrix/client/.*/login` | MEDIUM | ✅ Fixed — fail2ban installed, matrix-synapse jail active (5 retries / 24h ban) |
|
||||
| No media purge cron (retention policy set but never triggers) | MEDIUM | ✅ N/A — `media_retention` block already in homeserver.yaml; Synapse runs the purge internally on schedule |
|
||||
| PostgreSQL autovacuum not tuned per-table | LOW | ✅ Fixed — all 5 high-churn tables tuned, `autovacuum_max_workers` → 5 |
|
||||
| Hookshot metrics scrape unconfirmed | LOW | ⚠️ Port 9001 responds but `/metrics` returns 404 — hookshot bug or path mismatch; low impact |
|
||||
| LiveKit VP9/AV1 codec support | LOW | ✅ Applied — `video_codecs: [VP8, H264, VP9, AV1]` added to livekit config |
|
||||
| Federation allow/deny list not configured | LOW | Pending — Mjolnir/Draupnir on roadmap |
|
||||
| Sygnal push notifications not deployed | INFO | Deferred |
|
||||
|
||||
---
|
||||
|
||||
### 1. coturn Cert Auto-Renewal ✅
|
||||
|
||||
The coturn cert is managed by NPM (cert ID 91, stored at `/etc/letsencrypt/live/npm-91/` on LXC 139). NPM renews it automatically. A sync script on `compute-storage-01` detects when NPM renews and copies it to coturn.
|
||||
|
||||
**Deployed:** `/usr/local/bin/coturn-cert-sync.sh` on compute-storage-01, cron `/etc/cron.d/coturn-cert-sync` (runs 03:30 daily).
|
||||
|
||||
Script compares cert expiry dates between LXC 139 and LXC 151. If they differ (NPM renewed), it copies `fullchain.pem` + `privkey.pem` and restarts coturn.
|
||||
|
||||
**Additional coturn hardening (while you're in there):**
|
||||
```
|
||||
# /etc/turnserver.conf
|
||||
stale_nonce=600 # Nonce expires 600s (prevents replay attacks)
|
||||
user-quota=100 # Max concurrent allocations per user
|
||||
total-quota=1000 # Total allocations on server
|
||||
max-bps=1000000 # 1 Mbps per TURN session
|
||||
cipher-list="ECDHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-CHACHA20-POLY1305"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2. Synapse Configuration Gaps
|
||||
|
||||
**a) Metrics port exposed to 0.0.0.0 (HIGH)**
|
||||
|
||||
Port 9000 currently binds `0.0.0.0` — exposes internal state, user counts, DB query times externally. Fix in `homeserver.yaml`:
|
||||
```yaml
|
||||
metrics_flags:
|
||||
some_legacy_unrestricted_resources: false
|
||||
listeners:
|
||||
- port: 9000
|
||||
bind_addresses: ['127.0.0.1'] # NOT 0.0.0.0
|
||||
type: metrics
|
||||
resources: []
|
||||
```
|
||||
Grafana at `10.10.10.49` scrapes port 9000 from within the VLAN so this is safe to lock down.
|
||||
|
||||
**b) suppress_key_server_warning (MEDIUM)**
|
||||
|
||||
Fills Synapse logs with noise on every restart. One line in `homeserver.yaml`:
|
||||
```yaml
|
||||
suppress_key_server_warning: true
|
||||
```
|
||||
|
||||
**c) Database connection pooling (LOW — track for growth)**
|
||||
|
||||
Current defaults (`cp_min: 5`, `cp_max: 10`) are fine for single-process. When adding workers, increase `cp_max` to 20–30 per worker group. Add explicitly to `homeserver.yaml` to make it visible:
|
||||
```yaml
|
||||
database:
|
||||
name: psycopg2
|
||||
args:
|
||||
cp_min: 5
|
||||
cp_max: 10
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3. Matrix Well-Known 404
|
||||
|
||||
`/.well-known/matrix/client` returns 404. This breaks client autodiscovery — users who type `lotusguild.org` instead of `matrix.lotusguild.org` get an error. Fix in NPM with a custom location block on the `lotusguild.org` proxy host:
|
||||
|
||||
```nginx
|
||||
location /.well-known/matrix/client {
|
||||
add_header Content-Type application/json;
|
||||
add_header Access-Control-Allow-Origin *;
|
||||
return 200 '{"m.homeserver":{"base_url":"https://matrix.lotusguild.org"}}';
|
||||
}
|
||||
location /.well-known/matrix/server {
|
||||
add_header Content-Type application/json;
|
||||
add_header Access-Control-Allow-Origin *;
|
||||
return 200 '{"m.server":"matrix.lotusguild.org:443"}';
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 4. fail2ban for Synapse Login
|
||||
|
||||
No brute-force protection on `/_matrix/client/*/login`. Easy win.
|
||||
|
||||
**`/etc/fail2ban/jail.d/matrix-synapse.conf`:**
|
||||
```ini
|
||||
[matrix-synapse]
|
||||
enabled = true
|
||||
port = http,https
|
||||
filter = matrix-synapse
|
||||
logpath = /var/log/matrix-synapse/homeserver.log
|
||||
backend = systemd
|
||||
journalmatch = _SYSTEMD_UNIT=matrix-synapse.service + PRIORITY=3
|
||||
findtime = 600
|
||||
maxretry = 5
|
||||
bantime = 86400
|
||||
```
|
||||
|
||||
**`/etc/fail2ban/filter.d/matrix-synapse.conf`:**
|
||||
```ini
|
||||
[Definition]
|
||||
failregex = ^.*Failed \(password\|SAML\) login attempt for user .* from <HOST>.*$
|
||||
^.*"POST /.*login.*" 401.*$
|
||||
ignoreregex = ^.*"GET /sync.*".*$
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 5. Synapse Media Purge Cron
|
||||
|
||||
Retention policy is configured (remote 1yr, local 3yr) but nothing actually triggers the purge — media accumulates silently. The Synapse admin API purge endpoint must be called explicitly.
|
||||
|
||||
**`/usr/local/bin/purge-synapse-media.sh`** (create on LXC 151):
|
||||
```bash
|
||||
#!/bin/bash
|
||||
ADMIN_TOKEN="syt_your_admin_token"
|
||||
# Purge remote media (cached from other homeservers) older than 90 days
|
||||
CUTOFF_TS=$(($(date +%s000) - 7776000000))
|
||||
curl -X POST \
|
||||
"http://localhost:8008/_synapse/admin/v1/purge_media_cache?before_ts=$CUTOFF_TS" \
|
||||
-H "Authorization: Bearer $ADMIN_TOKEN" \
|
||||
-s -o /dev/null
|
||||
echo "$(date): Synapse remote media purge completed" >> /var/log/synapse-purge.log
|
||||
```
|
||||
|
||||
```bash
|
||||
chmod +x /usr/local/bin/purge-synapse-media.sh
|
||||
echo "0 4 * * * root /usr/local/bin/purge-synapse-media.sh" > /etc/cron.d/synapse-purge
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 6. PostgreSQL Autovacuum Per-Table Tuning
|
||||
|
||||
The high-churn Synapse tables (`state_groups_state`, `events`, `receipts`) are not tuned for aggressive autovacuum. As the DB grows, bloat accumulates and queries slow down. Run on LXC 109 (PostgreSQL):
|
||||
|
||||
```sql
|
||||
-- state_groups_state: biggest bloat source
|
||||
ALTER TABLE state_groups_state SET (
|
||||
autovacuum_vacuum_scale_factor = 0.01,
|
||||
autovacuum_analyze_scale_factor = 0.005,
|
||||
autovacuum_vacuum_cost_delay = 5,
|
||||
autovacuum_naptime = 30
|
||||
);
|
||||
|
||||
-- events: second priority
|
||||
ALTER TABLE events SET (
|
||||
autovacuum_vacuum_scale_factor = 0.02,
|
||||
autovacuum_analyze_scale_factor = 0.01,
|
||||
autovacuum_vacuum_cost_delay = 5,
|
||||
autovacuum_naptime = 30
|
||||
);
|
||||
|
||||
-- receipts and device_lists_stream
|
||||
ALTER TABLE receipts SET (autovacuum_vacuum_scale_factor = 0.01, autovacuum_vacuum_cost_delay = 5);
|
||||
ALTER TABLE device_lists_stream SET (autovacuum_vacuum_scale_factor = 0.02);
|
||||
ALTER TABLE presence_stream SET (autovacuum_vacuum_scale_factor = 0.02);
|
||||
```
|
||||
|
||||
Also bump `autovacuum_max_workers` from 3 → 5:
|
||||
```sql
|
||||
ALTER SYSTEM SET autovacuum_max_workers = 5;
|
||||
SELECT pg_reload_conf();
|
||||
```
|
||||
|
||||
**Monitor vacuum health:**
|
||||
```sql
|
||||
SELECT relname, last_autovacuum, n_dead_tup, n_live_tup
|
||||
FROM pg_stat_user_tables
|
||||
WHERE relname IN ('events', 'state_groups_state', 'receipts')
|
||||
ORDER BY n_dead_tup DESC;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 7. Hookshot Metrics + Grafana
|
||||
|
||||
**Hookshot metrics** are exposed at `127.0.0.1:9001/metrics` but it's unconfirmed whether Prometheus at `10.10.10.49` is scraping them. Verify:
|
||||
|
||||
```bash
|
||||
# On LXC 151
|
||||
curl http://127.0.0.1:9001/metrics | head -20
|
||||
```
|
||||
|
||||
If Prometheus is scraping, add the hookshot dashboard from the repo:
|
||||
`contrib/hookshot-dashboard.json` → import into Grafana.
|
||||
|
||||
**Grafana Synapse dashboard** — Prometheus is already scraping Synapse at port 9000. Import the official dashboard:
|
||||
- Grafana → Dashboards → Import → ID `18618` (Synapse Monitoring)
|
||||
- Set Prometheus datasource → done
|
||||
- Shows room count, message rates, federation lag, cache hit rates, DB query times in real time
|
||||
|
||||
---
|
||||
|
||||
### 8. Federation Security
|
||||
|
||||
Currently: open federation with key verification (correct for invite-only friends server). Recommended additions:
|
||||
|
||||
**Server-level allow/deny in `homeserver.yaml`** (optional, for closing federation entirely):
|
||||
```yaml
|
||||
# Fully closed (recommended long-term for private guild):
|
||||
federation_enabled: false
|
||||
|
||||
# OR: whitelist-only federation
|
||||
federation_domain_whitelist:
|
||||
- matrix.lotusguild.org
|
||||
- matrix.org # Keep if bridging needed
|
||||
```
|
||||
|
||||
**Per-room ACLs** for reactive blocking of specific bad servers:
|
||||
```json
|
||||
{
|
||||
"type": "m.room.server_acl",
|
||||
"content": {
|
||||
"allow": ["*"],
|
||||
"deny": ["spam.example.com"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Mjolnir/Draupnir** (already on roadmap) handles this automatically with ban list subscriptions (t2bot spam lists etc).
|
||||
|
||||
---
|
||||
|
||||
### 9. Sygnal Push Notifications
|
||||
|
||||
Sygnal is the official Matrix push gateway for mobile (Element X on iOS/Android). Without it, notifications don't arrive when the app is backgrounded.
|
||||
|
||||
**Requirements:**
|
||||
- Apple Developer account (APNS cert) for iOS
|
||||
- Firebase project (FCM API key) for Android
|
||||
- New LXC or run alongside existing services
|
||||
|
||||
**Basic config (`/etc/sygnal/sygnal.yaml`):**
|
||||
```yaml
|
||||
server:
|
||||
port: 8765
|
||||
database:
|
||||
type: postgresql
|
||||
user: sygnal
|
||||
password: <password>
|
||||
database: sygnal
|
||||
apps:
|
||||
com.element.android:
|
||||
type: gcm
|
||||
api_key: <FCM_API_KEY>
|
||||
im.riot.x.ios:
|
||||
type: apns
|
||||
platform: production
|
||||
certfile: /etc/sygnal/apns/element-x-cert.pem
|
||||
topic: im.riot.x.ios
|
||||
```
|
||||
|
||||
**Synapse integration:**
|
||||
```yaml
|
||||
# homeserver.yaml
|
||||
push:
|
||||
push_gateways:
|
||||
- url: "http://localhost:8765"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 10. LiveKit VP9/AV1 + Dynacast (Quality Improvement)
|
||||
|
||||
Currently H264 only. Enabling VP9/AV1 unlocks Dynacast (pauses video layers no one is watching) which significantly reduces bandwidth/CPU for low-viewer rooms.
|
||||
|
||||
**`/etc/livekit/config.yaml` additions:**
|
||||
```yaml
|
||||
video:
|
||||
codecs:
|
||||
- mime: video/H264
|
||||
fmtp: "level-asymmetry-allowed=1;packetization-mode=1;profile-level-id=42e01e"
|
||||
- mime: video/VP9
|
||||
fmtp: "profile=0"
|
||||
- mime: video/AV1
|
||||
fmtp: "profile=0"
|
||||
dynacast: true
|
||||
```
|
||||
|
||||
Note: Dynacast only works with VP9 or AV1 (SVC-capable codecs). H264 subscribers continue to work normally alongside VP9/AV1 subscribers.
|
||||
|
||||
---
|
||||
|
||||
### 11. Synapse Workers (Future Scaling Reference)
|
||||
|
||||
Current single-process handles ~100–300 concurrent users before the Python GIL becomes the bottleneck. Not needed now, but documented for when usage grows.
|
||||
|
||||
**Stage 1 trigger:** Synapse CPU >80% consistently, or >200 concurrent users.
|
||||
|
||||
**First workers to add:**
|
||||
```yaml
|
||||
# /etc/matrix-synapse/workers/client-reader-1.yaml
|
||||
worker_app: synapse.app.client_reader
|
||||
worker_name: client-reader-1
|
||||
worker_listeners:
|
||||
- type: http
|
||||
port: 8011
|
||||
resources: [{names: [client]}]
|
||||
```
|
||||
Add `federation_sender` next (off-loads outgoing federation from main process). Then `event_creator` for write-heavy loads. Redis required at Stage 2 (500+ users) for inter-worker coordination.
|
||||
|
||||
---
|
||||
|
||||
## Bot Checklist
|
||||
|
||||
### Core
|
||||
@@ -474,7 +802,7 @@ cp -r dist/* /var/www/html/
|
||||
- [x] Initial sync token (ignores old messages on startup)
|
||||
- [x] Auto-accept room invites
|
||||
- [x] Deployed as systemd service (`matrixbot.service`) on LXC 151
|
||||
- [x] Fix E2EE key errors — `nio_store/` cleared, bot restarted cleanly
|
||||
- [x] Fix E2EE key errors — full store + credentials wipe, fresh device registration (`BBRZSEUECZ`); stale devices removed via admin API
|
||||
|
||||
### Commands
|
||||
- [x] `!help` — list commands
|
||||
|
||||
Reference in New Issue
Block a user