Files
matrix/README.md
Jared Vititoe 0ba095ba03 docs: mark coturn hardening applied, update action items
- stale-nonce, user-quota, total-quota, cipher-list applied to /etc/turnserver.conf
- BBR noted as intentionally skipped (HA multi-host setup)
- Storj update and Synapse lag resolved

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 14:05:59 -04:00

984 lines
41 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Lotus Matrix Bot & Server Roadmap
Matrix bot and server infrastructure for the Lotus Guild homeserver (`matrix.lotusguild.org`).
**Repo**: https://code.lotusguild.org/LotusGuild/matrixBot
## Status: Phase 6 — Monitoring, Observability & Hardening
---
## Priority Order
1. ~~PostgreSQL migration~~
2. ~~TURN server~~
3. ~~Room structure + space setup~~
4. ~~Matrix bot (core + commands)~~
5. ~~LiveKit / Element Call~~
6. ~~SSO / OIDC (Authelia)~~
7. ~~Webhook integrations (hookshot)~~
8. ~~Voice stability & quality tuning~~
9. ~~Custom Cinny client (chat.lotusguild.org)~~
10. Custom emoji packs (partially finished)
11. Cinny custom branding (Lotus Guild theme)
12. Draupnir moderation bot
13. Push notifications (Sygnal)
---
## Infrastructure
| Service | IP | LXC | RAM | vCPUs | Disk | Versions |
|---------|----|-----|-----|-------|------|----------|
| Synapse | 10.10.10.29 | 151 | 8GB | 4 (Ryzen 9 7900) | 50GB (21% used) | Synapse 1.148.0, LiveKit 1.9.11, hookshot 7.3.2, coturn latest |
| PostgreSQL 17 | 10.10.10.44 | 109 | 6GB | 3 (Ryzen 9 7900) | 30GB (5% used) | PostgreSQL 17.9 |
| Cinny Web | 10.10.10.6 | 106 | 256MB runtime | 1 | 8GB (27% used) | Debian 13, nginx, Node 24, Cinny 4.10.5 |
| NPM | 10.10.10.27 | 139 | — | — | — | Nginx Proxy Manager |
| Authelia | 10.10.10.36 | 167 | — | — | — | SSO/OIDC provider |
| LLDAP | 10.10.10.39 | 147 | — | — | — | LDAP user directory |
| Uptime Kuma | 10.10.10.25 | 101 | — | — | — | Uptime monitoring (micro1 node) |
| Prometheus | 10.10.10.48 | 118 | — | — | — | Prometheus — scrapes all Matrix services |
| Grafana | 10.10.10.49 | 107 | — | — | — | Grafana 12.4.0 — dashboard.lotusguild.org |
> **Note:** PostgreSQL container IP is `10.10.10.44`, not `.2` — update any stale references.
**Key paths on Synapse/matrix LXC (151):**
- Synapse config: `/etc/matrix-synapse/homeserver.yaml`
- Synapse conf.d: `/etc/matrix-synapse/conf.d/` (metrics.yaml, report_stats.yaml, server_name.yaml)
- coturn config: `/etc/turnserver.conf`
- LiveKit config: `/etc/livekit/config.yaml`
- LiveKit service: `livekit-server.service`
- lk-jwt-service: `lk-jwt-service.service` (binds `:8070`, serves JWT tokens for MatrixRTC)
- Hookshot: `/opt/hookshot/`, service: `matrix-hookshot.service`
- Hookshot config: `/opt/hookshot/config.yml`
- Hookshot registration: `/etc/matrix-synapse/hookshot-registration.yaml`
- Landing page: `/var/www/matrix-landing/index.html` (on NPM LXC 139)
- Bot: `/opt/matrixbot/`, service: `matrixbot.service`
**Key paths on PostgreSQL LXC (109):**
- PostgreSQL config: `/etc/postgresql/17/main/postgresql.conf`
- PostgreSQL conf.d: `/etc/postgresql/17/main/conf.d/`
- HBA config: `/etc/postgresql/17/main/pg_hba.conf`
- Data directory: `/var/lib/postgresql/17/main`
**Running services on LXC 151:**
| Service | PID status | Memory | Notes |
|---------|-----------|--------|-------|
| matrix-synapse | active, 2+ days | 231MB peak 312MB | No workers, single process |
| livekit-server | active, 2+ days | 22MB peak 58MB | v1.9.11, node IP = 162.192.14.139 |
| lk-jwt-service | active, 2+ days | 2.7MB | Binds :8070, LIVEKIT_URL=wss://matrix.lotusguild.org |
| matrix-hookshot | active, 2+ days | 76MB peak 172MB | Actively receiving webhooks |
| matrixbot | active, 2+ days | 26MB peak 59MB | Some E2EE key errors (see known issues) |
| coturn | active, 2+ days | 13MB | Periodic TCP reset errors (normal) |
**Currently Open Port forwarding (router → 10.10.10.29):**
- TCP+UDP 3478 (TURN/STUN signaling)
- TCP+UDP 5349 (TURNS/TLS)
- TCP 7881 (LiveKit ICE TCP fallback)
- TCP+UDP 49152-65535 (TURN relay range)
- LiveKit WebRTC media: 50100-50500 (subset of above, only 400 ports — see improvements)
**Internal port map (LXC 151):**
| Port | Service | Bind |
|------|---------|------|
| 8008 | Synapse HTTP | 0.0.0.0 + ::1 |
| 9000 | Synapse metrics (Prometheus) | 127.0.0.1 + 10.10.10.29 |
| 9001 | Hookshot widgets | 0.0.0.0 |
| 9002 | Hookshot bridge (appservice) | 127.0.0.1 |
| 9003 | Hookshot webhooks | 0.0.0.0 |
| 9004 | Hookshot metrics (Prometheus) | 0.0.0.0 |
| 9100 | node_exporter (Prometheus) | 0.0.0.0 |
| 9101 | matrix-admin exporter | 0.0.0.0 |
| 6789 | LiveKit metrics (Prometheus) | 0.0.0.0 |
| 7880 | LiveKit HTTP | 0.0.0.0 |
| 7881 | LiveKit RTC TCP | 0.0.0.0 |
| 8070 | lk-jwt-service | 0.0.0.0 |
| 8080 | synapse-admin (nginx) | 0.0.0.0 |
| 3478 | coturn STUN/TURN | 0.0.0.0 |
| 5349 | coturn TURNS/TLS | 0.0.0.0 |
**Internal port map (LXC 109 — PostgreSQL):**
| Port | Service | Bind |
|------|---------|------|
| 5432 | PostgreSQL | 0.0.0.0 (hba-restricted to 10.10.10.29) |
| 9100 | node_exporter (Prometheus) | 0.0.0.0 |
| 9187 | postgres_exporter | 0.0.0.0 |
---
## Rooms (all v12)
| Room | Room ID | Join Rule |
|------|---------|-----------|
| The Lotus Guild (Space) | `!-1ZBnAH-JiCOV8MGSKN77zDGTuI3pgSdy8Unu_DrDyc` | public |
| General | `!wfokQ1-pE896scu_AOcCBA2s3L4qFo-PTBAFTd0WMI0` | public |
| Commands | `!ou56mVZQ8ZB7AhDYPmBV5_BR28WMZ4x5zwZkPCqjq1s` | restricted (Space members) |
| Memes | `!GK6v5cLEEnowIooQJv5jECfISUjADjt8aKhWv9VbG5U` | restricted (Space members) |
| Management | `!mEvR5fe3jMmzwd-FwNygD72OY_yu8H3UP_N-57oK7MI` | invite |
| Cool Kids | `!R7DT3QZHG9P8QQvX6zsZYxjkKgmUucxDz_n31qNrC94` | invite |
| Spam and Stuff | `!GttT4QYd1wlGlkHU3qTmq_P3gbyYKKeSSN6R7TPcJHg` | invite, **no E2EE** (hookshot) |
**Power level roles (Cinny tags):**
- 100: Owner (jared)
- 50: The Nerdy Council (enhuynh, lonely)
- 48: Panel of Geeks
- 35: Cool Kids
- 0: Member
---
## Webhook Integrations (matrix-hookshot 7.3.2)
Generic webhooks bridged into **Spam and Stuff** via [matrix-hookshot](https://github.com/matrix-org/matrix-hookshot).
Each service gets its own virtual user (`@hookshot_<service>`) with a unique avatar.
Webhook URL format: `https://matrix.lotusguild.org/webhook/<uuid>`
| Service | Webhook UUID | Notes |
|---------|-------------|-------|
| Grafana | `df4a1302-2d62-4a01-b858-fb56f4d3781a` | Unified alerting contact point |
| Proxmox | `9b3eafe5-7689-4011-addd-c466e524661d` | Notification system (8.1+) |
| Sonarr | `aeffc311-0686-42cb-9eeb-6757140c072e` | All event types |
| Radarr | `34913454-c1ac-4cda-82ea-924d4a9e60eb` | All event types |
| Readarr | `e57ab4f3-56e6-4dc4-8b30-2f4fd4bbeb0b` | All event types |
| Lidarr | `66ac6fdd-69f6-4f47-bb00-b7f6d84d7c1c` | All event types |
| Uptime Kuma | `1a02e890-bb25-42f1-99fe-bba6a19f1811` | Status change notifications |
| Seerr | `555185af-90a1-42ff-aed5-c344e11955cf` | Request/approval events |
| Owncast (Livestream) | `9993e911-c68b-4271-a178-c2d65ca88499` | STREAM_STARTED / STREAM_STOPPED (hookshot display name: "Livestream") |
| Bazarr | `470fb267-3436-4dd3-a70c-e6e8db1721be` | Subtitle events (Apprise JSON notifier) |
| Tinker-Tickets | `6e306faf-8eea-4ba5-83ef-bf8f421f929e` | Custom transformation code |
**Hookshot notes:**
- Spam and Stuff is intentionally **unencrypted** — hookshot bridges cannot join E2EE rooms
- Webhook tokens stored in Synapse PostgreSQL `room_account_data` for `@hookshot`
- JS transformation functions use hookshot v2 API: set `result = { version: "v2", plain, html, msgtype }`
- The `result` variable must be assigned without `var`/`let`/`const` (needs implicit global scope in the QuickJS IIFE sandbox)
- NPM proxies `https://matrix.lotusguild.org/webhook/*``http://10.10.10.29:9003`
- Virtual user avatars: set via appservice token (`as_token` in hookshot-registration.yaml) impersonating each user
- Hookshot bridge port (9002) binds `127.0.0.1` only; webhook ingest (9003) binds `0.0.0.0` (NPM-proxied)
---
## Known Issues
### coturn TLS Reset Errors
Periodic `TLS/TCP socket error: Connection reset by peer` in coturn logs from external IPs. This is normal — clients probe TURN and drop the connection once they establish a direct P2P path. Not an issue.
### BBR Congestion Control — Host-Level Only
`net.ipv4.tcp_congestion_control = bbr` and `net.core.default_qdisc = fq` cannot be set from inside an unprivileged LXC container — they affect the host kernel's network namespace. These must be applied on the Proxmox host itself to take effect for all containers. All other sysctl tuning (TCP/UDP buffers, fin_timeout) applied successfully inside LXC 151.
---
## Optimizations & Improvements
### 1. LiveKit / Voice Quality ✅ Applied
Noise suppression and volume normalization are **client-side only** (browser/Element X handles this via WebRTC's built-in audio processing). The server cannot enforce these. Applied server-side improvements:
- **ICE port range expanded:** 50100-50500 (400 ports) → **50000-51000 (1001 ports)** = ~500 concurrent WebRTC streams
- **TURN TTL reduced:** 86400s (24h) → **3600s (1h)** — stale allocations expire faster
- **Room defaults added:** `empty_timeout: 300`, `departure_timeout: 20`, `max_participants: 50`
**Client-side audio advice for users:**
- **Element Web/Desktop:** Settings → Voice & Video → enable "Noise Suppression" and "Echo Cancellation"
- **Element X (mobile):** automatic via WebRTC stack
- **Cinny (chat.lotusguild.org):** voice via embedded Element Call widget — browser WebRTC noise suppression is active automatically
### 2. PostgreSQL Tuning (LXC 109) ✅ Applied
`/etc/postgresql/17/main/conf.d/synapse_tuning.conf` written and active. `pg_stat_statements` extension created in the `synapse` database. Config applied:
```ini
# Memory — shared_buffers = 25% RAM, effective_cache_size = 75% RAM
shared_buffers = 1500MB
effective_cache_size = 4500MB
work_mem = 32MB # Per sort/hash operation (safe at low connection count)
maintenance_work_mem = 256MB # VACUUM, CREATE INDEX
wal_buffers = 64MB # WAL write buffer
# Checkpointing
checkpoint_completion_target = 0.9 # Spread checkpoint I/O (default 0.5 is aggressive)
max_wal_size = 2GB
# Storage (Ceph RBD block device = SSD-equivalent random I/O)
random_page_cost = 1.1 # Default 4.0 assumes spinning disk
effective_io_concurrency = 200 # For SSDs/Ceph
# Parallel queries (3 vCPUs)
max_worker_processes = 3
max_parallel_workers_per_gather = 1
max_parallel_workers = 2
# Monitoring
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.track = all
```
Restarted `postgresql@17-main`. Expected impact: Synapse query latency drops as the DB grows — the entire current 120MB database fits in shared_buffers.
### 3. PostgreSQL Security — pg_hba.conf (LXC 109) ✅ Applied
Removed the two open rules (`0.0.0.0/24 md5` and `0.0.0.0/0 md5`). Remote access is now restricted to Synapse LXC only:
```
host synapse synapse_user 10.10.10.29/32 scram-sha-256
```
All other remote connections are rejected. Local Unix socket and loopback remain functional for admin access.
### 4. Synapse Cache Tuning (LXC 151) ✅ Applied
`event_cache_size` bumped 15K → 30K. `_get_state_group_for_events: 3.0` added to `per_cache_factors` (heavily hit during E2EE key sharing). Synapse restarted cleanly.
```yaml
event_cache_size: 30K
caches:
global_factor: 2.0
per_cache_factors:
get_users_in_room: 3.0
get_current_state_ids: 3.0
_get_state_group_for_events: 3.0
```
### 5. Network / sysctl Tuning (LXC 151) ✅ Applied
`/etc/sysctl.d/99-matrix-tuning.conf` written and active. TCP/UDP buffers aligned and fin_timeout reduced.
```ini
# Align TCP buffers with core maximums
net.ipv4.tcp_rmem = 4096 131072 26214400
net.ipv4.tcp_wmem = 4096 65536 26214400
# UDP buffer sizing for WebRTC media streams
net.core.rmem_max = 26214400
net.core.wmem_max = 26214400
net.ipv4.udp_rmem_min = 65536
net.ipv4.udp_wmem_min = 65536
# Reduce latency for short-lived TURN connections
net.ipv4.tcp_fin_timeout = 10
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_intvl = 30
```
> **BBR note:** `tcp_congestion_control = bbr` and `default_qdisc = fq` require host-level sysctl — cannot be set inside an unprivileged LXC. Apply on the Proxmox host to benefit all containers:
> ```bash
> echo "net.ipv4.tcp_congestion_control = bbr" >> /etc/sysctl.d/99-bbr.conf
> echo "net.core.default_qdisc = fq" >> /etc/sysctl.d/99-bbr.conf
> sysctl --system
> ```
### 6. Synapse Federation Hardening
The server is effectively a private server for friends. Restricting federation prevents abuse and reduces load. Add to `homeserver.yaml`:
```yaml
# Allow federation only with specific trusted servers (or disable entirely)
federation_domain_whitelist:
- matrix.org # Keep for bridging if needed
- matrix.lotusguild.org
# OR to go fully closed (recommended for friends-only):
# federation_enabled: false
```
### 7. Bot E2EE Key Fix (LXC 151) ✅ Applied
`nio_store/` cleared and bot restarted cleanly. Megolm session errors resolved.
---
## Custom Cinny Client (chat.lotusguild.org)
Cinny v4 is the preferred client — clean UI, Cinny-style rendering already used by the bot's Wordle tiles. We build from source to get voice support and full branding control.
### Why Cinny over Element Web
- Much cleaner aesthetics, already the de-facto client for guild members
- Element Web voice suppression (Krisp) is only on `app.element.io` — a custom build loses it
- Cinny `add-joined-call-controls` branch uses `@element-hq/element-call-embedded` which talks to the **existing** MatrixRTC → lk-jwt-service → LiveKit stack with zero new infrastructure
- Static build (nginx serving ~5MB of files) — nearly zero runtime resource cost
### Voice support status (as of March 2026)
The official `add-joined-call-controls` branch (maintained by `ajbura`, last commit March 8 2026) embeds Element Call as a widget via `@element-hq/element-call-embedded: 0.16.3`. This uses the same MatrixRTC protocol that lk-jwt-service already handles. Two direct LiveKit integration PRs (#2703, #2704) were proposed but closed without merge — so the embedded Element Call approach is the official path.
Since lk-jwt-service is already running on LXC 151 and configured for `wss://matrix.lotusguild.org`, voice calls will work out of the box once the Cinny build is deployed.
### LXC Setup
**Create the LXC** (run on the host):
```bash
# ProxmoxVE Debian 13 community script
bash -c "$(curl -fsSL https://raw.githubusercontent.com/community-scripts/ProxmoxVE/main/ct/debian.sh)"
```
Recommended settings: 2GB RAM, 1-2 vCPUs, 20GB disk, Debian 13, static IP on VLAN 10 (e.g. `10.10.10.XX`).
**Inside the new LXC:**
```bash
# Install nginx + git + nvm dependencies
apt update && apt install -y nginx git curl
# Install Node.js 24 via nvm
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash
source ~/.bashrc
nvm install 24
nvm use 24
# Clone Cinny and switch to voice-support branch
git clone https://github.com/cinnyapp/cinny.git /opt/cinny
cd /opt/cinny
git checkout add-joined-call-controls
# Install dependencies and build
npm ci
NODE_OPTIONS=--max_old_space_size=4096 npm run build
# Output: /opt/cinny/dist/
# Deploy to nginx root
cp -r /opt/cinny/dist/* /var/www/html/
```
**Configure Cinny** — edit `/var/www/html/config.json`:
```json
{
"defaultHomeserver": 0,
"homeserverList": ["matrix.lotusguild.org"],
"allowCustomHomeservers": false,
"featuredCommunities": {
"openAsDefault": false,
"spaces": [],
"rooms": [],
"servers": []
},
"hashRouter": {
"enabled": false,
"basename": "/"
}
}
```
**Nginx config**`/etc/nginx/sites-available/cinny` (matches the official `docker-nginx.conf`):
```nginx
server {
listen 80;
listen [::]:80;
server_name chat.lotusguild.org;
root /var/www/html;
index index.html;
location / {
rewrite ^/config.json$ /config.json break;
rewrite ^/manifest.json$ /manifest.json break;
rewrite ^/sw.js$ /sw.js break;
rewrite ^/pdf.worker.min.js$ /pdf.worker.min.js break;
rewrite ^/public/(.*)$ /public/$1 break;
rewrite ^/assets/(.*)$ /assets/$1 break;
rewrite ^(.+)$ /index.html break;
}
}
```
```bash
ln -s /etc/nginx/sites-available/cinny /etc/nginx/sites-enabled/
nginx -t && systemctl reload nginx
```
Then in **NPM**: add a proxy host for `chat.lotusguild.org``http://10.10.10.XX:80` with SSL.
### Rebuilding after updates
```bash
cd /opt/cinny
git pull
npm ci
NODE_OPTIONS=--max_old_space_size=4096 npm run build
cp -r dist/* /var/www/html/
# Preserve your config.json — it gets overwritten by the copy above, so:
# Option: keep config.json outside dist and symlink/copy it in after each build
```
### Key paths (Cinny LXC 106 — 10.10.10.6)
- Source: `/opt/cinny/` (branch: `add-joined-call-controls`)
- Built files: `/var/www/html/`
- Cinny config: `/var/www/html/config.json`
- Config backup (survives rebuilds): `/opt/cinny-config.json`
- Nginx site config: `/etc/nginx/sites-available/cinny`
- Rebuild script: `/usr/local/bin/cinny-update`
---
## Server Checklist
### Quality of Life
- [x] Migrate from SQLite to PostgreSQL
- [x] TURN/STUN server (coturn) for reliable voice/video
- [x] URL previews
- [x] Upload size limit 200MB
- [x] Full-text message search (PostgreSQL backend)
- [x] Media retention policy (remote: 1yr, local: 3yr)
- [x] Sliding sync (native Synapse)
- [x] LiveKit for Element Call video rooms
- [x] Default room version v12, all rooms upgraded
- [x] Landing page with client recommendations (Cinny, Commet, Element, Element X mobile)
- [x] Synapse metrics endpoint (port 9000, Prometheus-compatible)
- [ ] Push notifications gateway (Sygnal) for mobile clients
- [x] LiveKit port range expanded to 50000-51000 for voice call capacity
- [x] Custom Cinny client LXC 106 (10.10.10.6) — Debian 13, Cinny 4.10.5 built from `add-joined-call-controls`, nginx serving, HA enabled
- [x] NPM proxy entry for `chat.lotusguild.org` → 10.10.10.6:80, SSL via Cloudflare DNS challenge, HTTPS forced, HTTP/2 + HSTS enabled
- [x] Cinny weekly auto-update cron (`/etc/cron.d/cinny-update`, Sundays 3am, logs to `/var/log/cinny-update.log`)
- [ ] Cinny custom branding — Lotus Guild theme (colors, title, favicon, PWA name)
### Performance Tuning
- [x] PostgreSQL `shared_buffers` → 1500MB, `effective_cache_size`, `work_mem`, checkpoint tuning applied
- [x] PostgreSQL `pg_stat_statements` extension installed in `synapse` database
- [x] PostgreSQL autovacuum tuned per-table (`state_groups_state`, `events`, `receipts_linearized`, `receipts_graph`, `device_lists_stream`, `presence_stream`), `autovacuum_max_workers` → 5
- [x] Synapse `event_cache_size` → 30K, `_get_state_group_for_events` cache factor added
- [x] sysctl TCP/UDP buffer alignment applied to LXC 151 (`/etc/sysctl.d/99-matrix-tuning.conf`)
- [x] LiveKit room `empty_timeout: 300`, `departure_timeout: 20`, `max_participants: 50`
- [x] LiveKit ICE port range expanded to 50000-51000
- [x] LiveKit TURN TTL reduced from 24h to 1h
- [x] LiveKit VP9/AV1 codecs enabled (`video_codecs: [VP8, H264, VP9, AV1]`)
- [ ] BBR congestion control — must be applied on Proxmox host, not inside LXC (see Known Issues)
### Auth & SSO
- [x] Token-based registration
- [x] SSO/OIDC via Authelia
- [x] `allow_existing_users: true` for linking accounts to SSO
- [x] Password auth alongside SSO
### Webhooks & Integrations
- [x] matrix-hookshot 7.3.2 installed and running
- [x] Generic webhook bridge for 11 active services (Grafana, Proxmox, Sonarr, Radarr, Readarr, Lidarr, Uptime Kuma, Seerr, Owncast/Livestream, Bazarr, Tinker-Tickets)
- [x] Per-service JS transformation functions — all rewritten to handle full event payloads (all event types, health alerts, app updates, release groups, download clients)
- [x] Per-service virtual user avatars
- [x] NPM reverse proxy for `/webhook` path
- [x] Tinker Tickets custom transformation code
### Room Structure
- [x] The Lotus Guild space
- [x] All core rooms with correct power levels and join rules
- [x] Spam and Stuff room for service notifications (hookshot)
- [x] Custom room avatars
### Hardening
- [x] Rate limiting
- [x] E2EE on all rooms (except Spam and Stuff — intentional for hookshot)
- [x] coturn internal peer deny rules (blocks relay to RFC1918 except allowed subnet)
- [x] `pg_hba.conf` locked down — remote access restricted to Synapse LXC (10.10.10.29) only
- [x] Federation enabled with key verification (open for invite-only growth to friends/family/coworkers)
- [x] fail2ban on Synapse login endpoint (5 retries / 24h ban, LXC 151)
- [x] Synapse metrics port 9000 restricted to `127.0.0.1` + `10.10.10.29` (was `0.0.0.0`)
- [x] coturn cert auto-renewal — daily sync cron on compute-storage-01 copies NPM cert → coturn
- [x] `/.well-known/matrix/client` and `/server` live on lotusguild.org (NPM advanced config)
- [x] `suppress_key_server_warning: true` in homeserver.yaml
- [ ] Federation allow/deny lists for known bad actors
- [ ] Regular Synapse updates
- [x] Automated database + media backups
### Monitoring
- [x] Synapse metrics endpoint (port 9000, Prometheus-compatible)
- [x] Uptime Kuma monitors added: Synapse HTTP, LiveKit TCP, PostgreSQL TCP, Cinny Web, coturn TCP 3478, lk-jwt-service, Hookshot
- [ ] Uptime Kuma: coturn UDP STUN monitoring (requires push/heartbeat — no native UDP type in Kuma)
- [x] Grafana dashboard — custom Synapse dashboard at `dashboard.lotusguild.org/d/matrix-synapse-dashboard/matrix-synapse` (140+ panels, see Monitoring section below)
- [x] Prometheus scraping all Matrix services: Synapse, Hookshot, LiveKit, matrix-node, postgres-node, matrix-admin, postgres, postgres-exporter
- [x] node_exporter installed on LXC 151 (Matrix) and LXC 109 (PostgreSQL)
- [x] LiveKit Prometheus metrics enabled (`prometheus_port: 6789`)
- [x] Hookshot metrics enabled (`metrics: { enabled: true }`) on dedicated port 9004
- [x] Grafana alert rules — 9 Matrix/infra alerts active (see Alert Rules section below)
- [x] Duplicate Grafana "Infrastructure" folder merged and deleted
### Admin
- [x] Synapse admin API dashboard (synapse-admin at http://10.10.10.29:8080)
- [x] Power levels per room
- [ ] Draupnir moderation bot (new LXC or alongside existing bot)
- [ ] Cinny custom branding (Lotus Guild theme — colors, title, favicon, PWA name)
- [ ] **Storj node update**`storj_uptodate=0` on LXC 138 (10.10.10.133), risk of disqualification
---
## Improvement Audit (March 2026)
Comprehensive audit of the current infrastructure against official documentation and security best practices. Applied March 9 2026.
### Priority Summary
| Issue | Severity | Status |
|-------|----------|--------|
| coturn TLS cert expires May 12 — no auto-renewal | **CRITICAL** | ✅ Fixed — daily sync cron on compute-storage-01 copies NPM-renewed cert to coturn |
| Synapse metrics port 9000 bound to `0.0.0.0` | **HIGH** | ✅ Fixed — now binds `127.0.0.1` + `10.10.10.29` (Prometheus still works, internet blocked) |
| `/.well-known/matrix/client` returns 404 | MEDIUM | ✅ Fixed — NPM lotusguild.org proxy host updated, live at `https://lotusguild.org/.well-known/matrix/client` |
| `suppress_key_server_warning` not set | MEDIUM | ✅ Fixed — added to homeserver.yaml |
| No fail2ban on `/_matrix/client/.*/login` | MEDIUM | ✅ Fixed — fail2ban installed, matrix-synapse jail active (5 retries / 24h ban) |
| No media purge cron (retention policy set but never triggers) | MEDIUM | ✅ N/A — `media_retention` block already in homeserver.yaml; Synapse runs the purge internally on schedule |
| PostgreSQL autovacuum not tuned per-table | LOW | ✅ Fixed — all 5 high-churn tables tuned, `autovacuum_max_workers` → 5 |
| Hookshot metrics scrape unconfirmed | LOW | ✅ Fixed — `metrics: { enabled: true }` added to config, metrics split to dedicated port 9004, Prometheus scraping confirmed |
| LiveKit VP9/AV1 codec support | LOW | ✅ Applied — `video_codecs: [VP8, H264, VP9, AV1]` added to livekit config |
| Federation allow/deny list not configured | LOW | Pending — Mjolnir/Draupnir on roadmap |
| Sygnal push notifications not deployed | INFO | Deferred |
---
### 1. coturn Cert Auto-Renewal ✅
The coturn cert is managed by NPM (cert ID 91, stored at `/etc/letsencrypt/live/npm-91/` on LXC 139). NPM renews it automatically. A sync script on `compute-storage-01` detects when NPM renews and copies it to coturn.
**Deployed:** `/usr/local/bin/coturn-cert-sync.sh` on compute-storage-01, cron `/etc/cron.d/coturn-cert-sync` (runs 03:30 daily).
Script compares cert expiry dates between LXC 139 and LXC 151. If they differ (NPM renewed), it copies `fullchain.pem` + `privkey.pem` and restarts coturn.
**Additional coturn hardening — ✅ Applied March 2026:**
```
# /etc/turnserver.conf
stale-nonce=600 # Nonce expires 600s (prevents replay attacks)
user-quota=100 # Max concurrent relay allocations per user
total-quota=1000 # Total relay allocations server-wide
cipher-list=ECDHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-CHACHA20-POLY1305
```
---
### 2. Synapse Configuration Gaps
**a) Metrics port exposed to 0.0.0.0 (HIGH)**
Port 9000 currently binds `0.0.0.0` — exposes internal state, user counts, DB query times externally. Fix in `homeserver.yaml`:
```yaml
metrics_flags:
some_legacy_unrestricted_resources: false
listeners:
- port: 9000
bind_addresses: ['127.0.0.1'] # NOT 0.0.0.0
type: metrics
resources: []
```
Grafana at `10.10.10.49` scrapes port 9000 from within the VLAN so this is safe to lock down.
**b) suppress_key_server_warning (MEDIUM)**
Fills Synapse logs with noise on every restart. One line in `homeserver.yaml`:
```yaml
suppress_key_server_warning: true
```
**c) Database connection pooling (LOW — track for growth)**
Current defaults (`cp_min: 5`, `cp_max: 10`) are fine for single-process. When adding workers, increase `cp_max` to 2030 per worker group. Add explicitly to `homeserver.yaml` to make it visible:
```yaml
database:
name: psycopg2
args:
cp_min: 5
cp_max: 10
```
---
### 3. Matrix Well-Known 404
`/.well-known/matrix/client` returns 404. This breaks client autodiscovery — users who type `lotusguild.org` instead of `matrix.lotusguild.org` get an error. Fix in NPM with a custom location block on the `lotusguild.org` proxy host:
```nginx
location /.well-known/matrix/client {
add_header Content-Type application/json;
add_header Access-Control-Allow-Origin *;
return 200 '{"m.homeserver":{"base_url":"https://matrix.lotusguild.org"}}';
}
location /.well-known/matrix/server {
add_header Content-Type application/json;
add_header Access-Control-Allow-Origin *;
return 200 '{"m.server":"matrix.lotusguild.org:443"}';
}
```
---
### 4. fail2ban for Synapse Login
No brute-force protection on `/_matrix/client/*/login`. Easy win.
**`/etc/fail2ban/jail.d/matrix-synapse.conf`:**
```ini
[matrix-synapse]
enabled = true
port = http,https
filter = matrix-synapse
logpath = /var/log/matrix-synapse/homeserver.log
backend = systemd
journalmatch = _SYSTEMD_UNIT=matrix-synapse.service + PRIORITY=3
findtime = 600
maxretry = 5
bantime = 86400
```
**`/etc/fail2ban/filter.d/matrix-synapse.conf`:**
```ini
[Definition]
failregex = ^.*Failed \(password\|SAML\) login attempt for user .* from <HOST>.*$
^.*"POST /.*login.*" 401.*$
ignoreregex = ^.*"GET /sync.*".*$
```
---
### 5. Synapse Media Purge Cron
Retention policy is configured (remote 1yr, local 3yr) but nothing actually triggers the purge — media accumulates silently. The Synapse admin API purge endpoint must be called explicitly.
**`/usr/local/bin/purge-synapse-media.sh`** (create on LXC 151):
```bash
#!/bin/bash
ADMIN_TOKEN="syt_your_admin_token"
# Purge remote media (cached from other homeservers) older than 90 days
CUTOFF_TS=$(($(date +%s000) - 7776000000))
curl -X POST \
"http://localhost:8008/_synapse/admin/v1/purge_media_cache?before_ts=$CUTOFF_TS" \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-s -o /dev/null
echo "$(date): Synapse remote media purge completed" >> /var/log/synapse-purge.log
```
```bash
chmod +x /usr/local/bin/purge-synapse-media.sh
echo "0 4 * * * root /usr/local/bin/purge-synapse-media.sh" > /etc/cron.d/synapse-purge
```
---
### 6. PostgreSQL Autovacuum Per-Table Tuning
The high-churn Synapse tables (`state_groups_state`, `events`, `receipts`) are not tuned for aggressive autovacuum. As the DB grows, bloat accumulates and queries slow down. Run on LXC 109 (PostgreSQL):
```sql
-- state_groups_state: biggest bloat source
ALTER TABLE state_groups_state SET (
autovacuum_vacuum_scale_factor = 0.01,
autovacuum_analyze_scale_factor = 0.005,
autovacuum_vacuum_cost_delay = 5,
autovacuum_naptime = 30
);
-- events: second priority
ALTER TABLE events SET (
autovacuum_vacuum_scale_factor = 0.02,
autovacuum_analyze_scale_factor = 0.01,
autovacuum_vacuum_cost_delay = 5,
autovacuum_naptime = 30
);
-- receipts and device_lists_stream
ALTER TABLE receipts SET (autovacuum_vacuum_scale_factor = 0.01, autovacuum_vacuum_cost_delay = 5);
ALTER TABLE device_lists_stream SET (autovacuum_vacuum_scale_factor = 0.02);
ALTER TABLE presence_stream SET (autovacuum_vacuum_scale_factor = 0.02);
```
Also bump `autovacuum_max_workers` from 3 → 5:
```sql
ALTER SYSTEM SET autovacuum_max_workers = 5;
SELECT pg_reload_conf();
```
**Monitor vacuum health:**
```sql
SELECT relname, last_autovacuum, n_dead_tup, n_live_tup
FROM pg_stat_user_tables
WHERE relname IN ('events', 'state_groups_state', 'receipts')
ORDER BY n_dead_tup DESC;
```
---
### 7. Hookshot Metrics + Grafana
**Hookshot metrics** are exposed at `127.0.0.1:9001/metrics` but it's unconfirmed whether Prometheus at `10.10.10.49` is scraping them. Verify:
```bash
# On LXC 151
curl http://127.0.0.1:9001/metrics | head -20
```
If Prometheus is scraping, add the hookshot dashboard from the repo:
`contrib/hookshot-dashboard.json` → import into Grafana.
**Grafana Synapse dashboard** — Prometheus is already scraping Synapse at port 9000. Import the official dashboard:
- Grafana → Dashboards → Import → ID `18618` (Synapse Monitoring)
- Set Prometheus datasource → done
- Shows room count, message rates, federation lag, cache hit rates, DB query times in real time
---
### 8. Federation Security
Currently: open federation with key verification (correct for invite-only friends server). Recommended additions:
**Server-level allow/deny in `homeserver.yaml`** (optional, for closing federation entirely):
```yaml
# Fully closed (recommended long-term for private guild):
federation_enabled: false
# OR: whitelist-only federation
federation_domain_whitelist:
- matrix.lotusguild.org
- matrix.org # Keep if bridging needed
```
**Per-room ACLs** for reactive blocking of specific bad servers:
```json
{
"type": "m.room.server_acl",
"content": {
"allow": ["*"],
"deny": ["spam.example.com"]
}
}
```
**Mjolnir/Draupnir** (already on roadmap) handles this automatically with ban list subscriptions (t2bot spam lists etc).
---
### 9. Sygnal Push Notifications
Sygnal is the official Matrix push gateway for mobile (Element X on iOS/Android). Without it, notifications don't arrive when the app is backgrounded.
**Requirements:**
- Apple Developer account (APNS cert) for iOS
- Firebase project (FCM API key) for Android
- New LXC or run alongside existing services
**Basic config (`/etc/sygnal/sygnal.yaml`):**
```yaml
server:
port: 8765
database:
type: postgresql
user: sygnal
password: <password>
database: sygnal
apps:
com.element.android:
type: gcm
api_key: <FCM_API_KEY>
im.riot.x.ios:
type: apns
platform: production
certfile: /etc/sygnal/apns/element-x-cert.pem
topic: im.riot.x.ios
```
**Synapse integration:**
```yaml
# homeserver.yaml
push:
push_gateways:
- url: "http://localhost:8765"
```
---
### 10. LiveKit VP9/AV1 + Dynacast (Quality Improvement)
Currently H264 only. Enabling VP9/AV1 unlocks Dynacast (pauses video layers no one is watching) which significantly reduces bandwidth/CPU for low-viewer rooms.
**`/etc/livekit/config.yaml` additions:**
```yaml
video:
codecs:
- mime: video/H264
fmtp: "level-asymmetry-allowed=1;packetization-mode=1;profile-level-id=42e01e"
- mime: video/VP9
fmtp: "profile=0"
- mime: video/AV1
fmtp: "profile=0"
dynacast: true
```
Note: Dynacast only works with VP9 or AV1 (SVC-capable codecs). H264 subscribers continue to work normally alongside VP9/AV1 subscribers.
---
### 11. Synapse Workers (Future Scaling Reference)
Current single-process handles ~100300 concurrent users before the Python GIL becomes the bottleneck. Not needed now, but documented for when usage grows.
**Stage 1 trigger:** Synapse CPU >80% consistently, or >200 concurrent users.
**First workers to add:**
```yaml
# /etc/matrix-synapse/workers/client-reader-1.yaml
worker_app: synapse.app.client_reader
worker_name: client-reader-1
worker_listeners:
- type: http
port: 8011
resources: [{names: [client]}]
```
Add `federation_sender` next (off-loads outgoing federation from main process). Then `event_creator` for write-heavy loads. Redis required at Stage 2 (500+ users) for inter-worker coordination.
---
---
## Monitoring & Observability (March 2026)
### Prometheus Scrape Jobs
All Matrix-related services scraped by Prometheus at `10.10.10.48` (LXC 118):
| Job | Target | Metrics |
|-----|--------|---------|
| `synapse` | `10.10.10.29:9000` | Full Synapse internals (events, federation, caches, DB, HTTP) |
| `matrix-admin` | `10.10.10.29:9101` | DAU, MAU, room/user/media totals |
| `livekit` | `10.10.10.29:6789` | Rooms, participants, packets, forward latency, quality |
| `hookshot` | `10.10.10.29:9004` | Connections by service, API calls/failures, Node.js runtime |
| `matrix-node` | `10.10.10.29:9100` | CPU, RAM, network, disk space, load avg (Matrix LXC host) |
| `postgres` | `10.10.10.44:9187` | pg_stat_database, connections, WAL, block I/O |
| `postgres-node` | `10.10.10.44:9100` | CPU, RAM, network, disk space, load avg (PostgreSQL LXC host) |
| `postgres-exporter-2` | `10.10.10.160:9711` | Secondary postgres exporter |
> **Disk I/O note:** All servers use Ceph-backed storage. Per-device disk I/O metrics are meaningless; use Network I/O panels to see actual storage traffic.
### Grafana Dashboard
**URL:** `https://dashboard.lotusguild.org/d/matrix-synapse-dashboard/matrix-synapse`
140+ panels across 18 sections:
| Section | Key panels |
|---------|-----------|
| Synapse Overview | Up status, users, rooms, DAU/MAU, media, federation peers |
| Synapse Process Health | CPU, memory, FDs, thread pool, GC, Twisted reactor |
| HTTP API Requests | Rate, response codes, p99/p50 latency, in-flight, DB txn time |
| Federation | Outgoing/incoming PDUs, queue depth, staging, known servers |
| Events & Rooms | Event persistence, notifier, sync responses |
| Presence & Push | Presence updates, pushers, state transitions |
| Rate Limiting | Rejections, sleeps, queue wait time p99 |
| Users & Registration | Login rate, registration rate, growth over time |
| Synapse Database Performance | Txn rate/duration, schedule latency, query latency |
| Synapse Caches | Hit rate (top 5), sizes, evictions, response cache |
| Event Processing & Lag | Lag by processor, stream positions, event fetch ongoing |
| State Resolution | Forward extremities, state resolution CPU, state groups |
| App Services (Hookshot) | Events sent, transactions sent vs failed |
| HTTP Push | Push processed vs failed, badge updates |
| Sliding Sync & Slow Endpoints | Sliding sync p99, slowest endpoints, rate limit wait |
| Background Processes | In-flight by name, start rate, CPU, scheduler tasks |
| PostgreSQL Database | Size, connections, transactions, block I/O, WAL, locks |
| LiveKit SFU | Rooms, participants, network, packets out/dropped, forward latency |
| Hookshot | Matrix API calls/failures, active connections, Node.js event loop lag |
| Matrix LXC Host | CPU, RAM, network (incl. Ceph), load average, disk space |
| PostgreSQL LXC Host | CPU, RAM, network (incl. Ceph), load average, disk space |
### Alert Rules
All alerts are Grafana-native (Alerting → Alert Rules). Current active rules:
**Matrix folder (`matrix-folder`):**
| Alert | Fires when | Severity |
|-------|-----------|----------|
| Synapse Down | `up{job="synapse"}` < 1 for 2m | critical |
| PostgreSQL Down | `pg_up` < 1 for 2m | critical |
| LiveKit Down | `up{job="livekit"}` < 1 for 2m | critical |
| Hookshot Down | `up{job="hookshot"}` < 1 for 2m | critical |
| PG Connection Saturation | connections > 80% of max for 5m | warning |
| Federation Queue Backing Up | pending PDUs > 100 for 10m | warning |
| Synapse High Memory | RSS > 2000MB for 10m | warning |
| Synapse High Response Time | p99 latency (excl. /sync) > 10s for 5m | warning |
| Synapse Event Processing Lag | any processor > 30s behind for 5m | warning |
| Synapse DB Query Latency High | p99 query time > 1s for 5m | warning |
**Infrastructure folder (`infra-folder`):**
| Alert | Fires when | Severity |
|-------|-----------|----------|
| Service Exporter Down | any `up == 0` for 3m | critical |
| Node High CPU Usage | CPU > 90% for 10m | warning |
| Node High Memory Usage | RAM > 90% for 10m | warning |
| Node Disk Space Low | available < 15% (excl. tmpfs/overlay) for 10m | warning |
**Prometheus rules (`/etc/prometheus/prometheus_rules.yml`):**
| Alert | Fires when |
|-------|-----------|
| InstanceDown | any `up == 0` for 1m |
| DiskSpaceFree10Percent | available < 10% (excl. tmpfs/overlay) for 5m |
> **`/sync` long-poll note:** The Matrix `/sync` endpoint is a long-poll (clients hold it open ≤30s). It is excluded from the High Response Time alert to prevent false positives. Without exclusion, p99 reads ~10s even when the server is healthy.
### Known Alert False Positives / Watch Items
- **Synapse Event Processing Lag** — can fire transiently after Synapse restart while processors catch up on backlog. Self-resolves in 1020 minutes. If it grows continuously (>10 min) and doesn't plateau, restart Synapse.
- **Node Disk Space Low** — excludes `tmpfs`, `overlay`, `squashfs`, `devtmpfs`, and `/boot`/`/run` mounts. If new filesystem types appear, add them to the `fstype!~` filter in the rule.
---
## Bot Checklist
### Core
- [x] matrix-nio async client with E2EE
- [x] Device trust (auto-trust all devices)
- [x] Graceful shutdown (SIGTERM/SIGINT)
- [x] Initial sync token (ignores old messages on startup)
- [x] Auto-accept room invites
- [x] Deployed as systemd service (`matrixbot.service`) on LXC 151
- [x] Fix E2EE key errors — full store + credentials wipe, fresh device registration (`BBRZSEUECZ`); stale devices removed via admin API
### Commands
- [x] `!help` — list commands
- [x] `!ping` — latency check
- [x] `!8ball <question>` — magic 8-ball
- [x] `!fortune` — fortune cookie
- [x] `!flip` — coin flip
- [x] `!roll <NdS>` — dice roller
- [x] `!random <min> <max>` — random number
- [x] `!rps <choice>` — rock paper scissors
- [x] `!poll <question>` — poll with reactions
- [x] `!trivia` — trivia game (reactions, 30s reveal)
- [x] `!champion [lane]` — random LoL champion
- [x] `!agent [role]` — random Valorant agent
- [x] `!wordle` — full Wordle game (daily, hard mode, stats, share)
- [x] `!minecraft <username>` — RCON whitelist add
- [x] `!ask <question>` — Ollama LLM (lotusllm, 2min cooldown)
- [x] `!health` — bot uptime + service status
### Welcome System
- [x] Watches Space joins and DMs new members automatically
- [x] React-to-join: react with ✅ in DM → bot invites to General, Commands, Memes
- [x] Welcome event ID persisted to `welcome_state.json`
### Wordle
- [x] Daily puzzles with two-pass letter evaluation
- [x] Hard mode with constraint validation
- [x] Stats persistence (`wordle_stats.json`)
- [x] Cinny-compatible rendering (inline `<span>` tiles)
- [x] DM-based gameplay, `!wordle share` posts result to public room
- [x] Virtual keyboard display
---
## Tech Stack
| Component | Technology | Version |
|-----------|-----------|---------|
| Bot language | Python 3 | 3.x |
| Bot library | matrix-nio (E2EE) | latest |
| Homeserver | Synapse | 1.148.0 |
| Database | PostgreSQL | 17.9 |
| TURN | coturn | latest |
| Video/voice calls | LiveKit SFU | 1.9.11 |
| LiveKit JWT | lk-jwt-service | latest |
| SSO | Authelia (OIDC) + LLDAP | — |
| Webhook bridge | matrix-hookshot | 7.3.2 |
| Reverse proxy | Nginx Proxy Manager | — |
| Web client | Cinny (custom build, `add-joined-call-controls` branch) | 4.10.5+ |
| Bot dependencies | matrix-nio[e2ee], aiohttp, python-dotenv, mcrcon | — |
## Bot Files
```
matrixBot/
├── bot.py # Entry point, client setup, event loop
├── callbacks.py # Message + reaction event handlers
├── commands.py # All command implementations
├── config.py # Environment config + validation
├── utils.py # send_text, send_html, send_reaction, get_or_create_dm
├── welcome.py # Welcome message + react-to-join logic
├── wordle.py # Full Wordle game engine
├── wordlist_answers.py # Wordle answer word list
├── wordlist_valid.py # Wordle valid guess word list
├── .env.example # Environment variable template
└── requirements.txt # Python dependencies
```