Files
matrix/README.md
Jared Vititoe 18c4ea14d4 docs: clean up README — remove stale audit sections, update versions, add Draupnir
- Remove all verbose Improvement Audit sections 1–11 (already applied)
- Remove stale running services table with old uptime/memory numbers
- Update Synapse version 1.148.0 → 1.149.0
- Add Draupnir moderation bot to infrastructure table, key paths, and new Moderation section
- Document active ban lists (community-moderation-effort-bl, matrix-org-coc-bl)
- Mark federation bad-actor blocking , Draupnir deployment 

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 19:43:27 -04:00

395 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Lotus Matrix Bot & Server Roadmap
Matrix bot and server infrastructure for the Lotus Guild homeserver (`matrix.lotusguild.org`).
**Repo**: https://code.lotusguild.org/LotusGuild/matrixBot
## Status: Phase 7 — Moderation & Client Customisation
---
## Priority Order
1. ~~PostgreSQL migration~~
2. ~~TURN server~~
3. ~~Room structure + space setup~~
4. ~~Matrix bot (core + commands)~~
5. ~~LiveKit / Element Call~~
6. ~~SSO / OIDC (Authelia)~~
7. ~~Webhook integrations (hookshot)~~
8. ~~Voice stability & quality tuning~~
9. ~~Custom Cinny client (chat.lotusguild.org)~~
10. Custom emoji packs (partially finished)
11. Cinny custom branding (Lotus Guild theme)
12. ~~Draupnir moderation bot~~
13. Push notifications (Sygnal)
---
## Infrastructure
| Service | IP | LXC | RAM | vCPUs | Disk | Versions |
|---------|----|-----|-----|-------|------|----------|
| Synapse | 10.10.10.29 | 151 | 8GB | 4 (Ryzen 9 7900) | 50GB | Synapse 1.149.0, LiveKit 1.9.11, hookshot 7.3.2, coturn latest |
| PostgreSQL 17 | 10.10.10.44 | 109 | 6GB | 3 (Ryzen 9 7900) | 30GB | PostgreSQL 17.9 |
| Cinny Web | 10.10.10.6 | 106 | 256MB | 1 | 8GB | Debian 13, nginx, Node 24, Cinny 4.10.5 |
| Draupnir | 10.10.10.24 | 110 | 1GB | 2 (Ryzen 9 7900) | 10GB | Draupnir v2.9.0, Node.js v22 |
| Prometheus | 10.10.10.48 | 118 | — | — | — | Prometheus — scrapes all Matrix services |
| Grafana | 10.10.10.49 | 107 | — | — | — | Grafana 12.4.0 — dashboard.lotusguild.org |
| NPM | 10.10.10.27 | 139 | — | — | — | Nginx Proxy Manager |
| Authelia | 10.10.10.36 | 167 | — | — | — | SSO/OIDC provider |
| LLDAP | 10.10.10.39 | 147 | — | — | — | LDAP user directory |
| Uptime Kuma | 10.10.10.25 | 101 | — | — | — | Uptime monitoring (micro1 node) |
**Key paths on Synapse LXC (151):**
- Synapse config: `/etc/matrix-synapse/homeserver.yaml`
- Synapse conf.d: `/etc/matrix-synapse/conf.d/` (metrics.yaml, report_stats.yaml, server_name.yaml)
- coturn config: `/etc/turnserver.conf`
- LiveKit config: `/etc/livekit/config.yaml`
- LiveKit service: `livekit-server.service`
- lk-jwt-service: `lk-jwt-service.service` (binds `:8070`, serves JWT tokens for MatrixRTC)
- Hookshot: `/opt/hookshot/`, service: `matrix-hookshot.service`
- Hookshot config: `/opt/hookshot/config.yml`
- Hookshot registration: `/etc/matrix-synapse/hookshot-registration.yaml`
- Bot: `/opt/matrixbot/`, service: `matrixbot.service`
- Landing page: `/var/www/matrix-landing/index.html` (on NPM LXC 139)
**Key paths on Draupnir LXC (110):**
- Install path: `/opt/draupnir/`
- Config: `/opt/draupnir/config/production.yaml`
- Data/SQLite DBs: `/data/storage/`
- Service: `draupnir.service`
- Management room: `#management:matrix.lotusguild.org` (`!mEvR5fe3jMmzwd-FwNygD72OY_yu8H3UP_N-57oK7MI`)
- Bot account: `@draupnir:matrix.lotusguild.org` (power level 100 in all protected rooms)
- Subscribed ban lists: `#community-moderation-effort-bl:neko.dev`, `#matrix-org-coc-bl:matrix.org`
- Rebuild: `NODE_OPTIONS="--max-old-space-size=768" npx tsc --project tsconfig.json`
**Key paths on PostgreSQL LXC (109):**
- PostgreSQL config: `/etc/postgresql/17/main/postgresql.conf`
- Tuning conf.d: `/etc/postgresql/17/main/conf.d/synapse_tuning.conf`
- HBA config: `/etc/postgresql/17/main/pg_hba.conf`
- Data directory: `/var/lib/postgresql/17/main`
**Key paths on Cinny LXC (106):**
- Source: `/opt/cinny/` (branch: `add-joined-call-controls`)
- Built files: `/var/www/html/`
- Cinny config: `/var/www/html/config.json`
- Config backup (survives rebuilds): `/opt/cinny-config.json`
- Nginx site config: `/etc/nginx/sites-available/cinny`
- Rebuild script: `/usr/local/bin/cinny-update`
---
## Port Maps
**Router → 10.10.10.29 (forwarded):**
- TCP+UDP 3478 — TURN/STUN
- TCP+UDP 5349 — TURNS/TLS
- TCP 7881 — LiveKit ICE TCP fallback
- TCP+UDP 49152-65535 — TURN relay range
**Internal port map (LXC 151):**
| Port | Service | Bind |
|------|---------|------|
| 8008 | Synapse HTTP | 0.0.0.0 |
| 9000 | Synapse metrics | 127.0.0.1 + 10.10.10.29 |
| 9001 | Hookshot widgets | 0.0.0.0 |
| 9002 | Hookshot bridge (appservice) | 127.0.0.1 |
| 9003 | Hookshot webhooks | 0.0.0.0 |
| 9004 | Hookshot metrics | 0.0.0.0 |
| 9100 | node_exporter | 0.0.0.0 |
| 9101 | matrix-admin exporter | 0.0.0.0 |
| 6789 | LiveKit metrics | 0.0.0.0 |
| 7880 | LiveKit HTTP | 0.0.0.0 |
| 7881 | LiveKit RTC TCP | 0.0.0.0 |
| 8070 | lk-jwt-service | 0.0.0.0 |
| 8080 | synapse-admin (nginx) | 0.0.0.0 |
| 3478 | coturn STUN/TURN | 0.0.0.0 |
| 5349 | coturn TURNS/TLS | 0.0.0.0 |
**Internal port map (LXC 109 — PostgreSQL):**
| Port | Service | Bind |
|------|---------|------|
| 5432 | PostgreSQL | 0.0.0.0 (hba-restricted to 10.10.10.29) |
| 9100 | node_exporter | 0.0.0.0 |
| 9187 | postgres_exporter | 0.0.0.0 |
---
## Rooms (all v12)
| Room | Room ID | Join Rule |
|------|---------|-----------|
| The Lotus Guild (Space) | `!-1ZBnAH-JiCOV8MGSKN77zDGTuI3pgSdy8Unu_DrDyc` | public |
| General | `!wfokQ1-pE896scu_AOcCBA2s3L4qFo-PTBAFTd0WMI0` | public |
| Commands | `!ou56mVZQ8ZB7AhDYPmBV5_BR28WMZ4x5zwZkPCqjq1s` | restricted (Space members) |
| Memes | `!GK6v5cLEEnowIooQJv5jECfISUjADjt8aKhWv9VbG5U` | restricted (Space members) |
| Management | `!mEvR5fe3jMmzwd-FwNygD72OY_yu8H3UP_N-57oK7MI` | invite |
| Cool Kids | `!R7DT3QZHG9P8QQvX6zsZYxjkKgmUucxDz_n31qNrC94` | invite |
| Spam and Stuff | `!GttT4QYd1wlGlkHU3qTmq_P3gbyYKKeSSN6R7TPcJHg` | invite, **no E2EE** (hookshot) |
**Power level roles (Cinny tags):**
- 100: Owner (jared)
- 50: The Nerdy Council (enhuynh, lonely)
- 48: Panel of Geeks
- 35: Cool Kids
- 0: Member
---
## Webhook Integrations (matrix-hookshot 7.3.2)
Generic webhooks bridged into **Spam and Stuff**.
Each service gets its own virtual user (`@hookshot_<service>`) with a unique avatar.
Webhook URL format: `https://matrix.lotusguild.org/webhook/<uuid>`
| Service | Webhook UUID | Notes |
|---------|-------------|-------|
| Grafana | `df4a1302-2d62-4a01-b858-fb56f4d3781a` | Unified alerting contact point |
| Proxmox | `9b3eafe5-7689-4011-addd-c466e524661d` | Notification system (8.1+) |
| Sonarr | `aeffc311-0686-42cb-9eeb-6757140c072e` | All event types |
| Radarr | `34913454-c1ac-4cda-82ea-924d4a9e60eb` | All event types |
| Readarr | `e57ab4f3-56e6-4dc4-8b30-2f4fd4bbeb0b` | All event types |
| Lidarr | `66ac6fdd-69f6-4f47-bb00-b7f6d84d7c1c` | All event types |
| Uptime Kuma | `1a02e890-bb25-42f1-99fe-bba6a19f1811` | Status change notifications |
| Seerr | `555185af-90a1-42ff-aed5-c344e11955cf` | Request/approval events |
| Owncast (Livestream) | `9993e911-c68b-4271-a178-c2d65ca88499` | STREAM_STARTED / STREAM_STOPPED |
| Bazarr | `470fb267-3436-4dd3-a70c-e6e8db1721be` | Subtitle events (Apprise JSON notifier) |
| Tinker-Tickets | `6e306faf-8eea-4ba5-83ef-bf8f421f929e` | Custom transformation code |
**Hookshot notes:**
- Spam and Stuff is intentionally **unencrypted** — hookshot bridges cannot join E2EE rooms
- JS transformation functions use hookshot v2 API: `result = { version: "v2", plain, html, msgtype }`
- The `result` variable must be assigned without `var`/`let`/`const` (QuickJS IIFE sandbox)
- NPM proxies `https://matrix.lotusguild.org/webhook/*``http://10.10.10.29:9003`
---
## Moderation (Draupnir v2.9.0)
Draupnir runs on LXC 110, manages moderation across all 9 protected rooms via `#management:matrix.lotusguild.org`.
**Subscribed ban lists:**
- `#community-moderation-effort-bl:neko.dev` — 12,599 banned users, 245 servers, 59 rooms
- `#matrix-org-coc-bl:matrix.org` — 4,589 banned users, 220 servers, 2 rooms
**Common commands (send in management room):**
```
!draupnir status — current status + protected rooms
!draupnir ban @user:server * "reason" — ban from all protected rooms
!draupnir redact @user:server — redact their recent messages
!draupnir rooms add !roomid:server — add a room to protection
!draupnir watch <alias> --no-confirm — subscribe to a ban list
```
---
## Known Issues
### coturn TLS Reset Errors
Periodic `TLS/TCP socket error: Connection reset by peer` in coturn logs. Normal — clients probe TURN and drop once they establish a direct P2P path.
### BBR Congestion Control
`net.ipv4.tcp_congestion_control = bbr` must be set on the Proxmox host, not inside an unprivileged LXC. All other sysctl tuning (TCP/UDP buffers, fin_timeout) is applied inside LXC 151.
---
## Server Checklist
### Quality of Life
- [x] Migrate from SQLite to PostgreSQL
- [x] TURN/STUN server (coturn) for reliable voice/video
- [x] URL previews
- [x] Upload size limit 200MB
- [x] Full-text message search (PostgreSQL backend)
- [x] Media retention policy (remote: 1yr, local: 3yr)
- [x] Sliding sync (native Synapse)
- [x] LiveKit for Element Call video rooms
- [x] Default room version v12, all rooms upgraded
- [x] Landing page with client recommendations
- [x] Synapse metrics endpoint (port 9000, Prometheus-compatible)
- [x] Custom Cinny client LXC 106 — Cinny 4.10.5, `add-joined-call-controls` branch, weekly auto-update cron
- [ ] Push notifications gateway (Sygnal) — needs Apple/Google developer credentials
- [ ] Cinny custom branding — Lotus Guild theme (colours, title, favicon, PWA name)
### Performance Tuning
- [x] PostgreSQL `shared_buffers` → 1500MB, `effective_cache_size`, `work_mem`, checkpoint tuning
- [x] PostgreSQL `pg_stat_statements` extension installed
- [x] PostgreSQL autovacuum tuned per-table (5 high-churn tables), `autovacuum_max_workers` → 5
- [x] Synapse `event_cache_size` → 30K, per-cache factors tuned
- [x] sysctl TCP/UDP buffer alignment on LXC 151 (`/etc/sysctl.d/99-matrix-tuning.conf`)
- [x] LiveKit: `empty_timeout: 300`, `departure_timeout: 20`, `max_participants: 50`
- [x] LiveKit ICE port range expanded to 50000-51000
- [x] LiveKit TURN TTL reduced to 1h
- [x] LiveKit VP9/AV1 codecs enabled
- [ ] BBR congestion control — must be applied on Proxmox host
### Auth & SSO
- [x] Token-based registration
- [x] SSO/OIDC via Authelia
- [x] `allow_existing_users: true` for linking accounts to SSO
- [x] Password auth alongside SSO
### Webhooks & Integrations
- [x] matrix-hookshot 7.3.2 — 11 active webhook services
- [x] Per-service JS transformation functions
- [x] Per-service virtual user avatars
- [x] NPM reverse proxy for `/webhook` path
### Room Structure
- [x] The Lotus Guild space with all core rooms
- [x] Correct power levels and join rules per room
- [x] Custom room avatars
### Hardening
- [x] Rate limiting
- [x] E2EE on all rooms (except Spam and Stuff — intentional for hookshot)
- [x] coturn internal peer deny rules (blocks relay to RFC1918 except allowed subnet)
- [x] coturn hardening: `stale-nonce=600`, `user-quota=100`, `total-quota=1000`, strong cipher list
- [x] `pg_hba.conf` locked down — remote access restricted to Synapse LXC only
- [x] Federation open with key verification
- [x] fail2ban on Synapse login endpoint (5 retries / 24h ban)
- [x] Synapse metrics port 9000 restricted to `127.0.0.1` + `10.10.10.29`
- [x] coturn cert auto-renewal — daily sync cron on compute-storage-01
- [x] `/.well-known/matrix/client` and `/server` live on lotusguild.org
- [x] `suppress_key_server_warning: true`
- [x] Automated database + media backups
- [x] Federation bad-actor blocking via Draupnir ban lists (17,000+ entries)
### Monitoring
- [x] Grafana dashboard — `dashboard.lotusguild.org/d/matrix-synapse-dashboard` (140+ panels)
- [x] Prometheus scraping all Matrix services (Synapse, Hookshot, LiveKit, node_exporter, postgres)
- [x] 14 active alert rules across matrix-folder and infra-folder
- [x] Uptime Kuma monitors: Synapse, LiveKit, PostgreSQL, Cinny, coturn, lk-jwt-service, Hookshot
### Admin
- [x] Synapse admin API dashboard (synapse-admin at http://10.10.10.29:8080)
- [x] Draupnir moderation bot — LXC 110, v2.9.0, 9 protected rooms, 2 ban lists
- [ ] Cinny custom branding
- [ ] **Storj node update**`storj_uptodate=0` on LXC 138 (10.10.10.133), risk of disqualification
---
## Monitoring & Observability
### Prometheus Scrape Jobs
| Job | Target | Metrics |
|-----|--------|---------|
| `synapse` | `10.10.10.29:9000` | Full Synapse internals |
| `matrix-admin` | `10.10.10.29:9101` | DAU, MAU, room/user/media totals |
| `livekit` | `10.10.10.29:6789` | Rooms, participants, packets, latency |
| `hookshot` | `10.10.10.29:9004` | Connections, API calls/failures, Node.js runtime |
| `matrix-node` | `10.10.10.29:9100` | CPU, RAM, network, load average, disk |
| `postgres` | `10.10.10.44:9187` | pg_stat_database, connections, WAL, block I/O |
| `postgres-node` | `10.10.10.44:9100` | CPU, RAM, network, load average, disk |
> **Disk I/O:** All servers use Ceph-backed storage. Per-device disk I/O metrics are meaningless — use Network I/O panels to see actual storage traffic.
### Alert Rules
**Matrix folder:**
| Alert | Fires when | Severity |
|-------|-----------|----------|
| Synapse Down | `up{job="synapse"}` < 1 for 2m | critical |
| PostgreSQL Down | `pg_up` < 1 for 2m | critical |
| LiveKit Down | `up{job="livekit"}` < 1 for 2m | critical |
| Hookshot Down | `up{job="hookshot"}` < 1 for 2m | critical |
| PG Connection Saturation | connections > 80% of max for 5m | warning |
| Federation Queue Backing Up | pending PDUs > 100 for 10m | warning |
| Synapse High Memory | RSS > 2000MB for 10m | warning |
| Synapse High Response Time | p99 latency (excl. /sync) > 10s for 5m | warning |
| Synapse Event Processing Lag | any processor > 30s behind for 5m | warning |
| Synapse DB Query Latency High | p99 query time > 1s for 5m | warning |
**Infrastructure folder:**
| Alert | Fires when | Severity |
|-------|-----------|----------|
| Service Exporter Down | any `up == 0` for 3m | critical |
| Node High CPU Usage | CPU > 90% for 10m | warning |
| Node High Memory Usage | RAM > 90% for 10m | warning |
| Node Disk Space Low | available < 15% (excl. tmpfs/overlay) for 10m | warning |
> **`/sync` long-poll:** The Matrix `/sync` endpoint is a long-poll (clients hold it open ≤30s). It is excluded from the High Response Time alert to prevent false positives.
> **Synapse Event Processing Lag** can fire transiently after a Synapse restart while processors drain their backlog. Self-resolves in 1020 minutes.
---
## Bot Checklist
### Core
- [x] matrix-nio async client with E2EE
- [x] Device trust (auto-trust all devices)
- [x] Graceful shutdown (SIGTERM/SIGINT)
- [x] Initial sync token (ignores old messages on startup)
- [x] Auto-accept room invites
- [x] Deployed as systemd service (`matrixbot.service`) on LXC 151
### Commands
- [x] `!help` — list commands
- [x] `!ping` — latency check
- [x] `!8ball <question>` — magic 8-ball
- [x] `!fortune` — fortune cookie
- [x] `!flip` — coin flip
- [x] `!roll <NdS>` — dice roller
- [x] `!random <min> <max>` — random number
- [x] `!rps <choice>` — rock paper scissors
- [x] `!poll <question>` — poll with reactions
- [x] `!trivia` — trivia game (reactions, 30s reveal)
- [x] `!champion [lane]` — random LoL champion
- [x] `!agent [role]` — random Valorant agent
- [x] `!wordle` — full Wordle game (daily, hard mode, stats, share)
- [x] `!minecraft <username>` — RCON whitelist add
- [x] `!ask <question>` — Ollama LLM (lotusllm, 2min cooldown)
- [x] `!health` — bot uptime + service status
### Welcome System
- [x] Watches Space joins and DMs new members automatically
- [x] React-to-join: react with ✅ in DM → bot invites to General, Commands, Memes
- [x] Welcome event ID persisted to `welcome_state.json`
### Wordle
- [x] Daily puzzles with two-pass letter evaluation
- [x] Hard mode with constraint validation
- [x] Stats persistence (`wordle_stats.json`)
- [x] Cinny-compatible rendering (inline `<span>` tiles)
- [x] DM-based gameplay, `!wordle share` posts result to public room
- [x] Virtual keyboard display
---
## Tech Stack
| Component | Technology | Version |
|-----------|-----------|---------|
| Bot language | Python 3 | 3.x |
| Bot library | matrix-nio (E2EE) | latest |
| Homeserver | Synapse | 1.149.0 |
| Database | PostgreSQL | 17.9 |
| TURN | coturn | latest |
| Video/voice calls | LiveKit SFU | 1.9.11 |
| LiveKit JWT | lk-jwt-service | latest |
| Moderation | Draupnir | 2.9.0 |
| SSO | Authelia (OIDC) + LLDAP | — |
| Webhook bridge | matrix-hookshot | 7.3.2 |
| Reverse proxy | Nginx Proxy Manager | — |
| Web client | Cinny (`add-joined-call-controls` branch) | 4.10.5 |
| Bot dependencies | matrix-nio[e2ee], aiohttp, python-dotenv, mcrcon | — |
## Bot Files
```
matrixBot/
├── bot.py # Entry point, client setup, event loop
├── callbacks.py # Message + reaction event handlers
├── commands.py # All command implementations
├── config.py # Environment config + validation
├── utils.py # send_text, send_html, send_reaction, get_or_create_dm
├── welcome.py # Welcome message + react-to-join logic
├── wordle.py # Full Wordle game engine
├── wordlist_answers.py # Wordle answer word list
├── wordlist_valid.py # Wordle valid guess word list
├── .env.example # Environment variable template
└── requirements.txt # Python dependencies
```