Files
matrix/README.md
Jared Vititoe ffd51e87bf docs: overhaul README with current infrastructure and auto-deploy docs
- Fix repo URL (matrixBot → matrix)
- Add repo structure tree
- Update Cinny: dev branch, nightly build, 2GB RAM, correct paths
- Add full Auto-Deployment section (per-LXC endpoints, what each deploys, installed components)
- Add Livekit Graceful Restart documentation
- Add Access Token Rotation procedure
- Update port map: add 9500 (webhook on LXC 151)
- Add Voice Room to rooms table
- Add Proxmox embed format note to hookshot section
- Add manual hookshot deploy instructions
- Add Cinny dev branch section with build notes
- Add HA migration livekit fix to Known Issues
- Update server checklist (auto-deploy, voice room visibility)
- Remove stale Python bot files section
- Update tech stack table

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 13:37:01 -04:00

509 lines
23 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Lotus Matrix Infrastructure
Matrix server infrastructure for the Lotus Guild homeserver (`matrix.lotusguild.org`).
**Repo**: https://code.lotusguild.org/LotusGuild/matrix
## Status: Phase 7 — Moderation & Client Customisation
---
## Priority Order
1. ~~PostgreSQL migration~~
2. ~~TURN server~~
3. ~~Room structure + space setup~~
4. ~~Matrix bot (core + commands)~~
5. ~~LiveKit / Element Call~~
6. ~~SSO / OIDC (Authelia)~~
7. ~~Webhook integrations (hookshot)~~
8. ~~Voice stability & quality tuning~~
9. ~~Custom Cinny client (chat.lotusguild.org)~~
10. Custom emoji packs (partially finished)
11. Cinny custom branding (Lotus Guild theme)
12. ~~Draupnir moderation bot~~
13. Push notifications (Sygnal)
---
## Repo Structure
```
matrix/
├── hookshot/ # Hookshot JS transformation functions (one file per webhook)
│ ├── deploy.sh # Deploys all .js files to Matrix room state via API
│ ├── proxmox.js
│ ├── grafana.js
│ ├── uptime-kuma.js
│ └── ... # One .js per webhook service
├── cinny/
│ ├── config.json # Cinny homeserver config (deployed to /var/www/html/config.json)
│ └── dev-update.sh # Nightly build script for Cinny dev branch
├── landing/
│ └── index.html # matrix.lotusguild.org landing page
├── draupnir/
│ └── production.yaml # Draupnir config (access token is redacted — see rotation docs below)
├── deploy/ # Auto-deployment infrastructure
│ ├── lxc151-hookshot.sh # Deploy script for LXC 151 (matrix/hookshot/livekit)
│ ├── lxc106-cinny.sh # Deploy script for LXC 106 (cinny)
│ ├── lxc139-landing.sh # Deploy script for LXC 139 (landing page)
│ ├── lxc110-draupnir.sh # Deploy script for LXC 110 (draupnir)
│ ├── livekit-graceful-restart.sh # Waits for zero active calls before restarting livekit
│ ├── hooks-lxc151.json # webhook binary config for LXC 151
│ ├── hooks-lxc106.json # webhook binary config for LXC 106
│ ├── hooks-lxc139.json # webhook binary config for LXC 139
│ └── hooks-lxc110.json # webhook binary config for LXC 110
└── systemd/
├── livekit-server.service # LiveKit systemd unit (with HA migration fix)
├── livekit-graceful-restart.service # oneshot — checks pending restart flag
├── livekit-graceful-restart.timer # Runs every 5 min
├── draupnir.service
└── cinny-dev-update.cron # Installed to /etc/cron.d/ on LXC 106
```
---
## Infrastructure
| Service | IP | LXC | RAM | vCPUs | Disk | Versions |
|---------|----|-----|-----|-------|------|----------|
| Synapse | 10.10.10.29 | 151 | 8GB | 4 (Ryzen 9 7900) | 50GB | Synapse 1.149.0, LiveKit 1.9.11, hookshot 7.3.2, coturn latest |
| PostgreSQL 17 | 10.10.10.44 | 109 | 6GB | 3 (Ryzen 9 7900) | 30GB | PostgreSQL 17.9 |
| Cinny Web | 10.10.10.6 | 106 | 2GB | 1 | 8GB | Debian 12, nginx, Node 24, Cinny `dev` branch (nightly build) |
| Draupnir | 10.10.10.24 | 110 | 1GB | 2 (Ryzen 9 7900) | 10GB | Draupnir v2.9.0, Node.js v22 |
| Prometheus | 10.10.10.48 | 118 | — | — | — | Prometheus — scrapes all Matrix services |
| Grafana | 10.10.10.49 | 107 | — | — | — | Grafana 12.4.0 — dashboard.lotusguild.org |
| NPM | 10.10.10.27 | 139 | — | — | — | Nginx Proxy Manager + matrix landing page |
| Authelia | 10.10.10.36 | 167 | — | — | — | SSO/OIDC provider |
| LLDAP | 10.10.10.39 | 147 | — | — | — | LDAP user directory |
| Uptime Kuma | 10.10.10.25 | 101 | — | — | — | Uptime monitoring (micro1 node) |
**Key paths on Synapse LXC (151):**
- Synapse config: `/etc/matrix-synapse/homeserver.yaml`
- Synapse conf.d: `/etc/matrix-synapse/conf.d/` (metrics.yaml, report_stats.yaml, server_name.yaml)
- coturn config: `/etc/turnserver.conf`
- LiveKit config: `/etc/livekit/config.yaml`
- LiveKit service: `livekit-server.service`
- lk-jwt-service: `lk-jwt-service.service` (binds `:8070`, serves JWT tokens for MatrixRTC)
- Hookshot: `/opt/hookshot/`, service: `matrix-hookshot.service`
- Hookshot config: `/opt/hookshot/config.yml`
- Hookshot registration: `/etc/matrix-synapse/hookshot-registration.yaml`
- Bot: `/opt/matrixbot/`, service: `matrixbot.service`
- Repo clone (auto-deploy): `/opt/matrix-config/`
- Deploy env: `/etc/matrix-deploy.env` (MATRIX_TOKEN, MATRIX_SERVER, MATRIX_ROOM)
- Deploy log: `/var/log/matrix-deploy.log`
**Key paths on Draupnir LXC (110):**
- Install path: `/opt/draupnir/`
- Config: `/opt/draupnir/config/production.yaml`
- Data/SQLite DBs: `/data/storage/`
- Service: `draupnir.service`
- Management room: `#management:matrix.lotusguild.org` (`!mEvR5fe3jMmzwd-FwNygD72OY_yu8H3UP_N-57oK7MI`)
- Bot account: `@draupnir:matrix.lotusguild.org` (power level 100 in all protected rooms)
- Subscribed ban lists: `#community-moderation-effort-bl:neko.dev`, `#matrix-org-coc-bl:matrix.org`
- Rebuild: `NODE_OPTIONS="--max-old-space-size=768" npx tsc --project tsconfig.json`
**Key paths on PostgreSQL LXC (109):**
- PostgreSQL config: `/etc/postgresql/17/main/postgresql.conf`
- Tuning conf.d: `/etc/postgresql/17/main/conf.d/synapse_tuning.conf`
- HBA config: `/etc/postgresql/17/main/pg_hba.conf`
- Data directory: `/var/lib/postgresql/17/main`
**Key paths on Cinny LXC (106):**
- Source: `/opt/cinny-dev/` (branch: `dev`, auto-updated nightly at 3am)
- Built files: `/var/www/html/`
- Cinny config: `/var/www/html/config.json`
- Config backup (survives rebuilds): `/opt/cinny-dev/.cinny-config.json`
- Dev update script: `/usr/local/bin/cinny-dev-update.sh`
- Cron: `/etc/cron.d/cinny-dev-update` (runs at 3:00am daily)
- Nginx site config: `/etc/nginx/sites-available/cinny`
---
## Auto-Deployment
Pushes to `main` on `LotusGuild/matrix` automatically deploy to the relevant LXC(s) via Gitea webhooks. All 4 LXCs are fully independent — each runs its own webhook listener and deploys only its own files. No cross-LXC SSH dependencies.
### How It Works
1. Push to `LotusGuild/matrix` on Gitea
2. Gitea fires webhooks to all 4 LXCs simultaneously (HMAC-SHA256 validated)
3. Each LXC runs `/usr/local/bin/matrix-deploy.sh` via the `webhook` binary
4. Script does `git fetch + reset --hard origin/main`, checks which files changed, deploys only relevant ones
5. Logs to `/var/log/matrix-deploy.log` on each LXC
### Per-LXC Webhook Endpoints
| LXC | Service | IP | Port | Deploys When Changed |
|-----|---------|----|----|----------------------|
| 151 | matrix/hookshot | 10.10.10.29 | **9500** | `hookshot/*.js`, `systemd/livekit-server.service` |
| 106 | cinny | 10.10.10.6 | 9000 | `cinny/config.json`, `cinny/dev-update.sh` |
| 139 | landing/NPM | 10.10.10.27 | 9000 | `landing/index.html` |
| 110 | draupnir | 10.10.10.24 | 9000 | `draupnir/production.yaml` |
> LXC 151 uses port **9500** because ports 90009004 are occupied by Synapse and Hookshot.
### What Each Deploy Does
**LXC 151 — hookshot/livekit:**
- `hookshot/*.js` changed → runs `hookshot/deploy.sh` (pushes transform functions to Matrix room state via API, requires `MATRIX_TOKEN` in `/etc/matrix-deploy.env`)
- `systemd/livekit-server.service` changed → copies file, `daemon-reload`, sets `/run/livekit-restart-pending` flag (actual restart deferred — see Livekit Graceful Restart below)
**LXC 106 — cinny:**
- `cinny/config.json` → copies to `/var/www/html/config.json`
- `cinny/dev-update.sh` → copies to `/usr/local/bin/cinny-dev-update.sh`, `chmod +x`
**LXC 139 — landing page:**
- `landing/index.html` → copies to `/var/www/matrix-landing/index.html`, `nginx -s reload`
**LXC 110 — draupnir:**
- `draupnir/production.yaml` → extracts live `accessToken` from existing config, overwrites from repo, restores token via `sed`, restarts `draupnir.service`
### Installed Components (per LXC)
- `webhook` binary (Debian package `webhook` v2.8.0) listening on respective port
- `/etc/webhook/hooks.json` — unique HMAC-SHA256 secret per LXC
- `/usr/local/bin/matrix-deploy.sh` — deploy script from this repo
- `/etc/systemd/system/webhook.service` — enabled and running
- `/opt/matrix-config/` — clone of this repo
- `/var/log/matrix-deploy.log` — deploy log
**LXC 151 additionally:**
- `/etc/matrix-deploy.env``MATRIX_TOKEN`, `MATRIX_SERVER`, `MATRIX_ROOM` (not in git)
- `/usr/local/bin/livekit-graceful-restart.sh`
- `/etc/systemd/system/livekit-graceful-restart.service` + `.timer`
### Livekit Graceful Restart
Killing livekit-server while a call is active drops everyone. Instead:
1. Deploy to LXC 151 copies the new `livekit-server.service` and sets a `/run/livekit-restart-pending` flag
2. `livekit-graceful-restart.timer` runs every 5 minutes
3. The timer script counts established TCP connections on port 7881 (`ss -tn state established`)
4. If zero connections → restarts livekit-server and clears the flag
5. If connections exist → logs and exits, retries in 5 minutes
---
## Access Token Rotation
The `MATRIX_TOKEN` in `/etc/matrix-deploy.env` on LXC 151 is a Jared user token used to push hookshot transforms to Matrix room state (requires power level ≥ 50 in Spam and Stuff).
The token in `draupnir/production.yaml` in this repo is **intentionally redacted** (`accessToken: REDACTED`). The deploy script on LXC 110 extracts the live token from the running config before overwriting from the repo, then restores it.
**To rotate the hookshot deploy token (LXC 151):**
1. Generate a new token via Synapse admin API or Cinny → Settings → Security → Manage Sessions
2. SSH to LXC 151 (via `ssh root@10.10.10.4` then `pct enter 151`): `nano /etc/matrix-deploy.env`
3. Replace `MATRIX_TOKEN=<old>` with new token
4. Test: `MATRIX_TOKEN=<new> MATRIX_SERVER=https://matrix.lotusguild.org bash /opt/matrix-config/hookshot/deploy.sh`
**To rotate the Draupnir token:**
1. Generate new token for `@draupnir:matrix.lotusguild.org`
2. On LXC 110: `nano /opt/draupnir/config/production.yaml` → update `accessToken`
3. `systemctl restart draupnir`
4. Do **not** commit the token to git — the repo version stays redacted
---
## Port Maps
**Router → 10.10.10.29 (forwarded):**
- TCP+UDP 3478 — TURN/STUN
- TCP+UDP 5349 — TURNS/TLS
- TCP 7881 — LiveKit ICE TCP fallback
- TCP+UDP 49152-65535 — TURN relay range
**Internal port map (LXC 151):**
| Port | Service | Bind |
|------|---------|------|
| 8008 | Synapse HTTP | 0.0.0.0 |
| 9000 | Synapse metrics | 127.0.0.1 + 10.10.10.29 |
| 9001 | Hookshot widgets | 0.0.0.0 |
| 9002 | Hookshot bridge (appservice) | 127.0.0.1 |
| 9003 | Hookshot webhooks | 0.0.0.0 |
| 9004 | Hookshot metrics | 0.0.0.0 |
| 9100 | node_exporter | 0.0.0.0 |
| 9101 | matrix-admin exporter | 0.0.0.0 |
| 9500 | webhook (auto-deploy) | 0.0.0.0 |
| 6789 | LiveKit metrics | 0.0.0.0 |
| 7880 | LiveKit HTTP | 0.0.0.0 |
| 7881 | LiveKit RTC TCP | 0.0.0.0 |
| 8070 | lk-jwt-service | 0.0.0.0 |
| 8080 | synapse-admin (nginx) | 0.0.0.0 |
| 3478 | coturn STUN/TURN | 0.0.0.0 |
| 5349 | coturn TURNS/TLS | 0.0.0.0 |
**Internal port map (LXC 109 — PostgreSQL):**
| Port | Service | Bind |
|------|---------|------|
| 5432 | PostgreSQL | 0.0.0.0 (hba-restricted to 10.10.10.29) |
| 9100 | node_exporter | 0.0.0.0 |
| 9187 | postgres_exporter | 0.0.0.0 |
---
## Rooms (all v12)
| Room | Room ID | Join Rule |
|------|---------|-----------|
| The Lotus Guild (Space) | `!-1ZBnAH-JiCOV8MGSKN77zDGTuI3pgSdy8Unu_DrDyc` | public |
| General | `!wfokQ1-pE896scu_AOcCBA2s3L4qFo-PTBAFTd0WMI0` | public |
| Commands | `!ou56mVZQ8ZB7AhDYPmBV5_BR28WMZ4x5zwZkPCqjq1s` | restricted (Space members) |
| Memes | `!GK6v5cLEEnowIooQJv5jECfISUjADjt8aKhWv9VbG5U` | restricted (Space members) |
| Voice Room | `!ARbRFSPNp2U0MslWTBGoTT3gbmJJ25dPRL6enQntvPo` | restricted (Space members) |
| Management | `!mEvR5fe3jMmzwd-FwNygD72OY_yu8H3UP_N-57oK7MI` | invite |
| Cool Kids | `!R7DT3QZHG9P8QQvX6zsZYxjkKgmUucxDz_n31qNrC94` | invite |
| Spam and Stuff | `!GttT4QYd1wlGlkHU3qTmq_P3gbyYKKeSSN6R7TPcJHg` | invite, **no E2EE** (hookshot) |
**Power level roles (Cinny tags):**
- 100: Owner (jared)
- 50: The Nerdy Council (enhuynh, lonely)
- 48: Panel of Geeks
- 35: Cool Kids
- 0: Member
---
## Webhook Integrations (matrix-hookshot 7.3.2)
Generic webhooks bridged into **Spam and Stuff**.
Each service gets its own virtual user (`@hookshot_<service>`) with a unique avatar.
Webhook URL format: `https://matrix.lotusguild.org/webhook/<uuid>`
| Service | Webhook UUID | Notes |
|---------|-------------|-------|
| Grafana | `df4a1302-2d62-4a01-b858-fb56f4d3781a` | Unified alerting contact point |
| Proxmox | `9b3eafe5-7689-4011-addd-c466e524661d` | Notification system (8.1+), Discord embed format |
| Sonarr | `aeffc311-0686-42cb-9eeb-6757140c072e` | All event types |
| Radarr | `34913454-c1ac-4cda-82ea-924d4a9e60eb` | All event types |
| Readarr | `e57ab4f3-56e6-4dc4-8b30-2f4fd4bbeb0b` | All event types |
| Lidarr | `66ac6fdd-69f6-4f47-bb00-b7f6d84d7c1c` | All event types |
| Uptime Kuma | `1a02e890-bb25-42f1-99fe-bba6a19f1811` | Status change notifications |
| Seerr | `555185af-90a1-42ff-aed5-c344e11955cf` | Request/approval events |
| Owncast (Livestream) | `9993e911-c68b-4271-a178-c2d65ca88499` | STREAM_STARTED / STREAM_STOPPED |
| Bazarr | `470fb267-3436-4dd3-a70c-e6e8db1721be` | Subtitle events (Apprise JSON notifier) |
| Tinker-Tickets | `6e306faf-8eea-4ba5-83ef-bf8f421f929e` | Custom transformation code |
**Hookshot notes:**
- Spam and Stuff is intentionally **unencrypted** — hookshot bridges cannot join E2EE rooms
- JS transformation functions use hookshot v2 API: `result = { version: "v2", plain, html, msgtype }`
- The `result` variable must be assigned without `var`/`let`/`const` (QuickJS IIFE sandbox)
- NPM proxies `https://matrix.lotusguild.org/webhook/*``http://10.10.10.29:9003`
- Proxmox sends Discord embed format: `data.embeds[0].{title,description,fields}` — NOT flat fields
- Transform functions are stored as Matrix room state (`uk.half-shot.matrix-hookshot.generic.hook`) and deployed via `hookshot/deploy.sh`
**Deploying hookshot transforms manually:**
```bash
# On LXC 151 or from any machine with access
export MATRIX_TOKEN=<jared_token>
export MATRIX_SERVER=https://matrix.lotusguild.org
export MATRIX_ROOM='!GttT4QYd1wlGlkHU3qTmq_P3gbyYKKeSSN6R7TPcJHg'
bash /opt/matrix-config/hookshot/deploy.sh # deploy all
bash /opt/matrix-config/hookshot/deploy.sh proxmox.js # deploy one
```
---
## Moderation (Draupnir v2.9.0)
Draupnir runs on LXC 110, manages moderation across all 9 protected rooms via `#management:matrix.lotusguild.org`.
**Subscribed ban lists:**
- `#community-moderation-effort-bl:neko.dev` — 12,599 banned users, 245 servers, 59 rooms
- `#matrix-org-coc-bl:matrix.org` — 4,589 banned users, 220 servers, 2 rooms
**Common commands (send in management room):**
```
!draupnir status — current status + protected rooms
!draupnir ban @user:server * "reason" — ban from all protected rooms
!draupnir redact @user:server — redact their recent messages
!draupnir rooms add !roomid:server — add a room to protection
!draupnir watch <alias> --no-confirm — subscribe to a ban list
```
---
## Cinny Dev Branch (chat.lotusguild.org)
`chat.lotusguild.org` tracks the Cinny `dev` branch to test the latest beta features.
**Nightly build process (`cinny-dev-update.sh`):**
1. `git fetch origin dev` — checks for new commits; exits early if nothing changed
2. Builds in `/opt/cinny-dev/` using Node 24 with `NODE_OPTIONS=--max_old_space_size=896`
3. Validates `dist/index.html` exists before touching the live web root
4. Copies `dist/` to `/var/www/html/`, restores `config.json` from `/opt/cinny-dev/.cinny-config.json`
5. Runs at 3:00am daily via `/etc/cron.d/cinny-dev-update`
**Manual rebuild:**
```bash
# On LXC 106
/usr/local/bin/cinny-dev-update.sh
```
**Why 2GB RAM:** Vite's build process OOM-killed at 1GB. 896MB Node heap + OS overhead requires at least 1.5GB; 2GB gives headroom.
---
## Known Issues
### LiveKit Port Conflict After HA Migration
LXC 151 can migrate between Proxmox nodes via HA. After migration, the old livekit-server process on the source node can leave a stale entry holding port 7881 on the destination. Fixed in `livekit-server.service` via:
```ini
ExecStartPre=-/bin/bash -c 'pkill -x livekit-server; sleep 1'
KillMode=control-group
```
### coturn TLS Reset Errors
Periodic `TLS/TCP socket error: Connection reset by peer` in coturn logs. Normal — clients probe TURN and drop once they establish a direct P2P path.
### BBR Congestion Control
`net.ipv4.tcp_congestion_control = bbr` must be set on the Proxmox host, not inside an unprivileged LXC. All other sysctl tuning (TCP/UDP buffers, fin_timeout) is applied inside LXC 151.
---
## Server Checklist
### Quality of Life
- [x] Migrate from SQLite to PostgreSQL
- [x] TURN/STUN server (coturn) for reliable voice/video
- [x] URL previews
- [x] Upload size limit 200MB
- [x] Full-text message search (PostgreSQL backend)
- [x] Media retention policy (remote: 1yr, local: 3yr)
- [x] Sliding sync (native Synapse)
- [x] LiveKit for Element Call video rooms
- [x] Default room version v12, all rooms upgraded
- [x] Landing page with client recommendations
- [x] Synapse metrics endpoint (port 9000, Prometheus-compatible)
- [x] Cinny `dev` branch — nightly auto-build, tracks latest beta features
- [x] Auto-deployment via Gitea webhooks (all 4 LXCs)
- [ ] Push notifications gateway (Sygnal) — needs Apple/Google developer credentials
- [ ] Cinny custom branding — Lotus Guild theme (colours, title, favicon, PWA name)
### Performance Tuning
- [x] PostgreSQL `shared_buffers` → 1500MB, `effective_cache_size`, `work_mem`, checkpoint tuning
- [x] PostgreSQL `pg_stat_statements` extension installed
- [x] PostgreSQL autovacuum tuned per-table (5 high-churn tables), `autovacuum_max_workers` → 5
- [x] Synapse `event_cache_size` → 30K, per-cache factors tuned
- [x] sysctl TCP/UDP buffer alignment on LXC 151 (`/etc/sysctl.d/99-matrix-tuning.conf`)
- [x] LiveKit: `empty_timeout: 300`, `departure_timeout: 20`, `max_participants: 50`
- [x] LiveKit ICE port range expanded to 50000-51000
- [x] LiveKit TURN TTL reduced to 1h
- [x] LiveKit VP9/AV1 codecs enabled
- [ ] BBR congestion control — must be applied on Proxmox host
### Auth & SSO
- [x] Token-based registration
- [x] SSO/OIDC via Authelia
- [x] `allow_existing_users: true` for linking accounts to SSO
- [x] Password auth alongside SSO
### Webhooks & Integrations
- [x] matrix-hookshot 7.3.2 — 11 active webhook services
- [x] Per-service JS transformation functions (stored in git, auto-deployed)
- [x] Per-service virtual user avatars
- [x] NPM reverse proxy for `/webhook` path
### Room Structure
- [x] The Lotus Guild space with all core rooms
- [x] Correct power levels and join rules per room
- [x] Custom room avatars
- [x] Voice room visible to space members (`suggested: true`)
### Hardening
- [x] Rate limiting
- [x] E2EE on all rooms (except Spam and Stuff — intentional for hookshot)
- [x] coturn internal peer deny rules (blocks relay to RFC1918 except allowed subnet)
- [x] coturn hardening: `stale-nonce=600`, `user-quota=100`, `total-quota=1000`, strong cipher list
- [x] `pg_hba.conf` locked down — remote access restricted to Synapse LXC only
- [x] Federation open with key verification
- [x] fail2ban on Synapse login endpoint (5 retries / 24h ban)
- [x] Synapse metrics port 9000 restricted to `127.0.0.1` + `10.10.10.29`
- [x] coturn cert auto-renewal — daily sync cron on compute-storage-01
- [x] `/.well-known/matrix/client` and `/server` live on lotusguild.org
- [x] `suppress_key_server_warning: true`
- [x] Automated database + media backups
- [x] Federation bad-actor blocking via Draupnir ban lists (17,000+ entries)
- [x] Webhook HMAC-SHA256 validation on all auto-deploy endpoints
### Monitoring
- [x] Grafana dashboard — `dashboard.lotusguild.org/d/matrix-synapse-dashboard` (140+ panels)
- [x] Prometheus scraping all Matrix services (Synapse, Hookshot, LiveKit, node_exporter, postgres)
- [x] 14 active alert rules across matrix-folder and infra-folder
- [x] Uptime Kuma monitors: Synapse, LiveKit, PostgreSQL, Cinny, coturn, lk-jwt-service, Hookshot
### Admin
- [x] Synapse admin API dashboard (synapse-admin at http://10.10.10.29:8080)
- [x] Draupnir moderation bot — LXC 110, v2.9.0, 9 protected rooms, 2 ban lists
- [ ] Cinny custom branding
---
## Monitoring & Observability
### Prometheus Scrape Jobs
| Job | Target | Metrics |
|-----|--------|---------|
| `synapse` | `10.10.10.29:9000` | Full Synapse internals |
| `matrix-admin` | `10.10.10.29:9101` | DAU, MAU, room/user/media totals |
| `livekit` | `10.10.10.29:6789` | Rooms, participants, packets, latency |
| `hookshot` | `10.10.10.29:9004` | Connections, API calls/failures, Node.js runtime |
| `matrix-node` | `10.10.10.29:9100` | CPU, RAM, network, load average, disk |
| `postgres` | `10.10.10.44:9187` | pg_stat_database, connections, WAL, block I/O |
| `postgres-node` | `10.10.10.44:9100` | CPU, RAM, network, load average, disk |
> **Disk I/O:** All servers use Ceph-backed storage. Per-device disk I/O metrics are meaningless — use Network I/O panels to see actual storage traffic.
### Alert Rules
**Matrix folder:**
| Alert | Fires when | Severity |
|-------|-----------|----------|
| Synapse Down | `up{job="synapse"}` < 1 for 2m | critical |
| PostgreSQL Down | `pg_up` < 1 for 2m | critical |
| LiveKit Down | `up{job="livekit"}` < 1 for 2m | critical |
| Hookshot Down | `up{job="hookshot"}` < 1 for 2m | critical |
| PG Connection Saturation | connections > 80% of max for 5m | warning |
| Federation Queue Backing Up | pending PDUs > 100 for 10m | warning |
| Synapse High Memory | RSS > 2000MB for 10m | warning |
| Synapse High Response Time | p99 latency (excl. /sync) > 10s for 5m | warning |
| Synapse Event Processing Lag | any processor > 30s behind for 5m | warning |
| Synapse DB Query Latency High | p99 query time > 1s for 5m | warning |
**Infrastructure folder:**
| Alert | Fires when | Severity |
|-------|-----------|----------|
| Service Exporter Down | any `up == 0` for 3m | critical |
| Node High CPU Usage | CPU > 90% for 10m | warning |
| Node High Memory Usage | RAM > 90% for 10m | warning |
| Node Disk Space Low | available < 15% (excl. tmpfs/overlay) for 10m | warning |
> **`/sync` long-poll:** The Matrix `/sync` endpoint is a long-poll (clients hold it open ≤30s). It is excluded from the High Response Time alert to prevent false positives.
> **Synapse Event Processing Lag** can fire transiently after a Synapse restart while processors drain their backlog. Self-resolves in 1020 minutes.
---
## Tech Stack
| Component | Technology | Version |
|-----------|-----------|---------|
| Homeserver | Synapse | 1.149.0 |
| Database | PostgreSQL | 17.9 |
| TURN | coturn | latest |
| Video/voice calls | LiveKit SFU | 1.9.11 |
| LiveKit JWT | lk-jwt-service | latest |
| Moderation | Draupnir | 2.9.0 |
| SSO | Authelia (OIDC) + LLDAP | — |
| Webhook bridge | matrix-hookshot | 7.3.2 |
| Reverse proxy | Nginx Proxy Manager | — |
| Web client | Cinny (`dev` branch, nightly build) | dev |
| Auto-deploy | adnanh/webhook | 2.8.0 |
| Bot language | Python 3 | 3.x |
| Bot library | matrix-nio (E2EE) | latest |
| Bot dependencies | matrix-nio[e2ee], aiohttp, python-dotenv, mcrcon | — |