# Lotus Matrix Infrastructure Matrix server infrastructure for the Lotus Guild homeserver (`matrix.lotusguild.org`). **Repo**: https://code.lotusguild.org/LotusGuild/matrix ## Status: Phase 7 — Moderation & Client Customisation --- ## Priority Order 1. ~~PostgreSQL migration~~ 2. ~~TURN server~~ 3. ~~Room structure + space setup~~ 4. ~~Matrix bot (core + commands)~~ 5. ~~LiveKit / Element Call~~ 6. ~~SSO / OIDC (Authelia)~~ 7. ~~Webhook integrations (hookshot)~~ 8. ~~Voice stability & quality tuning~~ 9. ~~Custom Cinny client (chat.lotusguild.org)~~ 10. Custom emoji packs (partially finished) 11. Cinny custom branding (Lotus Guild theme) 12. ~~Draupnir moderation bot~~ 13. Push notifications (Sygnal) --- ## Repo Structure ``` matrix/ ├── hookshot/ # Hookshot JS transformation functions (one file per webhook) │ ├── deploy.sh # Deploys all .js files to Matrix room state via API │ ├── proxmox.js │ ├── grafana.js │ ├── uptime-kuma.js │ └── ... # One .js per webhook service ├── cinny/ │ ├── config.json # Cinny homeserver config (deployed to /var/www/html/config.json) │ └── dev-update.sh # Nightly build script for Cinny dev branch ├── landing/ │ └── index.html # matrix.lotusguild.org landing page ├── draupnir/ │ └── production.yaml # Draupnir config (access token is redacted — see rotation docs below) ├── deploy/ # Auto-deployment infrastructure │ ├── lxc151-hookshot.sh # Deploy script for LXC 151 (matrix/hookshot/livekit) │ ├── lxc106-cinny.sh # Deploy script for LXC 106 (cinny) │ ├── lxc139-landing.sh # Deploy script for LXC 139 (landing page) │ ├── lxc110-draupnir.sh # Deploy script for LXC 110 (draupnir) │ ├── livekit-graceful-restart.sh # Waits for zero active calls before restarting livekit │ ├── hooks-lxc151.json # webhook binary config for LXC 151 │ ├── hooks-lxc106.json # webhook binary config for LXC 106 │ ├── hooks-lxc139.json # webhook binary config for LXC 139 │ └── hooks-lxc110.json # webhook binary config for LXC 110 └── systemd/ ├── livekit-server.service # LiveKit systemd unit (with HA migration fix) ├── livekit-graceful-restart.service # oneshot — checks pending restart flag ├── livekit-graceful-restart.timer # Runs every 5 min ├── draupnir.service └── cinny-dev-update.cron # Installed to /etc/cron.d/ on LXC 106 ``` --- ## Infrastructure | Service | IP | LXC | RAM | vCPUs | Disk | Versions | |---------|----|-----|-----|-------|------|----------| | Synapse | 10.10.10.29 | 151 | 8GB | 4 (Ryzen 9 7900) | 50GB | Synapse 1.149.0, LiveKit 1.9.11, hookshot 7.3.2, coturn latest | | PostgreSQL 17 | 10.10.10.44 | 109 | 6GB | 3 (Ryzen 9 7900) | 30GB | PostgreSQL 17.9 | | Cinny Web | 10.10.10.6 | 106 | 2GB | 1 | 8GB | Debian 12, nginx, Node 24, Cinny `dev` branch (nightly build) | | Draupnir | 10.10.10.24 | 110 | 1GB | 2 (Ryzen 9 7900) | 10GB | Draupnir v2.9.0, Node.js v22 | | Prometheus | 10.10.10.48 | 118 | — | — | — | Prometheus — scrapes all Matrix services | | Grafana | 10.10.10.49 | 107 | — | — | — | Grafana 12.4.0 — dashboard.lotusguild.org | | NPM | 10.10.10.27 | 139 | — | — | — | Nginx Proxy Manager + matrix landing page | | Authelia | 10.10.10.36 | 167 | — | — | — | SSO/OIDC provider | | LLDAP | 10.10.10.39 | 147 | — | — | — | LDAP user directory | | Uptime Kuma | 10.10.10.25 | 101 | — | — | — | Uptime monitoring (micro1 node) | **Key paths on Synapse LXC (151):** - Synapse config: `/etc/matrix-synapse/homeserver.yaml` - Synapse conf.d: `/etc/matrix-synapse/conf.d/` (metrics.yaml, report_stats.yaml, server_name.yaml) - coturn config: `/etc/turnserver.conf` - LiveKit config: `/etc/livekit/config.yaml` - LiveKit service: `livekit-server.service` - lk-jwt-service: `lk-jwt-service.service` (binds `:8070`, serves JWT tokens for MatrixRTC) - Hookshot: `/opt/hookshot/`, service: `matrix-hookshot.service` - Hookshot config: `/opt/hookshot/config.yml` - Hookshot registration: `/etc/matrix-synapse/hookshot-registration.yaml` - Bot: `/opt/matrixbot/`, service: `matrixbot.service` - Repo clone (auto-deploy): `/opt/matrix-config/` - Deploy env: `/etc/matrix-deploy.env` (MATRIX_TOKEN, MATRIX_SERVER, MATRIX_ROOM) - Deploy log: `/var/log/matrix-deploy.log` **Key paths on Draupnir LXC (110):** - Install path: `/opt/draupnir/` - Config: `/opt/draupnir/config/production.yaml` - Data/SQLite DBs: `/data/storage/` - Service: `draupnir.service` - Management room: `#management:matrix.lotusguild.org` (`!mEvR5fe3jMmzwd-FwNygD72OY_yu8H3UP_N-57oK7MI`) - Bot account: `@draupnir:matrix.lotusguild.org` (power level 100 in all protected rooms) - Subscribed ban lists: `#community-moderation-effort-bl:neko.dev`, `#matrix-org-coc-bl:matrix.org` - Rebuild: `NODE_OPTIONS="--max-old-space-size=768" npx tsc --project tsconfig.json` **Key paths on PostgreSQL LXC (109):** - PostgreSQL config: `/etc/postgresql/17/main/postgresql.conf` - Tuning conf.d: `/etc/postgresql/17/main/conf.d/synapse_tuning.conf` - HBA config: `/etc/postgresql/17/main/pg_hba.conf` - Data directory: `/var/lib/postgresql/17/main` **Key paths on Cinny LXC (106):** - Source: `/opt/cinny-dev/` (branch: `dev`, auto-updated nightly at 3am) - Built files: `/var/www/html/` - Cinny config: `/var/www/html/config.json` - Config backup (survives rebuilds): `/opt/cinny-dev/.cinny-config.json` - Dev update script: `/usr/local/bin/cinny-dev-update.sh` - Cron: `/etc/cron.d/cinny-dev-update` (runs at 3:00am daily) - Nginx site config: `/etc/nginx/sites-available/cinny` --- ## Auto-Deployment Pushes to `main` on `LotusGuild/matrix` automatically deploy to the relevant LXC(s) via Gitea webhooks. All 4 LXCs are fully independent — each runs its own webhook listener and deploys only its own files. No cross-LXC SSH dependencies. ### How It Works 1. Push to `LotusGuild/matrix` on Gitea 2. Gitea fires webhooks to all 4 LXCs simultaneously (HMAC-SHA256 validated) 3. Each LXC runs `/usr/local/bin/matrix-deploy.sh` via the `webhook` binary 4. Script does `git fetch + reset --hard origin/main`, checks which files changed, deploys only relevant ones 5. Logs to `/var/log/matrix-deploy.log` on each LXC ### Per-LXC Webhook Endpoints | LXC | Service | IP | Port | Deploys When Changed | |-----|---------|----|----|----------------------| | 151 | matrix/hookshot | 10.10.10.29 | **9500** | `hookshot/*.js`, `systemd/livekit-server.service` | | 106 | cinny | 10.10.10.6 | 9000 | `cinny/config.json`, `cinny/dev-update.sh` | | 139 | landing/NPM | 10.10.10.27 | 9000 | `landing/index.html` | | 110 | draupnir | 10.10.10.24 | 9000 | `draupnir/production.yaml` | > LXC 151 uses port **9500** because ports 9000–9004 are occupied by Synapse and Hookshot. ### What Each Deploy Does **LXC 151 — hookshot/livekit:** - `hookshot/*.js` changed → runs `hookshot/deploy.sh` (pushes transform functions to Matrix room state via API, requires `MATRIX_TOKEN` in `/etc/matrix-deploy.env`) - `systemd/livekit-server.service` changed → copies file, `daemon-reload`, sets `/run/livekit-restart-pending` flag (actual restart deferred — see Livekit Graceful Restart below) **LXC 106 — cinny:** - `cinny/config.json` → copies to `/var/www/html/config.json` - `cinny/dev-update.sh` → copies to `/usr/local/bin/cinny-dev-update.sh`, `chmod +x` **LXC 139 — landing page:** - `landing/index.html` → copies to `/var/www/matrix-landing/index.html`, `nginx -s reload` **LXC 110 — draupnir:** - `draupnir/production.yaml` → extracts live `accessToken` from existing config, overwrites from repo, restores token via `sed`, restarts `draupnir.service` ### Installed Components (per LXC) - `webhook` binary (Debian package `webhook` v2.8.0) listening on respective port - `/etc/webhook/hooks.json` — unique HMAC-SHA256 secret per LXC - `/usr/local/bin/matrix-deploy.sh` — deploy script from this repo - `/etc/systemd/system/webhook.service` — enabled and running - `/opt/matrix-config/` — clone of this repo - `/var/log/matrix-deploy.log` — deploy log **LXC 151 additionally:** - `/etc/matrix-deploy.env` — `MATRIX_TOKEN`, `MATRIX_SERVER`, `MATRIX_ROOM` (not in git) - `/usr/local/bin/livekit-graceful-restart.sh` - `/etc/systemd/system/livekit-graceful-restart.service` + `.timer` ### Livekit Graceful Restart Killing livekit-server while a call is active drops everyone. Instead: 1. Deploy to LXC 151 copies the new `livekit-server.service` and sets a `/run/livekit-restart-pending` flag 2. `livekit-graceful-restart.timer` runs every 5 minutes 3. The timer script counts established TCP connections on port 7881 (`ss -tn state established`) 4. If zero connections → restarts livekit-server and clears the flag 5. If connections exist → logs and exits, retries in 5 minutes --- ## Access Token Rotation The `MATRIX_TOKEN` in `/etc/matrix-deploy.env` on LXC 151 is a Jared user token used to push hookshot transforms to Matrix room state (requires power level ≥ 50 in Spam and Stuff). The token in `draupnir/production.yaml` in this repo is **intentionally redacted** (`accessToken: REDACTED`). The deploy script on LXC 110 extracts the live token from the running config before overwriting from the repo, then restores it. **To rotate the hookshot deploy token (LXC 151):** 1. Generate a new token via Synapse admin API or Cinny → Settings → Security → Manage Sessions 2. SSH to LXC 151 (via `ssh root@10.10.10.4` then `pct enter 151`): `nano /etc/matrix-deploy.env` 3. Replace `MATRIX_TOKEN=` with new token 4. Test: `MATRIX_TOKEN= MATRIX_SERVER=https://matrix.lotusguild.org bash /opt/matrix-config/hookshot/deploy.sh` **To rotate the Draupnir token:** 1. Generate new token for `@draupnir:matrix.lotusguild.org` 2. On LXC 110: `nano /opt/draupnir/config/production.yaml` → update `accessToken` 3. `systemctl restart draupnir` 4. Do **not** commit the token to git — the repo version stays redacted --- ## Port Maps **Router → 10.10.10.29 (forwarded):** - TCP+UDP 3478 — TURN/STUN - TCP+UDP 5349 — TURNS/TLS - TCP 7881 — LiveKit ICE TCP fallback - TCP+UDP 49152-65535 — TURN relay range **Internal port map (LXC 151):** | Port | Service | Bind | |------|---------|------| | 8008 | Synapse HTTP | 0.0.0.0 | | 9000 | Synapse metrics | 127.0.0.1 + 10.10.10.29 | | 9001 | Hookshot widgets | 0.0.0.0 | | 9002 | Hookshot bridge (appservice) | 127.0.0.1 | | 9003 | Hookshot webhooks | 0.0.0.0 | | 9004 | Hookshot metrics | 0.0.0.0 | | 9100 | node_exporter | 0.0.0.0 | | 9101 | matrix-admin exporter | 0.0.0.0 | | 9500 | webhook (auto-deploy) | 0.0.0.0 | | 6789 | LiveKit metrics | 0.0.0.0 | | 7880 | LiveKit HTTP | 0.0.0.0 | | 7881 | LiveKit RTC TCP | 0.0.0.0 | | 8070 | lk-jwt-service | 0.0.0.0 | | 8080 | synapse-admin (nginx) | 0.0.0.0 | | 3478 | coturn STUN/TURN | 0.0.0.0 | | 5349 | coturn TURNS/TLS | 0.0.0.0 | **Internal port map (LXC 109 — PostgreSQL):** | Port | Service | Bind | |------|---------|------| | 5432 | PostgreSQL | 0.0.0.0 (hba-restricted to 10.10.10.29) | | 9100 | node_exporter | 0.0.0.0 | | 9187 | postgres_exporter | 0.0.0.0 | --- ## Rooms (all v12) | Room | Room ID | Join Rule | |------|---------|-----------| | The Lotus Guild (Space) | `!-1ZBnAH-JiCOV8MGSKN77zDGTuI3pgSdy8Unu_DrDyc` | public | | General | `!wfokQ1-pE896scu_AOcCBA2s3L4qFo-PTBAFTd0WMI0` | public | | Commands | `!ou56mVZQ8ZB7AhDYPmBV5_BR28WMZ4x5zwZkPCqjq1s` | restricted (Space members) | | Memes | `!GK6v5cLEEnowIooQJv5jECfISUjADjt8aKhWv9VbG5U` | restricted (Space members) | | Voice Room | `!ARbRFSPNp2U0MslWTBGoTT3gbmJJ25dPRL6enQntvPo` | restricted (Space members) | | Management | `!mEvR5fe3jMmzwd-FwNygD72OY_yu8H3UP_N-57oK7MI` | invite | | Cool Kids | `!R7DT3QZHG9P8QQvX6zsZYxjkKgmUucxDz_n31qNrC94` | invite | | Spam and Stuff | `!GttT4QYd1wlGlkHU3qTmq_P3gbyYKKeSSN6R7TPcJHg` | invite, **no E2EE** (hookshot) | **Power level roles (Cinny tags):** - 100: Owner (jared) - 50: The Nerdy Council (enhuynh, lonely) - 48: Panel of Geeks - 35: Cool Kids - 0: Member --- ## Webhook Integrations (matrix-hookshot 7.3.2) Generic webhooks bridged into **Spam and Stuff**. Each service gets its own virtual user (`@hookshot_`) with a unique avatar. Webhook URL format: `https://matrix.lotusguild.org/webhook/` | Service | Webhook UUID | Notes | |---------|-------------|-------| | Grafana | `df4a1302-2d62-4a01-b858-fb56f4d3781a` | Unified alerting contact point | | Proxmox | `9b3eafe5-7689-4011-addd-c466e524661d` | Notification system (8.1+), Discord embed format | | Sonarr | `aeffc311-0686-42cb-9eeb-6757140c072e` | All event types | | Radarr | `34913454-c1ac-4cda-82ea-924d4a9e60eb` | All event types | | Readarr | `e57ab4f3-56e6-4dc4-8b30-2f4fd4bbeb0b` | All event types | | Lidarr | `66ac6fdd-69f6-4f47-bb00-b7f6d84d7c1c` | All event types | | Uptime Kuma | `1a02e890-bb25-42f1-99fe-bba6a19f1811` | Status change notifications | | Seerr | `555185af-90a1-42ff-aed5-c344e11955cf` | Request/approval events | | Owncast (Livestream) | `9993e911-c68b-4271-a178-c2d65ca88499` | STREAM_STARTED / STREAM_STOPPED | | Bazarr | `470fb267-3436-4dd3-a70c-e6e8db1721be` | Subtitle events (Apprise JSON notifier) | | Tinker-Tickets | `6e306faf-8eea-4ba5-83ef-bf8f421f929e` | Custom transformation code | **Hookshot notes:** - Spam and Stuff is intentionally **unencrypted** — hookshot bridges cannot join E2EE rooms - JS transformation functions use hookshot v2 API: `result = { version: "v2", plain, html, msgtype }` - The `result` variable must be assigned without `var`/`let`/`const` (QuickJS IIFE sandbox) - NPM proxies `https://matrix.lotusguild.org/webhook/*` → `http://10.10.10.29:9003` - Proxmox sends Discord embed format: `data.embeds[0].{title,description,fields}` — NOT flat fields - Transform functions are stored as Matrix room state (`uk.half-shot.matrix-hookshot.generic.hook`) and deployed via `hookshot/deploy.sh` **Deploying hookshot transforms manually:** ```bash # On LXC 151 or from any machine with access export MATRIX_TOKEN= export MATRIX_SERVER=https://matrix.lotusguild.org export MATRIX_ROOM='!GttT4QYd1wlGlkHU3qTmq_P3gbyYKKeSSN6R7TPcJHg' bash /opt/matrix-config/hookshot/deploy.sh # deploy all bash /opt/matrix-config/hookshot/deploy.sh proxmox.js # deploy one ``` --- ## Moderation (Draupnir v2.9.0) Draupnir runs on LXC 110, manages moderation across all 9 protected rooms via `#management:matrix.lotusguild.org`. **Subscribed ban lists:** - `#community-moderation-effort-bl:neko.dev` — 12,599 banned users, 245 servers, 59 rooms - `#matrix-org-coc-bl:matrix.org` — 4,589 banned users, 220 servers, 2 rooms **Common commands (send in management room):** ``` !draupnir status — current status + protected rooms !draupnir ban @user:server * "reason" — ban from all protected rooms !draupnir redact @user:server — redact their recent messages !draupnir rooms add !roomid:server — add a room to protection !draupnir watch --no-confirm — subscribe to a ban list ``` --- ## Cinny Dev Branch (chat.lotusguild.org) `chat.lotusguild.org` tracks the Cinny `dev` branch to test the latest beta features. **Nightly build process (`cinny-dev-update.sh`):** 1. `git fetch origin dev` — checks for new commits; exits early if nothing changed 2. Builds in `/opt/cinny-dev/` using Node 24 with `NODE_OPTIONS=--max_old_space_size=896` 3. Validates `dist/index.html` exists before touching the live web root 4. Copies `dist/` to `/var/www/html/`, restores `config.json` from `/opt/cinny-dev/.cinny-config.json` 5. Runs at 3:00am daily via `/etc/cron.d/cinny-dev-update` **Manual rebuild:** ```bash # On LXC 106 /usr/local/bin/cinny-dev-update.sh ``` **Why 2GB RAM:** Vite's build process OOM-killed at 1GB. 896MB Node heap + OS overhead requires at least 1.5GB; 2GB gives headroom. --- ## Known Issues ### LiveKit Port Conflict After HA Migration LXC 151 can migrate between Proxmox nodes via HA. After migration, the old livekit-server process on the source node can leave a stale entry holding port 7881 on the destination. Fixed in `livekit-server.service` via: ```ini ExecStartPre=-/bin/bash -c 'pkill -x livekit-server; sleep 1' KillMode=control-group ``` ### coturn TLS Reset Errors Periodic `TLS/TCP socket error: Connection reset by peer` in coturn logs. Normal — clients probe TURN and drop once they establish a direct P2P path. ### BBR Congestion Control `net.ipv4.tcp_congestion_control = bbr` must be set on the Proxmox host, not inside an unprivileged LXC. All other sysctl tuning (TCP/UDP buffers, fin_timeout) is applied inside LXC 151. --- ## Server Checklist ### Quality of Life - [x] Migrate from SQLite to PostgreSQL - [x] TURN/STUN server (coturn) for reliable voice/video - [x] URL previews - [x] Upload size limit 200MB - [x] Full-text message search (PostgreSQL backend) - [x] Media retention policy (remote: 1yr, local: 3yr) - [x] Sliding sync (native Synapse) - [x] LiveKit for Element Call video rooms - [x] Default room version v12, all rooms upgraded - [x] Landing page with client recommendations - [x] Synapse metrics endpoint (port 9000, Prometheus-compatible) - [x] Cinny `dev` branch — nightly auto-build, tracks latest beta features - [x] Auto-deployment via Gitea webhooks (all 4 LXCs) - [ ] Push notifications gateway (Sygnal) — needs Apple/Google developer credentials - [ ] Cinny custom branding — Lotus Guild theme (colours, title, favicon, PWA name) ### Performance Tuning - [x] PostgreSQL `shared_buffers` → 1500MB, `effective_cache_size`, `work_mem`, checkpoint tuning - [x] PostgreSQL `pg_stat_statements` extension installed - [x] PostgreSQL autovacuum tuned per-table (5 high-churn tables), `autovacuum_max_workers` → 5 - [x] Synapse `event_cache_size` → 30K, per-cache factors tuned - [x] sysctl TCP/UDP buffer alignment on LXC 151 (`/etc/sysctl.d/99-matrix-tuning.conf`) - [x] LiveKit: `empty_timeout: 300`, `departure_timeout: 20`, `max_participants: 50` - [x] LiveKit ICE port range expanded to 50000-51000 - [x] LiveKit TURN TTL reduced to 1h - [x] LiveKit VP9/AV1 codecs enabled - [ ] BBR congestion control — must be applied on Proxmox host ### Auth & SSO - [x] Token-based registration - [x] SSO/OIDC via Authelia - [x] `allow_existing_users: true` for linking accounts to SSO - [x] Password auth alongside SSO ### Webhooks & Integrations - [x] matrix-hookshot 7.3.2 — 11 active webhook services - [x] Per-service JS transformation functions (stored in git, auto-deployed) - [x] Per-service virtual user avatars - [x] NPM reverse proxy for `/webhook` path ### Room Structure - [x] The Lotus Guild space with all core rooms - [x] Correct power levels and join rules per room - [x] Custom room avatars - [x] Voice room visible to space members (`suggested: true`) ### Hardening - [x] Rate limiting - [x] E2EE on all rooms (except Spam and Stuff — intentional for hookshot) - [x] coturn internal peer deny rules (blocks relay to RFC1918 except allowed subnet) - [x] coturn hardening: `stale-nonce=600`, `user-quota=100`, `total-quota=1000`, strong cipher list - [x] `pg_hba.conf` locked down — remote access restricted to Synapse LXC only - [x] Federation open with key verification - [x] fail2ban on Synapse login endpoint (5 retries / 24h ban) - [x] Synapse metrics port 9000 restricted to `127.0.0.1` + `10.10.10.29` - [x] coturn cert auto-renewal — daily sync cron on compute-storage-01 - [x] `/.well-known/matrix/client` and `/server` live on lotusguild.org - [x] `suppress_key_server_warning: true` - [x] Automated database + media backups - [x] Federation bad-actor blocking via Draupnir ban lists (17,000+ entries) - [x] Webhook HMAC-SHA256 validation on all auto-deploy endpoints ### Monitoring - [x] Grafana dashboard — `dashboard.lotusguild.org/d/matrix-synapse-dashboard` (140+ panels) - [x] Prometheus scraping all Matrix services (Synapse, Hookshot, LiveKit, node_exporter, postgres) - [x] 14 active alert rules across matrix-folder and infra-folder - [x] Uptime Kuma monitors: Synapse, LiveKit, PostgreSQL, Cinny, coturn, lk-jwt-service, Hookshot ### Admin - [x] Synapse admin API dashboard (synapse-admin at http://10.10.10.29:8080) - [x] Draupnir moderation bot — LXC 110, v2.9.0, 9 protected rooms, 2 ban lists - [ ] Cinny custom branding --- ## Monitoring & Observability ### Prometheus Scrape Jobs | Job | Target | Metrics | |-----|--------|---------| | `synapse` | `10.10.10.29:9000` | Full Synapse internals | | `matrix-admin` | `10.10.10.29:9101` | DAU, MAU, room/user/media totals | | `livekit` | `10.10.10.29:6789` | Rooms, participants, packets, latency | | `hookshot` | `10.10.10.29:9004` | Connections, API calls/failures, Node.js runtime | | `matrix-node` | `10.10.10.29:9100` | CPU, RAM, network, load average, disk | | `postgres` | `10.10.10.44:9187` | pg_stat_database, connections, WAL, block I/O | | `postgres-node` | `10.10.10.44:9100` | CPU, RAM, network, load average, disk | > **Disk I/O:** All servers use Ceph-backed storage. Per-device disk I/O metrics are meaningless — use Network I/O panels to see actual storage traffic. ### Alert Rules **Matrix folder:** | Alert | Fires when | Severity | |-------|-----------|----------| | Synapse Down | `up{job="synapse"}` < 1 for 2m | critical | | PostgreSQL Down | `pg_up` < 1 for 2m | critical | | LiveKit Down | `up{job="livekit"}` < 1 for 2m | critical | | Hookshot Down | `up{job="hookshot"}` < 1 for 2m | critical | | PG Connection Saturation | connections > 80% of max for 5m | warning | | Federation Queue Backing Up | pending PDUs > 100 for 10m | warning | | Synapse High Memory | RSS > 2000MB for 10m | warning | | Synapse High Response Time | p99 latency (excl. /sync) > 10s for 5m | warning | | Synapse Event Processing Lag | any processor > 30s behind for 5m | warning | | Synapse DB Query Latency High | p99 query time > 1s for 5m | warning | **Infrastructure folder:** | Alert | Fires when | Severity | |-------|-----------|----------| | Service Exporter Down | any `up == 0` for 3m | critical | | Node High CPU Usage | CPU > 90% for 10m | warning | | Node High Memory Usage | RAM > 90% for 10m | warning | | Node Disk Space Low | available < 15% (excl. tmpfs/overlay) for 10m | warning | > **`/sync` long-poll:** The Matrix `/sync` endpoint is a long-poll (clients hold it open ≤30s). It is excluded from the High Response Time alert to prevent false positives. > **Synapse Event Processing Lag** can fire transiently after a Synapse restart while processors drain their backlog. Self-resolves in 10–20 minutes. --- ## Tech Stack | Component | Technology | Version | |-----------|-----------|---------| | Homeserver | Synapse | 1.149.0 | | Database | PostgreSQL | 17.9 | | TURN | coturn | latest | | Video/voice calls | LiveKit SFU | 1.9.11 | | LiveKit JWT | lk-jwt-service | latest | | Moderation | Draupnir | 2.9.0 | | SSO | Authelia (OIDC) + LLDAP | — | | Webhook bridge | matrix-hookshot | 7.3.2 | | Reverse proxy | Nginx Proxy Manager | — | | Web client | Cinny (`dev` branch, nightly build) | dev | | Auto-deploy | adnanh/webhook | 2.8.0 | | Bot language | Python 3 | 3.x | | Bot library | matrix-nio (E2EE) | latest | | Bot dependencies | matrix-nio[e2ee], aiohttp, python-dotenv, mcrcon | — |