Priority Order is stale project tracking that doesn't belong in a README. vCPUs removed from the infrastructure table — containers are HA and can migrate between physical hosts so pinning a CPU model is misleading. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Lotus Matrix Infrastructure
Matrix server infrastructure for the Lotus Guild homeserver (matrix.lotusguild.org).
Repo: https://code.lotusguild.org/LotusGuild/matrix
Status: Phase 7 — Moderation & Client Customisation
Repo Structure
matrix/
├── hookshot/ # Hookshot JS transformation functions (one file per webhook)
│ ├── deploy.sh # Deploys all .js files to Matrix room state via API
│ ├── proxmox.js
│ ├── grafana.js
│ ├── uptime-kuma.js
│ └── ... # One .js per webhook service
├── cinny/
│ ├── config.json # Cinny homeserver config (deployed to /var/www/html/config.json)
│ └── dev-update.sh # Nightly build script for Cinny dev branch
├── landing/
│ └── index.html # matrix.lotusguild.org landing page
├── draupnir/
│ └── production.yaml # Draupnir config (access token is redacted — see rotation docs below)
├── deploy/ # Auto-deployment infrastructure
│ ├── lxc151-hookshot.sh # Deploy script for LXC 151 (matrix/hookshot/livekit)
│ ├── lxc106-cinny.sh # Deploy script for LXC 106 (cinny)
│ ├── lxc139-landing.sh # Deploy script for LXC 139 (landing page)
│ ├── lxc110-draupnir.sh # Deploy script for LXC 110 (draupnir)
│ ├── livekit-graceful-restart.sh # Waits for zero active calls before restarting livekit
│ ├── hooks-lxc151.json # webhook binary config for LXC 151
│ ├── hooks-lxc106.json # webhook binary config for LXC 106
│ ├── hooks-lxc139.json # webhook binary config for LXC 139
│ └── hooks-lxc110.json # webhook binary config for LXC 110
└── systemd/
├── livekit-server.service # LiveKit systemd unit (with HA migration fix)
├── livekit-graceful-restart.service # oneshot — checks pending restart flag
├── livekit-graceful-restart.timer # Runs every 5 min
├── draupnir.service
└── cinny-dev-update.cron # Installed to /etc/cron.d/ on LXC 106
Infrastructure
| Service | IP | LXC | RAM | Disk | Versions |
|---|---|---|---|---|---|
| Synapse | 10.10.10.29 | 151 | 8GB | 50GB | Synapse 1.149.0, LiveKit 1.9.11, hookshot 7.3.2, coturn latest |
| PostgreSQL 17 | 10.10.10.44 | 109 | 6GB | 30GB | PostgreSQL 17.9 |
| Cinny Web | 10.10.10.6 | 106 | 2GB | 8GB | Debian 12, nginx, Node 24, Cinny dev branch (nightly build) |
| Draupnir | 10.10.10.24 | 110 | 1GB | 10GB | Draupnir v2.9.0, Node.js v22 |
| Prometheus | 10.10.10.48 | 118 | — | — | Prometheus — scrapes all Matrix services |
| Grafana | 10.10.10.49 | 107 | — | — | Grafana 12.4.0 — dashboard.lotusguild.org |
| NPM | 10.10.10.27 | 139 | — | — | Nginx Proxy Manager + matrix landing page |
| Authelia | 10.10.10.36 | 167 | — | — | SSO/OIDC provider |
| LLDAP | 10.10.10.39 | 147 | — | — | LDAP user directory |
| Uptime Kuma | 10.10.10.25 | 101 | — | — | Uptime monitoring (micro1 node) |
Key paths on Synapse LXC (151):
- Synapse config:
/etc/matrix-synapse/homeserver.yaml - Synapse conf.d:
/etc/matrix-synapse/conf.d/(metrics.yaml, report_stats.yaml, server_name.yaml) - coturn config:
/etc/turnserver.conf - LiveKit config:
/etc/livekit/config.yaml - LiveKit service:
livekit-server.service - lk-jwt-service:
lk-jwt-service.service(binds:8070, serves JWT tokens for MatrixRTC) - Hookshot:
/opt/hookshot/, service:matrix-hookshot.service - Hookshot config:
/opt/hookshot/config.yml - Hookshot registration:
/etc/matrix-synapse/hookshot-registration.yaml - Bot:
/opt/matrixbot/, service:matrixbot.service - Repo clone (auto-deploy):
/opt/matrix-config/ - Deploy env:
/etc/matrix-deploy.env(MATRIX_TOKEN, MATRIX_SERVER, MATRIX_ROOM) - Deploy log:
/var/log/matrix-deploy.log
Key paths on Draupnir LXC (110):
- Install path:
/opt/draupnir/ - Config:
/opt/draupnir/config/production.yaml - Data/SQLite DBs:
/data/storage/ - Service:
draupnir.service - Management room:
#management:matrix.lotusguild.org(!mEvR5fe3jMmzwd-FwNygD72OY_yu8H3UP_N-57oK7MI) - Bot account:
@draupnir:matrix.lotusguild.org(power level 100 in all protected rooms and the Lotus Guild space) - Subscribed ban lists:
#community-moderation-effort-bl:neko.dev,#matrix-org-coc-bl:matrix.org - Rebuild:
NODE_OPTIONS="--max-old-space-size=768" npx tsc --project tsconfig.json - Healthz endpoint:
http://10.10.10.24:8081/healthz(200 = healthy, 418 = disconnected) - Abuse reporting endpoint:
POST http://10.10.10.24:8080/_matrix/draupnir/1/report/{roomId}/{eventId} - Audit DBs:
/data/storage/user-restriction-audit-log.db,/data/storage/room-audit-log.db
Key paths on PostgreSQL LXC (109):
- PostgreSQL config:
/etc/postgresql/17/main/postgresql.conf - Tuning conf.d:
/etc/postgresql/17/main/conf.d/synapse_tuning.conf - HBA config:
/etc/postgresql/17/main/pg_hba.conf - Data directory:
/var/lib/postgresql/17/main
Key paths on Cinny LXC (106):
- Source:
/opt/cinny-dev/(branch:dev, auto-updated nightly at 3am) - Built files:
/var/www/html/ - Cinny config:
/var/www/html/config.json - Config backup (survives rebuilds):
/opt/cinny-dev/.cinny-config.json - Dev update script:
/usr/local/bin/cinny-dev-update.sh - Cron:
/etc/cron.d/cinny-dev-update(runs at 3:00am daily) - Nginx site config:
/etc/nginx/sites-available/cinny
Auto-Deployment
Pushes to main on LotusGuild/matrix automatically deploy to the relevant LXC(s) via Gitea webhooks. All 4 LXCs are fully independent — each runs its own webhook listener and deploys only its own files. No cross-LXC SSH dependencies.
How It Works
- Push to
LotusGuild/matrixon Gitea - Gitea fires webhooks to all 4 LXCs simultaneously (HMAC-SHA256 validated)
- Each LXC runs
/usr/local/bin/matrix-deploy.shvia thewebhookbinary - Script does
git fetch + reset --hard origin/main, checks which files changed, deploys only relevant ones - Logs to
/var/log/matrix-deploy.logon each LXC
Per-LXC Webhook Endpoints
| LXC | Service | IP | Port | Deploys When Changed |
|---|---|---|---|---|
| 151 | matrix/hookshot | 10.10.10.29 | 9500 | hookshot/*.js, systemd/livekit-server.service |
| 106 | cinny | 10.10.10.6 | 9000 | cinny/config.json, cinny/dev-update.sh |
| 139 | landing/NPM | 10.10.10.27 | 9000 | landing/index.html |
| 110 | draupnir | 10.10.10.24 | 9000 | draupnir/production.yaml |
LXC 151 uses port 9500 because ports 9000–9004 are occupied by Synapse and Hookshot.
What Each Deploy Does
LXC 151 — hookshot/livekit:
hookshot/*.jschanged → runshookshot/deploy.sh(pushes transform functions to Matrix room state via API, requiresMATRIX_TOKENin/etc/matrix-deploy.env)systemd/livekit-server.servicechanged → copies file,daemon-reload, sets/run/livekit-restart-pendingflag (actual restart deferred — see Livekit Graceful Restart below)
LXC 106 — cinny:
cinny/config.json→ copies to/var/www/html/config.jsoncinny/dev-update.sh→ copies to/usr/local/bin/cinny-dev-update.sh,chmod +x
LXC 139 — landing page:
landing/index.html→ copies to/var/www/matrix-landing/index.html,nginx -s reload
LXC 110 — draupnir:
draupnir/production.yaml→ extracts liveaccessTokenfrom existing config, overwrites from repo, restores token viased, restartsdraupnir.service
Installed Components (per LXC)
webhookbinary (Debian packagewebhookv2.8.0) listening on respective port/etc/webhook/hooks.json— unique HMAC-SHA256 secret per LXC/usr/local/bin/matrix-deploy.sh— deploy script from this repo/etc/systemd/system/webhook.service— enabled and running/opt/matrix-config/— clone of this repo/var/log/matrix-deploy.log— deploy log
LXC 151 additionally:
/etc/matrix-deploy.env—MATRIX_TOKEN,MATRIX_SERVER,MATRIX_ROOM(not in git)/usr/local/bin/livekit-graceful-restart.sh/etc/systemd/system/livekit-graceful-restart.service+.timer
Livekit Graceful Restart
Killing livekit-server while a call is active drops everyone. Instead:
- Deploy to LXC 151 copies the new
livekit-server.serviceand sets a/run/livekit-restart-pendingflag livekit-graceful-restart.timerruns every 5 minutes- The timer script counts established TCP connections on port 7881 (
ss -tn state established) - If zero connections → restarts livekit-server and clears the flag
- If connections exist → logs and exits, retries in 5 minutes
Access Token Rotation
The MATRIX_TOKEN in /etc/matrix-deploy.env on LXC 151 is a Jared user token used to push hookshot transforms to Matrix room state (requires power level ≥ 50 in Spam and Stuff).
The token in draupnir/production.yaml in this repo is intentionally redacted (accessToken: REDACTED). The deploy script on LXC 110 extracts the live token from the running config before overwriting from the repo, then restores it.
To rotate the hookshot deploy token (LXC 151):
- Generate a new token via Synapse admin API or Cinny → Settings → Security → Manage Sessions
- SSH to LXC 151 (via
ssh root@10.10.10.4thenpct enter 151):nano /etc/matrix-deploy.env - Replace
MATRIX_TOKEN=<old>with new token - Test:
MATRIX_TOKEN=<new> MATRIX_SERVER=https://matrix.lotusguild.org bash /opt/matrix-config/hookshot/deploy.sh
To rotate the Draupnir token:
- Generate new token for
@draupnir:matrix.lotusguild.org - On LXC 110:
nano /opt/draupnir/config/production.yaml→ updateaccessToken systemctl restart draupnir- Do not commit the token to git — the repo version stays redacted
Port Maps
Router → 10.10.10.29 (forwarded):
- TCP+UDP 3478 — TURN/STUN
- TCP+UDP 5349 — TURNS/TLS
- TCP 7881 — LiveKit ICE TCP fallback
- TCP+UDP 49152-65535 — TURN relay range
Internal port map (LXC 151):
| Port | Service | Bind |
|---|---|---|
| 8008 | Synapse HTTP | 0.0.0.0 |
| 9000 | Synapse metrics | 127.0.0.1 + 10.10.10.29 |
| 9001 | Hookshot widgets | 0.0.0.0 |
| 9002 | Hookshot bridge (appservice) | 127.0.0.1 |
| 9003 | Hookshot webhooks | 0.0.0.0 |
| 9004 | Hookshot metrics | 0.0.0.0 |
| 9100 | node_exporter | 0.0.0.0 |
| 9101 | matrix-admin exporter | 0.0.0.0 |
| 9500 | webhook (auto-deploy) | 0.0.0.0 |
| 6789 | LiveKit metrics | 0.0.0.0 |
| 7880 | LiveKit HTTP | 0.0.0.0 |
| 7881 | LiveKit RTC TCP | 0.0.0.0 |
| 8070 | lk-jwt-service | 0.0.0.0 |
| 8080 | synapse-admin (nginx) | 0.0.0.0 |
| 3478 | coturn STUN/TURN | 0.0.0.0 |
| 5349 | coturn TURNS/TLS | 0.0.0.0 |
Internal port map (LXC 110 — Draupnir):
| Port | Service | Bind |
|---|---|---|
| 8080 | Draupnir web (abuse reporting) | 0.0.0.0 |
| 8081 | Draupnir healthz | 0.0.0.0 |
| 9000 | webhook (auto-deploy) | 0.0.0.0 |
| 9100 | node_exporter | 0.0.0.0 |
| 9256 | process_exporter | 0.0.0.0 |
Internal port map (LXC 109 — PostgreSQL):
| Port | Service | Bind |
|---|---|---|
| 5432 | PostgreSQL | 0.0.0.0 (hba-restricted to 10.10.10.29) |
| 9100 | node_exporter | 0.0.0.0 |
| 9187 | postgres_exporter | 0.0.0.0 |
Rooms (all v12)
| Room | Room ID | Join Rule |
|---|---|---|
| The Lotus Guild (Space) | !-1ZBnAH-JiCOV8MGSKN77zDGTuI3pgSdy8Unu_DrDyc |
public |
| General | !wfokQ1-pE896scu_AOcCBA2s3L4qFo-PTBAFTd0WMI0 |
public |
| Commands | !ou56mVZQ8ZB7AhDYPmBV5_BR28WMZ4x5zwZkPCqjq1s |
restricted (Space members) |
| Memes | !GK6v5cLEEnowIooQJv5jECfISUjADjt8aKhWv9VbG5U |
restricted (Space members) |
| Music | !ktQu0gavhjpCMkgxk8SYdb6mnJRY-u7mY7_KfksV0SU |
restricted (Space members) |
| Voice Room | !ARbRFSPNp2U0MslWTBGoTT3gbmJJ25dPRL6enQntvPo |
restricted (Space members) |
| Management | !mEvR5fe3jMmzwd-FwNygD72OY_yu8H3UP_N-57oK7MI |
invite |
| Cool Kids | !R7DT3QZHG9P8QQvX6zsZYxjkKgmUucxDz_n31qNrC94 |
invite |
| Spam and Stuff | !GttT4QYd1wlGlkHU3qTmq_P3gbyYKKeSSN6R7TPcJHg |
invite, no E2EE (hookshot) |
Power level roles (Cinny tags):
- 100: Owner (jared, draupnir, lotusbot)
- 50: The Nerdy Council / Panel of Geeks (enhuynh, lonely)
- 0: Member
Webhook Integrations (matrix-hookshot 7.3.2)
Generic webhooks bridged into Spam and Stuff.
Each service gets its own virtual user (@hookshot_<service>) with a unique avatar.
Webhook URL format: https://matrix.lotusguild.org/webhook/<uuid>
| Service | Webhook UUID | Notes |
|---|---|---|
| Grafana | df4a1302-2d62-4a01-b858-fb56f4d3781a |
Unified alerting contact point |
| Proxmox | 9b3eafe5-7689-4011-addd-c466e524661d |
Notification system (8.1+), Discord embed format |
| Sonarr | aeffc311-0686-42cb-9eeb-6757140c072e |
All event types |
| Radarr | 34913454-c1ac-4cda-82ea-924d4a9e60eb |
All event types |
| Readarr | e57ab4f3-56e6-4dc4-8b30-2f4fd4bbeb0b |
All event types |
| Lidarr | 66ac6fdd-69f6-4f47-bb00-b7f6d84d7c1c |
All event types |
| Uptime Kuma | 1a02e890-bb25-42f1-99fe-bba6a19f1811 |
Status change notifications |
| Seerr | 555185af-90a1-42ff-aed5-c344e11955cf |
Request/approval events |
| Owncast (Livestream) | 9993e911-c68b-4271-a178-c2d65ca88499 |
STREAM_STARTED / STREAM_STOPPED |
| Bazarr | 470fb267-3436-4dd3-a70c-e6e8db1721be |
Subtitle events (Apprise JSON notifier) |
| Tinker-Tickets | 6e306faf-8eea-4ba5-83ef-bf8f421f929e |
Custom transformation code |
Hookshot notes:
- Spam and Stuff is intentionally unencrypted — hookshot bridges cannot join E2EE rooms
- JS transformation functions use hookshot v2 API:
result = { version: "v2", plain, html, msgtype } - The
resultvariable must be assigned withoutvar/let/const(QuickJS IIFE sandbox) - NPM proxies
https://matrix.lotusguild.org/webhook/*→http://10.10.10.29:9003 - Proxmox sends Discord embed format:
data.embeds[0].{title,description,fields}— NOT flat fields - Transform functions are stored as Matrix room state (
uk.half-shot.matrix-hookshot.generic.hook) and deployed viahookshot/deploy.sh
Deploying hookshot transforms manually:
# On LXC 151 or from any machine with access
export MATRIX_TOKEN=<jared_token>
export MATRIX_SERVER=https://matrix.lotusguild.org
export MATRIX_ROOM='!GttT4QYd1wlGlkHU3qTmq_P3gbyYKKeSSN6R7TPcJHg'
bash /opt/matrix-config/hookshot/deploy.sh # deploy all
bash /opt/matrix-config/hookshot/deploy.sh proxmox.js # deploy one
Moderation (Draupnir v2.9.0)
Draupnir runs on LXC 110, manages moderation across all protected rooms (including the Lotus Guild space) via #management:matrix.lotusguild.org.
Subscribed ban lists:
#community-moderation-effort-bl:neko.dev— 12,599 banned users, 245 servers, 59 rooms#matrix-org-coc-bl:matrix.org— 4,589 banned users, 220 servers, 2 rooms
Common commands (send in management room):
!draupnir status — current status + protected rooms
!draupnir ban @user:server * "reason" — ban from all protected rooms
!draupnir redact @user:server — redact their recent messages
!draupnir rooms add !roomid:server — add a room to protection
!draupnir watch <alias> --no-confirm — subscribe to a ban list
Abuse Reporting
When a Matrix client user clicks "Report" on a message, Synapse receives a POST /_matrix/client/v3/rooms/{roomId}/report/{eventId} request and stores the report internally. To forward these to the Draupnir management room, a Synapse Python module must be installed on LXC 151.
Draupnir web server is enabled (port 8080). The endpoint is:
POST http://10.10.10.24:8080/_matrix/draupnir/1/report/{roomId}/{eventId}
To complete Synapse integration (one-time, on LXC 151):
- Install the module:
pip install matrix-synapse-draupnir-abuse-reports(or equivalent — check Draupnir releases) - Add to
/etc/matrix-synapse/homeserver.yaml:modules: - module: "draupnir.abuse_reports.AbuseReportEndpoint" config: draupnir_endpoint: "http://10.10.10.24:8080" systemctl restart matrix-synapse
Until the Synapse module is installed, abuse reports are stored in Synapse's DB but do NOT appear in the management room. The Draupnir web server is running and ready to receive forwarded reports.
Cinny Dev Branch (chat.lotusguild.org)
chat.lotusguild.org tracks the Cinny dev branch to test the latest beta features.
Nightly build process (cinny-dev-update.sh):
git fetch origin dev— checks for new commits; exits early if nothing changed- Builds in
/opt/cinny-dev/using Node 24 withNODE_OPTIONS=--max_old_space_size=896 - Validates
dist/index.htmlexists before touching the live web root - Copies
dist/to/var/www/html/, restoresconfig.jsonfrom/opt/cinny-dev/.cinny-config.json - Runs at 3:00am daily via
/etc/cron.d/cinny-dev-update
Manual rebuild:
# On LXC 106
/usr/local/bin/cinny-dev-update.sh
Why 2GB RAM: Vite's build process OOM-killed at 1GB. 896MB Node heap + OS overhead requires at least 1.5GB; 2GB gives headroom.
Known Issues
LiveKit Port Conflict After HA Migration
LXC 151 can migrate between Proxmox nodes via HA. After migration, the old livekit-server process on the source node can leave a stale entry holding port 7881 on the destination. Fixed in livekit-server.service via:
ExecStartPre=-/bin/bash -c 'pkill -x livekit-server; sleep 1'
KillMode=control-group
coturn TLS Reset Errors
Periodic TLS/TCP socket error: Connection reset by peer in coturn logs. Normal — clients probe TURN and drop once they establish a direct P2P path.
BBR Congestion Control
net.ipv4.tcp_congestion_control = bbr must be set on the Proxmox host, not inside an unprivileged LXC. All other sysctl tuning (TCP/UDP buffers, fin_timeout) is applied inside LXC 151.
Server Checklist
Quality of Life
- Migrate from SQLite to PostgreSQL
- TURN/STUN server (coturn) for reliable voice/video
- URL previews
- Upload size limit 200MB
- Full-text message search (PostgreSQL backend)
- Media retention policy (remote: 1yr, local: 3yr)
- Sliding sync (native Synapse)
- LiveKit for Element Call video rooms
- Default room version v12, all rooms upgraded
- Landing page with client recommendations
- Synapse metrics endpoint (port 9000, Prometheus-compatible)
- Cinny
devbranch — nightly auto-build, tracks latest beta features - Auto-deployment via Gitea webhooks (all 4 LXCs)
- Push notifications gateway (Sygnal) — needs Apple/Google developer credentials
- Cinny custom branding — Lotus Guild theme (colours, title, favicon, PWA name)
Performance Tuning
- PostgreSQL
shared_buffers→ 1500MB,effective_cache_size,work_mem, checkpoint tuning - PostgreSQL
pg_stat_statementsextension installed - PostgreSQL autovacuum tuned per-table (5 high-churn tables),
autovacuum_max_workers→ 5 - Synapse
event_cache_size→ 30K, per-cache factors tuned - sysctl TCP/UDP buffer alignment on LXC 151 (
/etc/sysctl.d/99-matrix-tuning.conf) - LiveKit:
empty_timeout: 300,departure_timeout: 20,max_participants: 50 - LiveKit ICE port range expanded to 50000-51000
- LiveKit TURN TTL reduced to 1h
- LiveKit VP9/AV1 codecs enabled
- TCP retransmit timeout lowered (
tcp_retries2=5,tcp_syn_retries=4,tcp_keepalive_probes=3) — stalled outbound federation connections now fail in ~15-30s instead of ~15 min - Unreachable routes added for servers with asymmetric connectivity (can reach us but we can't reach their federation port) — prevents 90s TCP hangs from being added to lag; defined in
/etc/network/interfacespost-up hooks and survive reboots (bark.lgbt ×2, parodia.dev, chat.ohaa.xyz, matrix.k8ekat.dev) - Stuck
device_lists_remote_resyncentries cleared for dead-server users (@dalite:bark.lgbt, @arndot:matrix.goch.social) — device list resync was firing every 30s - BBR congestion control — must be applied on Proxmox host
Auth & SSO
- Token-based registration
- SSO/OIDC via Authelia
allow_existing_users: truefor linking accounts to SSO- Password auth alongside SSO
- Terms of Service / consent enforcement —
require_at_registration: false,block_events_errorset; new users cannot send messages until they explicitly accept via/_matrix/consent; Synapse sends a Server Notice DM with the consent URL on first blocked send
Webhooks & Integrations
- matrix-hookshot 7.3.2 — 11 active webhook services
- Per-service JS transformation functions (stored in git, auto-deployed)
- Per-service virtual user avatars
- NPM reverse proxy for
/webhookpath
Room Structure
- The Lotus Guild space with all core rooms
- Correct power levels and join rules per room
- Custom room avatars
- Voice room visible to space members (
suggested: true)
Hardening
- Rate limiting
- E2EE on all rooms (except Spam and Stuff — intentional for hookshot)
- coturn internal peer deny rules (blocks relay to RFC1918;
allowed-peer-ipscoped to 10.10.10.29 only — LiveKit host) - coturn TCP relay disabled (
no-tcp-relay=true) — UDP only, reduces internal network SSRF risk - coturn hardening:
stale-nonce=600,user-quota=100,total-quota=1000, strong cipher list rc_joinsandrc_invitesrate limits explicitly set in homeserver.yamlpg_hba.conflocked down — remote access restricted to Synapse LXC only- Federation open with key verification
- fail2ban on Synapse login endpoint (5 retries / 24h ban)
- Synapse metrics port 9000 restricted to
127.0.0.1+10.10.10.29 - coturn cert auto-renewal — daily sync cron on compute-storage-01
/.well-known/matrix/clientand/serverlive on lotusguild.orgsuppress_key_server_warning: true- Automated database + media backups
- Federation bad-actor blocking via Draupnir ban lists (17,000+ entries)
- Webhook HMAC-SHA256 validation on all auto-deploy endpoints
Monitoring
- Grafana dashboard —
dashboard.lotusguild.org/d/matrix-synapse-dashboard(140+ panels, Draupnir section added) - Prometheus scraping all Matrix services (Synapse, Hookshot, LiveKit, node_exporter, postgres, Draupnir)
- 15 active alert rules across matrix-folder and infra-folder (includes Draupnir Down)
- Uptime Kuma monitors: Synapse, LiveKit, PostgreSQL, Cinny, coturn, lk-jwt-service, Hookshot
- Draupnir: node_exporter (9100), process_exporter (9256), healthz probe via blackbox (8081)
Admin
- Synapse admin API dashboard (synapse-admin at http://10.10.10.29:8080)
- Draupnir moderation bot — LXC 110, v2.9.0, all rooms + space, 2 ban lists
- Cinny custom branding
Monitoring & Observability
Prometheus Scrape Jobs
| Job | Target | Metrics |
|---|---|---|
synapse |
10.10.10.29:9000 |
Full Synapse internals |
matrix-admin |
10.10.10.29:9101 |
DAU, MAU, room/user/media totals |
livekit |
10.10.10.29:6789 |
Rooms, participants, packets, latency |
hookshot |
10.10.10.29:9004 |
Connections, API calls/failures, Node.js runtime |
matrix-node |
10.10.10.29:9100 |
CPU, RAM, network, load average, disk |
postgres |
10.10.10.44:9187 |
pg_stat_database, connections, WAL, block I/O |
postgres-node |
10.10.10.44:9100 |
CPU, RAM, network, load average, disk |
draupnir-node |
10.10.10.24:9100 |
CPU, RAM, network, load average, disk |
draupnir-process |
10.10.10.24:9256 |
Process CPU/memory/threads/uptime (process_exporter) |
draupnir-healthz |
10.10.10.24:8081/healthz → 127.0.0.1:9115 |
probe_success (1=healthy, 0=disconnected) via blackbox exporter |
Disk I/O: All servers use Ceph-backed storage. Per-device disk I/O metrics are meaningless — use Network I/O panels to see actual storage traffic.
Alert Rules
Matrix folder:
| Alert | Fires when | Severity |
|---|---|---|
| Synapse Down | up{job="synapse"} < 1 for 2m |
critical |
| PostgreSQL Down | pg_up < 1 for 2m |
critical |
| LiveKit Down | up{job="livekit"} < 1 for 2m |
critical |
| Hookshot Down | up{job="hookshot"} < 1 for 2m |
critical |
| Draupnir Down | up{job="draupnir-node"} < 0.5 for 2m |
critical |
| PG Connection Saturation | connections > 80% of max for 5m | warning |
| Federation Queue Backing Up | pending PDUs > 100 for 10m | warning |
| Synapse High Memory | RSS > 2000MB for 10m | warning |
| Synapse High Response Time | p99 latency (excl. /sync) > 10s for 5m | warning |
| Synapse Event Processing Lag | any processor > 300s behind for 15m | warning |
| Synapse DB Query Latency High | p99 query time > 1s for 5m | warning |
Infrastructure folder:
| Alert | Fires when | Severity |
|---|---|---|
| Service Exporter Down | any up == 0 for 3m |
critical |
| Node High CPU Usage | CPU > 90% for 10m | warning |
| Node High Memory Usage | RAM > 90% for 10m | warning |
| Node Disk Space Low | available < 15% (excl. tmpfs/overlay) for 10m | warning |
/synclong-poll: The Matrix/syncendpoint is a long-poll (clients hold it open ≤30s). It is excluded from the High Response Time alert to prevent false positives.
Synapse Event Processing Lag alert fires when
synapse_event_processing_lag > 300sfor 15 consecutive minutes (threshold raised from 120s/5m to reduce noise from normal federation backoff cycling).Root cause: several federated servers (bark.lgbt, parodia.dev, etc.) have asymmetric connectivity — they can reach us but we cannot reach their federation ports. Each inbound transaction they send resets our backoff to 0, triggering a new outbound connection attempt that hangs for ~90s (TCP
User timeout). This causes the lag metric to spike. Mitigations in place:
tcp_retries2=5in/etc/sysctl.d/99-matrix-tuning.conf— TCP hangs now fail in ~15-30sip route add unreachable <ip>in/etc/network/interfacespost-up — outbound connections to these servers fail in 0ms (ICMP unreachable)- Alert threshold raised to 300s/15m — only fires for genuine outages, not normal 10-min backoff cycles
To find new offending servers:
grep "User timeout\|ConnectingCancell" /var/log/matrix-synapse/homeserver.log | grep -oP "\[([^\]]+)\]" | sort | uniq -c | sort -rn | head -20
Tech Stack
| Component | Technology | Version |
|---|---|---|
| Homeserver | Synapse | 1.149.0 |
| Database | PostgreSQL | 17.9 |
| TURN | coturn | latest |
| Video/voice calls | LiveKit SFU | 1.9.11 |
| LiveKit JWT | lk-jwt-service | latest |
| Moderation | Draupnir | 2.9.0 |
| SSO | Authelia (OIDC) + LLDAP | — |
| Webhook bridge | matrix-hookshot | 7.3.2 |
| Reverse proxy | Nginx Proxy Manager | — |
| Web client | Cinny (dev branch, nightly build) |
dev |
| Auto-deploy | adnanh/webhook | 2.8.0 |
| Bot language | Python 3 | 3.x |
| Bot library | matrix-nio (E2EE) | latest |
| Bot dependencies | matrix-nio[e2ee], aiohttp, python-dotenv, mcrcon | — |