Jared Vititoe 0ba095ba03 docs: mark coturn hardening applied, update action items
- stale-nonce, user-quota, total-quota, cipher-list applied to /etc/turnserver.conf
- BBR noted as intentionally skipped (HA multi-host setup)
- Storj update and Synapse lag resolved

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 14:05:59 -04:00

Lotus Matrix Bot & Server Roadmap

Matrix bot and server infrastructure for the Lotus Guild homeserver (matrix.lotusguild.org).

Repo: https://code.lotusguild.org/LotusGuild/matrixBot

Status: Phase 6 — Monitoring, Observability & Hardening


Priority Order

  1. PostgreSQL migration
  2. TURN server
  3. Room structure + space setup
  4. Matrix bot (core + commands)
  5. LiveKit / Element Call
  6. SSO / OIDC (Authelia)
  7. Webhook integrations (hookshot)
  8. Voice stability & quality tuning
  9. Custom Cinny client (chat.lotusguild.org)
  10. Custom emoji packs (partially finished)
  11. Cinny custom branding (Lotus Guild theme)
  12. Draupnir moderation bot
  13. Push notifications (Sygnal)

Infrastructure

Service IP LXC RAM vCPUs Disk Versions
Synapse 10.10.10.29 151 8GB 4 (Ryzen 9 7900) 50GB (21% used) Synapse 1.148.0, LiveKit 1.9.11, hookshot 7.3.2, coturn latest
PostgreSQL 17 10.10.10.44 109 6GB 3 (Ryzen 9 7900) 30GB (5% used) PostgreSQL 17.9
Cinny Web 10.10.10.6 106 256MB runtime 1 8GB (27% used) Debian 13, nginx, Node 24, Cinny 4.10.5
NPM 10.10.10.27 139 Nginx Proxy Manager
Authelia 10.10.10.36 167 SSO/OIDC provider
LLDAP 10.10.10.39 147 LDAP user directory
Uptime Kuma 10.10.10.25 101 Uptime monitoring (micro1 node)
Prometheus 10.10.10.48 118 Prometheus — scrapes all Matrix services
Grafana 10.10.10.49 107 Grafana 12.4.0 — dashboard.lotusguild.org

Note: PostgreSQL container IP is 10.10.10.44, not .2 — update any stale references.

Key paths on Synapse/matrix LXC (151):

  • Synapse config: /etc/matrix-synapse/homeserver.yaml
  • Synapse conf.d: /etc/matrix-synapse/conf.d/ (metrics.yaml, report_stats.yaml, server_name.yaml)
  • coturn config: /etc/turnserver.conf
  • LiveKit config: /etc/livekit/config.yaml
  • LiveKit service: livekit-server.service
  • lk-jwt-service: lk-jwt-service.service (binds :8070, serves JWT tokens for MatrixRTC)
  • Hookshot: /opt/hookshot/, service: matrix-hookshot.service
  • Hookshot config: /opt/hookshot/config.yml
  • Hookshot registration: /etc/matrix-synapse/hookshot-registration.yaml
  • Landing page: /var/www/matrix-landing/index.html (on NPM LXC 139)
  • Bot: /opt/matrixbot/, service: matrixbot.service

Key paths on PostgreSQL LXC (109):

  • PostgreSQL config: /etc/postgresql/17/main/postgresql.conf
  • PostgreSQL conf.d: /etc/postgresql/17/main/conf.d/
  • HBA config: /etc/postgresql/17/main/pg_hba.conf
  • Data directory: /var/lib/postgresql/17/main

Running services on LXC 151:

Service PID status Memory Notes
matrix-synapse active, 2+ days 231MB peak 312MB No workers, single process
livekit-server active, 2+ days 22MB peak 58MB v1.9.11, node IP = 162.192.14.139
lk-jwt-service active, 2+ days 2.7MB Binds :8070, LIVEKIT_URL=wss://matrix.lotusguild.org
matrix-hookshot active, 2+ days 76MB peak 172MB Actively receiving webhooks
matrixbot active, 2+ days 26MB peak 59MB Some E2EE key errors (see known issues)
coturn active, 2+ days 13MB Periodic TCP reset errors (normal)

Currently Open Port forwarding (router → 10.10.10.29):

  • TCP+UDP 3478 (TURN/STUN signaling)
  • TCP+UDP 5349 (TURNS/TLS)
  • TCP 7881 (LiveKit ICE TCP fallback)
  • TCP+UDP 49152-65535 (TURN relay range)
  • LiveKit WebRTC media: 50100-50500 (subset of above, only 400 ports — see improvements)

Internal port map (LXC 151):

Port Service Bind
8008 Synapse HTTP 0.0.0.0 + ::1
9000 Synapse metrics (Prometheus) 127.0.0.1 + 10.10.10.29
9001 Hookshot widgets 0.0.0.0
9002 Hookshot bridge (appservice) 127.0.0.1
9003 Hookshot webhooks 0.0.0.0
9004 Hookshot metrics (Prometheus) 0.0.0.0
9100 node_exporter (Prometheus) 0.0.0.0
9101 matrix-admin exporter 0.0.0.0
6789 LiveKit metrics (Prometheus) 0.0.0.0
7880 LiveKit HTTP 0.0.0.0
7881 LiveKit RTC TCP 0.0.0.0
8070 lk-jwt-service 0.0.0.0
8080 synapse-admin (nginx) 0.0.0.0
3478 coturn STUN/TURN 0.0.0.0
5349 coturn TURNS/TLS 0.0.0.0

Internal port map (LXC 109 — PostgreSQL):

Port Service Bind
5432 PostgreSQL 0.0.0.0 (hba-restricted to 10.10.10.29)
9100 node_exporter (Prometheus) 0.0.0.0
9187 postgres_exporter 0.0.0.0

Rooms (all v12)

Room Room ID Join Rule
The Lotus Guild (Space) !-1ZBnAH-JiCOV8MGSKN77zDGTuI3pgSdy8Unu_DrDyc public
General !wfokQ1-pE896scu_AOcCBA2s3L4qFo-PTBAFTd0WMI0 public
Commands !ou56mVZQ8ZB7AhDYPmBV5_BR28WMZ4x5zwZkPCqjq1s restricted (Space members)
Memes !GK6v5cLEEnowIooQJv5jECfISUjADjt8aKhWv9VbG5U restricted (Space members)
Management !mEvR5fe3jMmzwd-FwNygD72OY_yu8H3UP_N-57oK7MI invite
Cool Kids !R7DT3QZHG9P8QQvX6zsZYxjkKgmUucxDz_n31qNrC94 invite
Spam and Stuff !GttT4QYd1wlGlkHU3qTmq_P3gbyYKKeSSN6R7TPcJHg invite, no E2EE (hookshot)

Power level roles (Cinny tags):

  • 100: Owner (jared)
  • 50: The Nerdy Council (enhuynh, lonely)
  • 48: Panel of Geeks
  • 35: Cool Kids
  • 0: Member

Webhook Integrations (matrix-hookshot 7.3.2)

Generic webhooks bridged into Spam and Stuff via matrix-hookshot. Each service gets its own virtual user (@hookshot_<service>) with a unique avatar. Webhook URL format: https://matrix.lotusguild.org/webhook/<uuid>

Service Webhook UUID Notes
Grafana df4a1302-2d62-4a01-b858-fb56f4d3781a Unified alerting contact point
Proxmox 9b3eafe5-7689-4011-addd-c466e524661d Notification system (8.1+)
Sonarr aeffc311-0686-42cb-9eeb-6757140c072e All event types
Radarr 34913454-c1ac-4cda-82ea-924d4a9e60eb All event types
Readarr e57ab4f3-56e6-4dc4-8b30-2f4fd4bbeb0b All event types
Lidarr 66ac6fdd-69f6-4f47-bb00-b7f6d84d7c1c All event types
Uptime Kuma 1a02e890-bb25-42f1-99fe-bba6a19f1811 Status change notifications
Seerr 555185af-90a1-42ff-aed5-c344e11955cf Request/approval events
Owncast (Livestream) 9993e911-c68b-4271-a178-c2d65ca88499 STREAM_STARTED / STREAM_STOPPED (hookshot display name: "Livestream")
Bazarr 470fb267-3436-4dd3-a70c-e6e8db1721be Subtitle events (Apprise JSON notifier)
Tinker-Tickets 6e306faf-8eea-4ba5-83ef-bf8f421f929e Custom transformation code

Hookshot notes:

  • Spam and Stuff is intentionally unencrypted — hookshot bridges cannot join E2EE rooms
  • Webhook tokens stored in Synapse PostgreSQL room_account_data for @hookshot
  • JS transformation functions use hookshot v2 API: set result = { version: "v2", plain, html, msgtype }
  • The result variable must be assigned without var/let/const (needs implicit global scope in the QuickJS IIFE sandbox)
  • NPM proxies https://matrix.lotusguild.org/webhook/*http://10.10.10.29:9003
  • Virtual user avatars: set via appservice token (as_token in hookshot-registration.yaml) impersonating each user
  • Hookshot bridge port (9002) binds 127.0.0.1 only; webhook ingest (9003) binds 0.0.0.0 (NPM-proxied)

Known Issues

coturn TLS Reset Errors

Periodic TLS/TCP socket error: Connection reset by peer in coturn logs from external IPs. This is normal — clients probe TURN and drop the connection once they establish a direct P2P path. Not an issue.

BBR Congestion Control — Host-Level Only

net.ipv4.tcp_congestion_control = bbr and net.core.default_qdisc = fq cannot be set from inside an unprivileged LXC container — they affect the host kernel's network namespace. These must be applied on the Proxmox host itself to take effect for all containers. All other sysctl tuning (TCP/UDP buffers, fin_timeout) applied successfully inside LXC 151.


Optimizations & Improvements

1. LiveKit / Voice Quality Applied

Noise suppression and volume normalization are client-side only (browser/Element X handles this via WebRTC's built-in audio processing). The server cannot enforce these. Applied server-side improvements:

  • ICE port range expanded: 50100-50500 (400 ports) → 50000-51000 (1001 ports) = ~500 concurrent WebRTC streams
  • TURN TTL reduced: 86400s (24h) → 3600s (1h) — stale allocations expire faster
  • Room defaults added: empty_timeout: 300, departure_timeout: 20, max_participants: 50

Client-side audio advice for users:

  • Element Web/Desktop: Settings → Voice & Video → enable "Noise Suppression" and "Echo Cancellation"
  • Element X (mobile): automatic via WebRTC stack
  • Cinny (chat.lotusguild.org): voice via embedded Element Call widget — browser WebRTC noise suppression is active automatically

2. PostgreSQL Tuning (LXC 109) Applied

/etc/postgresql/17/main/conf.d/synapse_tuning.conf written and active. pg_stat_statements extension created in the synapse database. Config applied:

# Memory — shared_buffers = 25% RAM, effective_cache_size = 75% RAM
shared_buffers = 1500MB
effective_cache_size = 4500MB
work_mem = 32MB                    # Per sort/hash operation (safe at low connection count)
maintenance_work_mem = 256MB       # VACUUM, CREATE INDEX
wal_buffers = 64MB                 # WAL write buffer

# Checkpointing
checkpoint_completion_target = 0.9 # Spread checkpoint I/O (default 0.5 is aggressive)
max_wal_size = 2GB

# Storage (Ceph RBD block device = SSD-equivalent random I/O)
random_page_cost = 1.1             # Default 4.0 assumes spinning disk
effective_io_concurrency = 200     # For SSDs/Ceph

# Parallel queries (3 vCPUs)
max_worker_processes = 3
max_parallel_workers_per_gather = 1
max_parallel_workers = 2

# Monitoring
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.track = all

Restarted postgresql@17-main. Expected impact: Synapse query latency drops as the DB grows — the entire current 120MB database fits in shared_buffers.

3. PostgreSQL Security — pg_hba.conf (LXC 109) Applied

Removed the two open rules (0.0.0.0/24 md5 and 0.0.0.0/0 md5). Remote access is now restricted to Synapse LXC only:

host    synapse         synapse_user    10.10.10.29/32          scram-sha-256

All other remote connections are rejected. Local Unix socket and loopback remain functional for admin access.

4. Synapse Cache Tuning (LXC 151) Applied

event_cache_size bumped 15K → 30K. _get_state_group_for_events: 3.0 added to per_cache_factors (heavily hit during E2EE key sharing). Synapse restarted cleanly.

event_cache_size: 30K
caches:
  global_factor: 2.0
  per_cache_factors:
    get_users_in_room: 3.0
    get_current_state_ids: 3.0
    _get_state_group_for_events: 3.0

5. Network / sysctl Tuning (LXC 151) Applied

/etc/sysctl.d/99-matrix-tuning.conf written and active. TCP/UDP buffers aligned and fin_timeout reduced.

# Align TCP buffers with core maximums
net.ipv4.tcp_rmem = 4096 131072 26214400
net.ipv4.tcp_wmem = 4096 65536 26214400

# UDP buffer sizing for WebRTC media streams
net.core.rmem_max = 26214400
net.core.wmem_max = 26214400
net.ipv4.udp_rmem_min = 65536
net.ipv4.udp_wmem_min = 65536

# Reduce latency for short-lived TURN connections
net.ipv4.tcp_fin_timeout = 10
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_intvl = 30

BBR note: tcp_congestion_control = bbr and default_qdisc = fq require host-level sysctl — cannot be set inside an unprivileged LXC. Apply on the Proxmox host to benefit all containers:

echo "net.ipv4.tcp_congestion_control = bbr" >> /etc/sysctl.d/99-bbr.conf
echo "net.core.default_qdisc = fq" >> /etc/sysctl.d/99-bbr.conf
sysctl --system

6. Synapse Federation Hardening

The server is effectively a private server for friends. Restricting federation prevents abuse and reduces load. Add to homeserver.yaml:

# Allow federation only with specific trusted servers (or disable entirely)
federation_domain_whitelist:
  - matrix.org        # Keep for bridging if needed
  - matrix.lotusguild.org

# OR to go fully closed (recommended for friends-only):
# federation_enabled: false

7. Bot E2EE Key Fix (LXC 151) Applied

nio_store/ cleared and bot restarted cleanly. Megolm session errors resolved.


Custom Cinny Client (chat.lotusguild.org)

Cinny v4 is the preferred client — clean UI, Cinny-style rendering already used by the bot's Wordle tiles. We build from source to get voice support and full branding control.

Why Cinny over Element Web

  • Much cleaner aesthetics, already the de-facto client for guild members
  • Element Web voice suppression (Krisp) is only on app.element.io — a custom build loses it
  • Cinny add-joined-call-controls branch uses @element-hq/element-call-embedded which talks to the existing MatrixRTC → lk-jwt-service → LiveKit stack with zero new infrastructure
  • Static build (nginx serving ~5MB of files) — nearly zero runtime resource cost

Voice support status (as of March 2026)

The official add-joined-call-controls branch (maintained by ajbura, last commit March 8 2026) embeds Element Call as a widget via @element-hq/element-call-embedded: 0.16.3. This uses the same MatrixRTC protocol that lk-jwt-service already handles. Two direct LiveKit integration PRs (#2703, #2704) were proposed but closed without merge — so the embedded Element Call approach is the official path.

Since lk-jwt-service is already running on LXC 151 and configured for wss://matrix.lotusguild.org, voice calls will work out of the box once the Cinny build is deployed.

LXC Setup

Create the LXC (run on the host):

# ProxmoxVE Debian 13 community script
bash -c "$(curl -fsSL https://raw.githubusercontent.com/community-scripts/ProxmoxVE/main/ct/debian.sh)"

Recommended settings: 2GB RAM, 1-2 vCPUs, 20GB disk, Debian 13, static IP on VLAN 10 (e.g. 10.10.10.XX).

Inside the new LXC:

# Install nginx + git + nvm dependencies
apt update && apt install -y nginx git curl

# Install Node.js 24 via nvm
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash
source ~/.bashrc
nvm install 24
nvm use 24

# Clone Cinny and switch to voice-support branch
git clone https://github.com/cinnyapp/cinny.git /opt/cinny
cd /opt/cinny
git checkout add-joined-call-controls

# Install dependencies and build
npm ci
NODE_OPTIONS=--max_old_space_size=4096 npm run build
# Output: /opt/cinny/dist/

# Deploy to nginx root
cp -r /opt/cinny/dist/* /var/www/html/

Configure Cinny — edit /var/www/html/config.json:

{
  "defaultHomeserver": 0,
  "homeserverList": ["matrix.lotusguild.org"],
  "allowCustomHomeservers": false,
  "featuredCommunities": {
    "openAsDefault": false,
    "spaces": [],
    "rooms": [],
    "servers": []
  },
  "hashRouter": {
    "enabled": false,
    "basename": "/"
  }
}

Nginx config/etc/nginx/sites-available/cinny (matches the official docker-nginx.conf):

server {
    listen 80;
    listen [::]:80;
    server_name chat.lotusguild.org;

    root /var/www/html;
    index index.html;

    location / {
        rewrite ^/config.json$           /config.json break;
        rewrite ^/manifest.json$         /manifest.json break;
        rewrite ^/sw.js$                 /sw.js break;
        rewrite ^/pdf.worker.min.js$     /pdf.worker.min.js break;
        rewrite ^/public/(.*)$           /public/$1 break;
        rewrite ^/assets/(.*)$           /assets/$1 break;
        rewrite ^(.+)$                   /index.html break;
    }
}
ln -s /etc/nginx/sites-available/cinny /etc/nginx/sites-enabled/
nginx -t && systemctl reload nginx

Then in NPM: add a proxy host for chat.lotusguild.orghttp://10.10.10.XX:80 with SSL.

Rebuilding after updates

cd /opt/cinny
git pull
npm ci
NODE_OPTIONS=--max_old_space_size=4096 npm run build
cp -r dist/* /var/www/html/
# Preserve your config.json — it gets overwritten by the copy above, so:
# Option: keep config.json outside dist and symlink/copy it in after each build

Key paths (Cinny LXC 106 — 10.10.10.6)

  • Source: /opt/cinny/ (branch: add-joined-call-controls)
  • Built files: /var/www/html/
  • Cinny config: /var/www/html/config.json
  • Config backup (survives rebuilds): /opt/cinny-config.json
  • Nginx site config: /etc/nginx/sites-available/cinny
  • Rebuild script: /usr/local/bin/cinny-update

Server Checklist

Quality of Life

  • Migrate from SQLite to PostgreSQL
  • TURN/STUN server (coturn) for reliable voice/video
  • URL previews
  • Upload size limit 200MB
  • Full-text message search (PostgreSQL backend)
  • Media retention policy (remote: 1yr, local: 3yr)
  • Sliding sync (native Synapse)
  • LiveKit for Element Call video rooms
  • Default room version v12, all rooms upgraded
  • Landing page with client recommendations (Cinny, Commet, Element, Element X mobile)
  • Synapse metrics endpoint (port 9000, Prometheus-compatible)
  • Push notifications gateway (Sygnal) for mobile clients
  • LiveKit port range expanded to 50000-51000 for voice call capacity
  • Custom Cinny client LXC 106 (10.10.10.6) — Debian 13, Cinny 4.10.5 built from add-joined-call-controls, nginx serving, HA enabled
  • NPM proxy entry for chat.lotusguild.org → 10.10.10.6:80, SSL via Cloudflare DNS challenge, HTTPS forced, HTTP/2 + HSTS enabled
  • Cinny weekly auto-update cron (/etc/cron.d/cinny-update, Sundays 3am, logs to /var/log/cinny-update.log)
  • Cinny custom branding — Lotus Guild theme (colors, title, favicon, PWA name)

Performance Tuning

  • PostgreSQL shared_buffers → 1500MB, effective_cache_size, work_mem, checkpoint tuning applied
  • PostgreSQL pg_stat_statements extension installed in synapse database
  • PostgreSQL autovacuum tuned per-table (state_groups_state, events, receipts_linearized, receipts_graph, device_lists_stream, presence_stream), autovacuum_max_workers → 5
  • Synapse event_cache_size → 30K, _get_state_group_for_events cache factor added
  • sysctl TCP/UDP buffer alignment applied to LXC 151 (/etc/sysctl.d/99-matrix-tuning.conf)
  • LiveKit room empty_timeout: 300, departure_timeout: 20, max_participants: 50
  • LiveKit ICE port range expanded to 50000-51000
  • LiveKit TURN TTL reduced from 24h to 1h
  • LiveKit VP9/AV1 codecs enabled (video_codecs: [VP8, H264, VP9, AV1])
  • BBR congestion control — must be applied on Proxmox host, not inside LXC (see Known Issues)

Auth & SSO

  • Token-based registration
  • SSO/OIDC via Authelia
  • allow_existing_users: true for linking accounts to SSO
  • Password auth alongside SSO

Webhooks & Integrations

  • matrix-hookshot 7.3.2 installed and running
  • Generic webhook bridge for 11 active services (Grafana, Proxmox, Sonarr, Radarr, Readarr, Lidarr, Uptime Kuma, Seerr, Owncast/Livestream, Bazarr, Tinker-Tickets)
  • Per-service JS transformation functions — all rewritten to handle full event payloads (all event types, health alerts, app updates, release groups, download clients)
  • Per-service virtual user avatars
  • NPM reverse proxy for /webhook path
  • Tinker Tickets custom transformation code

Room Structure

  • The Lotus Guild space
  • All core rooms with correct power levels and join rules
  • Spam and Stuff room for service notifications (hookshot)
  • Custom room avatars

Hardening

  • Rate limiting
  • E2EE on all rooms (except Spam and Stuff — intentional for hookshot)
  • coturn internal peer deny rules (blocks relay to RFC1918 except allowed subnet)
  • pg_hba.conf locked down — remote access restricted to Synapse LXC (10.10.10.29) only
  • Federation enabled with key verification (open for invite-only growth to friends/family/coworkers)
  • fail2ban on Synapse login endpoint (5 retries / 24h ban, LXC 151)
  • Synapse metrics port 9000 restricted to 127.0.0.1 + 10.10.10.29 (was 0.0.0.0)
  • coturn cert auto-renewal — daily sync cron on compute-storage-01 copies NPM cert → coturn
  • /.well-known/matrix/client and /server live on lotusguild.org (NPM advanced config)
  • suppress_key_server_warning: true in homeserver.yaml
  • Federation allow/deny lists for known bad actors
  • Regular Synapse updates
  • Automated database + media backups

Monitoring

  • Synapse metrics endpoint (port 9000, Prometheus-compatible)
  • Uptime Kuma monitors added: Synapse HTTP, LiveKit TCP, PostgreSQL TCP, Cinny Web, coturn TCP 3478, lk-jwt-service, Hookshot
  • Uptime Kuma: coturn UDP STUN monitoring (requires push/heartbeat — no native UDP type in Kuma)
  • Grafana dashboard — custom Synapse dashboard at dashboard.lotusguild.org/d/matrix-synapse-dashboard/matrix-synapse (140+ panels, see Monitoring section below)
  • Prometheus scraping all Matrix services: Synapse, Hookshot, LiveKit, matrix-node, postgres-node, matrix-admin, postgres, postgres-exporter
  • node_exporter installed on LXC 151 (Matrix) and LXC 109 (PostgreSQL)
  • LiveKit Prometheus metrics enabled (prometheus_port: 6789)
  • Hookshot metrics enabled (metrics: { enabled: true }) on dedicated port 9004
  • Grafana alert rules — 9 Matrix/infra alerts active (see Alert Rules section below)
  • Duplicate Grafana "Infrastructure" folder merged and deleted

Admin

  • Synapse admin API dashboard (synapse-admin at http://10.10.10.29:8080)
  • Power levels per room
  • Draupnir moderation bot (new LXC or alongside existing bot)
  • Cinny custom branding (Lotus Guild theme — colors, title, favicon, PWA name)
  • Storj node updatestorj_uptodate=0 on LXC 138 (10.10.10.133), risk of disqualification

Improvement Audit (March 2026)

Comprehensive audit of the current infrastructure against official documentation and security best practices. Applied March 9 2026.

Priority Summary

Issue Severity Status
coturn TLS cert expires May 12 — no auto-renewal CRITICAL Fixed — daily sync cron on compute-storage-01 copies NPM-renewed cert to coturn
Synapse metrics port 9000 bound to 0.0.0.0 HIGH Fixed — now binds 127.0.0.1 + 10.10.10.29 (Prometheus still works, internet blocked)
/.well-known/matrix/client returns 404 MEDIUM Fixed — NPM lotusguild.org proxy host updated, live at https://lotusguild.org/.well-known/matrix/client
suppress_key_server_warning not set MEDIUM Fixed — added to homeserver.yaml
No fail2ban on /_matrix/client/.*/login MEDIUM Fixed — fail2ban installed, matrix-synapse jail active (5 retries / 24h ban)
No media purge cron (retention policy set but never triggers) MEDIUM N/A — media_retention block already in homeserver.yaml; Synapse runs the purge internally on schedule
PostgreSQL autovacuum not tuned per-table LOW Fixed — all 5 high-churn tables tuned, autovacuum_max_workers → 5
Hookshot metrics scrape unconfirmed LOW Fixed — metrics: { enabled: true } added to config, metrics split to dedicated port 9004, Prometheus scraping confirmed
LiveKit VP9/AV1 codec support LOW Applied — video_codecs: [VP8, H264, VP9, AV1] added to livekit config
Federation allow/deny list not configured LOW Pending — Mjolnir/Draupnir on roadmap
Sygnal push notifications not deployed INFO Deferred

1. coturn Cert Auto-Renewal

The coturn cert is managed by NPM (cert ID 91, stored at /etc/letsencrypt/live/npm-91/ on LXC 139). NPM renews it automatically. A sync script on compute-storage-01 detects when NPM renews and copies it to coturn.

Deployed: /usr/local/bin/coturn-cert-sync.sh on compute-storage-01, cron /etc/cron.d/coturn-cert-sync (runs 03:30 daily).

Script compares cert expiry dates between LXC 139 and LXC 151. If they differ (NPM renewed), it copies fullchain.pem + privkey.pem and restarts coturn.

Additional coturn hardening — Applied March 2026:

# /etc/turnserver.conf
stale-nonce=600              # Nonce expires 600s (prevents replay attacks)
user-quota=100               # Max concurrent relay allocations per user
total-quota=1000             # Total relay allocations server-wide
cipher-list=ECDHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-CHACHA20-POLY1305

2. Synapse Configuration Gaps

a) Metrics port exposed to 0.0.0.0 (HIGH)

Port 9000 currently binds 0.0.0.0 — exposes internal state, user counts, DB query times externally. Fix in homeserver.yaml:

metrics_flags:
  some_legacy_unrestricted_resources: false
listeners:
  - port: 9000
    bind_addresses: ['127.0.0.1']   # NOT 0.0.0.0
    type: metrics
    resources: []

Grafana at 10.10.10.49 scrapes port 9000 from within the VLAN so this is safe to lock down.

b) suppress_key_server_warning (MEDIUM)

Fills Synapse logs with noise on every restart. One line in homeserver.yaml:

suppress_key_server_warning: true

c) Database connection pooling (LOW — track for growth)

Current defaults (cp_min: 5, cp_max: 10) are fine for single-process. When adding workers, increase cp_max to 2030 per worker group. Add explicitly to homeserver.yaml to make it visible:

database:
  name: psycopg2
  args:
    cp_min: 5
    cp_max: 10

3. Matrix Well-Known 404

/.well-known/matrix/client returns 404. This breaks client autodiscovery — users who type lotusguild.org instead of matrix.lotusguild.org get an error. Fix in NPM with a custom location block on the lotusguild.org proxy host:

location /.well-known/matrix/client {
    add_header Content-Type application/json;
    add_header Access-Control-Allow-Origin *;
    return 200 '{"m.homeserver":{"base_url":"https://matrix.lotusguild.org"}}';
}
location /.well-known/matrix/server {
    add_header Content-Type application/json;
    add_header Access-Control-Allow-Origin *;
    return 200 '{"m.server":"matrix.lotusguild.org:443"}';
}

4. fail2ban for Synapse Login

No brute-force protection on /_matrix/client/*/login. Easy win.

/etc/fail2ban/jail.d/matrix-synapse.conf:

[matrix-synapse]
enabled  = true
port     = http,https
filter   = matrix-synapse
logpath  = /var/log/matrix-synapse/homeserver.log
backend  = systemd
journalmatch = _SYSTEMD_UNIT=matrix-synapse.service + PRIORITY=3
findtime = 600
maxretry = 5
bantime  = 86400

/etc/fail2ban/filter.d/matrix-synapse.conf:

[Definition]
failregex = ^.*Failed \(password\|SAML\) login attempt for user .* from <HOST>.*$
            ^.*"POST /.*login.*" 401.*$
ignoreregex = ^.*"GET /sync.*".*$

5. Synapse Media Purge Cron

Retention policy is configured (remote 1yr, local 3yr) but nothing actually triggers the purge — media accumulates silently. The Synapse admin API purge endpoint must be called explicitly.

/usr/local/bin/purge-synapse-media.sh (create on LXC 151):

#!/bin/bash
ADMIN_TOKEN="syt_your_admin_token"
# Purge remote media (cached from other homeservers) older than 90 days
CUTOFF_TS=$(($(date +%s000) - 7776000000))
curl -X POST \
  "http://localhost:8008/_synapse/admin/v1/purge_media_cache?before_ts=$CUTOFF_TS" \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -s -o /dev/null
echo "$(date): Synapse remote media purge completed" >> /var/log/synapse-purge.log
chmod +x /usr/local/bin/purge-synapse-media.sh
echo "0 4 * * * root /usr/local/bin/purge-synapse-media.sh" > /etc/cron.d/synapse-purge

6. PostgreSQL Autovacuum Per-Table Tuning

The high-churn Synapse tables (state_groups_state, events, receipts) are not tuned for aggressive autovacuum. As the DB grows, bloat accumulates and queries slow down. Run on LXC 109 (PostgreSQL):

-- state_groups_state: biggest bloat source
ALTER TABLE state_groups_state SET (
    autovacuum_vacuum_scale_factor = 0.01,
    autovacuum_analyze_scale_factor = 0.005,
    autovacuum_vacuum_cost_delay = 5,
    autovacuum_naptime = 30
);

-- events: second priority
ALTER TABLE events SET (
    autovacuum_vacuum_scale_factor = 0.02,
    autovacuum_analyze_scale_factor = 0.01,
    autovacuum_vacuum_cost_delay = 5,
    autovacuum_naptime = 30
);

-- receipts and device_lists_stream
ALTER TABLE receipts SET (autovacuum_vacuum_scale_factor = 0.01, autovacuum_vacuum_cost_delay = 5);
ALTER TABLE device_lists_stream SET (autovacuum_vacuum_scale_factor = 0.02);
ALTER TABLE presence_stream SET (autovacuum_vacuum_scale_factor = 0.02);

Also bump autovacuum_max_workers from 3 → 5:

ALTER SYSTEM SET autovacuum_max_workers = 5;
SELECT pg_reload_conf();

Monitor vacuum health:

SELECT relname, last_autovacuum, n_dead_tup, n_live_tup
FROM pg_stat_user_tables
WHERE relname IN ('events', 'state_groups_state', 'receipts')
ORDER BY n_dead_tup DESC;

7. Hookshot Metrics + Grafana

Hookshot metrics are exposed at 127.0.0.1:9001/metrics but it's unconfirmed whether Prometheus at 10.10.10.49 is scraping them. Verify:

# On LXC 151
curl http://127.0.0.1:9001/metrics | head -20

If Prometheus is scraping, add the hookshot dashboard from the repo: contrib/hookshot-dashboard.json → import into Grafana.

Grafana Synapse dashboard — Prometheus is already scraping Synapse at port 9000. Import the official dashboard:

  • Grafana → Dashboards → Import → ID 18618 (Synapse Monitoring)
  • Set Prometheus datasource → done
  • Shows room count, message rates, federation lag, cache hit rates, DB query times in real time

8. Federation Security

Currently: open federation with key verification (correct for invite-only friends server). Recommended additions:

Server-level allow/deny in homeserver.yaml (optional, for closing federation entirely):

# Fully closed (recommended long-term for private guild):
federation_enabled: false

# OR: whitelist-only federation
federation_domain_whitelist:
  - matrix.lotusguild.org
  - matrix.org   # Keep if bridging needed

Per-room ACLs for reactive blocking of specific bad servers:

{
  "type": "m.room.server_acl",
  "content": {
    "allow": ["*"],
    "deny": ["spam.example.com"]
  }
}

Mjolnir/Draupnir (already on roadmap) handles this automatically with ban list subscriptions (t2bot spam lists etc).


9. Sygnal Push Notifications

Sygnal is the official Matrix push gateway for mobile (Element X on iOS/Android). Without it, notifications don't arrive when the app is backgrounded.

Requirements:

  • Apple Developer account (APNS cert) for iOS
  • Firebase project (FCM API key) for Android
  • New LXC or run alongside existing services

Basic config (/etc/sygnal/sygnal.yaml):

server:
  port: 8765
database:
  type: postgresql
  user: sygnal
  password: <password>
  database: sygnal
apps:
  com.element.android:
    type: gcm
    api_key: <FCM_API_KEY>
  im.riot.x.ios:
    type: apns
    platform: production
    certfile: /etc/sygnal/apns/element-x-cert.pem
    topic: im.riot.x.ios

Synapse integration:

# homeserver.yaml
push:
  push_gateways:
    - url: "http://localhost:8765"

10. LiveKit VP9/AV1 + Dynacast (Quality Improvement)

Currently H264 only. Enabling VP9/AV1 unlocks Dynacast (pauses video layers no one is watching) which significantly reduces bandwidth/CPU for low-viewer rooms.

/etc/livekit/config.yaml additions:

video:
  codecs:
    - mime: video/H264
      fmtp: "level-asymmetry-allowed=1;packetization-mode=1;profile-level-id=42e01e"
    - mime: video/VP9
      fmtp: "profile=0"
    - mime: video/AV1
      fmtp: "profile=0"
  dynacast: true

Note: Dynacast only works with VP9 or AV1 (SVC-capable codecs). H264 subscribers continue to work normally alongside VP9/AV1 subscribers.


11. Synapse Workers (Future Scaling Reference)

Current single-process handles ~100300 concurrent users before the Python GIL becomes the bottleneck. Not needed now, but documented for when usage grows.

Stage 1 trigger: Synapse CPU >80% consistently, or >200 concurrent users.

First workers to add:

# /etc/matrix-synapse/workers/client-reader-1.yaml
worker_app: synapse.app.client_reader
worker_name: client-reader-1
worker_listeners:
  - type: http
    port: 8011
    resources: [{names: [client]}]

Add federation_sender next (off-loads outgoing federation from main process). Then event_creator for write-heavy loads. Redis required at Stage 2 (500+ users) for inter-worker coordination.



Monitoring & Observability (March 2026)

Prometheus Scrape Jobs

All Matrix-related services scraped by Prometheus at 10.10.10.48 (LXC 118):

Job Target Metrics
synapse 10.10.10.29:9000 Full Synapse internals (events, federation, caches, DB, HTTP)
matrix-admin 10.10.10.29:9101 DAU, MAU, room/user/media totals
livekit 10.10.10.29:6789 Rooms, participants, packets, forward latency, quality
hookshot 10.10.10.29:9004 Connections by service, API calls/failures, Node.js runtime
matrix-node 10.10.10.29:9100 CPU, RAM, network, disk space, load avg (Matrix LXC host)
postgres 10.10.10.44:9187 pg_stat_database, connections, WAL, block I/O
postgres-node 10.10.10.44:9100 CPU, RAM, network, disk space, load avg (PostgreSQL LXC host)
postgres-exporter-2 10.10.10.160:9711 Secondary postgres exporter

Disk I/O note: All servers use Ceph-backed storage. Per-device disk I/O metrics are meaningless; use Network I/O panels to see actual storage traffic.

Grafana Dashboard

URL: https://dashboard.lotusguild.org/d/matrix-synapse-dashboard/matrix-synapse

140+ panels across 18 sections:

Section Key panels
Synapse Overview Up status, users, rooms, DAU/MAU, media, federation peers
Synapse Process Health CPU, memory, FDs, thread pool, GC, Twisted reactor
HTTP API Requests Rate, response codes, p99/p50 latency, in-flight, DB txn time
Federation Outgoing/incoming PDUs, queue depth, staging, known servers
Events & Rooms Event persistence, notifier, sync responses
Presence & Push Presence updates, pushers, state transitions
Rate Limiting Rejections, sleeps, queue wait time p99
Users & Registration Login rate, registration rate, growth over time
Synapse Database Performance Txn rate/duration, schedule latency, query latency
Synapse Caches Hit rate (top 5), sizes, evictions, response cache
Event Processing & Lag Lag by processor, stream positions, event fetch ongoing
State Resolution Forward extremities, state resolution CPU, state groups
App Services (Hookshot) Events sent, transactions sent vs failed
HTTP Push Push processed vs failed, badge updates
Sliding Sync & Slow Endpoints Sliding sync p99, slowest endpoints, rate limit wait
Background Processes In-flight by name, start rate, CPU, scheduler tasks
PostgreSQL Database Size, connections, transactions, block I/O, WAL, locks
LiveKit SFU Rooms, participants, network, packets out/dropped, forward latency
Hookshot Matrix API calls/failures, active connections, Node.js event loop lag
Matrix LXC Host CPU, RAM, network (incl. Ceph), load average, disk space
PostgreSQL LXC Host CPU, RAM, network (incl. Ceph), load average, disk space

Alert Rules

All alerts are Grafana-native (Alerting → Alert Rules). Current active rules:

Matrix folder (matrix-folder):

Alert Fires when Severity
Synapse Down up{job="synapse"} < 1 for 2m critical
PostgreSQL Down pg_up < 1 for 2m critical
LiveKit Down up{job="livekit"} < 1 for 2m critical
Hookshot Down up{job="hookshot"} < 1 for 2m critical
PG Connection Saturation connections > 80% of max for 5m warning
Federation Queue Backing Up pending PDUs > 100 for 10m warning
Synapse High Memory RSS > 2000MB for 10m warning
Synapse High Response Time p99 latency (excl. /sync) > 10s for 5m warning
Synapse Event Processing Lag any processor > 30s behind for 5m warning
Synapse DB Query Latency High p99 query time > 1s for 5m warning

Infrastructure folder (infra-folder):

Alert Fires when Severity
Service Exporter Down any up == 0 for 3m critical
Node High CPU Usage CPU > 90% for 10m warning
Node High Memory Usage RAM > 90% for 10m warning
Node Disk Space Low available < 15% (excl. tmpfs/overlay) for 10m warning

Prometheus rules (/etc/prometheus/prometheus_rules.yml):

Alert Fires when
InstanceDown any up == 0 for 1m
DiskSpaceFree10Percent available < 10% (excl. tmpfs/overlay) for 5m

/sync long-poll note: The Matrix /sync endpoint is a long-poll (clients hold it open ≤30s). It is excluded from the High Response Time alert to prevent false positives. Without exclusion, p99 reads ~10s even when the server is healthy.

Known Alert False Positives / Watch Items

  • Synapse Event Processing Lag — can fire transiently after Synapse restart while processors catch up on backlog. Self-resolves in 1020 minutes. If it grows continuously (>10 min) and doesn't plateau, restart Synapse.
  • Node Disk Space Low — excludes tmpfs, overlay, squashfs, devtmpfs, and /boot//run mounts. If new filesystem types appear, add them to the fstype!~ filter in the rule.

Bot Checklist

Core

  • matrix-nio async client with E2EE
  • Device trust (auto-trust all devices)
  • Graceful shutdown (SIGTERM/SIGINT)
  • Initial sync token (ignores old messages on startup)
  • Auto-accept room invites
  • Deployed as systemd service (matrixbot.service) on LXC 151
  • Fix E2EE key errors — full store + credentials wipe, fresh device registration (BBRZSEUECZ); stale devices removed via admin API

Commands

  • !help — list commands
  • !ping — latency check
  • !8ball <question> — magic 8-ball
  • !fortune — fortune cookie
  • !flip — coin flip
  • !roll <NdS> — dice roller
  • !random <min> <max> — random number
  • !rps <choice> — rock paper scissors
  • !poll <question> — poll with reactions
  • !trivia — trivia game (reactions, 30s reveal)
  • !champion [lane] — random LoL champion
  • !agent [role] — random Valorant agent
  • !wordle — full Wordle game (daily, hard mode, stats, share)
  • !minecraft <username> — RCON whitelist add
  • !ask <question> — Ollama LLM (lotusllm, 2min cooldown)
  • !health — bot uptime + service status

Welcome System

  • Watches Space joins and DMs new members automatically
  • React-to-join: react with in DM → bot invites to General, Commands, Memes
  • Welcome event ID persisted to welcome_state.json

Wordle

  • Daily puzzles with two-pass letter evaluation
  • Hard mode with constraint validation
  • Stats persistence (wordle_stats.json)
  • Cinny-compatible rendering (inline <span> tiles)
  • DM-based gameplay, !wordle share posts result to public room
  • Virtual keyboard display

Tech Stack

Component Technology Version
Bot language Python 3 3.x
Bot library matrix-nio (E2EE) latest
Homeserver Synapse 1.148.0
Database PostgreSQL 17.9
TURN coturn latest
Video/voice calls LiveKit SFU 1.9.11
LiveKit JWT lk-jwt-service latest
SSO Authelia (OIDC) + LLDAP
Webhook bridge matrix-hookshot 7.3.2
Reverse proxy Nginx Proxy Manager
Web client Cinny (custom build, add-joined-call-controls branch) 4.10.5+
Bot dependencies matrix-nio[e2ee], aiohttp, python-dotenv, mcrcon

Bot Files

matrixBot/
├── bot.py              # Entry point, client setup, event loop
├── callbacks.py        # Message + reaction event handlers
├── commands.py         # All command implementations
├── config.py           # Environment config + validation
├── utils.py            # send_text, send_html, send_reaction, get_or_create_dm
├── welcome.py          # Welcome message + react-to-join logic
├── wordle.py           # Full Wordle game engine
├── wordlist_answers.py # Wordle answer word list
├── wordlist_valid.py   # Wordle valid guess word list
├── .env.example        # Environment variable template
└── requirements.txt    # Python dependencies
Description
Matrix bot for the Lotus Guild homeserver
Readme 373 KiB
Languages
HTML 61.3%
JavaScript 25.2%
Shell 13.5%