T

jared 0ba095ba03 docs: mark coturn hardening applied, update action items

- stale-nonce, user-quota, total-quota, cipher-list applied to /etc/turnserver.conf
- BBR noted as intentionally skipped (HA multi-host setup)
- Storj update and Synapse lag resolved

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-10 14:05:59 -04:00

.env.example

Add Phase 2: integrations, admin, and remaining commands

2026-02-11 20:52:57 -05:00

.gitignore

Add Wordle, welcome system, integrations, and update roadmap

2026-02-20 10:29:36 -05:00

bot.py

Add Wordle, welcome system, integrations, and update roadmap

2026-02-20 10:29:36 -05:00

callbacks.py

Add Wordle, welcome system, integrations, and update roadmap

2026-02-20 10:29:36 -05:00

commands.py

Add Wordle, welcome system, integrations, and update roadmap

2026-02-20 10:29:36 -05:00

config.py

Add Wordle, welcome system, integrations, and update roadmap

2026-02-20 10:29:36 -05:00

README.md

docs: mark coturn hardening applied, update action items

2026-03-10 14:05:59 -04:00

requirements.txt

Add Phase 2: integrations, admin, and remaining commands

2026-02-11 20:52:57 -05:00

utils.py

Add Wordle, welcome system, integrations, and update roadmap

2026-02-20 10:29:36 -05:00

welcome.py

Add Wordle, welcome system, integrations, and update roadmap

2026-02-20 10:29:36 -05:00

wordle.py

Use NYT Wordle API for daily word instead of local list

2026-02-23 10:21:30 -05:00

wordlist_answers.py

Add Wordle, welcome system, integrations, and update roadmap

2026-02-20 10:29:36 -05:00

wordlist_valid.py

Add Wordle, welcome system, integrations, and update roadmap

2026-02-20 10:29:36 -05:00

README.md

Lotus Matrix Bot & Server Roadmap

Matrix bot and server infrastructure for the Lotus Guild homeserver (matrix.lotusguild.org).

Repo: https://code.lotusguild.org/LotusGuild/matrixBot

Status: Phase 6 — Monitoring, Observability & Hardening

Priority Order

~~PostgreSQL migration~~
~~TURN server~~
~~Room structure + space setup~~
~~Matrix bot (core + commands)~~
~~LiveKit / Element Call~~
~~SSO / OIDC (Authelia)~~
~~Webhook integrations (hookshot)~~
~~Voice stability & quality tuning~~
~~Custom Cinny client (chat.lotusguild.org)~~
Custom emoji packs (partially finished)
Cinny custom branding (Lotus Guild theme)
Draupnir moderation bot
Push notifications (Sygnal)

Infrastructure

Service	IP	LXC	RAM	vCPUs	Disk	Versions
Synapse	10.10.10.29	151	8GB	4 (Ryzen 9 7900)	50GB (21% used)	Synapse 1.148.0, LiveKit 1.9.11, hookshot 7.3.2, coturn latest
PostgreSQL 17	10.10.10.44	109	6GB	3 (Ryzen 9 7900)	30GB (5% used)	PostgreSQL 17.9
Cinny Web	10.10.10.6	106	256MB runtime	1	8GB (27% used)	Debian 13, nginx, Node 24, Cinny 4.10.5
NPM	10.10.10.27	139	—	—	—	Nginx Proxy Manager
Authelia	10.10.10.36	167	—	—	—	SSO/OIDC provider
LLDAP	10.10.10.39	147	—	—	—	LDAP user directory
Uptime Kuma	10.10.10.25	101	—	—	—	Uptime monitoring (micro1 node)
Prometheus	10.10.10.48	118	—	—	—	Prometheus — scrapes all Matrix services
Grafana	10.10.10.49	107	—	—	—	Grafana 12.4.0 — dashboard.lotusguild.org

Note: PostgreSQL container IP is 10.10.10.44, not .2 — update any stale references.

Key paths on Synapse/matrix LXC (151):

Synapse config: /etc/matrix-synapse/homeserver.yaml
Synapse conf.d: /etc/matrix-synapse/conf.d/ (metrics.yaml, report_stats.yaml, server_name.yaml)
coturn config: /etc/turnserver.conf
LiveKit config: /etc/livekit/config.yaml
LiveKit service: livekit-server.service
lk-jwt-service: lk-jwt-service.service (binds :8070, serves JWT tokens for MatrixRTC)
Hookshot: /opt/hookshot/, service: matrix-hookshot.service
Hookshot config: /opt/hookshot/config.yml
Hookshot registration: /etc/matrix-synapse/hookshot-registration.yaml
Landing page: /var/www/matrix-landing/index.html (on NPM LXC 139)
Bot: /opt/matrixbot/, service: matrixbot.service

Key paths on PostgreSQL LXC (109):

PostgreSQL config: /etc/postgresql/17/main/postgresql.conf
PostgreSQL conf.d: /etc/postgresql/17/main/conf.d/
HBA config: /etc/postgresql/17/main/pg_hba.conf
Data directory: /var/lib/postgresql/17/main

Running services on LXC 151:

Service	PID status	Memory	Notes
matrix-synapse	active, 2+ days	231MB peak 312MB	No workers, single process
livekit-server	active, 2+ days	22MB peak 58MB	v1.9.11, node IP = 162.192.14.139
lk-jwt-service	active, 2+ days	2.7MB	Binds :8070, LIVEKIT_URL=wss://matrix.lotusguild.org
matrix-hookshot	active, 2+ days	76MB peak 172MB	Actively receiving webhooks
matrixbot	active, 2+ days	26MB peak 59MB	Some E2EE key errors (see known issues)
coturn	active, 2+ days	13MB	Periodic TCP reset errors (normal)

Currently Open Port forwarding (router → 10.10.10.29):

TCP+UDP 3478 (TURN/STUN signaling)
TCP+UDP 5349 (TURNS/TLS)
TCP 7881 (LiveKit ICE TCP fallback)
TCP+UDP 49152-65535 (TURN relay range)
LiveKit WebRTC media: 50100-50500 (subset of above, only 400 ports — see improvements)

Internal port map (LXC 151):

Port	Service	Bind
8008	Synapse HTTP	0.0.0.0 + ::1
9000	Synapse metrics (Prometheus)	127.0.0.1 + 10.10.10.29
9001	Hookshot widgets	0.0.0.0
9002	Hookshot bridge (appservice)	127.0.0.1
9003	Hookshot webhooks	0.0.0.0
9004	Hookshot metrics (Prometheus)	0.0.0.0
9100	node_exporter (Prometheus)	0.0.0.0
9101	matrix-admin exporter	0.0.0.0
6789	LiveKit metrics (Prometheus)	0.0.0.0
7880	LiveKit HTTP	0.0.0.0
7881	LiveKit RTC TCP	0.0.0.0
8070	lk-jwt-service	0.0.0.0
8080	synapse-admin (nginx)	0.0.0.0
3478	coturn STUN/TURN	0.0.0.0
5349	coturn TURNS/TLS	0.0.0.0

Internal port map (LXC 109 — PostgreSQL):

Port	Service	Bind
5432	PostgreSQL	0.0.0.0 (hba-restricted to 10.10.10.29)
9100	node_exporter (Prometheus)	0.0.0.0
9187	postgres_exporter	0.0.0.0

Rooms (all v12)

Room	Room ID	Join Rule
The Lotus Guild (Space)	`!-1ZBnAH-JiCOV8MGSKN77zDGTuI3pgSdy8Unu_DrDyc`	public
General	`!wfokQ1-pE896scu_AOcCBA2s3L4qFo-PTBAFTd0WMI0`	public
Commands	`!ou56mVZQ8ZB7AhDYPmBV5_BR28WMZ4x5zwZkPCqjq1s`	restricted (Space members)
Memes	`!GK6v5cLEEnowIooQJv5jECfISUjADjt8aKhWv9VbG5U`	restricted (Space members)
Management	`!mEvR5fe3jMmzwd-FwNygD72OY_yu8H3UP_N-57oK7MI`	invite
Cool Kids	`!R7DT3QZHG9P8QQvX6zsZYxjkKgmUucxDz_n31qNrC94`	invite
Spam and Stuff	`!GttT4QYd1wlGlkHU3qTmq_P3gbyYKKeSSN6R7TPcJHg`	invite, no E2EE (hookshot)

Power level roles (Cinny tags):

100: Owner (jared)
50: The Nerdy Council (enhuynh, lonely)
48: Panel of Geeks
35: Cool Kids
0: Member

Webhook Integrations (matrix-hookshot 7.3.2)

Generic webhooks bridged into Spam and Stuff via matrix-hookshot. Each service gets its own virtual user (@hookshot_<service>) with a unique avatar. Webhook URL format: https://matrix.lotusguild.org/webhook/<uuid>

Service	Webhook UUID	Notes
Grafana	`df4a1302-2d62-4a01-b858-fb56f4d3781a`	Unified alerting contact point
Proxmox	`9b3eafe5-7689-4011-addd-c466e524661d`	Notification system (8.1+)
Sonarr	`aeffc311-0686-42cb-9eeb-6757140c072e`	All event types
Radarr	`34913454-c1ac-4cda-82ea-924d4a9e60eb`	All event types
Readarr	`e57ab4f3-56e6-4dc4-8b30-2f4fd4bbeb0b`	All event types
Lidarr	`66ac6fdd-69f6-4f47-bb00-b7f6d84d7c1c`	All event types
Uptime Kuma	`1a02e890-bb25-42f1-99fe-bba6a19f1811`	Status change notifications
Seerr	`555185af-90a1-42ff-aed5-c344e11955cf`	Request/approval events
Owncast (Livestream)	`9993e911-c68b-4271-a178-c2d65ca88499`	STREAM_STARTED / STREAM_STOPPED (hookshot display name: "Livestream")
Bazarr	`470fb267-3436-4dd3-a70c-e6e8db1721be`	Subtitle events (Apprise JSON notifier)
Tinker-Tickets	`6e306faf-8eea-4ba5-83ef-bf8f421f929e`	Custom transformation code

Hookshot notes:

Spam and Stuff is intentionally unencrypted — hookshot bridges cannot join E2EE rooms
Webhook tokens stored in Synapse PostgreSQL room_account_data for @hookshot
JS transformation functions use hookshot v2 API: set result = { version: "v2", plain, html, msgtype }
The result variable must be assigned without var/let/const (needs implicit global scope in the QuickJS IIFE sandbox)
NPM proxies https://matrix.lotusguild.org/webhook/* → http://10.10.10.29:9003
Virtual user avatars: set via appservice token (as_token in hookshot-registration.yaml) impersonating each user
Hookshot bridge port (9002) binds 127.0.0.1 only; webhook ingest (9003) binds 0.0.0.0 (NPM-proxied)

Known Issues

coturn TLS Reset Errors

Periodic TLS/TCP socket error: Connection reset by peer in coturn logs from external IPs. This is normal — clients probe TURN and drop the connection once they establish a direct P2P path. Not an issue.

BBR Congestion Control — Host-Level Only

net.ipv4.tcp_congestion_control = bbr and net.core.default_qdisc = fq cannot be set from inside an unprivileged LXC container — they affect the host kernel's network namespace. These must be applied on the Proxmox host itself to take effect for all containers. All other sysctl tuning (TCP/UDP buffers, fin_timeout) applied successfully inside LXC 151.

Optimizations & Improvements

1. LiveKit / Voice Quality ✅ Applied

Noise suppression and volume normalization are client-side only (browser/Element X handles this via WebRTC's built-in audio processing). The server cannot enforce these. Applied server-side improvements:

ICE port range expanded: 50100-50500 (400 ports) → 50000-51000 (1001 ports) = ~500 concurrent WebRTC streams
TURN TTL reduced: 86400s (24h) → 3600s (1h) — stale allocations expire faster
Room defaults added: empty_timeout: 300, departure_timeout: 20, max_participants: 50

Client-side audio advice for users:

Element Web/Desktop: Settings → Voice & Video → enable "Noise Suppression" and "Echo Cancellation"
Element X (mobile): automatic via WebRTC stack
Cinny (chat.lotusguild.org): voice via embedded Element Call widget — browser WebRTC noise suppression is active automatically

2. PostgreSQL Tuning (LXC 109) ✅ Applied

/etc/postgresql/17/main/conf.d/synapse_tuning.conf written and active. pg_stat_statements extension created in the synapse database. Config applied:

# Memory — shared_buffers = 25% RAM, effective_cache_size = 75% RAM
shared_buffers = 1500MB
effective_cache_size = 4500MB
work_mem = 32MB                    # Per sort/hash operation (safe at low connection count)
maintenance_work_mem = 256MB       # VACUUM, CREATE INDEX
wal_buffers = 64MB                 # WAL write buffer

# Checkpointing
checkpoint_completion_target = 0.9 # Spread checkpoint I/O (default 0.5 is aggressive)
max_wal_size = 2GB

# Storage (Ceph RBD block device = SSD-equivalent random I/O)
random_page_cost = 1.1             # Default 4.0 assumes spinning disk
effective_io_concurrency = 200     # For SSDs/Ceph

# Parallel queries (3 vCPUs)
max_worker_processes = 3
max_parallel_workers_per_gather = 1
max_parallel_workers = 2

# Monitoring
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.track = all

Restarted postgresql@17-main. Expected impact: Synapse query latency drops as the DB grows — the entire current 120MB database fits in shared_buffers.

3. PostgreSQL Security — pg_hba.conf (LXC 109) ✅ Applied

Removed the two open rules (0.0.0.0/24 md5 and 0.0.0.0/0 md5). Remote access is now restricted to Synapse LXC only:

host    synapse         synapse_user    10.10.10.29/32          scram-sha-256

All other remote connections are rejected. Local Unix socket and loopback remain functional for admin access.

4. Synapse Cache Tuning (LXC 151) ✅ Applied

event_cache_size bumped 15K → 30K. _get_state_group_for_events: 3.0 added to per_cache_factors (heavily hit during E2EE key sharing). Synapse restarted cleanly.

event_cache_size: 30K
caches:
  global_factor: 2.0
  per_cache_factors:
    get_users_in_room: 3.0
    get_current_state_ids: 3.0
    _get_state_group_for_events: 3.0

5. Network / sysctl Tuning (LXC 151) ✅ Applied

/etc/sysctl.d/99-matrix-tuning.conf written and active. TCP/UDP buffers aligned and fin_timeout reduced.

# Align TCP buffers with core maximums
net.ipv4.tcp_rmem = 4096 131072 26214400
net.ipv4.tcp_wmem = 4096 65536 26214400

# UDP buffer sizing for WebRTC media streams
net.core.rmem_max = 26214400
net.core.wmem_max = 26214400
net.ipv4.udp_rmem_min = 65536
net.ipv4.udp_wmem_min = 65536

# Reduce latency for short-lived TURN connections
net.ipv4.tcp_fin_timeout = 10
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_intvl = 30

BBR note: tcp_congestion_control = bbr and default_qdisc = fq require host-level sysctl — cannot be set inside an unprivileged LXC. Apply on the Proxmox host to benefit all containers:
echo "net.ipv4.tcp_congestion_control = bbr" >> /etc/sysctl.d/99-bbr.conf
echo "net.core.default_qdisc = fq" >> /etc/sysctl.d/99-bbr.conf
sysctl --system

6. Synapse Federation Hardening

The server is effectively a private server for friends. Restricting federation prevents abuse and reduces load. Add to homeserver.yaml:

# Allow federation only with specific trusted servers (or disable entirely)
federation_domain_whitelist:
  - matrix.org        # Keep for bridging if needed
  - matrix.lotusguild.org

# OR to go fully closed (recommended for friends-only):
# federation_enabled: false

7. Bot E2EE Key Fix (LXC 151) ✅ Applied

nio_store/ cleared and bot restarted cleanly. Megolm session errors resolved.

Custom Cinny Client (chat.lotusguild.org)

Cinny v4 is the preferred client — clean UI, Cinny-style rendering already used by the bot's Wordle tiles. We build from source to get voice support and full branding control.

Why Cinny over Element Web

Much cleaner aesthetics, already the de-facto client for guild members
Element Web voice suppression (Krisp) is only on app.element.io — a custom build loses it
Cinny add-joined-call-controls branch uses @element-hq/element-call-embedded which talks to the existing MatrixRTC → lk-jwt-service → LiveKit stack with zero new infrastructure
Static build (nginx serving ~5MB of files) — nearly zero runtime resource cost

Voice support status (as of March 2026)

The official add-joined-call-controls branch (maintained by ajbura, last commit March 8 2026) embeds Element Call as a widget via @element-hq/element-call-embedded: 0.16.3. This uses the same MatrixRTC protocol that lk-jwt-service already handles. Two direct LiveKit integration PRs (#2703, #2704) were proposed but closed without merge — so the embedded Element Call approach is the official path.

Since lk-jwt-service is already running on LXC 151 and configured for wss://matrix.lotusguild.org, voice calls will work out of the box once the Cinny build is deployed.

LXC Setup

Create the LXC (run on the host):

# ProxmoxVE Debian 13 community script
bash -c "$(curl -fsSL https://raw.githubusercontent.com/community-scripts/ProxmoxVE/main/ct/debian.sh)"

Recommended settings: 2GB RAM, 1-2 vCPUs, 20GB disk, Debian 13, static IP on VLAN 10 (e.g. 10.10.10.XX).

Inside the new LXC:

# Install nginx + git + nvm dependencies
apt update && apt install -y nginx git curl

# Install Node.js 24 via nvm
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash
source ~/.bashrc
nvm install 24
nvm use 24

# Clone Cinny and switch to voice-support branch
git clone https://github.com/cinnyapp/cinny.git /opt/cinny
cd /opt/cinny
git checkout add-joined-call-controls

# Install dependencies and build
npm ci
NODE_OPTIONS=--max_old_space_size=4096 npm run build
# Output: /opt/cinny/dist/

# Deploy to nginx root
cp -r /opt/cinny/dist/* /var/www/html/

Configure Cinny — edit /var/www/html/config.json:

{
  "defaultHomeserver": 0,
  "homeserverList": ["matrix.lotusguild.org"],
  "allowCustomHomeservers": false,
  "featuredCommunities": {
    "openAsDefault": false,
    "spaces": [],
    "rooms": [],
    "servers": []
  },
  "hashRouter": {
    "enabled": false,
    "basename": "/"
  }
}

Nginx config — /etc/nginx/sites-available/cinny (matches the official docker-nginx.conf):

server {
    listen 80;
    listen [::]:80;
    server_name chat.lotusguild.org;

    root /var/www/html;
    index index.html;

    location / {
        rewrite ^/config.json$           /config.json break;
        rewrite ^/manifest.json$         /manifest.json break;
        rewrite ^/sw.js$                 /sw.js break;
        rewrite ^/pdf.worker.min.js$     /pdf.worker.min.js break;
        rewrite ^/public/(.*)$           /public/$1 break;
        rewrite ^/assets/(.*)$           /assets/$1 break;
        rewrite ^(.+)$                   /index.html break;
    }
}

ln -s /etc/nginx/sites-available/cinny /etc/nginx/sites-enabled/
nginx -t && systemctl reload nginx

Then in NPM: add a proxy host for chat.lotusguild.org → http://10.10.10.XX:80 with SSL.

Rebuilding after updates

cd /opt/cinny
git pull
npm ci
NODE_OPTIONS=--max_old_space_size=4096 npm run build
cp -r dist/* /var/www/html/
# Preserve your config.json — it gets overwritten by the copy above, so:
# Option: keep config.json outside dist and symlink/copy it in after each build

Key paths (Cinny LXC 106 — 10.10.10.6)

Source: /opt/cinny/ (branch: add-joined-call-controls)
Built files: /var/www/html/
Cinny config: /var/www/html/config.json
Config backup (survives rebuilds): /opt/cinny-config.json
Nginx site config: /etc/nginx/sites-available/cinny
Rebuild script: /usr/local/bin/cinny-update

Server Checklist

Quality of Life

Migrate from SQLite to PostgreSQL
TURN/STUN server (coturn) for reliable voice/video
URL previews
Upload size limit 200MB
Full-text message search (PostgreSQL backend)
Media retention policy (remote: 1yr, local: 3yr)
Sliding sync (native Synapse)
LiveKit for Element Call video rooms
Default room version v12, all rooms upgraded
Landing page with client recommendations (Cinny, Commet, Element, Element X mobile)
Synapse metrics endpoint (port 9000, Prometheus-compatible)
Push notifications gateway (Sygnal) for mobile clients
LiveKit port range expanded to 50000-51000 for voice call capacity
Custom Cinny client LXC 106 (10.10.10.6) — Debian 13, Cinny 4.10.5 built from add-joined-call-controls, nginx serving, HA enabled
NPM proxy entry for chat.lotusguild.org → 10.10.10.6:80, SSL via Cloudflare DNS challenge, HTTPS forced, HTTP/2 + HSTS enabled
Cinny weekly auto-update cron (/etc/cron.d/cinny-update, Sundays 3am, logs to /var/log/cinny-update.log)
Cinny custom branding — Lotus Guild theme (colors, title, favicon, PWA name)

Performance Tuning

PostgreSQL shared_buffers → 1500MB, effective_cache_size, work_mem, checkpoint tuning applied
PostgreSQL pg_stat_statements extension installed in synapse database
PostgreSQL autovacuum tuned per-table (state_groups_state, events, receipts_linearized, receipts_graph, device_lists_stream, presence_stream), autovacuum_max_workers → 5
Synapse event_cache_size → 30K, _get_state_group_for_events cache factor added
sysctl TCP/UDP buffer alignment applied to LXC 151 (/etc/sysctl.d/99-matrix-tuning.conf)
LiveKit room empty_timeout: 300, departure_timeout: 20, max_participants: 50
LiveKit ICE port range expanded to 50000-51000
LiveKit TURN TTL reduced from 24h to 1h
LiveKit VP9/AV1 codecs enabled (video_codecs: [VP8, H264, VP9, AV1])
BBR congestion control — must be applied on Proxmox host, not inside LXC (see Known Issues)

Auth & SSO

Token-based registration
SSO/OIDC via Authelia
allow_existing_users: true for linking accounts to SSO
Password auth alongside SSO

Webhooks & Integrations

matrix-hookshot 7.3.2 installed and running
Generic webhook bridge for 11 active services (Grafana, Proxmox, Sonarr, Radarr, Readarr, Lidarr, Uptime Kuma, Seerr, Owncast/Livestream, Bazarr, Tinker-Tickets)
Per-service JS transformation functions — all rewritten to handle full event payloads (all event types, health alerts, app updates, release groups, download clients)
Per-service virtual user avatars
NPM reverse proxy for /webhook path
Tinker Tickets custom transformation code

Room Structure

The Lotus Guild space
All core rooms with correct power levels and join rules
Spam and Stuff room for service notifications (hookshot)
Custom room avatars

Hardening

Rate limiting
E2EE on all rooms (except Spam and Stuff — intentional for hookshot)
coturn internal peer deny rules (blocks relay to RFC1918 except allowed subnet)
pg_hba.conf locked down — remote access restricted to Synapse LXC (10.10.10.29) only
Federation enabled with key verification (open for invite-only growth to friends/family/coworkers)
fail2ban on Synapse login endpoint (5 retries / 24h ban, LXC 151)
Synapse metrics port 9000 restricted to 127.0.0.1 + 10.10.10.29 (was 0.0.0.0)
coturn cert auto-renewal — daily sync cron on compute-storage-01 copies NPM cert → coturn
/.well-known/matrix/client and /server live on lotusguild.org (NPM advanced config)
suppress_key_server_warning: true in homeserver.yaml
Federation allow/deny lists for known bad actors
Regular Synapse updates
Automated database + media backups

Monitoring

Synapse metrics endpoint (port 9000, Prometheus-compatible)
Uptime Kuma monitors added: Synapse HTTP, LiveKit TCP, PostgreSQL TCP, Cinny Web, coturn TCP 3478, lk-jwt-service, Hookshot
Uptime Kuma: coturn UDP STUN monitoring (requires push/heartbeat — no native UDP type in Kuma)
Grafana dashboard — custom Synapse dashboard at dashboard.lotusguild.org/d/matrix-synapse-dashboard/matrix-synapse (140+ panels, see Monitoring section below)
Prometheus scraping all Matrix services: Synapse, Hookshot, LiveKit, matrix-node, postgres-node, matrix-admin, postgres, postgres-exporter
node_exporter installed on LXC 151 (Matrix) and LXC 109 (PostgreSQL)
LiveKit Prometheus metrics enabled (prometheus_port: 6789)
Hookshot metrics enabled (metrics: { enabled: true }) on dedicated port 9004
Grafana alert rules — 9 Matrix/infra alerts active (see Alert Rules section below)
Duplicate Grafana "Infrastructure" folder merged and deleted

Admin

Synapse admin API dashboard (synapse-admin at http://10.10.10.29:8080)
Power levels per room
Draupnir moderation bot (new LXC or alongside existing bot)
Cinny custom branding (Lotus Guild theme — colors, title, favicon, PWA name)
Storj node update — storj_uptodate=0 on LXC 138 (10.10.10.133), risk of disqualification

Improvement Audit (March 2026)

Comprehensive audit of the current infrastructure against official documentation and security best practices. Applied March 9 2026.

Priority Summary

Issue	Severity	Status
coturn TLS cert expires May 12 — no auto-renewal	CRITICAL	✅ Fixed — daily sync cron on compute-storage-01 copies NPM-renewed cert to coturn
Synapse metrics port 9000 bound to `0.0.0.0`	HIGH	✅ Fixed — now binds `127.0.0.1` + `10.10.10.29` (Prometheus still works, internet blocked)
`/.well-known/matrix/client` returns 404	MEDIUM	✅ Fixed — NPM lotusguild.org proxy host updated, live at `https://lotusguild.org/.well-known/matrix/client`
`suppress_key_server_warning` not set	MEDIUM	✅ Fixed — added to homeserver.yaml
No fail2ban on `/_matrix/client/.*/login`	MEDIUM	✅ Fixed — fail2ban installed, matrix-synapse jail active (5 retries / 24h ban)
No media purge cron (retention policy set but never triggers)	MEDIUM	✅ N/A — `media_retention` block already in homeserver.yaml; Synapse runs the purge internally on schedule
PostgreSQL autovacuum not tuned per-table	LOW	✅ Fixed — all 5 high-churn tables tuned, `autovacuum_max_workers` → 5
Hookshot metrics scrape unconfirmed	LOW	✅ Fixed — `metrics: { enabled: true }` added to config, metrics split to dedicated port 9004, Prometheus scraping confirmed
LiveKit VP9/AV1 codec support	LOW	✅ Applied — `video_codecs: [VP8, H264, VP9, AV1]` added to livekit config
Federation allow/deny list not configured	LOW	Pending — Mjolnir/Draupnir on roadmap
Sygnal push notifications not deployed	INFO	Deferred

1. coturn Cert Auto-Renewal ✅

The coturn cert is managed by NPM (cert ID 91, stored at /etc/letsencrypt/live/npm-91/ on LXC 139). NPM renews it automatically. A sync script on compute-storage-01 detects when NPM renews and copies it to coturn.

Deployed: /usr/local/bin/coturn-cert-sync.sh on compute-storage-01, cron /etc/cron.d/coturn-cert-sync (runs 03:30 daily).

Script compares cert expiry dates between LXC 139 and LXC 151. If they differ (NPM renewed), it copies fullchain.pem + privkey.pem and restarts coturn.

Additional coturn hardening — ✅ Applied March 2026:

# /etc/turnserver.conf
stale-nonce=600              # Nonce expires 600s (prevents replay attacks)
user-quota=100               # Max concurrent relay allocations per user
total-quota=1000             # Total relay allocations server-wide
cipher-list=ECDHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-CHACHA20-POLY1305

2. Synapse Configuration Gaps

a) Metrics port exposed to 0.0.0.0 (HIGH)

Port 9000 currently binds 0.0.0.0 — exposes internal state, user counts, DB query times externally. Fix in homeserver.yaml:

metrics_flags:
  some_legacy_unrestricted_resources: false
listeners:
  - port: 9000
    bind_addresses: ['127.0.0.1']   # NOT 0.0.0.0
    type: metrics
    resources: []

Grafana at 10.10.10.49 scrapes port 9000 from within the VLAN so this is safe to lock down.

b) suppress_key_server_warning (MEDIUM)

Fills Synapse logs with noise on every restart. One line in homeserver.yaml:

suppress_key_server_warning: true

c) Database connection pooling (LOW — track for growth)

Current defaults (cp_min: 5, cp_max: 10) are fine for single-process. When adding workers, increase cp_max to 20–30 per worker group. Add explicitly to homeserver.yaml to make it visible:

database:
  name: psycopg2
  args:
    cp_min: 5
    cp_max: 10

3. Matrix Well-Known 404

/.well-known/matrix/client returns 404. This breaks client autodiscovery — users who type lotusguild.org instead of matrix.lotusguild.org get an error. Fix in NPM with a custom location block on the lotusguild.org proxy host:

location /.well-known/matrix/client {
    add_header Content-Type application/json;
    add_header Access-Control-Allow-Origin *;
    return 200 '{"m.homeserver":{"base_url":"https://matrix.lotusguild.org"}}';
}
location /.well-known/matrix/server {
    add_header Content-Type application/json;
    add_header Access-Control-Allow-Origin *;
    return 200 '{"m.server":"matrix.lotusguild.org:443"}';
}

No brute-force protection on /_matrix/client/*/login. Easy win.

/etc/fail2ban/jail.d/matrix-synapse.conf:

[matrix-synapse]
enabled  = true
port     = http,https
filter   = matrix-synapse
logpath  = /var/log/matrix-synapse/homeserver.log
backend  = systemd
journalmatch = _SYSTEMD_UNIT=matrix-synapse.service + PRIORITY=3
findtime = 600
maxretry = 5
bantime  = 86400

/etc/fail2ban/filter.d/matrix-synapse.conf:

[Definition]
failregex = ^.*Failed \(password\|SAML\) login attempt for user .* from <HOST>.*$
            ^.*"POST /.*login.*" 401.*$
ignoreregex = ^.*"GET /sync.*".*$

5. Synapse Media Purge Cron

Retention policy is configured (remote 1yr, local 3yr) but nothing actually triggers the purge — media accumulates silently. The Synapse admin API purge endpoint must be called explicitly.

/usr/local/bin/purge-synapse-media.sh (create on LXC 151):

#!/bin/bash
ADMIN_TOKEN="syt_your_admin_token"
# Purge remote media (cached from other homeservers) older than 90 days
CUTOFF_TS=$(($(date +%s000) - 7776000000))
curl -X POST \
  "http://localhost:8008/_synapse/admin/v1/purge_media_cache?before_ts=$CUTOFF_TS" \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -s -o /dev/null
echo "$(date): Synapse remote media purge completed" >> /var/log/synapse-purge.log

chmod +x /usr/local/bin/purge-synapse-media.sh
echo "0 4 * * * root /usr/local/bin/purge-synapse-media.sh" > /etc/cron.d/synapse-purge

6. PostgreSQL Autovacuum Per-Table Tuning

The high-churn Synapse tables (state_groups_state, events, receipts) are not tuned for aggressive autovacuum. As the DB grows, bloat accumulates and queries slow down. Run on LXC 109 (PostgreSQL):

-- state_groups_state: biggest bloat source
ALTER TABLE state_groups_state SET (
    autovacuum_vacuum_scale_factor = 0.01,
    autovacuum_analyze_scale_factor = 0.005,
    autovacuum_vacuum_cost_delay = 5,
    autovacuum_naptime = 30
);

-- events: second priority
ALTER TABLE events SET (
    autovacuum_vacuum_scale_factor = 0.02,
    autovacuum_analyze_scale_factor = 0.01,
    autovacuum_vacuum_cost_delay = 5,
    autovacuum_naptime = 30
);

-- receipts and device_lists_stream
ALTER TABLE receipts SET (autovacuum_vacuum_scale_factor = 0.01, autovacuum_vacuum_cost_delay = 5);
ALTER TABLE device_lists_stream SET (autovacuum_vacuum_scale_factor = 0.02);
ALTER TABLE presence_stream SET (autovacuum_vacuum_scale_factor = 0.02);

Also bump autovacuum_max_workers from 3 → 5:

ALTER SYSTEM SET autovacuum_max_workers = 5;
SELECT pg_reload_conf();

Monitor vacuum health:

SELECT relname, last_autovacuum, n_dead_tup, n_live_tup
FROM pg_stat_user_tables
WHERE relname IN ('events', 'state_groups_state', 'receipts')
ORDER BY n_dead_tup DESC;

7. Hookshot Metrics + Grafana

Hookshot metrics are exposed at 127.0.0.1:9001/metrics but it's unconfirmed whether Prometheus at 10.10.10.49 is scraping them. Verify:

# On LXC 151
curl http://127.0.0.1:9001/metrics | head -20

If Prometheus is scraping, add the hookshot dashboard from the repo: contrib/hookshot-dashboard.json → import into Grafana.

Grafana Synapse dashboard — Prometheus is already scraping Synapse at port 9000. Import the official dashboard:

Grafana → Dashboards → Import → ID 18618 (Synapse Monitoring)
Set Prometheus datasource → done
Shows room count, message rates, federation lag, cache hit rates, DB query times in real time

8. Federation Security

Currently: open federation with key verification (correct for invite-only friends server). Recommended additions:

Server-level allow/deny in homeserver.yaml (optional, for closing federation entirely):

# Fully closed (recommended long-term for private guild):
federation_enabled: false

# OR: whitelist-only federation
federation_domain_whitelist:
  - matrix.lotusguild.org
  - matrix.org   # Keep if bridging needed

Per-room ACLs for reactive blocking of specific bad servers:

{
  "type": "m.room.server_acl",
  "content": {
    "allow": ["*"],
    "deny": ["spam.example.com"]
  }
}

Mjolnir/Draupnir (already on roadmap) handles this automatically with ban list subscriptions (t2bot spam lists etc).

9. Sygnal Push Notifications

Sygnal is the official Matrix push gateway for mobile (Element X on iOS/Android). Without it, notifications don't arrive when the app is backgrounded.

Requirements:

Apple Developer account (APNS cert) for iOS
Firebase project (FCM API key) for Android
New LXC or run alongside existing services

Basic config (/etc/sygnal/sygnal.yaml):

server:
  port: 8765
database:
  type: postgresql
  user: sygnal
  password: <password>
  database: sygnal
apps:
  com.element.android:
    type: gcm
    api_key: <FCM_API_KEY>
  im.riot.x.ios:
    type: apns
    platform: production
    certfile: /etc/sygnal/apns/element-x-cert.pem
    topic: im.riot.x.ios

Synapse integration:

# homeserver.yaml
push:
  push_gateways:
    - url: "http://localhost:8765"

10. LiveKit VP9/AV1 + Dynacast (Quality Improvement)

Currently H264 only. Enabling VP9/AV1 unlocks Dynacast (pauses video layers no one is watching) which significantly reduces bandwidth/CPU for low-viewer rooms.

/etc/livekit/config.yaml additions:

video:
  codecs:
    - mime: video/H264
      fmtp: "level-asymmetry-allowed=1;packetization-mode=1;profile-level-id=42e01e"
    - mime: video/VP9
      fmtp: "profile=0"
    - mime: video/AV1
      fmtp: "profile=0"
  dynacast: true

Note: Dynacast only works with VP9 or AV1 (SVC-capable codecs). H264 subscribers continue to work normally alongside VP9/AV1 subscribers.

11. Synapse Workers (Future Scaling Reference)

Current single-process handles ~100–300 concurrent users before the Python GIL becomes the bottleneck. Not needed now, but documented for when usage grows.

Stage 1 trigger: Synapse CPU >80% consistently, or >200 concurrent users.

First workers to add:

# /etc/matrix-synapse/workers/client-reader-1.yaml
worker_app: synapse.app.client_reader
worker_name: client-reader-1
worker_listeners:
  - type: http
    port: 8011
    resources: [{names: [client]}]

Add federation_sender next (off-loads outgoing federation from main process). Then event_creator for write-heavy loads. Redis required at Stage 2 (500+ users) for inter-worker coordination.

Monitoring & Observability (March 2026)

Prometheus Scrape Jobs

All Matrix-related services scraped by Prometheus at 10.10.10.48 (LXC 118):

Job	Target	Metrics
`synapse`	`10.10.10.29:9000`	Full Synapse internals (events, federation, caches, DB, HTTP)
`matrix-admin`	`10.10.10.29:9101`	DAU, MAU, room/user/media totals
`livekit`	`10.10.10.29:6789`	Rooms, participants, packets, forward latency, quality
`hookshot`	`10.10.10.29:9004`	Connections by service, API calls/failures, Node.js runtime
`matrix-node`	`10.10.10.29:9100`	CPU, RAM, network, disk space, load avg (Matrix LXC host)
`postgres`	`10.10.10.44:9187`	pg_stat_database, connections, WAL, block I/O
`postgres-node`	`10.10.10.44:9100`	CPU, RAM, network, disk space, load avg (PostgreSQL LXC host)
`postgres-exporter-2`	`10.10.10.160:9711`	Secondary postgres exporter

Disk I/O note: All servers use Ceph-backed storage. Per-device disk I/O metrics are meaningless; use Network I/O panels to see actual storage traffic.

Grafana Dashboard

URL: https://dashboard.lotusguild.org/d/matrix-synapse-dashboard/matrix-synapse

140+ panels across 18 sections:

Section	Key panels
Synapse Overview	Up status, users, rooms, DAU/MAU, media, federation peers
Synapse Process Health	CPU, memory, FDs, thread pool, GC, Twisted reactor
HTTP API Requests	Rate, response codes, p99/p50 latency, in-flight, DB txn time
Federation	Outgoing/incoming PDUs, queue depth, staging, known servers
Events & Rooms	Event persistence, notifier, sync responses
Presence & Push	Presence updates, pushers, state transitions
Rate Limiting	Rejections, sleeps, queue wait time p99
Users & Registration	Login rate, registration rate, growth over time
Synapse Database Performance	Txn rate/duration, schedule latency, query latency
Synapse Caches	Hit rate (top 5), sizes, evictions, response cache
Event Processing & Lag	Lag by processor, stream positions, event fetch ongoing
State Resolution	Forward extremities, state resolution CPU, state groups
App Services (Hookshot)	Events sent, transactions sent vs failed
HTTP Push	Push processed vs failed, badge updates
Sliding Sync & Slow Endpoints	Sliding sync p99, slowest endpoints, rate limit wait
Background Processes	In-flight by name, start rate, CPU, scheduler tasks
PostgreSQL Database	Size, connections, transactions, block I/O, WAL, locks
LiveKit SFU	Rooms, participants, network, packets out/dropped, forward latency
Hookshot	Matrix API calls/failures, active connections, Node.js event loop lag
Matrix LXC Host	CPU, RAM, network (incl. Ceph), load average, disk space
PostgreSQL LXC Host	CPU, RAM, network (incl. Ceph), load average, disk space

Alert Rules

All alerts are Grafana-native (Alerting → Alert Rules). Current active rules:

Matrix folder (matrix-folder):

Alert	Fires when	Severity
Synapse Down	`up{job="synapse"}` < 1 for 2m	critical
PostgreSQL Down	`pg_up` < 1 for 2m	critical
LiveKit Down	`up{job="livekit"}` < 1 for 2m	critical
Hookshot Down	`up{job="hookshot"}` < 1 for 2m	critical
PG Connection Saturation	connections > 80% of max for 5m	warning
Federation Queue Backing Up	pending PDUs > 100 for 10m	warning
Synapse High Memory	RSS > 2000MB for 10m	warning
Synapse High Response Time	p99 latency (excl. /sync) > 10s for 5m	warning
Synapse Event Processing Lag	any processor > 30s behind for 5m	warning
Synapse DB Query Latency High	p99 query time > 1s for 5m	warning

Infrastructure folder (infra-folder):

Alert	Fires when	Severity
Service Exporter Down	any `up == 0` for 3m	critical
Node High CPU Usage	CPU > 90% for 10m	warning
Node High Memory Usage	RAM > 90% for 10m	warning
Node Disk Space Low	available < 15% (excl. tmpfs/overlay) for 10m	warning

Prometheus rules (/etc/prometheus/prometheus_rules.yml):

Alert	Fires when
InstanceDown	any `up == 0` for 1m
DiskSpaceFree10Percent	available < 10% (excl. tmpfs/overlay) for 5m

/sync long-poll note: The Matrix /sync endpoint is a long-poll (clients hold it open ≤30s). It is excluded from the High Response Time alert to prevent false positives. Without exclusion, p99 reads ~10s even when the server is healthy.

Known Alert False Positives / Watch Items

Synapse Event Processing Lag — can fire transiently after Synapse restart while processors catch up on backlog. Self-resolves in 10–20 minutes. If it grows continuously (>10 min) and doesn't plateau, restart Synapse.
Node Disk Space Low — excludes tmpfs, overlay, squashfs, devtmpfs, and /boot//run mounts. If new filesystem types appear, add them to the fstype!~ filter in the rule.

Bot Checklist

Core

matrix-nio async client with E2EE
Device trust (auto-trust all devices)
Graceful shutdown (SIGTERM/SIGINT)
Initial sync token (ignores old messages on startup)
Auto-accept room invites
Deployed as systemd service (matrixbot.service) on LXC 151
Fix E2EE key errors — full store + credentials wipe, fresh device registration (BBRZSEUECZ); stale devices removed via admin API

Commands

!help — list commands
!ping — latency check
!8ball <question> — magic 8-ball
!fortune — fortune cookie
!flip — coin flip
!roll <NdS> — dice roller
!random <min> <max> — random number
!rps <choice> — rock paper scissors
!poll <question> — poll with reactions
!trivia — trivia game (reactions, 30s reveal)
!champion [lane] — random LoL champion
!agent [role] — random Valorant agent
!wordle — full Wordle game (daily, hard mode, stats, share)
!minecraft <username> — RCON whitelist add
!ask <question> — Ollama LLM (lotusllm, 2min cooldown)
!health — bot uptime + service status

Welcome System

Watches Space joins and DMs new members automatically
React-to-join: react with ✅ in DM → bot invites to General, Commands, Memes
Welcome event ID persisted to welcome_state.json

Wordle

Daily puzzles with two-pass letter evaluation
Hard mode with constraint validation
Stats persistence (wordle_stats.json)
Cinny-compatible rendering (inline <span> tiles)
DM-based gameplay, !wordle share posts result to public room
Virtual keyboard display

Tech Stack

Component	Technology	Version
Bot language	Python 3	3.x
Bot library	matrix-nio (E2EE)	latest
Homeserver	Synapse	1.148.0
Database	PostgreSQL	17.9
TURN	coturn	latest
Video/voice calls	LiveKit SFU	1.9.11
LiveKit JWT	lk-jwt-service	latest
SSO	Authelia (OIDC) + LLDAP	—
Webhook bridge	matrix-hookshot	7.3.2
Reverse proxy	Nginx Proxy Manager	—
Web client	Cinny (custom build, `add-joined-call-controls` branch)	4.10.5+
Bot dependencies	matrix-nio[e2ee], aiohttp, python-dotenv, mcrcon	—

Bot Files

matrixBot/
├── bot.py              # Entry point, client setup, event loop
├── callbacks.py        # Message + reaction event handlers
├── commands.py         # All command implementations
├── config.py           # Environment config + validation
├── utils.py            # send_text, send_html, send_reaction, get_or_create_dm
├── welcome.py          # Welcome message + react-to-join logic
├── wordle.py           # Full Wordle game engine
├── wordlist_answers.py # Wordle answer word list
├── wordlist_valid.py   # Wordle valid guess word list
├── .env.example        # Environment variable template
└── requirements.txt    # Python dependencies

README.md Unescape Escape

Lotus Matrix Bot & Server Roadmap

Status: Phase 6 — Monitoring, Observability & Hardening

Priority Order

Infrastructure

Rooms (all v12)

Webhook Integrations (matrix-hookshot 7.3.2)

Known Issues

coturn TLS Reset Errors

BBR Congestion Control — Host-Level Only

Optimizations & Improvements

1. LiveKit / Voice Quality ✅ Applied

2. PostgreSQL Tuning (LXC 109) ✅ Applied

3. PostgreSQL Security — pg_hba.conf (LXC 109) ✅ Applied

4. Synapse Cache Tuning (LXC 151) ✅ Applied

5. Network / sysctl Tuning (LXC 151) ✅ Applied

6. Synapse Federation Hardening

7. Bot E2EE Key Fix (LXC 151) ✅ Applied

Custom Cinny Client (chat.lotusguild.org)

Why Cinny over Element Web

Voice support status (as of March 2026)

LXC Setup

Rebuilding after updates

Key paths (Cinny LXC 106 — 10.10.10.6)

Server Checklist

Quality of Life

Performance Tuning

Auth & SSO

Webhooks & Integrations

Room Structure

Hardening

Monitoring

Admin

Improvement Audit (March 2026)

Priority Summary

1. coturn Cert Auto-Renewal ✅

2. Synapse Configuration Gaps

3. Matrix Well-Known 404

4. fail2ban for Synapse Login

5. Synapse Media Purge Cron

6. PostgreSQL Autovacuum Per-Table Tuning

7. Hookshot Metrics + Grafana

8. Federation Security

9. Sygnal Push Notifications

10. LiveKit VP9/AV1 + Dynacast (Quality Improvement)

11. Synapse Workers (Future Scaling Reference)

Monitoring & Observability (March 2026)

Prometheus Scrape Jobs

Grafana Dashboard

Alert Rules

Known Alert False Positives / Watch Items

Bot Checklist

Core

Commands

Welcome System

Wordle

Tech Stack

Bot Files

README.md