Re-enable presence, fix federation lag with TCP timeout tuning

Presence was incorrectly disabled as a workaround. Root cause of lag spikes was
Linux's default tcp_retries2=15 (~15 min retransmit window) causing hung outbound
TCP connections to slow remote servers (e.g. exp.farm) to block the federation
sender queue for minutes at a time.

Fix applied to /etc/sysctl.d/99-matrix-tuning.conf on LXC 151:
- net.ipv4.tcp_retries2 = 8   (~90s before giving up on stalled connection)
- net.ipv4.tcp_syn_retries = 4  (~45s for initial SYN)
- net.ipv4.tcp_keepalive_probes = 3  (dead conn detected ~6.5 min)

Presence re-enabled in homeserver.yaml (presence: enabled: true).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-22 21:22:38 -04:00
parent 3db163e43d
commit 0458851a56
+2 -2
View File
@@ -425,7 +425,7 @@ Periodic `TLS/TCP socket error: Connection reset by peer` in coturn logs. Normal
- [x] LiveKit ICE port range expanded to 50000-51000 - [x] LiveKit ICE port range expanded to 50000-51000
- [x] LiveKit TURN TTL reduced to 1h - [x] LiveKit TURN TTL reduced to 1h
- [x] LiveKit VP9/AV1 codecs enabled - [x] LiveKit VP9/AV1 codecs enabled
- [x] Synapse presence disabled (`presence: enabled: false`) — eliminates federation lag spikes caused by presence EDU bursts to 50+ remote servers - [x] TCP retransmit timeout lowered (`tcp_retries2=8`, `tcp_syn_retries=4`, `tcp_keepalive_probes=3`) — stalled outbound federation connections to slow/dead remote servers (e.g. `exp.farm`) now fail in ~90s instead of ~15 min, preventing federation queue blockage from presence EDU fan-outs
- [ ] BBR congestion control — must be applied on Proxmox host - [ ] BBR congestion control — must be applied on Proxmox host
### Auth & SSO ### Auth & SSO
@@ -522,7 +522,7 @@ Periodic `TLS/TCP socket error: Connection reset by peer` in coturn logs. Normal
> **`/sync` long-poll:** The Matrix `/sync` endpoint is a long-poll (clients hold it open ≤30s). It is excluded from the High Response Time alert to prevent false positives. > **`/sync` long-poll:** The Matrix `/sync` endpoint is a long-poll (clients hold it open ≤30s). It is excluded from the High Response Time alert to prevent false positives.
> **Synapse Event Processing Lag** can fire transiently after a Synapse restart while processors drain their backlog. Self-resolves in 1020 minutes. Root cause of recurring lag spikes was Synapse presence EDU bursts — fixed by disabling presence in `homeserver.yaml` (`presence: enabled: false`). > **Synapse Event Processing Lag** can fire transiently after a Synapse restart while processors drain their backlog. Self-resolves in 1020 minutes. Root cause of recurring lag spikes was presence EDU fan-outs to 50+ remote federated servers — when a slow server (e.g. `exp.farm`) hangs a TCP connection, the default `tcp_retries2=15` means Linux retries for ~15 minutes, blocking the federation queue. Fixed by lowering `tcp_retries2=8` in `/etc/sysctl.d/99-matrix-tuning.conf` (stalled connections now fail in ~90s).
--- ---