Re-enable presence, fix federation lag with TCP timeout tuning
Presence was incorrectly disabled as a workaround. Root cause of lag spikes was Linux's default tcp_retries2=15 (~15 min retransmit window) causing hung outbound TCP connections to slow remote servers (e.g. exp.farm) to block the federation sender queue for minutes at a time. Fix applied to /etc/sysctl.d/99-matrix-tuning.conf on LXC 151: - net.ipv4.tcp_retries2 = 8 (~90s before giving up on stalled connection) - net.ipv4.tcp_syn_retries = 4 (~45s for initial SYN) - net.ipv4.tcp_keepalive_probes = 3 (dead conn detected ~6.5 min) Presence re-enabled in homeserver.yaml (presence: enabled: true). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -425,7 +425,7 @@ Periodic `TLS/TCP socket error: Connection reset by peer` in coturn logs. Normal
|
||||
- [x] LiveKit ICE port range expanded to 50000-51000
|
||||
- [x] LiveKit TURN TTL reduced to 1h
|
||||
- [x] LiveKit VP9/AV1 codecs enabled
|
||||
- [x] Synapse presence disabled (`presence: enabled: false`) — eliminates federation lag spikes caused by presence EDU bursts to 50+ remote servers
|
||||
- [x] TCP retransmit timeout lowered (`tcp_retries2=8`, `tcp_syn_retries=4`, `tcp_keepalive_probes=3`) — stalled outbound federation connections to slow/dead remote servers (e.g. `exp.farm`) now fail in ~90s instead of ~15 min, preventing federation queue blockage from presence EDU fan-outs
|
||||
- [ ] BBR congestion control — must be applied on Proxmox host
|
||||
|
||||
### Auth & SSO
|
||||
@@ -522,7 +522,7 @@ Periodic `TLS/TCP socket error: Connection reset by peer` in coturn logs. Normal
|
||||
|
||||
> **`/sync` long-poll:** The Matrix `/sync` endpoint is a long-poll (clients hold it open ≤30s). It is excluded from the High Response Time alert to prevent false positives.
|
||||
|
||||
> **Synapse Event Processing Lag** can fire transiently after a Synapse restart while processors drain their backlog. Self-resolves in 10–20 minutes. Root cause of recurring lag spikes was Synapse presence EDU bursts — fixed by disabling presence in `homeserver.yaml` (`presence: enabled: false`).
|
||||
> **Synapse Event Processing Lag** can fire transiently after a Synapse restart while processors drain their backlog. Self-resolves in 10–20 minutes. Root cause of recurring lag spikes was presence EDU fan-outs to 50+ remote federated servers — when a slow server (e.g. `exp.farm`) hangs a TCP connection, the default `tcp_retries2=15` means Linux retries for ~15 minutes, blocking the federation queue. Fixed by lowering `tcp_retries2=8` in `/etc/sysctl.d/99-matrix-tuning.conf` (stalled connections now fail in ~90s).
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user