From 0458851a56da759bd6b21e2812dc9c1e82af1da3 Mon Sep 17 00:00:00 2001 From: Jared Vititoe Date: Sun, 22 Mar 2026 21:22:38 -0400 Subject: [PATCH] Re-enable presence, fix federation lag with TCP timeout tuning Presence was incorrectly disabled as a workaround. Root cause of lag spikes was Linux's default tcp_retries2=15 (~15 min retransmit window) causing hung outbound TCP connections to slow remote servers (e.g. exp.farm) to block the federation sender queue for minutes at a time. Fix applied to /etc/sysctl.d/99-matrix-tuning.conf on LXC 151: - net.ipv4.tcp_retries2 = 8 (~90s before giving up on stalled connection) - net.ipv4.tcp_syn_retries = 4 (~45s for initial SYN) - net.ipv4.tcp_keepalive_probes = 3 (dead conn detected ~6.5 min) Presence re-enabled in homeserver.yaml (presence: enabled: true). Co-Authored-By: Claude Sonnet 4.6 --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 4eba10c..2b7b156 100644 --- a/README.md +++ b/README.md @@ -425,7 +425,7 @@ Periodic `TLS/TCP socket error: Connection reset by peer` in coturn logs. Normal - [x] LiveKit ICE port range expanded to 50000-51000 - [x] LiveKit TURN TTL reduced to 1h - [x] LiveKit VP9/AV1 codecs enabled -- [x] Synapse presence disabled (`presence: enabled: false`) — eliminates federation lag spikes caused by presence EDU bursts to 50+ remote servers +- [x] TCP retransmit timeout lowered (`tcp_retries2=8`, `tcp_syn_retries=4`, `tcp_keepalive_probes=3`) — stalled outbound federation connections to slow/dead remote servers (e.g. `exp.farm`) now fail in ~90s instead of ~15 min, preventing federation queue blockage from presence EDU fan-outs - [ ] BBR congestion control — must be applied on Proxmox host ### Auth & SSO @@ -522,7 +522,7 @@ Periodic `TLS/TCP socket error: Connection reset by peer` in coturn logs. Normal > **`/sync` long-poll:** The Matrix `/sync` endpoint is a long-poll (clients hold it open ≤30s). It is excluded from the High Response Time alert to prevent false positives. -> **Synapse Event Processing Lag** can fire transiently after a Synapse restart while processors drain their backlog. Self-resolves in 10–20 minutes. Root cause of recurring lag spikes was Synapse presence EDU bursts — fixed by disabling presence in `homeserver.yaml` (`presence: enabled: false`). +> **Synapse Event Processing Lag** can fire transiently after a Synapse restart while processors drain their backlog. Self-resolves in 10–20 minutes. Root cause of recurring lag spikes was presence EDU fan-outs to 50+ remote federated servers — when a slow server (e.g. `exp.farm`) hangs a TCP connection, the default `tcp_retries2=15` means Linux retries for ~15 minutes, blocking the federation queue. Fixed by lowering `tcp_retries2=8` in `/etc/sysctl.d/99-matrix-tuning.conf` (stalled connections now fail in ~90s). ---