diff --git a/README.md b/README.md index 2b7b156..f1c867d 100644 --- a/README.md +++ b/README.md @@ -425,7 +425,9 @@ Periodic `TLS/TCP socket error: Connection reset by peer` in coturn logs. Normal - [x] LiveKit ICE port range expanded to 50000-51000 - [x] LiveKit TURN TTL reduced to 1h - [x] LiveKit VP9/AV1 codecs enabled -- [x] TCP retransmit timeout lowered (`tcp_retries2=8`, `tcp_syn_retries=4`, `tcp_keepalive_probes=3`) — stalled outbound federation connections to slow/dead remote servers (e.g. `exp.farm`) now fail in ~90s instead of ~15 min, preventing federation queue blockage from presence EDU fan-outs +- [x] TCP retransmit timeout lowered (`tcp_retries2=5`, `tcp_syn_retries=4`, `tcp_keepalive_probes=3`) — stalled outbound federation connections now fail in ~15-30s instead of ~15 min +- [x] Unreachable routes added for servers with asymmetric connectivity (can reach us but we can't reach their federation port) — prevents 90s TCP hangs from being added to lag; defined in `/etc/network/interfaces` post-up hooks and survive reboots (bark.lgbt ×2, parodia.dev, chat.ohaa.xyz, matrix.k8ekat.dev) +- [x] Stuck `device_lists_remote_resync` entries cleared for dead-server users (@dalite:bark.lgbt, @arndot:matrix.goch.social) — device list resync was firing every 30s - [ ] BBR congestion control — must be applied on Proxmox host ### Auth & SSO @@ -509,7 +511,7 @@ Periodic `TLS/TCP socket error: Connection reset by peer` in coturn logs. Normal | Federation Queue Backing Up | pending PDUs > 100 for 10m | warning | | Synapse High Memory | RSS > 2000MB for 10m | warning | | Synapse High Response Time | p99 latency (excl. /sync) > 10s for 5m | warning | -| Synapse Event Processing Lag | any processor > 30s behind for 5m | warning | +| Synapse Event Processing Lag | any processor > 300s behind for 15m | warning | | Synapse DB Query Latency High | p99 query time > 1s for 5m | warning | **Infrastructure folder:** @@ -522,7 +524,14 @@ Periodic `TLS/TCP socket error: Connection reset by peer` in coturn logs. Normal > **`/sync` long-poll:** The Matrix `/sync` endpoint is a long-poll (clients hold it open ≤30s). It is excluded from the High Response Time alert to prevent false positives. -> **Synapse Event Processing Lag** can fire transiently after a Synapse restart while processors drain their backlog. Self-resolves in 10–20 minutes. Root cause of recurring lag spikes was presence EDU fan-outs to 50+ remote federated servers — when a slow server (e.g. `exp.farm`) hangs a TCP connection, the default `tcp_retries2=15` means Linux retries for ~15 minutes, blocking the federation queue. Fixed by lowering `tcp_retries2=8` in `/etc/sysctl.d/99-matrix-tuning.conf` (stalled connections now fail in ~90s). +> **Synapse Event Processing Lag** alert fires when `synapse_event_processing_lag > 300s` for 15 consecutive minutes (threshold raised from 120s/5m to reduce noise from normal federation backoff cycling). +> +> Root cause: several federated servers (bark.lgbt, parodia.dev, etc.) have asymmetric connectivity — they can reach us but we cannot reach their federation ports. Each inbound transaction they send resets our backoff to 0, triggering a new outbound connection attempt that hangs for ~90s (TCP `User timeout`). This causes the lag metric to spike. Mitigations in place: +> 1. `tcp_retries2=5` in `/etc/sysctl.d/99-matrix-tuning.conf` — TCP hangs now fail in ~15-30s +> 2. `ip route add unreachable ` in `/etc/network/interfaces` post-up — outbound connections to these servers fail in 0ms (ICMP unreachable) +> 3. Alert threshold raised to 300s/15m — only fires for genuine outages, not normal 10-min backoff cycles +> +> To find new offending servers: `grep "User timeout\|ConnectingCancell" /var/log/matrix-synapse/homeserver.log | grep -oP "\[([^\]]+)\]" | sort | uniq -c | sort -rn | head -20` ---