Files
cinny/LOTUS_E2EE_INVESTIGATION.md
T
jared 7a8cadc6ec feat(diag): E2EE investigation kit for the KE-1→4 cluster
LOTUS_E2EE_INVESTIGATION.md: per-KE capture runbook (console signatures, synapse
log greps + SQL against the documented LXC deployment, the KE-1⇒KE-2 causality
decision tree, ranked remediations incl. what a crypto-store reset wipes; SDK
finding: stable 41.6.0 has no OTK fix over our RC pin). Client: capture-only
console ring buffer (cryptoDiagLog, KE-signature-matched, max 200) + a Crypto
Diagnostics card in Developer Tools with a download-report button. ClientRoot
installs the capture hook at module load and mounts useSessionSync (cross-tab
sessions, prior commit).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-01 21:19:02 -04:00

403 lines
21 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Lotus Chat — E2EE Investigation Runbook (KE-1 → KE-4)
> **Scope:** evidence-gathering only. Do **not** apply fixes from this document
> without a cross-system planning session (client rust-crypto ↔ Synapse ↔
> Element Call MatrixRTC). Symptom source: `LOTUS_BUGS.md` §"Encryption / E2EE"
> (KE-1..KE-4), observed live 2026-06-30 on `chat.lotusguild.org` during a
> 2-person Element Call.
>
> **Client:** Lotus Cinny fork, `matrix-js-sdk@41.6.0-rc.0`, rust-crypto.
> **Server:** Synapse `1.155.0` on **LXC 151** (`10.10.10.29`), PostgreSQL 17.9
> on **LXC 109** (`10.10.10.44`). Facts below are copy-pasteable against that
> deployment (paths/IPs from `/root/code/matrix/README.md`).
---
## 0. Deployment facts used by this runbook
From the matrix infra README (`/root/code/matrix/README.md`):
| Thing | Value |
|-------|-------|
| Synapse host | LXC **151**, `10.10.10.29` (Synapse 1.155.0) |
| Synapse log | `/var/log/matrix-synapse/homeserver.log` |
| Synapse config | `/etc/matrix-synapse/homeserver.yaml` (+ `conf.d/`) |
| Synapse HTTP | `10.10.10.29:8008` |
| PostgreSQL host | LXC **109**, `10.10.10.44` (PG 17.9), db `synapse` |
| synapse-admin UI | `http://10.10.10.29:8080` |
| LiveKit / lk-jwt / guard | LXC 151: LiveKit `:7880/:7881`, guard `:8070`, lk-jwt `:8071` |
| SSH path to Synapse | `ssh root@10.10.10.4` then `pct enter 151` |
| SSH path to PG | `ssh root@10.10.10.4` then `pct enter 109` |
**Getting a psql shell** (run on LXC 109, or from 151 over the network):
```bash
# On LXC 109:
sudo -u postgres psql synapse
# From LXC 151 (pg_hba allows 10.10.10.29):
psql "host=10.10.10.44 user=synapse dbname=synapse"
```
**Tailing Synapse during a call** (on LXC 151):
```bash
tail -F /var/log/matrix-synapse/homeserver.log | tee /tmp/lotus-call-$(date +%s).log
```
Synapse E2EE/to-device logging is chatty at `INFO`; if a category is silent,
temporarily raise it in `/etc/matrix-synapse/conf.d/log.yaml` (or the
`log_config` file referenced by `homeserver.yaml`):
```yaml
loggers:
synapse.rest.client.keys: { level: DEBUG }
synapse.handlers.e2e_keys: { level: DEBUG }
synapse.storage.databases.main.end_to_end_keys: { level: DEBUG }
synapse.handlers.devicemessage: { level: DEBUG } # to-device
```
Then `systemctl reload matrix-synapse` (reload re-reads log config without a
full restart). **Revert to `INFO` after the capture** — DEBUG is very verbose.
---
## 1. Per-KE evidence matrix
Client greps assume Chrome/Firefox DevTools console (filter box or, better,
"Preserve log" + save-as). The **Crypto Diagnostics** card (Settings →
Developer Tools) auto-captures every signature below into a downloadable JSON —
use it as the primary client artifact and DevTools as the raw backup.
### KE-1 — OTK upload conflict storm (root-cause candidate)
- **Console signature (grep):**
- `already exists`
- full: `POST /_matrix/client/v3/keys/upload … 400 M_UNKNOWN: One time key signed_curve25519:<id> already exists. Old key: {…} new key: {…}`
- **Capture client-side:**
- Timestamp (first occurrence + rate — "N/sec"), **device id**, **user id**.
- DevTools → **Network** → filter `keys/upload`: for a failing call save the
**request body** (the `one_time_keys` map — note the exact `signed_curve25519:<id>`)
and the **response body** (the `Old key` / `new key` JSON). This diff is the
smoking gun: same key-id, different value ⇒ store vs server divergence.
- Whether it self-heals or loops forever (KE-1 loops).
- **Synapse log grep (LXC 151):**
```bash
grep -E "keys/upload|One time key .* already exists|OneTimeKey" \
/var/log/matrix-synapse/homeserver.log | grep "<user_id>"
```
- **Synapse SQL (LXC 109) — what the server thinks it holds:**
```sql
-- Current OTK inventory for the device (compare key_id set against the
-- request body the client keeps retrying).
SELECT algorithm, key_id, ts_added_ms
FROM e2e_one_time_keys_json
WHERE user_id = '@user:matrix.lotusguild.org'
AND device_id = '<DEVICE_ID>'
ORDER BY algorithm, key_id;
-- Server's advertised counts (this is what /sync tells the client it has,
-- and drives whether the client decides to upload more).
SELECT algorithm, count(*) FROM e2e_one_time_keys_json
WHERE user_id = '@user:matrix.lotusguild.org' AND device_id = '<DEVICE_ID>'
GROUP BY algorithm;
-- Fallback key state (used when OTKs are exhausted).
SELECT algorithm, key_id, used, ts_added_ms
FROM e2e_fallback_keys_json
WHERE user_id = '@user:matrix.lotusguild.org' AND device_id = '<DEVICE_ID>';
```
> Table names are Synapse 1.155 (`e2e_one_time_keys_json`,
> `e2e_fallback_keys_json`). If a name is absent, list with `\dt e2e*` in psql.
- **Confirms:** if the offending `key_id` (from the 400) is **present** in
`e2e_one_time_keys_json` with a **different** stored value than the client's
request body → OTK state has diverged (rust-crypto store vs Synapse). That is
the KE-1 root condition.
### KE-2 — EC media keys not arriving/decrypting (audio/video cutouts)
- **Console signature (grep):**
- `MissingKey`
- `missing key at index` (e.g. `MissingKey: missing key at index N for participant @user`)
- `key set not found`
- `io.element.call.encryption_keys` (rust-crypto: `WARN … Received an unexpected encrypted to-device event … event_type="io.element.call.encryption_keys"`)
- **Capture client-side:**
- Timestamp windows where a participant's audio/video cut out, and the
`@participant` + `index N` from the message.
- The `io.element.call.encryption_keys` warnings (these are the media-key
to-device events failing to decrypt) with their timestamps.
- Own device id + user id (to correlate with the sender's Olm session).
- **Synapse log grep (LXC 151) — to-device delivery of the media keys:**
```bash
grep -E "io.element.call.encryption_keys|m.room.encrypted|/sendToDevice|to_device" \
/var/log/matrix-synapse/homeserver.log | grep -E "<user_id>|<participant_id>"
```
- **Synapse SQL (LXC 109) — undelivered / queued to-device events:**
```sql
-- Backlog of to-device messages queued for the affected device. A growing
-- count here = the HS has the media-key events but the device isn't draining
-- them via /sync (or they were sent to a stale device id).
SELECT user_id, device_id, count(*) AS pending
FROM device_inbox
WHERE user_id = '@user:matrix.lotusguild.org'
GROUP BY user_id, device_id;
-- Cross-check the device id the sender is targeting actually exists / is current.
SELECT device_id, display_name, last_seen, ts
FROM devices WHERE user_id = '@user:matrix.lotusguild.org';
```
- **Confirms:** to-device events present but undecryptable (client shows the
`io.element.call.encryption_keys` "unexpected encrypted" warning) ⇒ there is
**no valid Olm session** to decrypt them — the expected downstream of KE-1.
### KE-3 — Timeline decryption error: missing `algorithm` field
- **Console signature (grep):**
- `DecryptionError`
- full: `Error decrypting event (… type=m.room.encrypted …): DecryptionError[msg: missing field 'algorithm' at line 1 column 138 …]`
- **Capture client-side:**
- The **event id** (`$SASBBzoqj…` was one) and the **room id**.
- Pull the raw event JSON via DevTools or the Developer Tools account-data/event
viewer, or directly:
```
GET https://matrix.lotusguild.org/_matrix/client/v3/rooms/<roomId>/event/<eventId>
```
Inspect `content` — confirm whether `algorithm` (should be
`m.megolm.v1.aes-sha2`) is truly absent vs a serialization mismatch.
- **Synapse log grep (LXC 151):**
```bash
grep -E "<eventId>" /var/log/matrix-synapse/homeserver.log
```
- **Synapse SQL (LXC 109) — the stored event content as the HS holds it:**
```sql
SELECT ej.event_id, e.type, e.sender, e.origin_server_ts,
(ej.json::json -> 'content' -> 'algorithm') AS algorithm
FROM event_json ej
JOIN events e USING (event_id)
WHERE ej.event_id = '$SASBBzoqj...';
```
- **Confirms:** if the stored `content.algorithm` is **NULL/absent** on the HS →
a malformed/legacy event was persisted (sender-side or federation). If it is
**present** on the HS but the client throws → an RC-SDK deserialization bug.
This distinction decides whether KE-3 is a data problem or a client problem.
### KE-4 — MatrixRTC delayed-event / membership timeouts
- **Console signature (grep):**
- `update_delayed_event` (`org.matrix.msc4157.update_delayed_event`)
- `delayed event` / `Restart delayed event timed out`
- full: `[MembershipManager] Network local timeout error while sending event, immediate retry … AbortError: Restart delayed event timed out before the HS responded`
- **Capture client-side:**
- Timestamps of each timeout; whether they correlate with call join/leave or
with general sync slowness.
- DevTools → Network: the `…/delayed_events…` / `update_delayed_event`
requests — their **HTTP status and latency** (timed-out vs slow-200).
- **Synapse log grep (LXC 151):**
```bash
grep -E "delayed_event|msc4140|msc4157|update_delayed" \
/var/log/matrix-synapse/homeserver.log | grep "<user_id>"
# HS responsiveness in the same window (KE-4 may be pure latency):
grep -E "Processed request|/sync" /var/log/matrix-synapse/homeserver.log | tail -50
```
- **Server-side corroboration (Grafana, `dashboard.lotusguild.org`):** Synapse
p99 response time (excl. `/sync`), event-processing lag, DB query latency for
the call window. High latency here ⇒ KE-4 is (partly) homeserver
responsiveness, not a client bug.
- **Confirms:** timeouts that line up with HS latency spikes → reliability/load;
timeouts with a healthy HS → client MembershipManager retry logic.
---
## 2. Causality hypothesis
```
KE-1 OTK upload conflict storm
(rust-crypto store ↔ Synapse OTK state DIVERGED; server rejects re-uploads)
│ no fresh OTKs can be published/claimed
No new Olm (1:1) sessions can be established with this device
KE-2 EC media-key to-device events (io.element.call.encryption_keys)
arrive but cannot be decrypted ⇒ MissingKey at index N
⇒ friend's audio/video cuts out
```
KE-3 (missing `algorithm`) and KE-4 (delayed-event timeouts) are **likely
independent** of the KE-1→KE-2 chain: KE-3 is a decode/serialization path,
KE-4 is a MatrixRTC-vs-HS reliability path. Confirm/refute independence with the
decision tree below.
### Decision tree — which capture confirms/refutes each link
```
Q1. Does the KE-1 offending key_id from the 400 response exist in
e2e_one_time_keys_json with a DIFFERENT value than the client request body?
├─ YES → OTK divergence CONFIRMED (KE-1 root). Go to Q2.
└─ NO → Not divergence. Check: are OTK counts at 0 with fallback key `used=true`?
├─ YES → OTK exhaustion, not divergence — different remediation.
└─ NO → Suspect RC-SDK 41.6.0-rc.0 upload-loop regression (see §3).
Q2. During the same call, are io.element.call.encryption_keys to-device events
present in device_inbox / Synapse to-device logs for our device id?
├─ YES + client shows "unexpected encrypted"/MissingKey
│ → KE-1 ⇒ KE-2 LINK CONFIRMED (events delivered, no Olm session to open them).
├─ YES + client decrypts fine, but LiveKit still silent
│ → KE-2 is downstream of LiveKit/SFU, NOT KE-1. Decouple from crypto.
└─ NO (nothing queued/targeted our device)
→ media keys never sent to us: stale device id / membership (see KE-4)
→ KE-2 is a device-targeting problem, weakly linked to KE-1.
Q3. KE-3: is content.algorithm NULL in event_json on the HS?
├─ YES → malformed persisted event (sender/federation). Independent of KE-1.
└─ NO → client-side RC-SDK deserialization bug. Independent of KE-1.
Q4. KE-4: do delayed-event timeouts coincide with Synapse p99 latency spikes
(Grafana) in the same minute?
├─ YES → homeserver responsiveness/load. Independent of KE-1..KE-3.
└─ NO → client MembershipManager retry behavior. Independent.
```
---
## 3. Ranked remediation options (with blast radius)
> Ordered least-destructive → most-destructive. **Do not run any of these as a
> "fix" before the planning session** — they are listed so evidence collection
> can be paired with a recovery plan. Confirm the root condition (Q1/Q2) first.
1. **Per-device logout + re-login of the affected device** *(lowest blast radius)*
- **What:** log the one glitching device out and back in. Forces a fresh
device id, fresh device keys, and a clean OTK batch — sidesteps a diverged
OTK store without touching other sessions.
- **Blast radius:** that device only. Other sessions/devices untouched.
- **Cost:** the new device must be re-verified (cross-signing) and will need
to restore room keys from **key backup** to read old encrypted history.
- **Confirms/uses:** if KE-1 stops after this, OTK-store divergence (Q1) was
the cause.
2. **Client crypto-store reset (`clearLoginData` path)** *(medium)*
- **What:** `clearLoginData()` in `src/client/initMatrix.ts` (coordinator's
file — do not edit) **deletes ALL IndexedDB databases** (incl.
`web-sync-store` and the rust-crypto store `crypto-store`), **unregisters
service workers**, **clears all Cache Storage**, and **`localStorage.clear()`**,
then reloads. `clearCacheAndReload()` is lighter — it only calls
`mx.store.deleteAllData()` (sync cache) and does **not** wipe crypto.
- **Blast radius:** this browser profile only, but total: you are logged out,
lose all cached sync state, drafts, settings, and **the local
megolm/room-key store**.
- **⚠️ Message-history / backup implication:** wiping `crypto-store` destroys
locally-held **room keys (megolm inbound sessions)**. Any history **not
backed up to server-side Key Backup** becomes **permanently undecryptable
on this device**. Before doing this: verify Key Backup is enabled and the
recovery key / passphrase is available (Settings → Security), or the user
loses readable history. Cross-signing must be re-established too.
- **Use when:** the rust-crypto store itself is corrupt/diverged and option 1
didn't clear it.
3. **SDK pin change off the RC** *(medium — codebase change, needs rebuild)*
- **Current pin:** `package.json` → `"matrix-js-sdk": "41.6.0-rc.0"` (a
release candidate).
- **Finding (npm / GitHub changelog, checked 2026-07):** stable **`41.6.0`**
was released **2026-05-26**. Its only changelog line is *"Throw sane error
on completeLoginOnNewDevice IdP rejection"* — **no OTK / keys-upload / Olm /
to-device fix** relative to the RC. Later stable lines exist
(`41.7.0`, `41.8.0`; `41.7.0-rc.3` / `41.9.0-rc.0` seen as pre-releases).
Nearby crypto-relevant entries: `41.5.0` *"Enable encrypted history sharing
by default"*; `41.4.0` key-backup handling. **No changelog entry directly
addresses the KE-1 OTK-conflict symptom** in the immediate range — so
moving RC→`41.6.0` stable is a low-risk hygiene step but is **not expected
to fix KE-1 by itself**. Before pinning, re-read the CHANGELOG for any
`41.7.x`/`41.8.x` OTK/one-time-key/olm entry that post-dates this note.
- **Blast radius:** all users after the next `cinny-build.sh` deploy. Test the
rust-crypto IndexedDB schema — a downgrade triggers the `IDB_VERSION_CONFLICT`
path in `initMatrix.ts`.
4. **Synapse-side OTK row surgery** *(LAST RESORT — highest danger)*
- **What:** deleting/rewriting rows in `e2e_one_time_keys_json` (and/or
`e2e_fallback_keys_json`, `device_inbox`) for the affected device to force
the client to re-upload a clean batch.
- **⚠️ Danger:** direct writes to Synapse crypto tables can **desync every
device of that user**, break Olm sessions **for everyone who has claimed one
of those keys**, and are easy to get wrong (wrong `key_id`, cache not
invalidated). Synapse caches OTK counts — a raw DELETE without a restart can
leave the advertised count wrong, **worsening** the KE-1 loop.
- **Guardrails if ever done (planning session + HS owner only):** full
`pg_dump` of `synapse` first; do it during **zero active calls**; delete only
the exact diverged `key_id` for the exact `device_id`; `systemctl restart
matrix-synapse` to flush caches; then log the device out/in (option 1) so it
republishes. **Never** run this speculatively.
---
## 4. "Capture session" checklist (run during the next call)
Do these **in order**. Aim to have client + server capturing the **same call**.
1. **Prep server tail (LXC 151):** SSH in, start
`tail -F /var/log/matrix-synapse/homeserver.log | tee /tmp/lotus-call-$(date +%s).log`.
(Optionally raise the `synapse.rest.client.keys` / `handlers.e2e_keys` /
`handlers.devicemessage` loggers to DEBUG per §0 and `systemctl reload
matrix-synapse` — remember to revert after.)
2. **Prep client:** open Lotus Chat → Settings → Developer Tools → **enable
Developer Tools** so the **Crypto Diagnostics** card is visible; note its
entry count starts at (or reset by reload to) 0.
3. **Open DevTools** (F12) → Console: enable **Preserve log**; Network tab:
enable **Preserve log** + **Record**. Note your **device id** and **user id**
(Settings → Devices / Developer Tools → Copy access token page shows ids).
4. **Note wall-clock start time** (ISO/UTC) on both machines so logs align.
5. **Join the Element Call** with the second participant; reproduce the fault
(wait for the audio/video cutouts and let KE-1 storm run ~3060s).
6. **When a fault occurs, note the wall-clock timestamp** and which symptom
(audio cut / video freeze / etc.) — this bounds the log window.
7. **Client artifacts:** in the Crypto Diagnostics card click **Download report**
(`lotus-crypto-diag-<ts>.json`); in DevTools Network, save the failing
`keys/upload` request+response (right-click → Save/Copy), and the raw HAR
(Network → Save all as HAR) for the call window.
8. **Grab KE-3 event id / KE-2 participant+index** from the console (or the
diag JSON `entries[]`) for the SQL lookups.
9. **Server artifacts:** stop the tail; run the per-KE greps and SQL from §1
against the noted device id / user id / event id, saving output alongside the
client JSON. Screenshot the Grafana Synapse latency panels for the window
(for KE-4).
10. **Bundle & label:** put client JSON + HAR + server log slice + SQL output in
one folder named with the call's UTC start time. Revert any DEBUG log config
(`systemctl reload matrix-synapse`). Hand off to the planning session — **do
not apply §3 remediations yet.**
---
## 5. Client diagnostics helper (this kit)
- **`src/app/utils/cryptoDiagLog.ts`** — capture-only console instrumentation.
- `installCryptoDiagLog()` — idempotent; wraps `console.warn`/`console.error`
with pass-through wrappers (originals always called) that ring-buffer (max
**200**) any line matching the KE signatures. No network, no timers.
- `getCryptoDiagEntries()` — snapshot copy of the buffer (`{ ts, level, ke,
signature, message }`, most-recent-last).
- `buildCryptoDiagReport(mx)` — JSON string: SDK version, device id, user id,
sync state, `cryptoReady` (`mx.getCrypto()` presence), per-KE counts, and the
entry buffer. No tokens/PII beyond those ids; captured log lines are retained
verbatim as evidence.
- **Signatures → KE mapping:** `already exists`→KE-1; `missing key at index` /
`io.element.call.encryption_keys` / `MissingKey`→KE-2; `DecryptionError`→KE-3;
`update_delayed_event` / `delayed event`→KE-4.
- **`src/app/features/settings/developer/CryptoDiagnostics.tsx`** — a folds
`SequenceCard`/`SettingTile` card (mirrors `developer-tools/DevelopTools.tsx`)
showing the live matched-entry count (Badge) and a **Download report** button
(Blob → `lotus-crypto-diag-<ts>.json`, same download idiom as
`room-settings/ExportRoomHistory.tsx`).
### Recommended mount points (coordinator)
- **Install call:** call `installCryptoDiagLog()` **as early as possible during
boot** so it captures crypto errors from first sync — ideally at the top of
the client entry module or inside `ClientRoot` before/around `initClient`
(e.g. `src/app/pages/client/ClientRoot.tsx`). It is idempotent, side-effect
only, and needs no `mx`, so a module-scope call at app entry is safe. (Do
**not** put it in `initMatrix.ts` — that file is off-limits.)
- **Settings card:** render `<CryptoDiagnostics />` inside the Developer Tools
page — in `src/app/features/settings/developer-tools/DevelopTools.tsx`, add it
to the `Box direction="Column" gap="700"` list (guarded by the existing
`developerTools` flag), right after the "Access Token" card. It pulls `mx`
from `useMatrixClient()` itself, so it just needs to be placed in the tree.