LotusGuild/cinny

Fork 0

Files

T

jared 81904372bc

CI / Build & Quality Checks (push) Successful in 10m49s

Details

CI / Trigger Desktop Build (push) Successful in 21s

Details

docs(e2ee): investigation update — 41.7.0 delta + web-specific KE-1 root cause

Code-read + upstream-issue triage this session:
- 41.7.0 / crypto-wasm 18.3.1 does NOT fix KE-1 (no OTK/upload change; #5200
  still open) — the SDK-pin remediation lever is closed.
- Confirmed root cause = rust-crypto store <-> Synapse OTK divergence; the
  leading web trigger is that cinny never requests persistent storage, so the
  IndexedDB crypto store is evictable while the localStorage session survives.
- New buildable preventive mitigation: navigator.storage.persist() on login
  (+ multi-tab guard, 400-loop recovery prompt). Added as §6 with a secondary
  KE-2 to-device-validation hypothesis and capture discriminators.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-07-02 15:14:46 -04:00

26 KiB

Raw Blame History

Lotus Chat — E2EE Investigation Runbook (KE-1 → KE-4)

Scope: evidence-gathering only. Do not apply fixes from this document without a cross-system planning session (client rust-crypto ↔ Synapse ↔ Element Call MatrixRTC). Symptom source: LOTUS_BUGS.md §"Encryption / E2EE" (KE-1..KE-4), observed live 2026-06-30 on chat.lotusguild.org during a 2-person Element Call.

Client: Lotus Cinny fork, matrix-js-sdk@41.6.0-rc.0, rust-crypto. Server: Synapse 1.155.0 on LXC 151 (10.10.10.29), PostgreSQL 17.9 on LXC 109 (10.10.10.44). Facts below are copy-pasteable against that deployment (paths/IPs from /root/code/matrix/README.md).

0. Deployment facts used by this runbook

From the matrix infra README (/root/code/matrix/README.md):

Thing	Value
Synapse host	LXC 151, `10.10.10.29` (Synapse 1.155.0)
Synapse log	`/var/log/matrix-synapse/homeserver.log`
Synapse config	`/etc/matrix-synapse/homeserver.yaml` (+ `conf.d/`)
Synapse HTTP	`10.10.10.29:8008`
PostgreSQL host	LXC 109, `10.10.10.44` (PG 17.9), db `synapse`
synapse-admin UI	`http://10.10.10.29:8080`
LiveKit / lk-jwt / guard	LXC 151: LiveKit `:7880/:7881`, guard `:8070`, lk-jwt `:8071`
SSH path to Synapse	`ssh root@10.10.10.4` then `pct enter 151`
SSH path to PG	`ssh root@10.10.10.4` then `pct enter 109`

Getting a psql shell (run on LXC 109, or from 151 over the network):

# On LXC 109:
sudo -u postgres psql synapse
# From LXC 151 (pg_hba allows 10.10.10.29):
psql "host=10.10.10.44 user=synapse dbname=synapse"

Tailing Synapse during a call (on LXC 151):

tail -F /var/log/matrix-synapse/homeserver.log | tee /tmp/lotus-call-$(date +%s).log

Synapse E2EE/to-device logging is chatty at INFO; if a category is silent, temporarily raise it in /etc/matrix-synapse/conf.d/log.yaml (or the log_config file referenced by homeserver.yaml):

loggers:
  synapse.rest.client.keys: { level: DEBUG }
  synapse.handlers.e2e_keys: { level: DEBUG }
  synapse.storage.databases.main.end_to_end_keys: { level: DEBUG }
  synapse.handlers.devicemessage: { level: DEBUG } # to-device

Then systemctl reload matrix-synapse (reload re-reads log config without a full restart). Revert to INFO after the capture — DEBUG is very verbose.

1. Per-KE evidence matrix

Client greps assume Chrome/Firefox DevTools console (filter box or, better, "Preserve log" + save-as). The Crypto Diagnostics card (Settings → Developer Tools) auto-captures every signature below into a downloadable JSON — use it as the primary client artifact and DevTools as the raw backup.

KE-1 — OTK upload conflict storm (root-cause candidate)

Console signature (grep):
- already exists
- full: POST /_matrix/client/v3/keys/upload … 400 M_UNKNOWN: One time key signed_curve25519:<id> already exists. Old key: {…} new key: {…}
Capture client-side:
- Timestamp (first occurrence + rate — "N/sec"), device id, user id.
- DevTools → Network → filter keys/upload: for a failing call save the request body (the one_time_keys map — note the exact signed_curve25519:<id>) and the response body (the Old key / new key JSON). This diff is the smoking gun: same key-id, different value ⇒ store vs server divergence.
- Whether it self-heals or loops forever (KE-1 loops).

Synapse log grep (LXC 151):

grep -E "keys/upload|One time key .* already exists|OneTimeKey" \
  /var/log/matrix-synapse/homeserver.log | grep "<user_id>"

Synapse SQL (LXC 109) — what the server thinks it holds:

-- Current OTK inventory for the device (compare key_id set against the
-- request body the client keeps retrying).
SELECT algorithm, key_id, ts_added_ms
FROM e2e_one_time_keys_json
WHERE user_id = '@user:matrix.lotusguild.org'
  AND device_id = '<DEVICE_ID>'
ORDER BY algorithm, key_id;

-- Server's advertised counts (this is what /sync tells the client it has,
-- and drives whether the client decides to upload more).
SELECT algorithm, count(*) FROM e2e_one_time_keys_json
WHERE user_id = '@user:matrix.lotusguild.org' AND device_id = '<DEVICE_ID>'
GROUP BY algorithm;

-- Fallback key state (used when OTKs are exhausted).
SELECT algorithm, key_id, used, ts_added_ms
FROM e2e_fallback_keys_json
WHERE user_id = '@user:matrix.lotusguild.org' AND device_id = '<DEVICE_ID>';

Table names are Synapse 1.155 (e2e_one_time_keys_json, e2e_fallback_keys_json). If a name is absent, list with \dt e2e* in psql.

Confirms: if the offending key_id (from the 400) is present in e2e_one_time_keys_json with a different stored value than the client's request body → OTK state has diverged (rust-crypto store vs Synapse). That is the KE-1 root condition.

KE-2 — EC media keys not arriving/decrypting (audio/video cutouts)

Console signature (grep):
- MissingKey
- missing key at index (e.g. MissingKey: missing key at index N for participant @user)
- key set not found
- io.element.call.encryption_keys (rust-crypto: WARN … Received an unexpected encrypted to-device event … event_type="io.element.call.encryption_keys")
Capture client-side:
- Timestamp windows where a participant's audio/video cut out, and the @participant + index N from the message.
- The io.element.call.encryption_keys warnings (these are the media-key to-device events failing to decrypt) with their timestamps.
- Own device id + user id (to correlate with the sender's Olm session).

Synapse log grep (LXC 151) — to-device delivery of the media keys:

grep -E "io.element.call.encryption_keys|m.room.encrypted|/sendToDevice|to_device" \
  /var/log/matrix-synapse/homeserver.log | grep -E "<user_id>|<participant_id>"

Synapse SQL (LXC 109) — undelivered / queued to-device events:

-- Backlog of to-device messages queued for the affected device. A growing
-- count here = the HS has the media-key events but the device isn't draining
-- them via /sync (or they were sent to a stale device id).
SELECT user_id, device_id, count(*) AS pending
FROM device_inbox
WHERE user_id = '@user:matrix.lotusguild.org'
GROUP BY user_id, device_id;

-- Cross-check the device id the sender is targeting actually exists / is current.
SELECT device_id, display_name, last_seen, ts
FROM devices WHERE user_id = '@user:matrix.lotusguild.org';

Confirms: to-device events present but undecryptable (client shows the io.element.call.encryption_keys "unexpected encrypted" warning) ⇒ there is no valid Olm session to decrypt them — the expected downstream of KE-1.

KE-3 — Timeline decryption error: missing `algorithm` field

Console signature (grep):
- DecryptionError
- full: Error decrypting event (… type=m.room.encrypted …): DecryptionError[msg: missing field 'algorithm' at line 1 column 138 …]
Capture client-side:
- The event id ($SASBBzoqj… was one) and the room id.
- Pull the raw event JSON via DevTools or the Developer Tools account-data/event viewer, or directly:
```
GET https://matrix.lotusguild.org/_matrix/client/v3/rooms/<roomId>/event/<eventId>
```
  Inspect content — confirm whether algorithm (should be m.megolm.v1.aes-sha2) is truly absent vs a serialization mismatch.

Synapse log grep (LXC 151):

grep -E "<eventId>" /var/log/matrix-synapse/homeserver.log

Synapse SQL (LXC 109) — the stored event content as the HS holds it:

SELECT ej.event_id, e.type, e.sender, e.origin_server_ts,
       (ej.json::json -> 'content' -> 'algorithm') AS algorithm
FROM event_json ej
JOIN events e USING (event_id)
WHERE ej.event_id = '$SASBBzoqj...';

Confirms: if the stored content.algorithm is NULL/absent on the HS → a malformed/legacy event was persisted (sender-side or federation). If it is present on the HS but the client throws → an RC-SDK deserialization bug. This distinction decides whether KE-3 is a data problem or a client problem.

KE-4 — MatrixRTC delayed-event / membership timeouts

Console signature (grep):
- update_delayed_event (org.matrix.msc4157.update_delayed_event)
- delayed event / Restart delayed event timed out
- full: [MembershipManager] Network local timeout error while sending event, immediate retry … AbortError: Restart delayed event timed out before the HS responded
Capture client-side:
- Timestamps of each timeout; whether they correlate with call join/leave or with general sync slowness.
- DevTools → Network: the …/delayed_events… / update_delayed_event requests — their HTTP status and latency (timed-out vs slow-200).

Synapse log grep (LXC 151):

grep -E "delayed_event|msc4140|msc4157|update_delayed" \
  /var/log/matrix-synapse/homeserver.log | grep "<user_id>"
# HS responsiveness in the same window (KE-4 may be pure latency):
grep -E "Processed request|/sync" /var/log/matrix-synapse/homeserver.log | tail -50

Server-side corroboration (Grafana, dashboard.lotusguild.org): Synapse p99 response time (excl. /sync), event-processing lag, DB query latency for the call window. High latency here ⇒ KE-4 is (partly) homeserver responsiveness, not a client bug.
Confirms: timeouts that line up with HS latency spikes → reliability/load; timeouts with a healthy HS → client MembershipManager retry logic.

2. Causality hypothesis

KE-1  OTK upload conflict storm
      (rust-crypto store ↔ Synapse OTK state DIVERGED; server rejects re-uploads)
        │  no fresh OTKs can be published/claimed
        ▼
      No new Olm (1:1) sessions can be established with this device
        │
        ▼
KE-2  EC media-key to-device events (io.element.call.encryption_keys)
      arrive but cannot be decrypted  ⇒  MissingKey at index N
      ⇒  friend's audio/video cuts out

KE-3 (missing algorithm) and KE-4 (delayed-event timeouts) are likely independent of the KE-1→KE-2 chain: KE-3 is a decode/serialization path, KE-4 is a MatrixRTC-vs-HS reliability path. Confirm/refute independence with the decision tree below.

Decision tree — which capture confirms/refutes each link

Q1. Does the KE-1 offending key_id from the 400 response exist in
    e2e_one_time_keys_json with a DIFFERENT value than the client request body?
    ├─ YES → OTK divergence CONFIRMED (KE-1 root). Go to Q2.
    └─ NO  → Not divergence. Check: are OTK counts at 0 with fallback key `used=true`?
             ├─ YES → OTK exhaustion, not divergence — different remediation.
             └─ NO  → Suspect RC-SDK 41.6.0-rc.0 upload-loop regression (see §3).

Q2. During the same call, are io.element.call.encryption_keys to-device events
    present in device_inbox / Synapse to-device logs for our device id?
    ├─ YES + client shows "unexpected encrypted"/MissingKey
    │        → KE-1 ⇒ KE-2 LINK CONFIRMED (events delivered, no Olm session to open them).
    ├─ YES + client decrypts fine, but LiveKit still silent
    │        → KE-2 is downstream of LiveKit/SFU, NOT KE-1. Decouple from crypto.
    └─ NO (nothing queued/targeted our device)
             → media keys never sent to us: stale device id / membership (see KE-4)
               → KE-2 is a device-targeting problem, weakly linked to KE-1.

Q3. KE-3: is content.algorithm NULL in event_json on the HS?
    ├─ YES → malformed persisted event (sender/federation). Independent of KE-1.
    └─ NO  → client-side RC-SDK deserialization bug. Independent of KE-1.

Q4. KE-4: do delayed-event timeouts coincide with Synapse p99 latency spikes
    (Grafana) in the same minute?
    ├─ YES → homeserver responsiveness/load. Independent of KE-1..KE-3.
    └─ NO  → client MembershipManager retry behavior. Independent.

3. Ranked remediation options (with blast radius)

Ordered least-destructive → most-destructive. Do not run any of these as a "fix" before the planning session — they are listed so evidence collection can be paired with a recovery plan. Confirm the root condition (Q1/Q2) first.

Per-device logout + re-login of the affected device (lowest blast radius)
- What: log the one glitching device out and back in. Forces a fresh device id, fresh device keys, and a clean OTK batch — sidesteps a diverged OTK store without touching other sessions.
- Blast radius: that device only. Other sessions/devices untouched.
- Cost: the new device must be re-verified (cross-signing) and will need to restore room keys from key backup to read old encrypted history.
- Confirms/uses: if KE-1 stops after this, OTK-store divergence (Q1) was the cause.
Client crypto-store reset (clearLoginData path) (medium)
- What: clearLoginData() in src/client/initMatrix.ts (coordinator's file — do not edit) deletes ALL IndexedDB databases (incl. web-sync-store and the rust-crypto store crypto-store), unregisters service workers, clears all Cache Storage, and localStorage.clear(), then reloads. clearCacheAndReload() is lighter — it only calls mx.store.deleteAllData() (sync cache) and does not wipe crypto.
- Blast radius: this browser profile only, but total: you are logged out, lose all cached sync state, drafts, settings, and the local megolm/room-key store.
- ⚠️ Message-history / backup implication: wiping crypto-store destroys locally-held room keys (megolm inbound sessions). Any history not backed up to server-side Key Backup becomes permanently undecryptable on this device. Before doing this: verify Key Backup is enabled and the recovery key / passphrase is available (Settings → Security), or the user loses readable history. Cross-signing must be re-established too.
- Use when: the rust-crypto store itself is corrupt/diverged and option 1 didn't clear it.
SDK pin change off the RC (medium — codebase change, needs rebuild)
- Current pin: package.json → "matrix-js-sdk": "41.6.0-rc.0" (a release candidate).
- Finding (npm / GitHub changelog, checked 2026-07): stable 41.6.0 was released 2026-05-26. Its only changelog line is "Throw sane error on completeLoginOnNewDevice IdP rejection" — no OTK / keys-upload / Olm / to-device fix relative to the RC. Later stable lines exist (41.7.0, 41.8.0; 41.7.0-rc.3 / 41.9.0-rc.0 seen as pre-releases). Nearby crypto-relevant entries: 41.5.0 "Enable encrypted history sharing by default"; 41.4.0 key-backup handling. No changelog entry directly addresses the KE-1 OTK-conflict symptom in the immediate range — so moving RC→41.6.0 stable is a low-risk hygiene step but is not expected to fix KE-1 by itself. Before pinning, re-read the CHANGELOG for any 41.7.x/41.8.x OTK/one-time-key/olm entry that post-dates this note.
- Blast radius: all users after the next cinny-build.sh deploy. Test the rust-crypto IndexedDB schema — a downgrade triggers the IDB_VERSION_CONFLICT path in initMatrix.ts.
Synapse-side OTK row surgery (LAST RESORT — highest danger)
- What: deleting/rewriting rows in e2e_one_time_keys_json (and/or e2e_fallback_keys_json, device_inbox) for the affected device to force the client to re-upload a clean batch.
- ⚠️ Danger: direct writes to Synapse crypto tables can desync every device of that user, break Olm sessions for everyone who has claimed one of those keys, and are easy to get wrong (wrong key_id, cache not invalidated). Synapse caches OTK counts — a raw DELETE without a restart can leave the advertised count wrong, worsening the KE-1 loop.
- Guardrails if ever done (planning session + HS owner only): full pg_dump of synapse first; do it during zero active calls; delete only the exact diverged key_id for the exact device_id; systemctl restart matrix-synapse to flush caches; then log the device out/in (option 1) so it republishes. Never run this speculatively.

4. "Capture session" checklist (run during the next call)

Do these in order. Aim to have client + server capturing the same call.

Prep server tail (LXC 151): SSH in, start tail -F /var/log/matrix-synapse/homeserver.log | tee /tmp/lotus-call-$(date +%s).log. (Optionally raise the synapse.rest.client.keys / handlers.e2e_keys / handlers.devicemessage loggers to DEBUG per §0 and systemctl reload matrix-synapse — remember to revert after.)
Prep client: open Lotus Chat → Settings → Developer Tools → enable Developer Tools so the Crypto Diagnostics card is visible; note its entry count starts at (or reset by reload to) 0.
Open DevTools (F12) → Console: enable Preserve log; Network tab: enable Preserve log + Record. Note your device id and user id (Settings → Devices / Developer Tools → Copy access token page shows ids).
Note wall-clock start time (ISO/UTC) on both machines so logs align.
Join the Element Call with the second participant; reproduce the fault (wait for the audio/video cutouts and let KE-1 storm run ~30–60s).
When a fault occurs, note the wall-clock timestamp and which symptom (audio cut / video freeze / etc.) — this bounds the log window.
Client artifacts: in the Crypto Diagnostics card click Download report (lotus-crypto-diag-<ts>.json); in DevTools Network, save the failing keys/upload request+response (right-click → Save/Copy), and the raw HAR (Network → Save all as HAR) for the call window.
Grab KE-3 event id / KE-2 participant+index from the console (or the diag JSON entries[]) for the SQL lookups.
Server artifacts: stop the tail; run the per-KE greps and SQL from §1 against the noted device id / user id / event id, saving output alongside the client JSON. Screenshot the Grafana Synapse latency panels for the window (for KE-4).
Bundle & label: put client JSON + HAR + server log slice + SQL output in one folder named with the call's UTC start time. Revert any DEBUG log config (systemctl reload matrix-synapse). Hand off to the planning session — do not apply §3 remediations yet.

5. Client diagnostics helper (this kit)

src/app/utils/cryptoDiagLog.ts — capture-only console instrumentation.
- installCryptoDiagLog() — idempotent; wraps console.warn/console.error with pass-through wrappers (originals always called) that ring-buffer (max 200) any line matching the KE signatures. No network, no timers.
- getCryptoDiagEntries() — snapshot copy of the buffer ({ ts, level, ke, signature, message }, most-recent-last).
- buildCryptoDiagReport(mx) — JSON string: SDK version, device id, user id, sync state, cryptoReady (mx.getCrypto() presence), per-KE counts, and the entry buffer. No tokens/PII beyond those ids; captured log lines are retained verbatim as evidence.
- Signatures → KE mapping: already exists→KE-1; missing key at index / io.element.call.encryption_keys / MissingKey→KE-2; DecryptionError→KE-3; update_delayed_event / delayed event→KE-4.
src/app/features/settings/developer/CryptoDiagnostics.tsx — a folds SequenceCard/SettingTile card (mirrors developer-tools/DevelopTools.tsx) showing the live matched-entry count (Badge) and a Download report button (Blob → lotus-crypto-diag-<ts>.json, same download idiom as room-settings/ExportRoomHistory.tsx).

Recommended mount points (coordinator)

Install call: call installCryptoDiagLog() as early as possible during boot so it captures crypto errors from first sync — ideally at the top of the client entry module or inside ClientRoot before/around initClient (e.g. src/app/pages/client/ClientRoot.tsx). It is idempotent, side-effect only, and needs no mx, so a module-scope call at app entry is safe. (Do not put it in initMatrix.ts — that file is off-limits.)
Settings card: render <CryptoDiagnostics /> inside the Developer Tools page — in src/app/features/settings/developer-tools/DevelopTools.tsx, add it to the Box direction="Column" gap="700" list (guarded by the existing developerTools flag), right after the "Access Token" card. It pulls mx from useMatrixClient() itself, so it just needs to be placed in the tree.

6. 2026-07 investigation update — 41.7.0 delta + web-specific root cause

New findings this session (code-read + upstream issue triage). These sharpen KE-1's root cause and close the "just upgrade the SDK" lever.

6.1 The 41.7.0 upgrade does NOT fix KE-1 (lever closed)

We are now on matrix-js-sdk@41.7.0 → @matrix-org/matrix-sdk-crypto-wasm@18.3.1 (was 41.6.0-rc.0 when KE-1/2 were observed). Checked both changelogs:

41.7.0's only crypto line is the security bump to crypto-wasm 18.3.1. No OTK / keys-upload / Olm-session change.
crypto-wasm 17.0 → 18.3.1: no entry for one-time-keys, keys/upload, "already exists", or upload conflicts. The 18.3.x work was to-device security hardening (vodozemac 0.10; sender-spoofing check via sender_device_keys; MSC4147 validation) — unrelated to the OTK loop.
Upstream matrix-rust-sdk#5200 ("OlmMachine constantly tries to upload keys when restoring session") is still OPEN (as of mid-2025). The loop mechanism is confirmed there: on the 400, mark_request_as_sent() never fires, so the keys stay "unshared" and the SDK re-issues the identical failing upload every cycle → the storm.

⇒ Remediation option 3 (SDK pin) is exhausted for KE-1. Do not expect a version bump to help; the fix is store-hygiene, below.

6.2 Confirmed root cause + the web-specific trigger we can act on

Upstream #5200 + #1415 pin the root condition to rust-crypto store ↔ server OTK divergence, from one of:

Crypto store reset/restore without deregistering the device server-side — the store forgets OTKs it already published; the server still holds them.
Unsafe concurrent access to the crypto store — e.g. the same session open in multiple browser tabs, each running its own OlmMachine against the one IndexedDB crypto store.
A store that isn't durably persisted, so a restore can't track what was sent.

Cinny is a web client and hits two of these by construction (verified in code):

No navigator.storage.persist() anywhere (grep clean). The rust-crypto IndexedDB store is therefore evictable under storage pressure — while the access token + device id live in localStorage (N97), which browsers evict less aggressively. Partial eviction ⇒ the device resurrects with a blank crypto store but the SAME device id ⇒ it re-uploads OTKs the server still holds ⇒ the exact KE-1 "already exists" divergence, with no user action and no visible cause. This is the leading hypothesis for a self-hosted web deployment.
No multi-tab crypto guard (no navigator.locks / BroadcastChannel leader election in src/). initMatrix.ts calls mx.initRustCrypto() with no single-writer coordination, so 2+ tabs = concurrent store access = trigger #2.

6.3 Concrete PREVENTIVE client mitigations (new — buildable, don't need a call)

Ordered by value/effort. These reduce the recurrence of KE-1; they don't heal an already-diverged device (that still needs remediation option 1: clean logout+login).

Request persistent storage on login — navigator.storage.persist() (cheapest, highest value). Idempotent, side-effect only, no behavior change if the browser denies it. Directly prevents the eviction-induced divergence in 6.2. Best placed at app entry alongside the other module-scope calls (NOT in initMatrix.ts, which is off-limits) — e.g. a one-liner in ClientRoot/app bootstrap: if (navigator.storage?.persist) navigator.storage.persist(); Optionally surface navigator.storage.persisted() in the Crypto Diagnostics card so a capture records whether the store was evictable.
Multi-tab guard (medium). Detect a second tab of the same session (BroadcastChannel or the Web Locks API) and either (a) warn "Lotus is open in another tab — encryption may misbehave", or (b) make secondary tabs read-only for crypto. Prevents trigger #2.
Loop detection → recovery prompt (medium). Watch for repeated keys/upload 400 M_UNKNOWN … already exists (the client sees the rejection); after N in a window, stop hammering and surface a "Reset encryption on this device (log out & back in)" prompt instead of looping silently.

6.4 Secondary KE-2 hypothesis to test in the capture

crypto-wasm 18.3.0 tightened Olm to-device validation (sender-spoof check + MSC4147). It's therefore possible KE-2's WARN … unexpected encrypted to-device event … io.element.call.encryption_keys is partly the new validation rejecting EC's media-key events, not only the missing-Olm-session downstream of KE-1. Capture discriminator: if KE-2 still occurs in a call where OTK counts are healthy and no KE-1 storm is present (Q1 = NO), suspect the to-device validation path (EC ↔ rust-crypto 18.3.x), not KE-1. If KE-2 only ever co-occurs with the KE-1 storm, the original KE-1⇒KE-2 chain stands.

6.5 What to do now vs. at capture

Now (no call needed): ship 6.3.1 (persist()) — it's safe and preventive. Consider 6.3.3 (loop detection) as a follow-up.
At the next glitchy call: run the §4 capture; answer Q1 (divergence?) and 6.4's discriminator. For any currently stuck device, remediation option 1 (clean logout + login, not just "clear storage" — clearing storage without mx.logout() leaves the server device + its OTKs and can re-trigger the divergence).

26 KiB Raw Blame History Unescape Escape

Lotus Chat — E2EE Investigation Runbook (KE-1 → KE-4)

0. Deployment facts used by this runbook

1. Per-KE evidence matrix

KE-1 — OTK upload conflict storm (root-cause candidate)

KE-2 — EC media keys not arriving/decrypting (audio/video cutouts)

KE-3 — Timeline decryption error: missing algorithm field

KE-4 — MatrixRTC delayed-event / membership timeouts

2. Causality hypothesis

Decision tree — which capture confirms/refutes each link

3. Ranked remediation options (with blast radius)

4. "Capture session" checklist (run during the next call)

5. Client diagnostics helper (this kit)

Recommended mount points (coordinator)

6. 2026-07 investigation update — 41.7.0 delta + web-specific root cause

6.1 The 41.7.0 upgrade does NOT fix KE-1 (lever closed)

6.2 Confirmed root cause + the web-specific trigger we can act on

6.3 Concrete PREVENTIVE client mitigations (new — buildable, don't need a call)

6.4 Secondary KE-2 hypothesis to test in the capture

6.5 What to do now vs. at capture

26 KiB

Raw Blame History

KE-3 — Timeline decryption error: missing `algorithm` field