docs(e2ee): investigation update — 41.7.0 delta + web-specific KE-1 root cause

Code-read + upstream-issue triage this session: - 41.7.0 / crypto-wasm 18.3.1 does NOT fix KE-1 (no OTK/upload change; #5200 still open) — the SDK-pin remediation lever is closed. - Confirmed root cause = rust-crypto store <-> Synapse OTK divergence; the leading web trigger is that cinny never requests persistent storage, so the IndexedDB crypto store is evictable while the localStorage session survives. - New buildable preventive mitigation: navigator.storage.persist() on login (+ multi-tab guard, 400-loop recovery prompt). Added as §6 with a secondary KE-2 to-device-validation hypothesis and capture discriminators. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-02 15:14:46 -04:00
parent c82ab5c7f5
commit 81904372bc
2 changed files with 110 additions and 0 deletions
@@ -112,6 +112,19 @@ signed_curve25519:AAAAAAAAAGQ already exists. Old key: {…} new key: {…}` —
  mismatch, OTK id-counter desync, RC-SDK (`41.6.0-rc.0`) regression, or a
  Synapse OTK bug. Repro signature: grep console for `already exists`.
  **Extreme — planning session.**
  **Update 2026-07 (investigation §6):** upstream `matrix-rust-sdk#5200` (still
  OPEN) confirms the mechanism — on the 400, `mark_request_as_sent()` never fires
  so the SDK re-issues the identical upload forever. **`41.7.0` does NOT fix it**
  (crypto-wasm 17→18.3.1 has no OTK/upload change; 18.3.x was to-device security
  only) — the SDK-pin lever is closed. Root cause = **store↔server OTK
  divergence**; the leading **web-specific** trigger is that cinny never calls
  **`navigator.storage.persist()`**, so the IndexedDB crypto store is evictable
  while the `localStorage` session/device-id survives → device resurrects with a
  blank store → re-uploads OTKs the server still holds. **Actionable preventive
  fix (buildable now, no call needed):** request persistent storage on login
  (+ optional multi-tab guard + 400-loop→recovery-prompt). Healing an already-
  diverged device still needs a clean **logout+login** (not just "clear
  storage"). See `LOTUS_E2EE_INVESTIGATION.md` §6.
 - **KE-2 — Element Call media keys not arriving/decrypting → audio & video cut out (CRITICAL).**
  `MissingKey: missing key at index N for participant @user`, `skipping decryption
@@ -405,3 +405,100 @@ signature, message }`, most-recent-last).
  to the `Box direction="Column" gap="700"` list (guarded by the existing
  `developerTools` flag), right after the "Access Token" card. It pulls `mx`
  from `useMatrixClient()` itself, so it just needs to be placed in the tree.
 ---
 ## 6. 2026-07 investigation update — 41.7.0 delta + web-specific root cause
 New findings this session (code-read + upstream issue triage). These **sharpen
 KE-1's root cause and close the "just upgrade the SDK" lever**.
 ### 6.1 The 41.7.0 upgrade does NOT fix KE-1 (lever closed)
 We are now on **`matrix-js-sdk@41.7.0`** → **`@matrix-org/matrix-sdk-crypto-wasm@18.3.1`**
 (was `41.6.0-rc.0` when KE-1/2 were observed). Checked both changelogs:
 - 41.7.0's only crypto line is the **security bump to crypto-wasm 18.3.1**. No
  OTK / keys-upload / Olm-session change.
 - crypto-wasm 17.0 → 18.3.1: **no entry** for one-time-keys, keys/upload,
  "already exists", or upload conflicts. The 18.3.x work was **to-device
  security hardening** (vodozemac 0.10; sender-spoofing check via
  `sender_device_keys`; MSC4147 validation) — unrelated to the OTK loop.
 - Upstream **`matrix-rust-sdk#5200`** ("OlmMachine constantly tries to upload
  keys when restoring session") is **still OPEN** (as of mid-2025). The loop
  mechanism is confirmed there: on the 400, `mark_request_as_sent()` never
  fires, so the keys stay "unshared" and the SDK re-issues the identical failing
  upload every cycle → the storm.
 ⇒ **Remediation option 3 (SDK pin) is exhausted for KE-1.** Do not expect a
 version bump to help; the fix is store-hygiene, below.
 ### 6.2 Confirmed root cause + the web-specific trigger we can act on
 Upstream `#5200` + `#1415` pin the root condition to **rust-crypto store ↔
 server OTK divergence**, from one of:
 1. **Crypto store reset/restore without deregistering the device server-side**
   — the store forgets OTKs it already published; the server still holds them.
 2. **Unsafe concurrent access to the crypto store** — e.g. the **same session
   open in multiple browser tabs**, each running its own OlmMachine against the
   one IndexedDB crypto store.
 3. A store that isn't durably persisted, so a restore can't track what was sent.
 **Cinny is a web client and hits two of these by construction (verified in code):**
 - **No `navigator.storage.persist()` anywhere** (`grep` clean). The rust-crypto
  IndexedDB store is therefore **evictable under storage pressure** — while the
  **access token + device id live in `localStorage`** (N97), which browsers evict
  _less_ aggressively. Partial eviction ⇒ the device **resurrects with a blank
  crypto store but the SAME device id** ⇒ it re-uploads OTKs the server still
  holds ⇒ the **exact KE-1 "already exists" divergence**, with **no user action**
  and no visible cause. This is the leading hypothesis for a self-hosted web
  deployment.
 - **No multi-tab crypto guard** (no `navigator.locks` / `BroadcastChannel`
  leader election in `src/`). `initMatrix.ts` calls `mx.initRustCrypto()` with no
  single-writer coordination, so 2+ tabs = concurrent store access = trigger #2.
 ### 6.3 Concrete PREVENTIVE client mitigations (new — buildable, don't need a call)
 Ordered by value/effort. These reduce the _recurrence_ of KE-1; they don't heal
 an already-diverged device (that still needs remediation option 1: clean
 logout+login).
 1. **Request persistent storage on login — `navigator.storage.persist()`**
   _(cheapest, highest value)_. Idempotent, side-effect only, no behavior change
   if the browser denies it. Directly prevents the eviction-induced divergence in
   6.2. Best placed at app entry alongside the other module-scope calls (NOT in
   `initMatrix.ts`, which is off-limits) — e.g. a one-liner in `ClientRoot`/app
   bootstrap: `if (navigator.storage?.persist) navigator.storage.persist();`
   Optionally surface `navigator.storage.persisted()` in the Crypto Diagnostics
   card so a capture records whether the store was evictable.
 2. **Multi-tab guard** _(medium)_. Detect a second tab of the same session
   (BroadcastChannel or the Web Locks API) and either (a) warn "Lotus is open in
   another tab — encryption may misbehave", or (b) make secondary tabs read-only
   for crypto. Prevents trigger #2.
 3. **Loop detection → recovery prompt** _(medium)_. Watch for repeated
   `keys/upload` 400 `M_UNKNOWN … already exists` (the client sees the rejection);
   after N in a window, stop hammering and surface a "Reset encryption on this
   device (log out & back in)" prompt instead of looping silently.
 ### 6.4 Secondary KE-2 hypothesis to test in the capture
 crypto-wasm **18.3.0 tightened Olm to-device validation** (sender-spoof check +
 MSC4147). It's therefore possible KE-2's `WARN … unexpected encrypted to-device
 event … io.element.call.encryption_keys` is **partly** the new validation
 rejecting EC's media-key events, not _only_ the missing-Olm-session downstream of
 KE-1. **Capture discriminator:** if KE-2 still occurs in a call where OTK counts
 are healthy and no KE-1 storm is present (Q1 = NO), suspect the to-device
 validation path (EC ↔ rust-crypto 18.3.x), not KE-1. If KE-2 only ever co-occurs
 with the KE-1 storm, the original KE-1⇒KE-2 chain stands.
 ### 6.5 What to do now vs. at capture
 - **Now (no call needed):** ship 6.3.1 (`persist()`) — it's safe and preventive.
  Consider 6.3.3 (loop detection) as a follow-up.
 - **At the next glitchy call:** run the §4 capture; answer Q1 (divergence?) and
  6.4's discriminator. For any _currently_ stuck device, remediation option 1
  (clean **logout + login**, not just "clear storage" — clearing storage without
  `mx.logout()` leaves the server device + its OTKs and can re-trigger the
  divergence).