diff --git a/LOTUS_BUGS.md b/LOTUS_BUGS.md index b50940495..a145eb2e0 100644 --- a/LOTUS_BUGS.md +++ b/LOTUS_BUGS.md @@ -112,6 +112,19 @@ signed_curve25519:AAAAAAAAAGQ already exists. Old key: {…} new key: {…}` — mismatch, OTK id-counter desync, RC-SDK (`41.6.0-rc.0`) regression, or a Synapse OTK bug. Repro signature: grep console for `already exists`. **Extreme — planning session.** + **Update 2026-07 (investigation §6):** upstream `matrix-rust-sdk#5200` (still + OPEN) confirms the mechanism — on the 400, `mark_request_as_sent()` never fires + so the SDK re-issues the identical upload forever. **`41.7.0` does NOT fix it** + (crypto-wasm 17→18.3.1 has no OTK/upload change; 18.3.x was to-device security + only) — the SDK-pin lever is closed. Root cause = **store↔server OTK + divergence**; the leading **web-specific** trigger is that cinny never calls + **`navigator.storage.persist()`**, so the IndexedDB crypto store is evictable + while the `localStorage` session/device-id survives → device resurrects with a + blank store → re-uploads OTKs the server still holds. **Actionable preventive + fix (buildable now, no call needed):** request persistent storage on login + (+ optional multi-tab guard + 400-loop→recovery-prompt). Healing an already- + diverged device still needs a clean **logout+login** (not just "clear + storage"). See `LOTUS_E2EE_INVESTIGATION.md` §6. - **KE-2 — Element Call media keys not arriving/decrypting → audio & video cut out (CRITICAL).** `MissingKey: missing key at index N for participant @user`, `skipping decryption diff --git a/LOTUS_E2EE_INVESTIGATION.md b/LOTUS_E2EE_INVESTIGATION.md index c1d7434a4..dee0fc1f8 100644 --- a/LOTUS_E2EE_INVESTIGATION.md +++ b/LOTUS_E2EE_INVESTIGATION.md @@ -405,3 +405,100 @@ signature, message }`, most-recent-last). to the `Box direction="Column" gap="700"` list (guarded by the existing `developerTools` flag), right after the "Access Token" card. It pulls `mx` from `useMatrixClient()` itself, so it just needs to be placed in the tree. + +--- + +## 6. 2026-07 investigation update — 41.7.0 delta + web-specific root cause + +New findings this session (code-read + upstream issue triage). These **sharpen +KE-1's root cause and close the "just upgrade the SDK" lever**. + +### 6.1 The 41.7.0 upgrade does NOT fix KE-1 (lever closed) + +We are now on **`matrix-js-sdk@41.7.0`** → **`@matrix-org/matrix-sdk-crypto-wasm@18.3.1`** +(was `41.6.0-rc.0` when KE-1/2 were observed). Checked both changelogs: + +- 41.7.0's only crypto line is the **security bump to crypto-wasm 18.3.1**. No + OTK / keys-upload / Olm-session change. +- crypto-wasm 17.0 → 18.3.1: **no entry** for one-time-keys, keys/upload, + "already exists", or upload conflicts. The 18.3.x work was **to-device + security hardening** (vodozemac 0.10; sender-spoofing check via + `sender_device_keys`; MSC4147 validation) — unrelated to the OTK loop. +- Upstream **`matrix-rust-sdk#5200`** ("OlmMachine constantly tries to upload + keys when restoring session") is **still OPEN** (as of mid-2025). The loop + mechanism is confirmed there: on the 400, `mark_request_as_sent()` never + fires, so the keys stay "unshared" and the SDK re-issues the identical failing + upload every cycle → the storm. + +⇒ **Remediation option 3 (SDK pin) is exhausted for KE-1.** Do not expect a +version bump to help; the fix is store-hygiene, below. + +### 6.2 Confirmed root cause + the web-specific trigger we can act on + +Upstream `#5200` + `#1415` pin the root condition to **rust-crypto store ↔ +server OTK divergence**, from one of: + +1. **Crypto store reset/restore without deregistering the device server-side** + — the store forgets OTKs it already published; the server still holds them. +2. **Unsafe concurrent access to the crypto store** — e.g. the **same session + open in multiple browser tabs**, each running its own OlmMachine against the + one IndexedDB crypto store. +3. A store that isn't durably persisted, so a restore can't track what was sent. + +**Cinny is a web client and hits two of these by construction (verified in code):** + +- **No `navigator.storage.persist()` anywhere** (`grep` clean). The rust-crypto + IndexedDB store is therefore **evictable under storage pressure** — while the + **access token + device id live in `localStorage`** (N97), which browsers evict + _less_ aggressively. Partial eviction ⇒ the device **resurrects with a blank + crypto store but the SAME device id** ⇒ it re-uploads OTKs the server still + holds ⇒ the **exact KE-1 "already exists" divergence**, with **no user action** + and no visible cause. This is the leading hypothesis for a self-hosted web + deployment. +- **No multi-tab crypto guard** (no `navigator.locks` / `BroadcastChannel` + leader election in `src/`). `initMatrix.ts` calls `mx.initRustCrypto()` with no + single-writer coordination, so 2+ tabs = concurrent store access = trigger #2. + +### 6.3 Concrete PREVENTIVE client mitigations (new — buildable, don't need a call) + +Ordered by value/effort. These reduce the _recurrence_ of KE-1; they don't heal +an already-diverged device (that still needs remediation option 1: clean +logout+login). + +1. **Request persistent storage on login — `navigator.storage.persist()`** + _(cheapest, highest value)_. Idempotent, side-effect only, no behavior change + if the browser denies it. Directly prevents the eviction-induced divergence in + 6.2. Best placed at app entry alongside the other module-scope calls (NOT in + `initMatrix.ts`, which is off-limits) — e.g. a one-liner in `ClientRoot`/app + bootstrap: `if (navigator.storage?.persist) navigator.storage.persist();` + Optionally surface `navigator.storage.persisted()` in the Crypto Diagnostics + card so a capture records whether the store was evictable. +2. **Multi-tab guard** _(medium)_. Detect a second tab of the same session + (BroadcastChannel or the Web Locks API) and either (a) warn "Lotus is open in + another tab — encryption may misbehave", or (b) make secondary tabs read-only + for crypto. Prevents trigger #2. +3. **Loop detection → recovery prompt** _(medium)_. Watch for repeated + `keys/upload` 400 `M_UNKNOWN … already exists` (the client sees the rejection); + after N in a window, stop hammering and surface a "Reset encryption on this + device (log out & back in)" prompt instead of looping silently. + +### 6.4 Secondary KE-2 hypothesis to test in the capture + +crypto-wasm **18.3.0 tightened Olm to-device validation** (sender-spoof check + +MSC4147). It's therefore possible KE-2's `WARN … unexpected encrypted to-device +event … io.element.call.encryption_keys` is **partly** the new validation +rejecting EC's media-key events, not _only_ the missing-Olm-session downstream of +KE-1. **Capture discriminator:** if KE-2 still occurs in a call where OTK counts +are healthy and no KE-1 storm is present (Q1 = NO), suspect the to-device +validation path (EC ↔ rust-crypto 18.3.x), not KE-1. If KE-2 only ever co-occurs +with the KE-1 storm, the original KE-1⇒KE-2 chain stands. + +### 6.5 What to do now vs. at capture + +- **Now (no call needed):** ship 6.3.1 (`persist()`) — it's safe and preventive. + Consider 6.3.3 (loop detection) as a follow-up. +- **At the next glitchy call:** run the §4 capture; answer Q1 (divergence?) and + 6.4's discriminator. For any _currently_ stuck device, remediation option 1 + (clean **logout + login**, not just "clear storage" — clearing storage without + `mx.logout()` leaves the server device + its OTKs and can re-trigger the + divergence).