docs(e2ee): investigation update — 41.7.0 delta + web-specific KE-1 root cause
CI / Build & Quality Checks (push) Successful in 10m49s
CI / Trigger Desktop Build (push) Successful in 21s

Code-read + upstream-issue triage this session:
- 41.7.0 / crypto-wasm 18.3.1 does NOT fix KE-1 (no OTK/upload change; #5200
  still open) — the SDK-pin remediation lever is closed.
- Confirmed root cause = rust-crypto store <-> Synapse OTK divergence; the
  leading web trigger is that cinny never requests persistent storage, so the
  IndexedDB crypto store is evictable while the localStorage session survives.
- New buildable preventive mitigation: navigator.storage.persist() on login
  (+ multi-tab guard, 400-loop recovery prompt). Added as §6 with a secondary
  KE-2 to-device-validation hypothesis and capture discriminators.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-07-02 15:14:46 -04:00
parent c82ab5c7f5
commit 81904372bc
2 changed files with 110 additions and 0 deletions
+13
View File
@@ -112,6 +112,19 @@ signed_curve25519:AAAAAAAAAGQ already exists. Old key: {…} new key: {…}` —
mismatch, OTK id-counter desync, RC-SDK (`41.6.0-rc.0`) regression, or a mismatch, OTK id-counter desync, RC-SDK (`41.6.0-rc.0`) regression, or a
Synapse OTK bug. Repro signature: grep console for `already exists`. Synapse OTK bug. Repro signature: grep console for `already exists`.
**Extreme — planning session.** **Extreme — planning session.**
**Update 2026-07 (investigation §6):** upstream `matrix-rust-sdk#5200` (still
OPEN) confirms the mechanism — on the 400, `mark_request_as_sent()` never fires
so the SDK re-issues the identical upload forever. **`41.7.0` does NOT fix it**
(crypto-wasm 17→18.3.1 has no OTK/upload change; 18.3.x was to-device security
only) — the SDK-pin lever is closed. Root cause = **store↔server OTK
divergence**; the leading **web-specific** trigger is that cinny never calls
**`navigator.storage.persist()`**, so the IndexedDB crypto store is evictable
while the `localStorage` session/device-id survives → device resurrects with a
blank store → re-uploads OTKs the server still holds. **Actionable preventive
fix (buildable now, no call needed):** request persistent storage on login
(+ optional multi-tab guard + 400-loop→recovery-prompt). Healing an already-
diverged device still needs a clean **logout+login** (not just "clear
storage"). See `LOTUS_E2EE_INVESTIGATION.md` §6.
- **KE-2 — Element Call media keys not arriving/decrypting → audio & video cut out (CRITICAL).** - **KE-2 — Element Call media keys not arriving/decrypting → audio & video cut out (CRITICAL).**
`MissingKey: missing key at index N for participant @user`, `skipping decryption `MissingKey: missing key at index N for participant @user`, `skipping decryption
+97
View File
@@ -405,3 +405,100 @@ signature, message }`, most-recent-last).
to the `Box direction="Column" gap="700"` list (guarded by the existing to the `Box direction="Column" gap="700"` list (guarded by the existing
`developerTools` flag), right after the "Access Token" card. It pulls `mx` `developerTools` flag), right after the "Access Token" card. It pulls `mx`
from `useMatrixClient()` itself, so it just needs to be placed in the tree. from `useMatrixClient()` itself, so it just needs to be placed in the tree.
---
## 6. 2026-07 investigation update — 41.7.0 delta + web-specific root cause
New findings this session (code-read + upstream issue triage). These **sharpen
KE-1's root cause and close the "just upgrade the SDK" lever**.
### 6.1 The 41.7.0 upgrade does NOT fix KE-1 (lever closed)
We are now on **`matrix-js-sdk@41.7.0`** → **`@matrix-org/matrix-sdk-crypto-wasm@18.3.1`**
(was `41.6.0-rc.0` when KE-1/2 were observed). Checked both changelogs:
- 41.7.0's only crypto line is the **security bump to crypto-wasm 18.3.1**. No
OTK / keys-upload / Olm-session change.
- crypto-wasm 17.0 → 18.3.1: **no entry** for one-time-keys, keys/upload,
"already exists", or upload conflicts. The 18.3.x work was **to-device
security hardening** (vodozemac 0.10; sender-spoofing check via
`sender_device_keys`; MSC4147 validation) — unrelated to the OTK loop.
- Upstream **`matrix-rust-sdk#5200`** ("OlmMachine constantly tries to upload
keys when restoring session") is **still OPEN** (as of mid-2025). The loop
mechanism is confirmed there: on the 400, `mark_request_as_sent()` never
fires, so the keys stay "unshared" and the SDK re-issues the identical failing
upload every cycle → the storm.
⇒ **Remediation option 3 (SDK pin) is exhausted for KE-1.** Do not expect a
version bump to help; the fix is store-hygiene, below.
### 6.2 Confirmed root cause + the web-specific trigger we can act on
Upstream `#5200` + `#1415` pin the root condition to **rust-crypto store ↔
server OTK divergence**, from one of:
1. **Crypto store reset/restore without deregistering the device server-side**
— the store forgets OTKs it already published; the server still holds them.
2. **Unsafe concurrent access to the crypto store** — e.g. the **same session
open in multiple browser tabs**, each running its own OlmMachine against the
one IndexedDB crypto store.
3. A store that isn't durably persisted, so a restore can't track what was sent.
**Cinny is a web client and hits two of these by construction (verified in code):**
- **No `navigator.storage.persist()` anywhere** (`grep` clean). The rust-crypto
IndexedDB store is therefore **evictable under storage pressure** — while the
**access token + device id live in `localStorage`** (N97), which browsers evict
_less_ aggressively. Partial eviction ⇒ the device **resurrects with a blank
crypto store but the SAME device id** ⇒ it re-uploads OTKs the server still
holds ⇒ the **exact KE-1 "already exists" divergence**, with **no user action**
and no visible cause. This is the leading hypothesis for a self-hosted web
deployment.
- **No multi-tab crypto guard** (no `navigator.locks` / `BroadcastChannel`
leader election in `src/`). `initMatrix.ts` calls `mx.initRustCrypto()` with no
single-writer coordination, so 2+ tabs = concurrent store access = trigger #2.
### 6.3 Concrete PREVENTIVE client mitigations (new — buildable, don't need a call)
Ordered by value/effort. These reduce the _recurrence_ of KE-1; they don't heal
an already-diverged device (that still needs remediation option 1: clean
logout+login).
1. **Request persistent storage on login — `navigator.storage.persist()`**
_(cheapest, highest value)_. Idempotent, side-effect only, no behavior change
if the browser denies it. Directly prevents the eviction-induced divergence in
6.2. Best placed at app entry alongside the other module-scope calls (NOT in
`initMatrix.ts`, which is off-limits) — e.g. a one-liner in `ClientRoot`/app
bootstrap: `if (navigator.storage?.persist) navigator.storage.persist();`
Optionally surface `navigator.storage.persisted()` in the Crypto Diagnostics
card so a capture records whether the store was evictable.
2. **Multi-tab guard** _(medium)_. Detect a second tab of the same session
(BroadcastChannel or the Web Locks API) and either (a) warn "Lotus is open in
another tab — encryption may misbehave", or (b) make secondary tabs read-only
for crypto. Prevents trigger #2.
3. **Loop detection → recovery prompt** _(medium)_. Watch for repeated
`keys/upload` 400 `M_UNKNOWN … already exists` (the client sees the rejection);
after N in a window, stop hammering and surface a "Reset encryption on this
device (log out & back in)" prompt instead of looping silently.
### 6.4 Secondary KE-2 hypothesis to test in the capture
crypto-wasm **18.3.0 tightened Olm to-device validation** (sender-spoof check +
MSC4147). It's therefore possible KE-2's `WARN … unexpected encrypted to-device
event … io.element.call.encryption_keys` is **partly** the new validation
rejecting EC's media-key events, not _only_ the missing-Olm-session downstream of
KE-1. **Capture discriminator:** if KE-2 still occurs in a call where OTK counts
are healthy and no KE-1 storm is present (Q1 = NO), suspect the to-device
validation path (EC ↔ rust-crypto 18.3.x), not KE-1. If KE-2 only ever co-occurs
with the KE-1 storm, the original KE-1⇒KE-2 chain stands.
### 6.5 What to do now vs. at capture
- **Now (no call needed):** ship 6.3.1 (`persist()`) — it's safe and preventive.
Consider 6.3.3 (loop detection) as a follow-up.
- **At the next glitchy call:** run the §4 capture; answer Q1 (divergence?) and
6.4's discriminator. For any _currently_ stuck device, remediation option 1
(clean **logout + login**, not just "clear storage" — clearing storage without
`mx.logout()` leaves the server device + its OTKs and can re-trigger the
divergence).