docs(e2ee): investigation update — 41.7.0 delta + web-specific KE-1 root cause
Code-read + upstream-issue triage this session: - 41.7.0 / crypto-wasm 18.3.1 does NOT fix KE-1 (no OTK/upload change; #5200 still open) — the SDK-pin remediation lever is closed. - Confirmed root cause = rust-crypto store <-> Synapse OTK divergence; the leading web trigger is that cinny never requests persistent storage, so the IndexedDB crypto store is evictable while the localStorage session survives. - New buildable preventive mitigation: navigator.storage.persist() on login (+ multi-tab guard, 400-loop recovery prompt). Added as §6 with a secondary KE-2 to-device-validation hypothesis and capture discriminators. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -112,6 +112,19 @@ signed_curve25519:AAAAAAAAAGQ already exists. Old key: {…} new key: {…}` —
|
|||||||
mismatch, OTK id-counter desync, RC-SDK (`41.6.0-rc.0`) regression, or a
|
mismatch, OTK id-counter desync, RC-SDK (`41.6.0-rc.0`) regression, or a
|
||||||
Synapse OTK bug. Repro signature: grep console for `already exists`.
|
Synapse OTK bug. Repro signature: grep console for `already exists`.
|
||||||
**Extreme — planning session.**
|
**Extreme — planning session.**
|
||||||
|
**Update 2026-07 (investigation §6):** upstream `matrix-rust-sdk#5200` (still
|
||||||
|
OPEN) confirms the mechanism — on the 400, `mark_request_as_sent()` never fires
|
||||||
|
so the SDK re-issues the identical upload forever. **`41.7.0` does NOT fix it**
|
||||||
|
(crypto-wasm 17→18.3.1 has no OTK/upload change; 18.3.x was to-device security
|
||||||
|
only) — the SDK-pin lever is closed. Root cause = **store↔server OTK
|
||||||
|
divergence**; the leading **web-specific** trigger is that cinny never calls
|
||||||
|
**`navigator.storage.persist()`**, so the IndexedDB crypto store is evictable
|
||||||
|
while the `localStorage` session/device-id survives → device resurrects with a
|
||||||
|
blank store → re-uploads OTKs the server still holds. **Actionable preventive
|
||||||
|
fix (buildable now, no call needed):** request persistent storage on login
|
||||||
|
(+ optional multi-tab guard + 400-loop→recovery-prompt). Healing an already-
|
||||||
|
diverged device still needs a clean **logout+login** (not just "clear
|
||||||
|
storage"). See `LOTUS_E2EE_INVESTIGATION.md` §6.
|
||||||
|
|
||||||
- **KE-2 — Element Call media keys not arriving/decrypting → audio & video cut out (CRITICAL).**
|
- **KE-2 — Element Call media keys not arriving/decrypting → audio & video cut out (CRITICAL).**
|
||||||
`MissingKey: missing key at index N for participant @user`, `skipping decryption
|
`MissingKey: missing key at index N for participant @user`, `skipping decryption
|
||||||
|
|||||||
@@ -405,3 +405,100 @@ signature, message }`, most-recent-last).
|
|||||||
to the `Box direction="Column" gap="700"` list (guarded by the existing
|
to the `Box direction="Column" gap="700"` list (guarded by the existing
|
||||||
`developerTools` flag), right after the "Access Token" card. It pulls `mx`
|
`developerTools` flag), right after the "Access Token" card. It pulls `mx`
|
||||||
from `useMatrixClient()` itself, so it just needs to be placed in the tree.
|
from `useMatrixClient()` itself, so it just needs to be placed in the tree.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. 2026-07 investigation update — 41.7.0 delta + web-specific root cause
|
||||||
|
|
||||||
|
New findings this session (code-read + upstream issue triage). These **sharpen
|
||||||
|
KE-1's root cause and close the "just upgrade the SDK" lever**.
|
||||||
|
|
||||||
|
### 6.1 The 41.7.0 upgrade does NOT fix KE-1 (lever closed)
|
||||||
|
|
||||||
|
We are now on **`matrix-js-sdk@41.7.0`** → **`@matrix-org/matrix-sdk-crypto-wasm@18.3.1`**
|
||||||
|
(was `41.6.0-rc.0` when KE-1/2 were observed). Checked both changelogs:
|
||||||
|
|
||||||
|
- 41.7.0's only crypto line is the **security bump to crypto-wasm 18.3.1**. No
|
||||||
|
OTK / keys-upload / Olm-session change.
|
||||||
|
- crypto-wasm 17.0 → 18.3.1: **no entry** for one-time-keys, keys/upload,
|
||||||
|
"already exists", or upload conflicts. The 18.3.x work was **to-device
|
||||||
|
security hardening** (vodozemac 0.10; sender-spoofing check via
|
||||||
|
`sender_device_keys`; MSC4147 validation) — unrelated to the OTK loop.
|
||||||
|
- Upstream **`matrix-rust-sdk#5200`** ("OlmMachine constantly tries to upload
|
||||||
|
keys when restoring session") is **still OPEN** (as of mid-2025). The loop
|
||||||
|
mechanism is confirmed there: on the 400, `mark_request_as_sent()` never
|
||||||
|
fires, so the keys stay "unshared" and the SDK re-issues the identical failing
|
||||||
|
upload every cycle → the storm.
|
||||||
|
|
||||||
|
⇒ **Remediation option 3 (SDK pin) is exhausted for KE-1.** Do not expect a
|
||||||
|
version bump to help; the fix is store-hygiene, below.
|
||||||
|
|
||||||
|
### 6.2 Confirmed root cause + the web-specific trigger we can act on
|
||||||
|
|
||||||
|
Upstream `#5200` + `#1415` pin the root condition to **rust-crypto store ↔
|
||||||
|
server OTK divergence**, from one of:
|
||||||
|
|
||||||
|
1. **Crypto store reset/restore without deregistering the device server-side**
|
||||||
|
— the store forgets OTKs it already published; the server still holds them.
|
||||||
|
2. **Unsafe concurrent access to the crypto store** — e.g. the **same session
|
||||||
|
open in multiple browser tabs**, each running its own OlmMachine against the
|
||||||
|
one IndexedDB crypto store.
|
||||||
|
3. A store that isn't durably persisted, so a restore can't track what was sent.
|
||||||
|
|
||||||
|
**Cinny is a web client and hits two of these by construction (verified in code):**
|
||||||
|
|
||||||
|
- **No `navigator.storage.persist()` anywhere** (`grep` clean). The rust-crypto
|
||||||
|
IndexedDB store is therefore **evictable under storage pressure** — while the
|
||||||
|
**access token + device id live in `localStorage`** (N97), which browsers evict
|
||||||
|
_less_ aggressively. Partial eviction ⇒ the device **resurrects with a blank
|
||||||
|
crypto store but the SAME device id** ⇒ it re-uploads OTKs the server still
|
||||||
|
holds ⇒ the **exact KE-1 "already exists" divergence**, with **no user action**
|
||||||
|
and no visible cause. This is the leading hypothesis for a self-hosted web
|
||||||
|
deployment.
|
||||||
|
- **No multi-tab crypto guard** (no `navigator.locks` / `BroadcastChannel`
|
||||||
|
leader election in `src/`). `initMatrix.ts` calls `mx.initRustCrypto()` with no
|
||||||
|
single-writer coordination, so 2+ tabs = concurrent store access = trigger #2.
|
||||||
|
|
||||||
|
### 6.3 Concrete PREVENTIVE client mitigations (new — buildable, don't need a call)
|
||||||
|
|
||||||
|
Ordered by value/effort. These reduce the _recurrence_ of KE-1; they don't heal
|
||||||
|
an already-diverged device (that still needs remediation option 1: clean
|
||||||
|
logout+login).
|
||||||
|
|
||||||
|
1. **Request persistent storage on login — `navigator.storage.persist()`**
|
||||||
|
_(cheapest, highest value)_. Idempotent, side-effect only, no behavior change
|
||||||
|
if the browser denies it. Directly prevents the eviction-induced divergence in
|
||||||
|
6.2. Best placed at app entry alongside the other module-scope calls (NOT in
|
||||||
|
`initMatrix.ts`, which is off-limits) — e.g. a one-liner in `ClientRoot`/app
|
||||||
|
bootstrap: `if (navigator.storage?.persist) navigator.storage.persist();`
|
||||||
|
Optionally surface `navigator.storage.persisted()` in the Crypto Diagnostics
|
||||||
|
card so a capture records whether the store was evictable.
|
||||||
|
2. **Multi-tab guard** _(medium)_. Detect a second tab of the same session
|
||||||
|
(BroadcastChannel or the Web Locks API) and either (a) warn "Lotus is open in
|
||||||
|
another tab — encryption may misbehave", or (b) make secondary tabs read-only
|
||||||
|
for crypto. Prevents trigger #2.
|
||||||
|
3. **Loop detection → recovery prompt** _(medium)_. Watch for repeated
|
||||||
|
`keys/upload` 400 `M_UNKNOWN … already exists` (the client sees the rejection);
|
||||||
|
after N in a window, stop hammering and surface a "Reset encryption on this
|
||||||
|
device (log out & back in)" prompt instead of looping silently.
|
||||||
|
|
||||||
|
### 6.4 Secondary KE-2 hypothesis to test in the capture
|
||||||
|
|
||||||
|
crypto-wasm **18.3.0 tightened Olm to-device validation** (sender-spoof check +
|
||||||
|
MSC4147). It's therefore possible KE-2's `WARN … unexpected encrypted to-device
|
||||||
|
event … io.element.call.encryption_keys` is **partly** the new validation
|
||||||
|
rejecting EC's media-key events, not _only_ the missing-Olm-session downstream of
|
||||||
|
KE-1. **Capture discriminator:** if KE-2 still occurs in a call where OTK counts
|
||||||
|
are healthy and no KE-1 storm is present (Q1 = NO), suspect the to-device
|
||||||
|
validation path (EC ↔ rust-crypto 18.3.x), not KE-1. If KE-2 only ever co-occurs
|
||||||
|
with the KE-1 storm, the original KE-1⇒KE-2 chain stands.
|
||||||
|
|
||||||
|
### 6.5 What to do now vs. at capture
|
||||||
|
|
||||||
|
- **Now (no call needed):** ship 6.3.1 (`persist()`) — it's safe and preventive.
|
||||||
|
Consider 6.3.3 (loop detection) as a follow-up.
|
||||||
|
- **At the next glitchy call:** run the §4 capture; answer Q1 (divergence?) and
|
||||||
|
6.4's discriminator. For any _currently_ stuck device, remediation option 1
|
||||||
|
(clean **logout + login**, not just "clear storage" — clearing storage without
|
||||||
|
`mx.logout()` leaves the server device + its OTKs and can re-trigger the
|
||||||
|
divergence).
|
||||||
|
|||||||
Reference in New Issue
Block a user