Phase 13-12 design — offline queue hardening + D-009 7-day cap (core scope)
Phase 13-12 — offline queue hardening + D-009 7-day cap (design memo)
Date: 2026-05-22
Author: tech-architect
Status: Draft spec — implementation dispatched as separate work units.
Related canon: D-007 (anti-cheat single-writer), D-009 (heavy offline + 7-day cap + provisional state), D-016 (progression — quest grants only), D-018 (Energy economy — sink-debits must be server-authoritative).
Related ADRs: ADR-0002 §3 (idempotency-key on /step/ingest), ADR-0004 (anti-cheat layer reject taxonomy), ADR-0007 §7 (error envelope {error, message, details, requestId}), ADR-0006 §Light anti-cheat (mock-trust posture this phase).
This memo is the spec phase. No code is written here. Implementation is dispatched in two following ticks per §6.
0. Sub-deliverable map
| # | Deliverable | Owner(s) | Artifact |
|---|---|---|---|
| 1 | Idempotency-key shape + canonical TTL | tech-architect (this memo) + backend-engineer | New idempotency_keys Prisma model + Nest interceptor + DTO header binding |
| 2 | D-009 7-day cap server-side enforcement via clientGeneratedAt | tech-architect (this memo) + backend-engineer | New STALE_ACTION error code + tree/quest service guard + Zod schema additions |
| 3 | Action ordering contract (per-walker FIFO) on the wire | tech-architect (this memo) + mobile-developer | clientGeneratedAt field + worker preserves DB-insertion order |
| 4 | Mobile queue rewrite — UUIDv4 keys for ALL three action types + new error handling | mobile-developer | OfflineAction schema add + OfflineQueueWorker 409 / STALE / ORDERING branches + telemetry |
1. Context — what 13-12 closes
Sub-phase 13-5 shipped the Room queue (OfflineActionDao / OfflineQueueDb / OfflineAction / OfflineQueueRepository{,Impl} / OfflineQueueWorker) at walkrpg-mobile/android/app/src/main/java/com/walkrpg/mobile/data/offline/. It works in the happy path. The four hardening gaps 13-12 closes:
-
Idempotency is partial. Only
TREE_ALLOCATEcarries an idempotency key (UUIDv4 generated atTreeViewerViewModel+TreeRepositoryImpl, parsed byAllocateRequestSchemainbackend/src/tree/tree.dto.ts:16-18).QUEST_ADVANCEandQUEST_COMPLETEcarry no key — the worker treats HTTP 409 as “success” at the HTTP layer (OfflineQueueWorker.kt:117/124), but the backend has no record that this specific enqueue was the one that succeeded vs a different enqueue of the same logical action. This is a silent merge that violates ADR-0002’s “same key + different payload = error” convention. -
Tree allocation idempotency is enforced at the unique-index layer, not the response layer.
tree.service.ts:19-20explicitly documents this: “A durable idempotency-key table (production) will allow returning the cached prior response for full idempotency semantics.” Today the second submission of the same key + same allocations returns 409 ALREADY_ALLOCATED instead of the original 200 response — the client sees a different shape on replay, which forces it to issue a follow-upGET /tree/stateto reconcile. Wasted round trip + a race window where the client UI can briefly show stale state. -
The D-009 7-day cap is client-only.
OfflineQueueWorker.kt:60-64runsdropExpired(SEVEN_DAYS_MS)based onOfflineAction.enqueuedAt(Room-side wall clock at enqueue). The server has no equivalent check on tree/quest endpoints — a walker with a tampered system clock can replay actions tagged with a syntheticenqueuedAtindefinitely. D-009 §2 mandates the cap; today it’s enforced by a non-attested device clock. -
D-018 sink-debit ordering is undefined on the wire. D-018 §What-not-decide says “sink debits must be server-authoritative”. Energy economy is not wired into the backend yet (no Energy ledger schema; combat already debits “implicitly” via the
simulateEncounterpass-through). 13-12 needs to lock the ordering contract the future Energy ledger will inherit, without speculatively shipping the ledger schema itself.
13-12 closes (1)+(2)+(3) directly. (4) is contract-only — no Energy ledger ships; the ordering rule is documented so the future schema lands without re-litigating it.
Scope discipline. Per phase-13-plan §6 13-12’s FLAG_LEAD A (D-026 push) + B (D-027 onboarding), CEO triaged “core only”. This memo treats both as explicitly deferred. See §6.
2. Cross-cutting contracts
2.1 Idempotency-key shape — canonical
Format. UUIDv4 string. 36 chars including dashes. Validation: z.string().uuid() server-side (matches existing AllocateRequestSchema:16-18). Lowercase canonical; case-sensitive comparison.
Generation site. Client. Mobile generates the key at enqueue time, not at send time. This matters because re-enqueue of the same logical action (e.g. user taps “Complete Quest” twice while offline) must produce the same key, not two different keys; the Room layer is the source of truth.
Lifetime / storage. Server-side idempotency_keys table (Prisma model — §3.1). Composite uniqueness on (walkerId, key, endpoint). TTL = 7 days to match D-009 §2 offline cap (W-lock §5 picks this; alternative 30d explored).
Replay semantics. A second request with the same (walkerId, key, endpoint):
- Identical payload → return the cached
response_bodyJSON byte-for-byte with the original status code. No DB mutation. Counts as a normal request for rate-limit purposes. - Different payload → 409
IDEMPOTENCY_CONFLICT, withdetails.expectedHash(sha256 of the canonical-JSON-stringified original payload) +details.receivedHash. No DB mutation. This is ADR-0002’s “same key + different payload = error” convention extended to all queueable endpoints.
Header vs body. Honor ADR-0002 §3 today’s shape on /step/ingest (key in body). On the new queueable endpoints (/tree/allocate, /quest/:id/step/:n/advance, /quest/complete), the key arrives via the Idempotency-Key HTTP header. Two reasons:
- HTTP layer is the canonical place per RFC 7240 / Stripe convention; the body slot is repurposable for richer payloads later (e.g. batch operations).
- The Nest interceptor (§3.2) is wire-level, not DTO-level — it can short-circuit replay BEFORE the controller’s DTO parse runs, avoiding wasted Zod cycles.
/step/ingest’s existing body-level idempotencyKey is grandfathered — it predates this memo and the step-ingest pipeline has its own merge-on-conflict semantics per ADR-0002 §5. Not migrated in 13-12. The new interceptor opts in per-endpoint via a @Idempotent() decorator (§3.2).
2.2 clientGeneratedAt — clock-skew defense
Every queueable POST gains a new required body field: clientGeneratedAt: ISO-8601 UTC instant. This is the device’s claim of “the moment the user committed this action” — distinct from enqueuedAt (Room insertion wall clock, may equal but is not required to) and clientSubmittedAt (HTTP send wall clock, only meaningful when the request reaches the server).
Why three timestamps:
enqueuedAt(Room column) — used by the worker’s client-side 7-day expiry check.clientGeneratedAt(NEW wire field) — what the server uses for server-side D-009 enforcement.clientSubmittedAt(already on/step/ingest, extended to other endpoints) — informational, used by the request-id middleware for log correlation. Not anti-cheat.
In the happy online path, all three are within ~milliseconds. In the offline-then-sync path, clientGeneratedAt can be days before clientSubmittedAt. The cap is checked against clientGeneratedAt.
Cap formula (server-side). Action is STALE if server_now - clientGeneratedAt > 7 days + GRACE. GRACE = 6 hours to absorb device-vs-server clock drift in either direction (covers reasonable NTP slew + tz-jump edge cases). Phase-13-plan §6 13-12 row says “core scope”; 6h is small enough not to weaken the cap, large enough not to false-positive walkers on cellular handover at midnight.
Anti-tamper note: clientGeneratedAt is client-asserted, not attested. Per ADR-0006 mock-trust posture, the server trusts it for now; production migration adds Play Integrity / DeviceCheck attestation that the device clock is OS-reported (not user-overridable). The 6h grace is a deliberate floor — a walker who pushes clientGeneratedAt more than 6h into the past relative to their own claim of clientSubmittedAt is rejected at schema-validation time (new Zod rule, §3.3). This catches the trivial “set my phone clock to 2026-01-01” attack today, without depending on attestation.
2.3 Action ordering — per-walker FIFO
Lock: per-walker FIFO, enforced client-side at the worker. Within one walker’s queue, actions sync in insertion order. The DAO already orders enqueued_at ASC (OfflineActionDao.kt:26); the worker iterates serially. No parallelism within a walker.
Why not per-action-type FIFO (e.g. all TREE_ALLOCATE syncs together, all QUEST_* syncs together):
- Tree allocation and quest completion can be causally linked (Quest 005 keystone-allocation reveals a hidden quest; later
QUEST_ADVANCEon the revealed quest is logically AFTER the allocation). Per-action-type FIFO would race the order. - D-018 sink-debit ordering: when the Energy ledger ships, the same Energy debit can fund any of the four sinks (combat, craft, allocation, harvest). If the queue contains
TREE_ALLOCATE(debit) →COMBAT_RESOLVE(debit), they must serialize against a shared Energy balance; reordering them is a balance violation.
Wire expression. None — ordering is implicit in send order. The server does not reject “out of order” submissions; it serializes them via DB row-level locks on the walker row when they touch shared columns (treePointsBanked, future currentEnergy). The contract is: the client guarantees the order it submits matches the order the user committed, and the server’s transactional writes commit in submit-order.
ORDERING_VIOLATION error — surfaced only when a strict ordering invariant is detectable server-side. Concrete case for 13-12: QUEST_ADVANCE for step N+1 arriving before QUEST_ADVANCE for step N would surface as STEP_OUT_OF_ORDER (already exists in quest.service.ts:467). 13-12 generalizes this surface to a new top-level error code ORDERING_VIOLATION only if we discover a second case; for 13-12 v1, the existing STEP_OUT_OF_ORDER covers the documented case. The new code is reserved in the taxonomy (§2.5) but not actively thrown anywhere in 13-12 — future-proofing.
2.4 D-009 7-day cap enforcement — what server-side adds
Client-side enforcement (OfflineQueueWorker.dropExpired against enqueuedAt) handles the honest offline-too-long case: walker walks for 8 days off-grid, queue has stale entries, worker drops them on reconnect. Telemetry-friendly, user-visible via the existing OfflineActionExpired event toast.
Server-side adds three things client cannot:
-
Defense against client-clock tamper. A walker who manually sets the device clock backward by 30 days can keep replaying queued actions indefinitely; the worker’s
System.currentTimeMillis() - enqueuedAt < 7dis always true relative to the tampered clock. Server-side checksserverNow - clientGeneratedAt. The cap holds against any client-side clock value. -
Defense against malicious queue inject. A walker (or a tool) that bypasses the Room queue entirely and posts directly to the API can claim arbitrary
clientGeneratedAtvalues. Server-sideSTALE_ACTION(per §2.5) rejects them. -
Operational forensic. Every stale action attempt writes an
idempotency_keysrow withresponse_status = 422even though no business write happens (§3.1 — the row is the audit trail). Production migration’s anti-cheat layer correlates stale-action rates per walker against the existing AttestationLog (ADR-0004).
Critical clarification on what the cap means. D-009 §2 says “Allocations made offline are provisional until server validates step provenance against HealthKit / Health Connect authoritative source.” That’s about provisional flag flips on TreeAllocation.provisional. The 7-day cap is the hard floor beyond which the action is dropped entirely, not provisionally accepted. 13-12 enforces the hard floor on the action side; the provisional-flag reconciliation is the post-Phase-13 reconciliation worker per ADR-0002 §6 (already documented as deferred).
2.5 Error response taxonomy
All errors follow ADR-0007 §7 envelope: {error, message, details, requestId}. New codes introduced by 13-12 in bold.
| Code | HTTP | Endpoint | Meaning | Mobile handling |
|---|---|---|---|---|
| STALE_ACTION | 422 | tree/allocate, quest/advance, quest/complete | serverNow - clientGeneratedAt > 7d + 6h | Drop with telemetry, emit OfflineActionExpired event (extend semantics to cover server-rejected stale, not only client-side expired). W-lock §5 confirms drop-vs-surface. |
| IDEMPOTENCY_CONFLICT | 409 | tree/allocate, quest/advance, quest/complete | Same (walkerId, key, endpoint) + different payload (sha256 mismatch) | Critical bug signal — log + telemetry + drop from queue. UI surfaces “Sync error” toast. Should never happen in normal flow; if it does, app state is corrupt. |
| ALREADY_ALLOCATED | 409 | tree/allocate | Node/keystone already in DB (legacy) | Treated as success — replay returned cached prior response per §2.1. |
| QUEST_ALREADY_COMPLETED | 409 | quest/complete | Quest already completed (legacy) | Treated as success — replay returned cached prior response per §2.1. |
| QUEST_ALREADY_IN_PROGRESS | 409 | quest/start | Existing in-progress row (legacy) | Treated as success — replay returned cached prior response. |
| STEP_OUT_OF_ORDER | 422 | quest/advance | Step N+1 sent before step N (legacy) | Pause queue, fetch /quest/available for canonical state, re-enqueue corrected. |
| ORDERING_VIOLATION | 422 | (reserved) | Generalized ordering invariant (not actively thrown in 13-12) | Same as STEP_OUT_OF_ORDER. |
| PREREQUISITES_NOT_MET | 422 | quest/start, tree/allocate | Gate slipped between enqueue and sync | Drop with telemetry; UI shows “Action no longer valid” toast. Common at long offline durations near cap. |
| INSUFFICIENT_POINTS | 422 | tree/allocate | Banked points changed via another path | Drop with telemetry; UI invalidates tree state. |
2.6 Client-side retry policy
Exponential backoff with jitter, cap on attempts.
| Parameter | Value | Rationale |
|---|---|---|
| Base delay | 5s | First retry is fast — covers transient network blips. |
| Backoff factor | 2.0 | Standard doubling. |
| Max delay | 5min | Beyond 5min, WorkManager’s periodic schedule (15min) takes over. |
| Jitter | ±25% randomized per attempt | De-synchronize burst-reconnect crowds (e.g. cell-tower restoration). |
| Max attempts | 12 | At backoff 2.0 starting 5s, 12 attempts span ~5min total of in-worker retries before deferring to next WorkManager cycle. |
| Give-up semantics | After 12 in-worker attempts that hit transient errors (5xx, network), action stays in queue with retry_count incremented; WorkManager re-schedules the periodic worker. Permanent errors (4xx STALE_ACTION / IDEMPOTENCY_CONFLICT / PREREQUISITES_NOT_MET / INSUFFICIENT_POINTS) drop the action immediately — no retry. |
Permanent vs transient classification.
- Transient (retry): 5xx, network failure, 408 timeout, 429 rate-limited (honor
Retry-Afterif present). - Permanent (drop or treat-as-success): all 422s except STEP_OUT_OF_ORDER (re-queue corrected). 409s with cached-response replay are treated-as-success per §2.1.
- STEP_OUT_OF_ORDER: not a drop; the worker pauses the queue, fetches canonical state via
GET /quest/available, and re-enqueues actions starting from the corrected position. This is the only error that triggers a queue-rebuild path.
2.7 D-018 sink-debit ordering (forward-reference, no schema this phase)
Energy ledger schema is NOT shipped in 13-12. D-018 §What-not-decide flags this as a tech-architect responsibility for “Phase 13+”; we are deferring it because:
- Combat already runs through
simulateEncounterwhich consumeswalkerStatsSnapshot.energyserver-side, satisfying D-018’s “server-authoritative sink debit” requirement implicitly for the combat sink. - Tree allocation, craft, and leak-harvest sinks are not yet on the backend wire. Shipping the Energy ledger speculatively before there are sinks to debit would be over-engineering.
What 13-12 DOES lock for the future ledger:
- Action submit order = debit commit order. When the ledger ships, debits commit inside the same transaction that writes the action effect (tree allocation, quest reward, etc.) — same
$transactionboundary, walker-row lock. Insufficient-energy at commit time →INSUFFICIENT_ENERGY422 (reserved code, not active in 13-12). - Idempotency key from §2.1 funds the ledger commit. A replay returns the cached response without re-debiting — the cached response carries the originally-debited Energy state. This is why the idempotency-key contract must be locked BEFORE the ledger lands.
This is a tech-architect-only commitment, not a CEO ratification. Documented here so the backend-engineer who ships the Energy ledger post-Phase-13 finds the contract pre-baked.
Blocker / question I can’t resolve alone: if CEO wants the Energy ledger to land in 13-12 (instead of forward-referenced), that materially expands scope by ~1.5 dispatch ticks (ledger Prisma model + commit-on-action wiring across 4+ services + replay-on-cache plumbing). See §5 Q3.
3. Backend deliverables
3.1 Prisma migration — idempotency_keys table
/// Persistent idempotency-key table per D-009-extension + ADR-0002 §3 generalized to all/// queueable POST endpoints. Replaces the implicit DB-unique-constraint idempotency at/// tree.service:19-20 with cached-response-replay semantics.////// TTL: 7 days (matches D-009 offline cap; W-lock §5). Rows older than 7d + 1d grace are/// swept by a daily cron (B-level under tech-architect; minimum-viable in 13-12 = single/// scheduled task, no separate retention service).////// Composite uniqueness on (walkerId, key, endpoint) so the same UUID v4 can be reused/// across endpoints without collision (the client may not enforce uniqueness across/// endpoint domains; the server treats key+endpoint as the canonical identifier).model IdempotencyKey { id String @id @default(uuid()) walkerId String /// UUID v4 string, 36 chars. Validated by Zod at controller entry. key String /// Endpoint route key — e.g. "tree.allocate", "quest.advance", "quest.complete". /// Hand-coded enum string (not a Prisma enum — adding/removing endpoints should not /// require a migration). endpoint String /// sha256 of the canonical-JSON-stringified original request payload. Used to detect /// "same key + different payload" replay (→ 409 IDEMPOTENCY_CONFLICT). payloadHash String /// HTTP status code of the original response (200/201/etc.). responseStatus Int /// JSON body of the original response, byte-stable serialized. Returned verbatim on /// replay. Size bounded by the endpoint's natural response shape; tree.allocate's /// TreeStateResponseDto is the largest at ~3-10 KB. responseBody Json /// clientGeneratedAt from the original request (informational; the cap check ran /// against this value at original-write time and is not re-checked on replay). clientGeneratedAt DateTime createdAt DateTime @default(now()) /// createdAt + 7 days. Cron sweep deletes rows past expiresAt. expiresAt DateTime
walker Walker @relation(fields: [walkerId], references: [id], onDelete: Cascade)
@@unique([walkerId, key, endpoint]) @@index([walkerId]) @@index([expiresAt]) @@map("idempotency_keys")}Walker model gains idempotencyKeys IdempotencyKey[] reverse relation. Migration name: add_idempotency_keys.
Why payloadHash and not full payload comparison. The hash is small, fixed-size, and lets the conflict check run as a single column read; full-payload comparison would require deserialising both sides and a deep-equal pass. Hash is sha256 over JSON.stringify(payload, Object.keys(payload).sort()) — deterministic key-ordered serialization to neutralize JS object iteration order. Backend-engineer codifies the hashing helper at backend/src/common/canonical-hash.ts.
Retention. The cron sweep is a B-level concern under tech-architect. Minimum-viable for 13-12: a single NestJS scheduled job (@nestjs/schedule @Cron('0 3 * * *')) deletes expiresAt < now. The query is index-backed (@@index([expiresAt])). No alerting on sweep volume in 13-12; logged at info-level only.
3.2 Nest interceptor — IdempotencyInterceptor + @Idempotent() decorator
New file backend/src/common/idempotency.interceptor.ts. Activates per-endpoint via a controller-method decorator:
@Idempotent({ endpoint: "tree.allocate" })@Post("/tree/allocate")async allocate(...) { ... }Flow on every decorated POST:
- Read
Idempotency-Keyheader. If absent → 400IDEMPOTENCY_KEY_REQUIRED(new error code; mobile must send for decorated endpoints). - Validate UUIDv4 format. If invalid → 400
IDEMPOTENCY_KEY_INVALID. - Compute payloadHash from request body (canonical-JSON sha256).
- Look up
idempotency_keyswhere(walkerId, key, endpoint)match. - If found:
- same payloadHash → return cached
responseBodywith cachedresponseStatus. Short-circuit; controller body does not execute. - different payloadHash → 409
IDEMPOTENCY_CONFLICTwith{expectedHash, receivedHash}indetails. Controller body does not execute.
- same payloadHash → return cached
- If not found: proceed to controller. After successful response, insert
idempotency_keysrow with the response body cached. Insertion happens in a tail-end interceptor branch (post-handler) inside the controller’s transaction if possible, or as a separate insert if the controller did not open a transaction.
Critical detail on transactional consistency. The idempotency row insertion MUST happen inside the same transaction as the business write. Otherwise a successful business write + failed key insert leaves a “ghost replay” — the second submission would re-run the business write. Backend-engineer wires this by exposing the IdempotencyInterceptor’s “cache this response” call as a method on the PrismaService request-scoped wrapper, called by the controller inside its $transaction block right before returning. The interceptor’s only job is then the pre-check + 409-on-mismatch; the post-cache is controller-driven.
Why interceptor + controller cooperate rather than pure-interceptor. Pure-interceptor would require wrapping the controller in another transaction (interceptor opens tx, runs controller, commits with insert). This breaks tree.service.ts:156 which already opens its own $transaction. Cleanest is: interceptor pre-checks, controller persists business + key inside its own transaction, interceptor never closes anything.
3.3 STALE_ACTION enforcement — guard placement
Three sites:
Site 1 — Zod schema extension (per endpoint). Add clientGeneratedAt: z.string().datetime() to AllocateRequestSchema, new AdvanceStepRequestSchema, new CompleteQuestRequestSchema. The latter two don’t exist as Zod schemas today (quest controller takes path params + empty body); 13-12 introduces them as the first body-validated quest mutations.
Site 2 — A reusable assertNotStale guard. Lives at backend/src/common/stale-action.guard.ts:
export function assertNotStale(clientGeneratedAt: Date, opts?: { graceHours?: number }): void { const graceMs = (opts?.graceHours ?? 6) * 60 * 60 * 1000; const sevenDaysMs = 7 * 24 * 60 * 60 * 1000; const ageMs = Date.now() - clientGeneratedAt.getTime(); if (ageMs > sevenDaysMs + graceMs) { throw new UnprocessableEntityException({ error: "STALE_ACTION", message: "Action was generated more than 7 days ago (D-009 §2 offline cap).", details: { clientGeneratedAt: clientGeneratedAt.toISOString(), serverNow: new Date().toISOString(), ageDays: Math.floor(ageMs / (24 * 60 * 60 * 1000)), capDays: 7, graceHours: 6, }, }); } // Also reject "future-dated" — clientGeneratedAt > serverNow + grace. // Tampered-clock-forward case. Same grace window. if (ageMs < -graceMs) { throw new UnprocessableEntityException({ error: "STALE_ACTION", message: "Action clientGeneratedAt is in the server's future beyond clock-skew tolerance.", details: { clientGeneratedAt: clientGeneratedAt.toISOString(), serverNow: new Date().toISOString(), graceHours: 6, }, }); }}Called from TreeService.allocate, QuestService.advanceStep, QuestService.completeQuest — first line, before any DB read. The guard is idempotency-key-aware: if the request is a successful replay (interceptor short-circuits), the guard never runs. This is correct — replays of an action that was accepted within-cap stay accepted forever (or rather, until the key TTL expires, which is also 7d → de-facto same window).
Site 3 — Quest controller body DTO. Quest endpoints today (/quest/:id/step/:n/advance, /quest/complete) take minimal bodies. 13-12 adds:
POST /quest/:id/step/:n/advancebody:{ clientGeneratedAt: ISO-8601 }(currently empty /QuestStepAdvanceRequestDtoonly carriesstepIndexwhich is redundant with path).POST /quest/completebody: extends{ questId, clientGeneratedAt: ISO-8601 }.
3.4 Service updates — tree, quest
tree.service.ts:
allocate(walkerId, dto):assertNotStale(new Date(dto.clientGeneratedAt))as first line. Remove lines 19-20 stale comment; the durable idempotency-key table now exists. Per-entry validation unchanged.- Idempotency caching: the controller (not the service) calls the interceptor’s “cache this response” inside the
$transaction(line 156 block). The service returns the response DTO; the controller threads through.
quest.service.ts:
advanceStep(walkerId, questId, stepNumber, tx?):assertNotStale(new Date(dto.clientGeneratedAt)). The optionaltxparam stays — the controller decides whether to wrap.completeQuest(walkerId, questId, dto): signature gains the dto forclientGeneratedAt.assertNotStalefirst line.
Existing 409 paths unchanged on the wire (QUEST_ALREADY_COMPLETED, QUEST_ALREADY_IN_PROGRESS, ALREADY_ALLOCATED) — they remain as the “no idempotency key passed” fallback for non-decorated callers. The mobile worker stops relying on these as a substitute for proper idempotency replay because the new interceptor returns the cached 200 first.
3.5 Test surface estimate
Unit tests (~30 new in backend):
idempotency.interceptor.spec.ts— ~10 cases: key-missing 400, key-malformed 400, key-found-same-hash returns cached body, key-found-diff-hash 409 IDEMPOTENCY_CONFLICT, key-not-found proceeds to controller, post-handler insert wired into transaction, sweep cron drops expired rows.stale-action.guard.spec.ts— ~8 cases: exactly-at-7d passes (within grace), 7d+5h passes, 7d+7h rejects with STALE_ACTION, future-dated +1h passes, future-dated +12h rejects, tz-confusion fixture (PL-zoned vs UTC), boundary at exactly 7d+6h, negative ageMs.tree.service.spec.ts— extend existing 14 tests with: STALE_ACTION on 8-day-old clientGeneratedAt, IDEMPOTENCY_CONFLICT on same key + altered allocations[], cached-replay on identical resubmit returns identical TreeStateResponseDto body.quest.service.spec.ts— extend with: STALE on advance, STALE on complete, idempotent replay of completeQuest returns identical CompleteQuestResponseDto including pointsAwarded (D-016 invariant — never double-grant).
Integration tests (~6 new):
- E2E
idempotency.e2e.spec.ts— full HTTP roundtrip via supertest: enqueue → drain → replay → mismatched-payload-conflict → STALE_ACTION rejection. Covers all three decorated endpoints.
Total estimate: +36 backend tests (current backend = 240 post-13-10 + 13-11 0 deltas; post-13-12 target ~276).
4. Mobile deliverables
4.1 OfflineAction schema additions
OfflineAction.kt gains two columns:
@ColumnInfo(name = "idempotency_key")val idempotencyKey: String, // UUIDv4 — generated at enqueue, never re-generated
@ColumnInfo(name = "client_generated_at")val clientGeneratedAt: Long, // epoch ms at the moment the user committedBoth NOT NULL. OfflineActionDao requires no DAO changes (column reads are implicit via Room codegen); a Room migration is required (current DB version is whatever 13-5 shipped; bump + add columns). Migration backfills:
idempotency_key: re-usepayloadJson’s embeddedidempotencyKeyfield if present (TREE_ALLOCATE already has one); else generate fresh UUIDv4 in the migration.client_generated_at: backfill toenqueued_at(best-effort — these are pre-13-12 rows; the server’s 6h grace covers the small skew).
Note: a Room destructive migration is acceptable per OfflineQueueDb posture (mock-trust + no canonical persistence) — backfill is friendlier and only ~10 lines.
4.2 Repository + enqueue sites
OfflineQueueRepositoryImpl.enqueue signature change:
override suspend fun enqueue( actionType: String, payloadJson: String, idempotencyKey: String = UUID.randomUUID().toString(), clientGeneratedAt: Long = System.currentTimeMillis(),): LongDefaults make existing call-sites compile; explicit overrides allowed for replay-of-replay edge cases (an action that was previously enqueued, dropped from a buggy build, and is being re-enqueued by repair logic).
Three enqueue sites updated:
TreeViewerViewModel.kt:204—idempotencyKeyalready generated; thread through.QuestDetailViewModel(advance enqueue site) — NEW: generate UUIDv4 at enqueue, embed in payload AND pass as column.QuestDetailViewModel(complete enqueue site) — NEW: same.
The payload JSON carries the same idempotencyKey field (for redundancy + payload-self-describing) AND it’s the column source of truth.
4.3 Worker dispatch — header attach + new error branches
OfflineQueueWorker.processAction refactor:
Each processQuestAdvance / processQuestComplete / processTreeAllocate helper gains:
Idempotency-Keyheader attached via Retrofit@Header("Idempotency-Key") idempotencyKey: String(new param onQuestApi.advanceStep,QuestApi.completeQuest,TreeApi.allocate).- Request body extended with
clientGeneratedAtfield.
Error handling redesigned:
private suspend fun classifyResponse(response: Response<*>): ActionResult { return when { response.isSuccessful -> ActionResult.Success response.code() == 409 -> { val errorBody = parseErrorBody(response) when (errorBody?.error) { "IDEMPOTENCY_CONFLICT" -> ActionResult.PermanentDrop("idempotency-conflict") // Treat ALREADY_ALLOCATED / QUEST_ALREADY_COMPLETED / QUEST_ALREADY_IN_PROGRESS as success // — they happen on legacy paths or near-simultaneous duplicate enqueue else -> ActionResult.Success } } response.code() == 422 -> { val errorBody = parseErrorBody(response) when (errorBody?.error) { "STALE_ACTION" -> ActionResult.PermanentDrop("stale-action") "PREREQUISITES_NOT_MET" -> ActionResult.PermanentDrop("prereq-slipped") "INSUFFICIENT_POINTS" -> ActionResult.PermanentDrop("points-changed") "STEP_OUT_OF_ORDER" -> ActionResult.RebuildQueue "ORDERING_VIOLATION" -> ActionResult.RebuildQueue else -> ActionResult.TransientRetry // unknown 422 — retry once to be safe } } response.code() == 400 -> ActionResult.PermanentDrop("bad-request") // schema bug — log loudly response.code() == 401 -> ActionResult.AuthRefresh // session expired response.code() in 500..599 -> ActionResult.TransientRetry else -> ActionResult.TransientRetry }}
sealed class ActionResult { data object Success : ActionResult() data class PermanentDrop(val reason: String) : ActionResult() data object RebuildQueue : ActionResult() data object TransientRetry : ActionResult() data object AuthRefresh : ActionResult()}Telemetry per result:
Success— count by action type.PermanentDrop("stale-action")— emitOfflineActionExpiredevent (re-use existing event; extend semantics to include server-side rejection, not only client-side cap). UI toast unchanged.PermanentDrop("idempotency-conflict")— log Crashlytics-level error; this should never happen in normal flow. UI surfaces generic “Sync error” toast.PermanentDrop("prereq-slipped")/points-changed— toast “This action is no longer valid.”RebuildQueue— fetchGET /quest/available, reconcile current step, re-enqueue subsequent actions starting from the corrected position. This is a 13-12 surface; v1 implementation can simply drop downstream actions in the same quest’s chain and let the user re-tap; full rebuild is a B-level follow-up.
4.4 Stress-test scenarios
Mobile-developer must include at minimum these test cases in the worker’s unit test surface:
- Clock-skew device. Mock
System.currentTimeMillis()to be 30 days in the past at enqueue, current at send. Verify worker still sends; server returns STALE_ACTION (mocked); worker drops with telemetry; UI sees the OfflineActionExpired event. - Queue drain mid-disconnect. Enqueue 5 actions while online, then mid-drain (after 2 actions sent) simulate connection loss. Verify the 3 remaining actions stay in queue with correct
retry_countincrements and identicalidempotency_keys on next attempt. - Duplicate enqueue same UUID. Two enqueue calls passing the same explicit
idempotencyKey(replay-of-replay path). Verify Room’s unique-or-conflict strategy preserves the first row; second enqueue is no-op. (RoomOnConflictStrategy.ABORTon the new unique index(walker_id, idempotency_key, action_type).) - 409 IDEMPOTENCY_CONFLICT. Mock backend response: same key, different payload. Verify worker drops action + emits Crashlytics telemetry.
- Cached-replay 200. Mock backend response: same key, identical payload, server returns cached 200 body. Verify worker treats as success + removes from queue.
- Backoff sequence. 5 consecutive 500s. Verify delays approximate
5s, 10s, 20s, 40s, 80swith jitter; 12th attempt defers to WorkManager periodic re-schedule. - STEP_OUT_OF_ORDER queue rebuild. Enqueue advance for step 3 + advance for step 4. Mock server: step 3 → 422 STEP_OUT_OF_ORDER (walker is at step 2). Verify worker fetches /quest/available, drops both 3+4, surfaces toast.
4.5 Test surface estimate
Unit tests (~25 new in walkrpg-mobile):
OfflineActionTest— Room migration test: backfill existing rows correctly.OfflineQueueRepositoryImplTest— 4 new cases for theidempotencyKey/clientGeneratedAtcolumns + unique-conflict.OfflineQueueWorkerTest— 7 cases per §4.4 stress scenarios.OfflineQueueWorkerClassifyResponseTest— ~12 cases: each row of the 4.3 error table, plus unknown-error fallback.TreeViewerViewModelTest+QuestDetailViewModelTest— extend withclientGeneratedAt+idempotencyKeyenqueue parameter assertions (~3 cases each = 6 total).
Total estimate: +29 mobile tests (current Android = 300 post-13-11; post-13-12 target ~329).
5. W-locks for CEO (max 3, A/B/C)
Q1 — Idempotency-key TTL
How long does the server keep cached response bodies for replay?
- A. 7 days (match D-009 §2 offline cap exactly). After 7d, the original action is no longer replayable; any replay attempt returns the same STALE_ACTION error the original would now produce. Symmetric, single mental model.
- B. 30 days. Generous slack for “I went off-grid, came back, walker queue had key but action was already stale; I want to see the original response for forensic reasons.” Adds storage cost (~4x the row count).
- C. 24 hours. Aggressive cleanup; the cap really only matters within hours of the original write. Tighter storage but risks rejecting legitimate retries from a poorly-implemented client.
Recommendation: A. D-009 is the canonical cap; making the idempotency TTL match it gives a single mental model + symmetric expiry. The “client retries 8 days later” path is by-construction STALE_ACTION on the action level; the missing cached response then doesn’t matter (the client gets 422 STALE_ACTION fresh-computed). 30d (option B) opens a window where a stale action’s cached 200 could replay AFTER the cap should have rejected it — confusing semantics.
Q2 — Action ordering enforcement
How strict is the ordering contract between walks / tree / quest actions on the wire?
- A. Per-walker FIFO — client guarantees submit-order matches user-commit-order. Server serializes via DB row-level locks but does not actively reject “out of order” submissions except for
STEP_OUT_OF_ORDER. Loose; simple. - B. Per-action-type FIFO — server tracks last-seen sequence number per
(walker, action_type); any out-of-sequence rejected withORDERING_VIOLATION. Stricter; requires sequence number on wire + per-walker counter. - C. Global per-walker sequence number with hard-reject — every queueable action carries a monotonic counter; server rejects gaps + out-of-order with
ORDERING_VIOLATION. Strictest; bulletproof against malicious reorder; expensive (counter increment on every enqueue + send).
Recommendation: A. Per-walker FIFO is sufficient for 13-12 because (a) the existing STEP_OUT_OF_ORDER catches the only known causal-ordering violation today, (b) D-018 Energy ordering will be enforced by transactional balance check (insufficient-energy error) not by sequence numbers, (c) Phase-13 backend has no per-walker counter primitive and adding one is scope-creep. B/C are reserved for production hardening where attestation makes the sequence number trustworthy.
Q3 — D-018 Energy ledger scope for 13-12
Does 13-12 ship the Energy ledger schema speculatively, or forward-reference?
- A. Forward-reference only. Memo §2.7 documents the contract; no ledger schema, no commit-on-action wiring. 13-12 ships clean per CEO “core scope only” triage. The next sub-phase (13-N+1 or post-Phase-13) ships the ledger.
- B. Ship the ledger schema + Energy column on Walker. No commit-on-action wiring yet; just the storage layer ready. Inflates 13-12 by ~0.5 ticks; gives the schema a home and lets backend tests reference it.
- C. Ship the ledger + wire commits for all four sinks (combat, tree, craft, harvest). Materially expands scope by ~1.5 dispatch ticks. Out of “core scope only” framing.
Recommendation: A. D-018 §What-not-decide explicitly says ledger spec is “tech-architect; Phase 13+” — not pinned to 13-12. The contract in §2.7 + §2.6 (INSUFFICIENT_ENERGY reserved code) is sufficient to ensure the future ledger lands without re-litigating idempotency or ordering. CEO triage said “core only” for 13-12; ledger is out.
Q4 — STALE_ACTION user-facing treatment (B-level if CEO defers)
When the server rejects a queued action with STALE_ACTION, what does the mobile UI do?
- A. Silent drop with toast. Re-use the existing
OfflineActionExpiredevent toast pattern from 13-5 client-side expiry. Walker sees “Some queued actions expired (offline > 7 days)” once per drain cycle. - B. Surface as data-loss notification. Persistent in-app inbox entry listing each dropped action’s user-visible label (“Allocate node.even-stride was dropped — expired”). User can dismiss individually.
- C. Pre-warn at queue-age threshold. When any action is between 5-7 days old, show a banner: “Sync soon — actions older than 7 days will expire.” Combine with A on actual drop.
Recommendation: A. Existing surface, lowest friction. The 7-day cap is a backstop — at 13-12 maturity the actual queue depth is small (closed-beta cohort, no multiplayer). Option C is a UX nice-to-have for a later polish pass; B is over-investing in an exception case. This is B-level under ui-designer + mobile-developer — flagged as Q4 only because the answer affects mobile telemetry shape; CEO can defer to leads.
6. Out of scope (explicit defers)
Per phase-13-plan §6 13-12 row’s FLAG_LEAD A + B (CEO triaged “core only”):
- D-026 minimal push absorption. No FCM token registration; no per-class opt-in surface; no notification firing matrix wiring. The full notification spec is post-Phase-13 (ui-designer + mobile-developer + narrative-designer paired) per §10.9 of phase-13-plan.
- D-027 diegetic onboarding absorption. Current 13-3 class-pick UI is treated as Phase-13-final per phase-13-plan §10.10 recommendation (option b). Bertranda walkthrough + 50-step register surface + class-selection-as-Quest-001-beat-1-close all defer to a post-Phase-13 onboarding sub-phase.
- D-018 Energy ledger schema. Forward-reference only per §2.7. The contract (idempotency-key funds future debits, action-submit-order = debit-commit-order) is locked here; no schema ships.
- Production attestation of
clientGeneratedAt. Per ADR-0006 mock-trust posture, the field is client-asserted not OS-attested. Production migration adds Play Integrity / DeviceCheck verification that the device clock is not user-overridable. Out of 13-12. - Reconciliation worker for provisional TreeAllocation rows. ADR-0002 §6 reconciliation flow + provisional flag flipping. Out of 13-12; the 7d cap is the action-level floor, not the provisional-flag flip mechanism.
ORDERING_VIOLATIONactive throws. Code is reserved in the error taxonomy (§2.5) but no 13-12 code path throws it. Reserved for the second causal-ordering invariant when it surfaces.- D-020 pull-based encounter pool ordering. Combat encounters in 13-12 still fire only from quest beats; pool-driven
tropyaccumulation has its own ordering semantics (per D-020 §2) handled when that sub-phase lands. - Multi-device queue reconciliation. A walker logged in on two devices, each with their own offline queue, could submit identical
idempotency_keys if generated independently. The composite unique on(walkerId, key, endpoint)correctly merges them — but the second device’s queue still believes it submitted; UI consistency is a Phase 14+ concern.
7. Dispatch order (recommended)
-
Tick 1 — backend-engineer:
- 1a. Prisma migration
add_idempotency_keys. Verify schema validate + migrate runs clean. - 1b.
backend/src/common/canonical-hash.ts+backend/src/common/stale-action.guard.ts+backend/src/common/idempotency.interceptor.ts+@Idempotent()decorator. - 1c. Wire interceptor + guard into
tree.controller.ts,quest.controller.ts(advance + complete routes). Update Zod schemas + DTOs for newclientGeneratedAt. - 1d. Daily cron sweep job (
@nestjs/schedule). - 1e. Tests per §3.5 (~36 new).
- 1f. Swagger doc updates —
Idempotency-Keyheader on all three endpoints +clientGeneratedAtbody field + STALE_ACTION + IDEMPOTENCY_CONFLICT error responses.
- 1a. Prisma migration
-
Tick 2 — mobile-developer:
- 2a. Room migration:
OfflineActiongainsidempotency_key+client_generated_atcolumns + unique index(walker_id, action_type, idempotency_key). Backfill. - 2b.
OfflineQueueRepositoryImpl.enqueuesignature change + 3 call-site updates (TreeViewerViewModel + QuestDetailViewModel × 2). - 2c.
OfflineQueueWorker.processActionrefactor withclassifyResponsesealed-class result;Idempotency-Keyheader on all three Retrofit interfaces;clientGeneratedAtbody field. - 2d. Backoff + jitter retry policy per §2.6.
- 2e. Tests per §4.4 + §4.5 (~29 new).
- 2a. Room migration:
Tick 1 ships independently (server tolerates absent Idempotency-Key on un-decorated endpoints; new decorated endpoints will return 400 IDEMPOTENCY_KEY_REQUIRED to any non-13-12 client — acceptable since the only client is the mobile app, gated on app version). Tick 2 needs Tick 1’s wire-shape commits.
8. Risks + mitigations
- Transactional consistency of interceptor-cached response. The hardest part: the response body must be cached inside the same transaction as the business write. Mitigation: controller-driven post-cache hook (§3.2) — interceptor pre-checks only; controller calls a
cacheIdempotentResponse(...)method inside its own$transaction. Pattern documented + reviewed at code-review. - payloadHash determinism across JS engines.
JSON.stringifywith sorted keys must be byte-identical between Node 22 (backend) and any potential future serverless re-host. Mitigation: extract the canonical-hash helper to a single utility; lock test fixtures with known hashes; ADR-0002’s existing canonical-hash convention as model. - Room migration data loss. Adding NOT NULL columns to an existing table requires backfill. Mitigation: explicit migration with backfill SQL + test against a fixture DB built from 13-11’s schema.
- Clock-skew false positives. 6h grace is empirically chosen; some walkers (international travelers crossing date-line offline) may exceed. Mitigation: telemetry on STALE_ACTION rejection rate per walker — production migration tunes the grace window if false-positive rate exceeds 1%. For 13-12, 6h is a safe default.
- D-018 ordering contract drift before ledger ships. The §2.7 commitment binds future tech-architect work; if a future sub-phase ships a ledger that breaks the contract, the cached responses in
idempotency_keyswould mis-account. Mitigation: contract documented + ADR-0002 cross-references this memo; ledger PR will be reviewed against §2.7 explicitly. - CEO option (b) on D-027 means the §5 exit-scenario walkthrough is not updated. Phase-13-plan §10.10 already notes this — 13-13 owns the editorial pass. 13-12 stays narrow.
9. Deferred / NOT in 13-12 (appendix)
- D-026 push notification infrastructure (FCM, opt-in surface, voice copy bank).
- D-027 diegetic onboarding (Bertranda walkthrough, 50-step register, class-pick-as-beat-1-close).
- D-018 Energy ledger schema + commit-on-action wiring.
- Reconciliation worker for
provisionalTreeAllocation flag. - Production attestation (Play Integrity / DeviceCheck) of
clientGeneratedAt. ORDERING_VIOLATIONactive throw paths (code reserved only).- Multi-device queue reconciliation.
- D-020 pull-based encounter pool ordering.
- Queue-age pre-warn UX (Q4 option C — B-level under ui-designer).
- Per-walker monotonic sequence numbers (Q2 options B/C).