Phase 13-12 design — offline queue hardening + D-009 7-day cap (core scope)

Phase 13-12 — offline queue hardening + D-009 7-day cap (design memo)

Date: 2026-05-22 Author: tech-architect Status: Draft spec — implementation dispatched as separate work units. Related canon: D-007 (anti-cheat single-writer), D-009 (heavy offline + 7-day cap + provisional state), D-016 (progression — quest grants only), D-018 (Energy economy — sink-debits must be server-authoritative). Related ADRs: ADR-0002 §3 (idempotency-key on /step/ingest), ADR-0004 (anti-cheat layer reject taxonomy), ADR-0007 §7 (error envelope {error, message, details, requestId}), ADR-0006 §Light anti-cheat (mock-trust posture this phase).

This memo is the spec phase. No code is written here. Implementation is dispatched in two following ticks per §6.

0. Sub-deliverable map

#	Deliverable	Owner(s)	Artifact
1	Idempotency-key shape + canonical TTL	tech-architect (this memo) + backend-engineer	New `idempotency_keys` Prisma model + Nest interceptor + DTO header binding
2	D-009 7-day cap server-side enforcement via `clientGeneratedAt`	tech-architect (this memo) + backend-engineer	New `STALE_ACTION` error code + tree/quest service guard + Zod schema additions
3	Action ordering contract (per-walker FIFO) on the wire	tech-architect (this memo) + mobile-developer	`clientGeneratedAt` field + worker preserves DB-insertion order
4	Mobile queue rewrite — UUIDv4 keys for ALL three action types + new error handling	mobile-developer	`OfflineAction` schema add + `OfflineQueueWorker` 409 / STALE / ORDERING branches + telemetry

1. Context — what 13-12 closes

Sub-phase 13-5 shipped the Room queue (OfflineActionDao / OfflineQueueDb / OfflineAction / OfflineQueueRepository{,Impl} / OfflineQueueWorker) at walkrpg-mobile/android/app/src/main/java/com/walkrpg/mobile/data/offline/. It works in the happy path. The four hardening gaps 13-12 closes:

Idempotency is partial. Only TREE_ALLOCATE carries an idempotency key (UUIDv4 generated at TreeViewerViewModel + TreeRepositoryImpl, parsed by AllocateRequestSchema in backend/src/tree/tree.dto.ts:16-18). QUEST_ADVANCE and QUEST_COMPLETE carry no key — the worker treats HTTP 409 as “success” at the HTTP layer (OfflineQueueWorker.kt:117/124), but the backend has no record that this specific enqueue was the one that succeeded vs a different enqueue of the same logical action. This is a silent merge that violates ADR-0002’s “same key + different payload = error” convention.
Tree allocation idempotency is enforced at the unique-index layer, not the response layer. tree.service.ts:19-20 explicitly documents this: “A durable idempotency-key table (production) will allow returning the cached prior response for full idempotency semantics.” Today the second submission of the same key + same allocations returns 409 ALREADY_ALLOCATED instead of the original 200 response — the client sees a different shape on replay, which forces it to issue a follow-up GET /tree/state to reconcile. Wasted round trip + a race window where the client UI can briefly show stale state.
The D-009 7-day cap is client-only. OfflineQueueWorker.kt:60-64 runs dropExpired(SEVEN_DAYS_MS) based on OfflineAction.enqueuedAt (Room-side wall clock at enqueue). The server has no equivalent check on tree/quest endpoints — a walker with a tampered system clock can replay actions tagged with a synthetic enqueuedAt indefinitely. D-009 §2 mandates the cap; today it’s enforced by a non-attested device clock.
D-018 sink-debit ordering is undefined on the wire. D-018 §What-not-decide says “sink debits must be server-authoritative”. Energy economy is not wired into the backend yet (no Energy ledger schema; combat already debits “implicitly” via the simulateEncounter pass-through). 13-12 needs to lock the ordering contract the future Energy ledger will inherit, without speculatively shipping the ledger schema itself.

13-12 closes (1)+(2)+(3) directly. (4) is contract-only — no Energy ledger ships; the ordering rule is documented so the future schema lands without re-litigating it.

Scope discipline. Per phase-13-plan §6 13-12’s FLAG_LEAD A (D-026 push) + B (D-027 onboarding), CEO triaged “core only”. This memo treats both as explicitly deferred. See §6.

2. Cross-cutting contracts

2.1 Idempotency-key shape — canonical

Format. UUIDv4 string. 36 chars including dashes. Validation: z.string().uuid() server-side (matches existing AllocateRequestSchema:16-18). Lowercase canonical; case-sensitive comparison.

Generation site. Client. Mobile generates the key at enqueue time, not at send time. This matters because re-enqueue of the same logical action (e.g. user taps “Complete Quest” twice while offline) must produce the same key, not two different keys; the Room layer is the source of truth.

Lifetime / storage. Server-side idempotency_keys table (Prisma model — §3.1). Composite uniqueness on (walkerId, key, endpoint). TTL = 7 days to match D-009 §2 offline cap (W-lock §5 picks this; alternative 30d explored).

Replay semantics. A second request with the same (walkerId, key, endpoint):

Identical payload → return the cached response_body JSON byte-for-byte with the original status code. No DB mutation. Counts as a normal request for rate-limit purposes.
Different payload → 409 IDEMPOTENCY_CONFLICT, with details.expectedHash (sha256 of the canonical-JSON-stringified original payload) + details.receivedHash. No DB mutation. This is ADR-0002’s “same key + different payload = error” convention extended to all queueable endpoints.

Header vs body. Honor ADR-0002 §3 today’s shape on /step/ingest (key in body). On the new queueable endpoints (/tree/allocate, /quest/:id/step/:n/advance, /quest/complete), the key arrives via the Idempotency-Key HTTP header. Two reasons:

HTTP layer is the canonical place per RFC 7240 / Stripe convention; the body slot is repurposable for richer payloads later (e.g. batch operations).
The Nest interceptor (§3.2) is wire-level, not DTO-level — it can short-circuit replay BEFORE the controller’s DTO parse runs, avoiding wasted Zod cycles.

/step/ingest’s existing body-level idempotencyKey is grandfathered — it predates this memo and the step-ingest pipeline has its own merge-on-conflict semantics per ADR-0002 §5. Not migrated in 13-12. The new interceptor opts in per-endpoint via a @Idempotent() decorator (§3.2).

2.2 `clientGeneratedAt` — clock-skew defense

Every queueable POST gains a new required body field: clientGeneratedAt: ISO-8601 UTC instant. This is the device’s claim of “the moment the user committed this action” — distinct from enqueuedAt (Room insertion wall clock, may equal but is not required to) and clientSubmittedAt (HTTP send wall clock, only meaningful when the request reaches the server).

Why three timestamps:

enqueuedAt (Room column) — used by the worker’s client-side 7-day expiry check.
clientGeneratedAt (NEW wire field) — what the server uses for server-side D-009 enforcement.
clientSubmittedAt (already on /step/ingest, extended to other endpoints) — informational, used by the request-id middleware for log correlation. Not anti-cheat.

In the happy online path, all three are within ~milliseconds. In the offline-then-sync path, clientGeneratedAt can be days before clientSubmittedAt. The cap is checked against clientGeneratedAt.

Cap formula (server-side). Action is STALE if server_now - clientGeneratedAt > 7 days + GRACE. GRACE = 6 hours to absorb device-vs-server clock drift in either direction (covers reasonable NTP slew + tz-jump edge cases). Phase-13-plan §6 13-12 row says “core scope”; 6h is small enough not to weaken the cap, large enough not to false-positive walkers on cellular handover at midnight.

Anti-tamper note: clientGeneratedAt is client-asserted, not attested. Per ADR-0006 mock-trust posture, the server trusts it for now; production migration adds Play Integrity / DeviceCheck attestation that the device clock is OS-reported (not user-overridable). The 6h grace is a deliberate floor — a walker who pushes clientGeneratedAt more than 6h into the past relative to their own claim of clientSubmittedAt is rejected at schema-validation time (new Zod rule, §3.3). This catches the trivial “set my phone clock to 2026-01-01” attack today, without depending on attestation.

2.3 Action ordering — per-walker FIFO

Lock: per-walker FIFO, enforced client-side at the worker. Within one walker’s queue, actions sync in insertion order. The DAO already orders enqueued_at ASC (OfflineActionDao.kt:26); the worker iterates serially. No parallelism within a walker.

Why not per-action-type FIFO (e.g. all TREE_ALLOCATE syncs together, all QUEST_* syncs together):

Tree allocation and quest completion can be causally linked (Quest 005 keystone-allocation reveals a hidden quest; later QUEST_ADVANCE on the revealed quest is logically AFTER the allocation). Per-action-type FIFO would race the order.
D-018 sink-debit ordering: when the Energy ledger ships, the same Energy debit can fund any of the four sinks (combat, craft, allocation, harvest). If the queue contains TREE_ALLOCATE (debit) → COMBAT_RESOLVE (debit), they must serialize against a shared Energy balance; reordering them is a balance violation.

Wire expression. None — ordering is implicit in send order. The server does not reject “out of order” submissions; it serializes them via DB row-level locks on the walker row when they touch shared columns (treePointsBanked, future currentEnergy). The contract is: the client guarantees the order it submits matches the order the user committed, and the server’s transactional writes commit in submit-order.

ORDERING_VIOLATION error — surfaced only when a strict ordering invariant is detectable server-side. Concrete case for 13-12: QUEST_ADVANCE for step N+1 arriving before QUEST_ADVANCE for step N would surface as STEP_OUT_OF_ORDER (already exists in quest.service.ts:467). 13-12 generalizes this surface to a new top-level error code ORDERING_VIOLATION only if we discover a second case; for 13-12 v1, the existing STEP_OUT_OF_ORDER covers the documented case. The new code is reserved in the taxonomy (§2.5) but not actively thrown anywhere in 13-12 — future-proofing.

2.4 D-009 7-day cap enforcement — what server-side adds

Client-side enforcement (OfflineQueueWorker.dropExpired against enqueuedAt) handles the honest offline-too-long case: walker walks for 8 days off-grid, queue has stale entries, worker drops them on reconnect. Telemetry-friendly, user-visible via the existing OfflineActionExpired event toast.

Server-side adds three things client cannot:

Defense against client-clock tamper. A walker who manually sets the device clock backward by 30 days can keep replaying queued actions indefinitely; the worker’s System.currentTimeMillis() - enqueuedAt < 7d is always true relative to the tampered clock. Server-side checks serverNow - clientGeneratedAt. The cap holds against any client-side clock value.
Defense against malicious queue inject. A walker (or a tool) that bypasses the Room queue entirely and posts directly to the API can claim arbitrary clientGeneratedAt values. Server-side STALE_ACTION (per §2.5) rejects them.
Operational forensic. Every stale action attempt writes an idempotency_keys row with response_status = 422 even though no business write happens (§3.1 — the row is the audit trail). Production migration’s anti-cheat layer correlates stale-action rates per walker against the existing AttestationLog (ADR-0004).

Critical clarification on what the cap means. D-009 §2 says “Allocations made offline are provisional until server validates step provenance against HealthKit / Health Connect authoritative source.” That’s about provisional flag flips on TreeAllocation.provisional. The 7-day cap is the hard floor beyond which the action is dropped entirely, not provisionally accepted. 13-12 enforces the hard floor on the action side; the provisional-flag reconciliation is the post-Phase-13 reconciliation worker per ADR-0002 §6 (already documented as deferred).

2.5 Error response taxonomy

All errors follow ADR-0007 §7 envelope: {error, message, details, requestId}. New codes introduced by 13-12 in bold.

Code	HTTP	Endpoint	Meaning	Mobile handling
STALE_ACTION	422	tree/allocate, quest/advance, quest/complete	`serverNow - clientGeneratedAt > 7d + 6h`	Drop with telemetry, emit `OfflineActionExpired` event (extend semantics to cover server-rejected stale, not only client-side expired). W-lock §5 confirms drop-vs-surface.
IDEMPOTENCY_CONFLICT	409	tree/allocate, quest/advance, quest/complete	Same `(walkerId, key, endpoint)` + different payload (sha256 mismatch)	Critical bug signal — log + telemetry + drop from queue. UI surfaces “Sync error” toast. Should never happen in normal flow; if it does, app state is corrupt.
ALREADY_ALLOCATED	409	tree/allocate	Node/keystone already in DB (legacy)	Treated as success — replay returned cached prior response per §2.1.
QUEST_ALREADY_COMPLETED	409	quest/complete	Quest already completed (legacy)	Treated as success — replay returned cached prior response per §2.1.
QUEST_ALREADY_IN_PROGRESS	409	quest/start	Existing in-progress row (legacy)	Treated as success — replay returned cached prior response.
STEP_OUT_OF_ORDER	422	quest/advance	Step N+1 sent before step N (legacy)	Pause queue, fetch `/quest/available` for canonical state, re-enqueue corrected.
ORDERING_VIOLATION	422	(reserved)	Generalized ordering invariant (not actively thrown in 13-12)	Same as STEP_OUT_OF_ORDER.
PREREQUISITES_NOT_MET	422	quest/start, tree/allocate	Gate slipped between enqueue and sync	Drop with telemetry; UI shows “Action no longer valid” toast. Common at long offline durations near cap.
INSUFFICIENT_POINTS	422	tree/allocate	Banked points changed via another path	Drop with telemetry; UI invalidates tree state.

2.6 Client-side retry policy

Exponential backoff with jitter, cap on attempts.

Parameter	Value	Rationale
Base delay	5s	First retry is fast — covers transient network blips.
Backoff factor	2.0	Standard doubling.
Max delay	5min	Beyond 5min, WorkManager’s periodic schedule (15min) takes over.
Jitter	±25% randomized per attempt	De-synchronize burst-reconnect crowds (e.g. cell-tower restoration).
Max attempts	12	At backoff 2.0 starting 5s, 12 attempts span ~5min total of in-worker retries before deferring to next WorkManager cycle.
Give-up semantics	After 12 in-worker attempts that hit transient errors (5xx, network), action stays in queue with `retry_count` incremented; WorkManager re-schedules the periodic worker. Permanent errors (4xx STALE_ACTION / IDEMPOTENCY_CONFLICT / PREREQUISITES_NOT_MET / INSUFFICIENT_POINTS) drop the action immediately — no retry.

Permanent vs transient classification.

Transient (retry): 5xx, network failure, 408 timeout, 429 rate-limited (honor Retry-After if present).
Permanent (drop or treat-as-success): all 422s except STEP_OUT_OF_ORDER (re-queue corrected). 409s with cached-response replay are treated-as-success per §2.1.
STEP_OUT_OF_ORDER: not a drop; the worker pauses the queue, fetches canonical state via GET /quest/available, and re-enqueues actions starting from the corrected position. This is the only error that triggers a queue-rebuild path.

2.7 D-018 sink-debit ordering (forward-reference, no schema this phase)

Energy ledger schema is NOT shipped in 13-12. D-018 §What-not-decide flags this as a tech-architect responsibility for “Phase 13+”; we are deferring it because:

Combat already runs through simulateEncounter which consumes walkerStatsSnapshot.energy server-side, satisfying D-018’s “server-authoritative sink debit” requirement implicitly for the combat sink.
Tree allocation, craft, and leak-harvest sinks are not yet on the backend wire. Shipping the Energy ledger speculatively before there are sinks to debit would be over-engineering.

What 13-12 DOES lock for the future ledger:

Action submit order = debit commit order. When the ledger ships, debits commit inside the same transaction that writes the action effect (tree allocation, quest reward, etc.) — same $transaction boundary, walker-row lock. Insufficient-energy at commit time → INSUFFICIENT_ENERGY 422 (reserved code, not active in 13-12).
Idempotency key from §2.1 funds the ledger commit. A replay returns the cached response without re-debiting — the cached response carries the originally-debited Energy state. This is why the idempotency-key contract must be locked BEFORE the ledger lands.

This is a tech-architect-only commitment, not a CEO ratification. Documented here so the backend-engineer who ships the Energy ledger post-Phase-13 finds the contract pre-baked.

Blocker / question I can’t resolve alone: if CEO wants the Energy ledger to land in 13-12 (instead of forward-referenced), that materially expands scope by ~1.5 dispatch ticks (ledger Prisma model + commit-on-action wiring across 4+ services + replay-on-cache plumbing). See §5 Q3.

3. Backend deliverables

3.1 Prisma migration — `idempotency_keys` table

/// Persistent idempotency-key table per D-009-extension + ADR-0002 §3 generalized to all
/// queueable POST endpoints. Replaces the implicit DB-unique-constraint idempotency at
/// tree.service:19-20 with cached-response-replay semantics.
///
/// TTL: 7 days (matches D-009 offline cap; W-lock §5). Rows older than 7d + 1d grace are
/// swept by a daily cron (B-level under tech-architect; minimum-viable in 13-12 = single
/// scheduled task, no separate retention service).
///
/// Composite uniqueness on (walkerId, key, endpoint) so the same UUID v4 can be reused
/// across endpoints without collision (the client may not enforce uniqueness across
/// endpoint domains; the server treats key+endpoint as the canonical identifier).
model IdempotencyKey {
  id              String   @id @default(uuid())
  walkerId        String
  /// UUID v4 string, 36 chars. Validated by Zod at controller entry.
  key             String
  /// Endpoint route key — e.g. "tree.allocate", "quest.advance", "quest.complete".
  /// Hand-coded enum string (not a Prisma enum — adding/removing endpoints should not
  /// require a migration).
  endpoint        String
  /// sha256 of the canonical-JSON-stringified original request payload. Used to detect
  /// "same key + different payload" replay (→ 409 IDEMPOTENCY_CONFLICT).
  payloadHash     String
  /// HTTP status code of the original response (200/201/etc.).
  responseStatus  Int
  /// JSON body of the original response, byte-stable serialized. Returned verbatim on
  /// replay. Size bounded by the endpoint's natural response shape; tree.allocate's
  /// TreeStateResponseDto is the largest at ~3-10 KB.
  responseBody    Json
  /// clientGeneratedAt from the original request (informational; the cap check ran
  /// against this value at original-write time and is not re-checked on replay).
  clientGeneratedAt DateTime
  createdAt       DateTime @default(now())
  /// createdAt + 7 days. Cron sweep deletes rows past expiresAt.
  expiresAt       DateTime

  walker Walker @relation(fields: [walkerId], references: [id], onDelete: Cascade)

  @@unique([walkerId, key, endpoint])
  @@index([walkerId])
  @@index([expiresAt])
  @@map("idempotency_keys")
}

Walker model gains idempotencyKeys IdempotencyKey[] reverse relation. Migration name: add_idempotency_keys.

Why payloadHash and not full payload comparison. The hash is small, fixed-size, and lets the conflict check run as a single column read; full-payload comparison would require deserialising both sides and a deep-equal pass. Hash is sha256 over JSON.stringify(payload, Object.keys(payload).sort()) — deterministic key-ordered serialization to neutralize JS object iteration order. Backend-engineer codifies the hashing helper at backend/src/common/canonical-hash.ts.

Retention. The cron sweep is a B-level concern under tech-architect. Minimum-viable for 13-12: a single NestJS scheduled job (@nestjs/schedule @Cron('0 3 * * *')) deletes expiresAt < now. The query is index-backed (@@index([expiresAt])). No alerting on sweep volume in 13-12; logged at info-level only.

3.2 Nest interceptor — `IdempotencyInterceptor` + `@Idempotent()` decorator

New file backend/src/common/idempotency.interceptor.ts. Activates per-endpoint via a controller-method decorator:

@Idempotent({ endpoint: "tree.allocate" })
@Post("/tree/allocate")
async allocate(...) { ... }

Flow on every decorated POST:

Read Idempotency-Key header. If absent → 400 IDEMPOTENCY_KEY_REQUIRED (new error code; mobile must send for decorated endpoints).
Validate UUIDv4 format. If invalid → 400 IDEMPOTENCY_KEY_INVALID.
Compute payloadHash from request body (canonical-JSON sha256).
Look up idempotency_keys where (walkerId, key, endpoint) match.
If found:
- same payloadHash → return cached responseBody with cached responseStatus. Short-circuit; controller body does not execute.
- different payloadHash → 409 IDEMPOTENCY_CONFLICT with {expectedHash, receivedHash} in details. Controller body does not execute.
If not found: proceed to controller. After successful response, insert idempotency_keys row with the response body cached. Insertion happens in a tail-end interceptor branch (post-handler) inside the controller’s transaction if possible, or as a separate insert if the controller did not open a transaction.

Critical detail on transactional consistency. The idempotency row insertion MUST happen inside the same transaction as the business write. Otherwise a successful business write + failed key insert leaves a “ghost replay” — the second submission would re-run the business write. Backend-engineer wires this by exposing the IdempotencyInterceptor’s “cache this response” call as a method on the PrismaService request-scoped wrapper, called by the controller inside its $transaction block right before returning. The interceptor’s only job is then the pre-check + 409-on-mismatch; the post-cache is controller-driven.

Why interceptor + controller cooperate rather than pure-interceptor. Pure-interceptor would require wrapping the controller in another transaction (interceptor opens tx, runs controller, commits with insert). This breaks tree.service.ts:156 which already opens its own $transaction. Cleanest is: interceptor pre-checks, controller persists business + key inside its own transaction, interceptor never closes anything.

3.3 STALE_ACTION enforcement — guard placement

Three sites:

Site 1 — Zod schema extension (per endpoint). Add clientGeneratedAt: z.string().datetime() to AllocateRequestSchema, new AdvanceStepRequestSchema, new CompleteQuestRequestSchema. The latter two don’t exist as Zod schemas today (quest controller takes path params + empty body); 13-12 introduces them as the first body-validated quest mutations.

Site 2 — A reusable assertNotStale guard. Lives at backend/src/common/stale-action.guard.ts:

export function assertNotStale(clientGeneratedAt: Date, opts?: { graceHours?: number }): void {
  const graceMs = (opts?.graceHours ?? 6) * 60 * 60 * 1000;
  const sevenDaysMs = 7 * 24 * 60 * 60 * 1000;
  const ageMs = Date.now() - clientGeneratedAt.getTime();
  if (ageMs > sevenDaysMs + graceMs) {
    throw new UnprocessableEntityException({
      error: "STALE_ACTION",
      message: "Action was generated more than 7 days ago (D-009 §2 offline cap).",
      details: {
        clientGeneratedAt: clientGeneratedAt.toISOString(),
        serverNow: new Date().toISOString(),
        ageDays: Math.floor(ageMs / (24 * 60 * 60 * 1000)),
        capDays: 7,
        graceHours: 6,
      },
    });
  }
  // Also reject "future-dated" — clientGeneratedAt > serverNow + grace.
  // Tampered-clock-forward case. Same grace window.
  if (ageMs < -graceMs) {
    throw new UnprocessableEntityException({
      error: "STALE_ACTION",
      message: "Action clientGeneratedAt is in the server's future beyond clock-skew tolerance.",
      details: {
        clientGeneratedAt: clientGeneratedAt.toISOString(),
        serverNow: new Date().toISOString(),
        graceHours: 6,
      },
    });
  }
}

Called from TreeService.allocate, QuestService.advanceStep, QuestService.completeQuest — first line, before any DB read. The guard is idempotency-key-aware: if the request is a successful replay (interceptor short-circuits), the guard never runs. This is correct — replays of an action that was accepted within-cap stay accepted forever (or rather, until the key TTL expires, which is also 7d → de-facto same window).

Site 3 — Quest controller body DTO. Quest endpoints today (/quest/:id/step/:n/advance, /quest/complete) take minimal bodies. 13-12 adds:

POST /quest/:id/step/:n/advance body: { clientGeneratedAt: ISO-8601 } (currently empty / QuestStepAdvanceRequestDto only carries stepIndex which is redundant with path).
POST /quest/complete body: extends { questId, clientGeneratedAt: ISO-8601 }.

3.4 Service updates — tree, quest

tree.service.ts:

allocate(walkerId, dto): assertNotStale(new Date(dto.clientGeneratedAt)) as first line. Remove lines 19-20 stale comment; the durable idempotency-key table now exists. Per-entry validation unchanged.
Idempotency caching: the controller (not the service) calls the interceptor’s “cache this response” inside the $transaction (line 156 block). The service returns the response DTO; the controller threads through.

quest.service.ts:

advanceStep(walkerId, questId, stepNumber, tx?): assertNotStale(new Date(dto.clientGeneratedAt)). The optional tx param stays — the controller decides whether to wrap.
completeQuest(walkerId, questId, dto): signature gains the dto for clientGeneratedAt. assertNotStale first line.

Existing 409 paths unchanged on the wire (QUEST_ALREADY_COMPLETED, QUEST_ALREADY_IN_PROGRESS, ALREADY_ALLOCATED) — they remain as the “no idempotency key passed” fallback for non-decorated callers. The mobile worker stops relying on these as a substitute for proper idempotency replay because the new interceptor returns the cached 200 first.

3.5 Test surface estimate

Unit tests (~30 new in backend):

idempotency.interceptor.spec.ts — ~10 cases: key-missing 400, key-malformed 400, key-found-same-hash returns cached body, key-found-diff-hash 409 IDEMPOTENCY_CONFLICT, key-not-found proceeds to controller, post-handler insert wired into transaction, sweep cron drops expired rows.
stale-action.guard.spec.ts — ~8 cases: exactly-at-7d passes (within grace), 7d+5h passes, 7d+7h rejects with STALE_ACTION, future-dated +1h passes, future-dated +12h rejects, tz-confusion fixture (PL-zoned vs UTC), boundary at exactly 7d+6h, negative ageMs.
tree.service.spec.ts — extend existing 14 tests with: STALE_ACTION on 8-day-old clientGeneratedAt, IDEMPOTENCY_CONFLICT on same key + altered allocations[], cached-replay on identical resubmit returns identical TreeStateResponseDto body.
quest.service.spec.ts — extend with: STALE on advance, STALE on complete, idempotent replay of completeQuest returns identical CompleteQuestResponseDto including pointsAwarded (D-016 invariant — never double-grant).

Integration tests (~6 new):

E2E idempotency.e2e.spec.ts — full HTTP roundtrip via supertest: enqueue → drain → replay → mismatched-payload-conflict → STALE_ACTION rejection. Covers all three decorated endpoints.

Total estimate: +36 backend tests (current backend = 240 post-13-10 + 13-11 0 deltas; post-13-12 target ~276).

4. Mobile deliverables

4.1 OfflineAction schema additions

OfflineAction.kt gains two columns:

@ColumnInfo(name = "idempotency_key")
val idempotencyKey: String,  // UUIDv4 — generated at enqueue, never re-generated

@ColumnInfo(name = "client_generated_at")
val clientGeneratedAt: Long,  // epoch ms at the moment the user committed

Both NOT NULL. OfflineActionDao requires no DAO changes (column reads are implicit via Room codegen); a Room migration is required (current DB version is whatever 13-5 shipped; bump + add columns). Migration backfills:

idempotency_key: re-use payloadJson’s embedded idempotencyKey field if present (TREE_ALLOCATE already has one); else generate fresh UUIDv4 in the migration.
client_generated_at: backfill to enqueued_at (best-effort — these are pre-13-12 rows; the server’s 6h grace covers the small skew).

Note: a Room destructive migration is acceptable per OfflineQueueDb posture (mock-trust + no canonical persistence) — backfill is friendlier and only ~10 lines.

4.2 Repository + enqueue sites

OfflineQueueRepositoryImpl.enqueue signature change:

override suspend fun enqueue(
  actionType: String,
  payloadJson: String,
  idempotencyKey: String = UUID.randomUUID().toString(),
  clientGeneratedAt: Long = System.currentTimeMillis(),
): Long

Defaults make existing call-sites compile; explicit overrides allowed for replay-of-replay edge cases (an action that was previously enqueued, dropped from a buggy build, and is being re-enqueued by repair logic).

Three enqueue sites updated:

TreeViewerViewModel.kt:204 — idempotencyKey already generated; thread through.
QuestDetailViewModel (advance enqueue site) — NEW: generate UUIDv4 at enqueue, embed in payload AND pass as column.
QuestDetailViewModel (complete enqueue site) — NEW: same.

The payload JSON carries the same idempotencyKey field (for redundancy + payload-self-describing) AND it’s the column source of truth.

4.3 Worker dispatch — header attach + new error branches

OfflineQueueWorker.processAction refactor:

Each processQuestAdvance / processQuestComplete / processTreeAllocate helper gains:

Idempotency-Key header attached via Retrofit @Header("Idempotency-Key") idempotencyKey: String (new param on QuestApi.advanceStep, QuestApi.completeQuest, TreeApi.allocate).
Request body extended with clientGeneratedAt field.

Error handling redesigned:

private suspend fun classifyResponse(response: Response<*>): ActionResult {
  return when {
    response.isSuccessful -> ActionResult.Success
    response.code() == 409 -> {
      val errorBody = parseErrorBody(response)
      when (errorBody?.error) {
        "IDEMPOTENCY_CONFLICT" -> ActionResult.PermanentDrop("idempotency-conflict")
        // Treat ALREADY_ALLOCATED / QUEST_ALREADY_COMPLETED / QUEST_ALREADY_IN_PROGRESS as success
        // — they happen on legacy paths or near-simultaneous duplicate enqueue
        else -> ActionResult.Success
      }
    }
    response.code() == 422 -> {
      val errorBody = parseErrorBody(response)
      when (errorBody?.error) {
        "STALE_ACTION" -> ActionResult.PermanentDrop("stale-action")
        "PREREQUISITES_NOT_MET" -> ActionResult.PermanentDrop("prereq-slipped")
        "INSUFFICIENT_POINTS" -> ActionResult.PermanentDrop("points-changed")
        "STEP_OUT_OF_ORDER" -> ActionResult.RebuildQueue
        "ORDERING_VIOLATION" -> ActionResult.RebuildQueue
        else -> ActionResult.TransientRetry  // unknown 422 — retry once to be safe
      }
    }
    response.code() == 400 -> ActionResult.PermanentDrop("bad-request")  // schema bug — log loudly
    response.code() == 401 -> ActionResult.AuthRefresh  // session expired
    response.code() in 500..599 -> ActionResult.TransientRetry
    else -> ActionResult.TransientRetry
  }
}

sealed class ActionResult {
  data object Success : ActionResult()
  data class PermanentDrop(val reason: String) : ActionResult()
  data object RebuildQueue : ActionResult()
  data object TransientRetry : ActionResult()
  data object AuthRefresh : ActionResult()
}

Telemetry per result:

Success — count by action type.
PermanentDrop("stale-action") — emit OfflineActionExpired event (re-use existing event; extend semantics to include server-side rejection, not only client-side cap). UI toast unchanged.
PermanentDrop("idempotency-conflict") — log Crashlytics-level error; this should never happen in normal flow. UI surfaces generic “Sync error” toast.
PermanentDrop("prereq-slipped") / points-changed — toast “This action is no longer valid.”
RebuildQueue — fetch GET /quest/available, reconcile current step, re-enqueue subsequent actions starting from the corrected position. This is a 13-12 surface; v1 implementation can simply drop downstream actions in the same quest’s chain and let the user re-tap; full rebuild is a B-level follow-up.

4.4 Stress-test scenarios

Mobile-developer must include at minimum these test cases in the worker’s unit test surface:

Clock-skew device. Mock System.currentTimeMillis() to be 30 days in the past at enqueue, current at send. Verify worker still sends; server returns STALE_ACTION (mocked); worker drops with telemetry; UI sees the OfflineActionExpired event.
Queue drain mid-disconnect. Enqueue 5 actions while online, then mid-drain (after 2 actions sent) simulate connection loss. Verify the 3 remaining actions stay in queue with correct retry_count increments and identical idempotency_keys on next attempt.
Duplicate enqueue same UUID. Two enqueue calls passing the same explicit idempotencyKey (replay-of-replay path). Verify Room’s unique-or-conflict strategy preserves the first row; second enqueue is no-op. (Room OnConflictStrategy.ABORT on the new unique index (walker_id, idempotency_key, action_type).)
409 IDEMPOTENCY_CONFLICT. Mock backend response: same key, different payload. Verify worker drops action + emits Crashlytics telemetry.
Cached-replay 200. Mock backend response: same key, identical payload, server returns cached 200 body. Verify worker treats as success + removes from queue.
Backoff sequence. 5 consecutive 500s. Verify delays approximate 5s, 10s, 20s, 40s, 80s with jitter; 12th attempt defers to WorkManager periodic re-schedule.
STEP_OUT_OF_ORDER queue rebuild. Enqueue advance for step 3 + advance for step 4. Mock server: step 3 → 422 STEP_OUT_OF_ORDER (walker is at step 2). Verify worker fetches /quest/available, drops both 3+4, surfaces toast.

4.5 Test surface estimate

Unit tests (~25 new in walkrpg-mobile):

OfflineActionTest — Room migration test: backfill existing rows correctly.
OfflineQueueRepositoryImplTest — 4 new cases for the idempotencyKey / clientGeneratedAt columns + unique-conflict.
OfflineQueueWorkerTest — 7 cases per §4.4 stress scenarios.
OfflineQueueWorkerClassifyResponseTest — ~12 cases: each row of the 4.3 error table, plus unknown-error fallback.
TreeViewerViewModelTest + QuestDetailViewModelTest — extend with clientGeneratedAt + idempotencyKey enqueue parameter assertions (~3 cases each = 6 total).

Total estimate: +29 mobile tests (current Android = 300 post-13-11; post-13-12 target ~329).

5. W-locks for CEO (max 3, A/B/C)

Q1 — Idempotency-key TTL

How long does the server keep cached response bodies for replay?

A. 7 days (match D-009 §2 offline cap exactly). After 7d, the original action is no longer replayable; any replay attempt returns the same STALE_ACTION error the original would now produce. Symmetric, single mental model.
B. 30 days. Generous slack for “I went off-grid, came back, walker queue had key but action was already stale; I want to see the original response for forensic reasons.” Adds storage cost (~4x the row count).
C. 24 hours. Aggressive cleanup; the cap really only matters within hours of the original write. Tighter storage but risks rejecting legitimate retries from a poorly-implemented client.

Recommendation: A. D-009 is the canonical cap; making the idempotency TTL match it gives a single mental model + symmetric expiry. The “client retries 8 days later” path is by-construction STALE_ACTION on the action level; the missing cached response then doesn’t matter (the client gets 422 STALE_ACTION fresh-computed). 30d (option B) opens a window where a stale action’s cached 200 could replay AFTER the cap should have rejected it — confusing semantics.

Q2 — Action ordering enforcement

How strict is the ordering contract between walks / tree / quest actions on the wire?

A. Per-walker FIFO — client guarantees submit-order matches user-commit-order. Server serializes via DB row-level locks but does not actively reject “out of order” submissions except for STEP_OUT_OF_ORDER. Loose; simple.
B. Per-action-type FIFO — server tracks last-seen sequence number per (walker, action_type); any out-of-sequence rejected with ORDERING_VIOLATION. Stricter; requires sequence number on wire + per-walker counter.
C. Global per-walker sequence number with hard-reject — every queueable action carries a monotonic counter; server rejects gaps + out-of-order with ORDERING_VIOLATION. Strictest; bulletproof against malicious reorder; expensive (counter increment on every enqueue + send).

Recommendation: A. Per-walker FIFO is sufficient for 13-12 because (a) the existing STEP_OUT_OF_ORDER catches the only known causal-ordering violation today, (b) D-018 Energy ordering will be enforced by transactional balance check (insufficient-energy error) not by sequence numbers, (c) Phase-13 backend has no per-walker counter primitive and adding one is scope-creep. B/C are reserved for production hardening where attestation makes the sequence number trustworthy.

Q3 — D-018 Energy ledger scope for 13-12

Does 13-12 ship the Energy ledger schema speculatively, or forward-reference?

A. Forward-reference only. Memo §2.7 documents the contract; no ledger schema, no commit-on-action wiring. 13-12 ships clean per CEO “core scope only” triage. The next sub-phase (13-N+1 or post-Phase-13) ships the ledger.
B. Ship the ledger schema + Energy column on Walker. No commit-on-action wiring yet; just the storage layer ready. Inflates 13-12 by ~0.5 ticks; gives the schema a home and lets backend tests reference it.
C. Ship the ledger + wire commits for all four sinks (combat, tree, craft, harvest). Materially expands scope by ~1.5 dispatch ticks. Out of “core scope only” framing.

Recommendation: A. D-018 §What-not-decide explicitly says ledger spec is “tech-architect; Phase 13+” — not pinned to 13-12. The contract in §2.7 + §2.6 (INSUFFICIENT_ENERGY reserved code) is sufficient to ensure the future ledger lands without re-litigating idempotency or ordering. CEO triage said “core only” for 13-12; ledger is out.

Q4 — STALE_ACTION user-facing treatment (B-level if CEO defers)

When the server rejects a queued action with STALE_ACTION, what does the mobile UI do?

A. Silent drop with toast. Re-use the existing OfflineActionExpired event toast pattern from 13-5 client-side expiry. Walker sees “Some queued actions expired (offline > 7 days)” once per drain cycle.
B. Surface as data-loss notification. Persistent in-app inbox entry listing each dropped action’s user-visible label (“Allocate node.even-stride was dropped — expired”). User can dismiss individually.
C. Pre-warn at queue-age threshold. When any action is between 5-7 days old, show a banner: “Sync soon — actions older than 7 days will expire.” Combine with A on actual drop.

Recommendation: A. Existing surface, lowest friction. The 7-day cap is a backstop — at 13-12 maturity the actual queue depth is small (closed-beta cohort, no multiplayer). Option C is a UX nice-to-have for a later polish pass; B is over-investing in an exception case. This is B-level under ui-designer + mobile-developer — flagged as Q4 only because the answer affects mobile telemetry shape; CEO can defer to leads.

6. Out of scope (explicit defers)

Per phase-13-plan §6 13-12 row’s FLAG_LEAD A + B (CEO triaged “core only”):

D-026 minimal push absorption. No FCM token registration; no per-class opt-in surface; no notification firing matrix wiring. The full notification spec is post-Phase-13 (ui-designer + mobile-developer + narrative-designer paired) per §10.9 of phase-13-plan.
D-027 diegetic onboarding absorption. Current 13-3 class-pick UI is treated as Phase-13-final per phase-13-plan §10.10 recommendation (option b). Bertranda walkthrough + 50-step register surface + class-selection-as-Quest-001-beat-1-close all defer to a post-Phase-13 onboarding sub-phase.
D-018 Energy ledger schema. Forward-reference only per §2.7. The contract (idempotency-key funds future debits, action-submit-order = debit-commit-order) is locked here; no schema ships.
Production attestation of clientGeneratedAt. Per ADR-0006 mock-trust posture, the field is client-asserted not OS-attested. Production migration adds Play Integrity / DeviceCheck verification that the device clock is not user-overridable. Out of 13-12.
Reconciliation worker for provisional TreeAllocation rows. ADR-0002 §6 reconciliation flow + provisional flag flipping. Out of 13-12; the 7d cap is the action-level floor, not the provisional-flag flip mechanism.
ORDERING_VIOLATION active throws. Code is reserved in the error taxonomy (§2.5) but no 13-12 code path throws it. Reserved for the second causal-ordering invariant when it surfaces.
D-020 pull-based encounter pool ordering. Combat encounters in 13-12 still fire only from quest beats; pool-driven tropy accumulation has its own ordering semantics (per D-020 §2) handled when that sub-phase lands.
Multi-device queue reconciliation. A walker logged in on two devices, each with their own offline queue, could submit identical idempotency_keys if generated independently. The composite unique on (walkerId, key, endpoint) correctly merges them — but the second device’s queue still believes it submitted; UI consistency is a Phase 14+ concern.

7. Dispatch order (recommended)

Tick 1 — backend-engineer:
- 1a. Prisma migration add_idempotency_keys. Verify schema validate + migrate runs clean.
- 1b. backend/src/common/canonical-hash.ts + backend/src/common/stale-action.guard.ts + backend/src/common/idempotency.interceptor.ts + @Idempotent() decorator.
- 1c. Wire interceptor + guard into tree.controller.ts, quest.controller.ts (advance + complete routes). Update Zod schemas + DTOs for new clientGeneratedAt.
- 1d. Daily cron sweep job (@nestjs/schedule).
- 1e. Tests per §3.5 (~36 new).
- 1f. Swagger doc updates — Idempotency-Key header on all three endpoints + clientGeneratedAt body field + STALE_ACTION + IDEMPOTENCY_CONFLICT error responses.
Tick 2 — mobile-developer:
- 2a. Room migration: OfflineAction gains idempotency_key + client_generated_at columns + unique index (walker_id, action_type, idempotency_key). Backfill.
- 2b. OfflineQueueRepositoryImpl.enqueue signature change + 3 call-site updates (TreeViewerViewModel + QuestDetailViewModel × 2).
- 2c. OfflineQueueWorker.processAction refactor with classifyResponse sealed-class result; Idempotency-Key header on all three Retrofit interfaces; clientGeneratedAt body field.
- 2d. Backoff + jitter retry policy per §2.6.
- 2e. Tests per §4.4 + §4.5 (~29 new).

Tick 1 ships independently (server tolerates absent Idempotency-Key on un-decorated endpoints; new decorated endpoints will return 400 IDEMPOTENCY_KEY_REQUIRED to any non-13-12 client — acceptable since the only client is the mobile app, gated on app version). Tick 2 needs Tick 1’s wire-shape commits.

8. Risks + mitigations

Transactional consistency of interceptor-cached response. The hardest part: the response body must be cached inside the same transaction as the business write. Mitigation: controller-driven post-cache hook (§3.2) — interceptor pre-checks only; controller calls a cacheIdempotentResponse(...) method inside its own $transaction. Pattern documented + reviewed at code-review.
payloadHash determinism across JS engines. JSON.stringify with sorted keys must be byte-identical between Node 22 (backend) and any potential future serverless re-host. Mitigation: extract the canonical-hash helper to a single utility; lock test fixtures with known hashes; ADR-0002’s existing canonical-hash convention as model.
Room migration data loss. Adding NOT NULL columns to an existing table requires backfill. Mitigation: explicit migration with backfill SQL + test against a fixture DB built from 13-11’s schema.
Clock-skew false positives. 6h grace is empirically chosen; some walkers (international travelers crossing date-line offline) may exceed. Mitigation: telemetry on STALE_ACTION rejection rate per walker — production migration tunes the grace window if false-positive rate exceeds 1%. For 13-12, 6h is a safe default.
D-018 ordering contract drift before ledger ships. The §2.7 commitment binds future tech-architect work; if a future sub-phase ships a ledger that breaks the contract, the cached responses in idempotency_keys would mis-account. Mitigation: contract documented + ADR-0002 cross-references this memo; ledger PR will be reviewed against §2.7 explicitly.
CEO option (b) on D-027 means the §5 exit-scenario walkthrough is not updated. Phase-13-plan §10.10 already notes this — 13-13 owns the editorial pass. 13-12 stays narrow.

9. Deferred / NOT in 13-12 (appendix)

D-026 push notification infrastructure (FCM, opt-in surface, voice copy bank).
D-027 diegetic onboarding (Bertranda walkthrough, 50-step register, class-pick-as-beat-1-close).
D-018 Energy ledger schema + commit-on-action wiring.
Reconciliation worker for provisional TreeAllocation flag.
Production attestation (Play Integrity / DeviceCheck) of clientGeneratedAt.
ORDERING_VIOLATION active throw paths (code reserved only).
Multi-device queue reconciliation.
D-020 pull-based encounter pool ordering.
Queue-age pre-warn UX (Q4 option C — B-level under ui-designer).
Per-walker monotonic sequence numbers (Q2 options B/C).