Skip to content

ADR-0009 — VPS migration (Phase 14, Hetzner CX22)

ADR-0009 — VPS migration (Phase 14, Hetzner CX22)

Status: Accepted (CEO ratifying via deployment 2026-05-22 evening) Date: 2026-05-22 Owner: tech-architect Paired implementation: tech-architect + CEO (provisioning), backend-engineer (Dockerfile / compose / nginx / .gitlab-ci.yml follow-up) Supersedes: none Amends scope (for Phase 14 only): ADR-0006 (test-phase infrastructure — VPS replaces local + tunnel topology) Related canon: D-015 (post-Phase-12 phasing — Phase 14 = VPS migration, Hetzner CX22 named), D-009 §3 (GDPR — EU residency mandate, satisfied by Hetzner Nuremberg / Falkenstein), D-007 §3 (anti-cheat — restrictive posture; mock auth note preserved), ADR-0006 (mock-auth backend posture, Docker Compose stack template, migration plan §2 names this ADR as Phase 14 trigger), ADR-0007 (Android network layer — §9 BASE_URL reconfiguration target)

1. Context

D-015 ratifies Phase 14 as a numbered phase: Hetzner CX22, public IP, asynchronous testing across timezones, test cohort growth past ~30 walkers without the CEO-laptop SLA ceiling that ADR-0006 §Migration plan §2 explicitly named as the trigger.

The trigger fires now because:

  • Phase 13 closed 12/13 sub-phases with 13-13 device walkthrough prep complete. The backend has the surface the closed-beta cohort needs (/auth/callback, /walker/profile, /step/ingest, /tree/state, /quest/*, /combat/encounter). No further endpoint additions block Phase 14.
  • The CEO-laptop tunnel SLA is now actively biting — every tester ping requires the laptop to be on with cloudflared running. Asynchronous testing across timezones is a Phase 13 close-out blocker.
  • D-009 §3 EU residency mandate is satisfiable on Hetzner without engaging the paused-indefinitely GCP europe-central2-warsaw estate. Hetzner Nuremberg (NUR) and Falkenstein (FSN1) are both in Germany, both inside EU/GDPR jurisdiction, both eligible. ADR-0006’s “production-target preservation” stance holds — this ADR adds a transitional infrastructure layer, it does not unfreeze D-007/D-008/D-009 production scope.
  • Closed-beta cohort is ~20 walkers (CEO + trusted recruits). Cost-per-walker tolerance is high; cost-per-month tolerance is essentially zero until cohort signal justifies it. €5.70/month VPS + €10-15/year domain satisfies both.

This ADR is the Phase 14 spec. The paired runbook (wiki/src/content/docs/tech/phase-14-runbook.md) is CEO-executable from the ADR’s stack-layout decisions.

Portfolio coexistence (amendment, ratified 2026-05-22 evening): the same VPS additionally hosts a personal portfolio site at morrisassert.dev (apex + www). The portfolio is a Next.js 14 standalone container alongside the backend stack; nginx differentiates by Host header. This addition is B-level — it does not change the security posture, cost band, or cohort scale of the Phase 14 backend, and reuses the same Hetzner CX22 + Cloudflare + Let’s Encrypt SAN cert + nginx ingress. The portfolio code lives in portfolio/ at the monorepo root (pnpm workspace @walkrpg/portfolio). See §3 for the updated service list, §5 for the multi-domain DNS + SAN cert topology, and the paired portfolio Dockerfile at portfolio/Dockerfile.

Wiki coexistence (amendment, ratified 2026-05-23): the same VPS additionally hosts the WalkRPG wiki (Astro Starlight static site, pnpm workspace @walkrpg/wiki) at wiki.morrisassert.dev. The wiki is a build-once / serve-statically container: wiki/Dockerfile.build produces wiki/dist/ and copies it onto a shared docker named volume wiki-static, which the main nginx mounts read-only and serves directly. Cloudflare Access (Zero Trust Free tier, ≤50 users) gates the hostname AT EDGE before requests reach the origin — nginx implicitly trusts the Cf-Access-Jwt-Assertion header (defense-in-depth JWT validation against the per-application JWKS endpoint is a tracked follow-up, see §3.5). This amendment is B-level — it does not change cost band (Hetzner CX22 unchanged, CF Access free tier $0), does not change cohort scale, and reuses the same nginx + LE SAN cert + Cloudflare proxy infrastructure. A future-state walkrpg.morrisassert.dev frontend can use the identical CF Access pattern when ratified. See §3.1 for the updated service list, §3.5 for the auth/serve model, §5 for the SAN cert topology now covering five hostnames.

Multi-env amendment (2026-05-22, ratified via ADR-0010): the original ADR-0009 framing assumed a single env on this VPS (“master = prod”). ADR-0010 supersedes that implicit assumption — the same VPS now hosts two backend stacks (prod + staging) keyed off two branches (main + dev). The compose topology, nginx config, and LE SAN list extend to cover a sixth hostname (api-staging.walkrpg.morrisassert.dev). The wiki + portfolio remain prod-env-only. See §3.1 for the per-env service-list breakdown, §11 known limitations for the cross-link to ADR-0010 follow-ups, and the paired backend/docker-compose.staging.yml file shipped alongside ADR-0010.

2. Decision

Phase 14 ships against a single-host Hetzner CX22 VPS running a Docker Compose stack identical in shape to ADR-0006’s local stack, with public TLS via Cloudflare proxy + Let’s Encrypt origin certificate, and GitLab-CI-driven push-to-master deploy.

ComponentTest phase (ADR-0006)Phase 14 (this ADR)Production target (D-007/8/9, ADR-0001)
ComputeLocal Docker Compose on CEO laptopHetzner CX22 single host, Docker ComposeCloud Run + Cloud Run realtime + walkrpg-jobs
DatabaseLocal Postgres 16 containerPostgres 16 container, host-mount volumeCloud SQL Postgres 16 europe-central2-warsaw
AuthMock JWTMock JWT (unchanged, AUTH_MODE=mock)Firebase Auth Path B
External reachCloudflare Tunnel (laptop-tethered)Cloudflare proxy (orange cloud) → nginx (443) → api (3000)Cloud Load Balancer + Cloud Armor + EU region pin
TLSCloudflare edge only (tunnel terminus)Cloudflare Full (strict) + Let’s Encrypt origin cert via certbotGCP-managed certs
Reconciliation workerNoneNone (matches ADR-0006)Cloud Tasks reconcile-steps 60s coalescing
Anti-cheatSchema + impossible-burstSchema + impossible-burst (unchanged)Three-layer defense in depth (ADR-0004)
BackupsNoneDaily pg_dump to host /var/backups/walkrpg/, 7-day rotationCloud SQL automated backups + PITR
Observabilitydocker logs on laptopdocker compose logs on VPS, stdout-structuredCloud Logging + Cloud Monitoring + alerting
CI/CDManual pnpm dev on laptopGitLab CI: lint → test → build-image → ssh-deploy on push-to-masterCloud Build → Cloud Run revision deploy
EU residencyCloudflare edge proxy (egress only)Hetzner DE (Nuremberg or Falkenstein) — fully EU-resident origin + data at restGCP europe-central2-warsaw
Op cost€0/mo (laptop electricity)€5.70/mo VPS + €10-15/yr domain ≈ €78/yr all-inBand A ~€143/mo, scaling per ADR-0001
SLA ceilingCEO laptop uptimeHetzner CX22 SLA (~99.9%) + single-host failure modeMulti-zone GCP, ~99.95% target

The Prisma schema is identical between Phase 14 and both ADR-0006 (test) and ADR-0001 (production). The data model is invariant across migration boundaries — only deployment, auth, and reconciliation layers differ.

3. Stack layout

3.1 Per-env stack — Docker Compose services

Prod stack defined in backend/docker-compose.prod.yml. Six services run side-by-side per the portfolio + wiki coexistence amendments (§1). Staging stack defined in backend/docker-compose.staging.yml per the ADR-0010 multi-env amendment — two services (api-staging + db-staging) on a separate compose project, separate volume, separate .env.staging. The prod nginx joins the staging network as external to terminate TLS for api-staging.walkrpg.morrisassert.dev (preserves the single-ingress invariant from §3.2).

Prod stack:

ServiceImagePurposeNetworkRestart
apiregistry.gitlab.com/positive-walkers/walkrpg/backend:<sha>NestJS 11 + Prisma 6, invoked via tsx src/main.ts (NOT node dist/main — see §13 known limitations)walkrpg-net (internal)unless-stopped
dbpostgres:16-alpinePostgres 16, data volume mounted at /var/lib/postgresql/data from host ./pgdata/walkrpg-net (internal only — NO host port exposure)unless-stopped
nginxnginx:1.27-alpineReverse proxy + static file server. Five server blocks: morrisassert.dev + www.morrisassert.devweb:3000 (portfolio), api.walkrpg.morrisassert.devapi:3000 (backend), walkrpg.morrisassert.dev → 503 placeholder, wiki.morrisassert.dev → static files from the wiki-static named volume. Terminates TLS using the LE SAN cert mounted from host /etc/letsencrypt/, emits HSTS + standard security headerswalkrpg-net + host ports 80, 443unless-stopped
certbotcertbot/certbot:latestLet’s Encrypt cert renewal sidecar. Webroot challenge via shared volume with nginx (/var/www/certbot/). Renews the single multi-domain SAN cert covering all five hostnames. Cron-mode: --keep-until-expiring runs every 12h, post-renewal hook signals nginx reload.host 80 (webroot only, shared with nginx)unless-stopped
webregistry.gitlab.com/positive-walkers/walkrpg/portfolio:<sha>Next.js 14 standalone (Morris portfolio). Built from portfolio/Dockerfile against the monorepo root. Listens on :3000 inside the docker network — NO host port.walkrpg-net (internal only)unless-stopped
wiki-builderregistry.gitlab.com/positive-walkers/walkrpg/wiki:<sha>Astro Starlight static-build container. Multi-stage build (builder + busybox-publish) emits wiki/dist/ and the container command copies the tree onto the shared wiki-static named volume at startup, then tail -fs to keep the container alive. NO network exposure — nginx serves the static tree directly from the volume. Cloudflare Access gates wiki.morrisassert.dev at edge.walkrpg-net (internal, no port surface)no (one-shot build + keepalive)

Staging stack (added per ADR-0010 §4):

ServiceImagePurposeNetworkRestart
api-stagingregistry.gitlab.com/positive-walkers/walkrpg/backend:<sha> (same image as prod, different tag at deploy time)NestJS 11 + Prisma 6 running against the staging database. Loads .env.staging. Distinct JWT_SECRET from prod so sessions don’t cross envs.walkrpg-net-staging (internal)unless-stopped
db-stagingpostgres:16-alpinePostgres 16 for staging, data volume mounted at /var/lib/postgresql/data from host ./pgdata-staging/. Internal-only — NO host port.walkrpg-net-staging (internal only)unless-stopped

The prod nginx + certbot + portfolio + wiki services are shared (not duplicated). The LE SAN cert covers all six hostnames (see §5).

3.2 Network topology

Internet
Cloudflare edge (proxy, orange cloud, SSL Full strict)
│ + Cloudflare Access (Zero Trust) gates wiki.morrisassert.dev at edge
│ HTTPS
▼ for all five hostnames (apex + www + api + walkrpg + wiki subdomains)
Hetzner CX22 public IPv4 (ufw: 22/80/443 only)
nginx :443 (LE SAN cert from /etc/letsencrypt/live/morrisassert.dev/)
├─ Host: morrisassert.dev, www.morrisassert.dev → web :3000 (Next.js)
├─ Host: api.walkrpg.morrisassert.dev → api :3000 (NestJS)
│ │
│ ▼
│ db :5432 (Postgres,
│ internal-only, never
│ bound to host port)
├─ Host: walkrpg.morrisassert.dev → nginx 503 placeholder
│ (reserved future
│ WalkRPG frontend)
└─ Host: wiki.morrisassert.dev → static files from
wiki-static volume
(CF Access-gated;
wiki-builder service
populates the volume
at startup)

Key invariants:

  • db has no host port binding. Postgres is reachable only from inside walkrpg-net. No 5432:5432 in compose. Reduces attack surface to zero externally.
  • web has no host port binding. The portfolio is reachable only from nginx on walkrpg-net. Same internal port (3000) as api is fine — docker service-name DNS disambiguates.
  • wiki-builder has no host port binding and no proxy_pass target. It writes to the shared wiki-static named volume; nginx serves the tree directly via root /var/www/wiki. Preserves the single-nginx-ingress invariant.
  • nginx is the only public ingress. ufw blocks all other inbound ports.
  • Cloudflare proxy is mandatory for the public TLS chain. CF “Full (strict)” mode means CF validates the origin cert; LE-issued cert on origin satisfies this. If CF is set to “Flexible”, the origin-edge hop is plaintext — that is a misconfiguration that the runbook §H smoke test detects.
  • Cloudflare Access gates the wiki + Swagger at edge, before requests reach the origin. CF Access is also mandatory for the wiki path; bypassing the orange cloud bypasses the auth gate.
  • One SAN cert, five hostnames. Renewal updates a single fullchain; nginx reload picks it up across all five server blocks.

3.3 Volumes

VolumeHost pathContainer mountOwnerPurpose
Postgres data/home/deploy/walkrpg/backend/pgdata//var/lib/postgresql/datadbDatabase files. Backup target.
LE certs/etc/letsencrypt//etc/letsencrypt/ (ro)nginx, certbotTLS certs.
Webroot/home/deploy/walkrpg/backend/certbot-webroot//var/www/certbot/nginx, certbotLE HTTP-01 challenge.
Backups/var/backups/walkrpg/(host-side, via docker exec cron)db (via cron)Daily pg_dump output.
Wiki staticdocker-managed (walkrpg_wiki_static)/wiki/dist/ (wiki-builder, rw) / /var/www/wiki/ (nginx, ro)wiki-builder writes, nginx readsAstro Starlight static output. Repopulated by wiki-builder on container start; reproducible from git + lockfile, so persistence across restarts is incidental.

3.5 Wiki coexistence — auth + serve model

The wiki.morrisassert.dev hostname is served as static files from the wiki-static docker named volume, populated at container startup by the wiki-builder service. The wiki-builder runs wiki/Dockerfile.build (multi-stage Astro build → busybox publish), executes cp -r /build/wiki/dist/. /wiki/dist/ against the shared volume, then tail -f /dev/null to keep the container alive for docker compose ps honesty.

Why Option A (volume-share + tail) over Option B (standalone wiki container behind reverse_proxy):

  • Preserves the single-nginx-ingress invariant from §3.2 — only one TLS termination surface.
  • Cheaper at runtime. Wiki content is rarely-changing canon; rebuild on every CI deploy is fine. No standby nginx-like surface to keep memory-resident.
  • Simpler attack surface. The wiki container exposes zero ports; the volume is the only data path.

The tradeoff is the mild idiomatic-but-not-pretty cp + tail -f keepalive pattern. Documented inline in backend/docker-compose.prod.yml so the next reader has full context.

Cloudflare Access — auth at edge.

The walkrpg-wiki Application is a Self-hosted Application in the Zero Trust dashboard:

  • Application domain: wiki.morrisassert.dev (path: blank — gates whole host)
  • Session duration: 24 hours
  • Identity providers: One-time PIN minimum (email OTP); optional Google / GitHub fast-follows
  • Policy: morris-only — Action Allow, rule type Emails, value <CEO_EMAIL> (extendable by editing the Emails selector to include additional tester addresses; takes effect within ~30s)

A second walkrpg-swagger Application is configured against api.walkrpg.morrisassert.dev with path /api/docs/*. This gate only activates when SWAGGER_ENABLED=true (currently false in prod per §13.1); configuring it now means the gate is live the moment Swagger is re-enabled with no scramble.

nginx — implicit trust of CF Access JWT.

CF Access forwards authenticated requests to the origin with two artifacts:

  • CF_Authorization cookie (the signed JWT)
  • Cf-Access-Jwt-Assertion request header (same JWT, header form)

nginx trusts both implicitly today — there is no auth_request directive that validates the JWT against the per-application JWKS endpoint at https://<team>.cloudflareaccess.com/cdn-cgi/access/certs. This is acceptable for night 1 because:

  1. The Hetzner VPS public IP is not advertised — DNS resolves to CF edge IPs only (orange cloud mandatory).
  2. ufw blocks all inbound except 22/80/443 — no direct origin path exists for an attacker who does not know the IP.
  3. Wiki content is non-secret canon (lore, mechanics, tech ADRs). Even if CF Access were bypassed via a discovered origin-direct connection, the worst case is exposure of work-in-progress documentation, not credential leakage.

Defense-in-depth follow-up: add JWT validation in nginx. Two implementation paths:

  • auth_request → tiny sidecar (cloudflared access tools or a 50-line Node service) that hits the JWKS endpoint, caches the keys, verifies the JWT signature + claims (aud must match the Application AUD; iss must be the team subdomain). Cleanest separation.
  • lua-resty-jwt or njs inside nginx. Avoids the sidecar at the cost of an nginx-image rebuild with the OpenResty / NJS-enabled binary.

Tracked under Phase 14 follow-up items (B-level, no blocker). Revisit when (a) the wiki content includes anything genuinely sensitive (post-D-009 unfreeze artifacts, credentials in ADRs, etc.), or (b) the cohort grows past the point where origin IP discovery becomes statistically inevitable.

4. Environment / secrets management

Secrets live in two places:

  1. /home/deploy/walkrpg/backend/.env on the VPS (NOT in git, file mode 600, owner deploy). Loaded by Docker Compose via env_file: directive per service.
  2. GitLab CI Variables (project Settings → CI/CD → Variables, masked + protected). Used during deploy step to ssh into the VPS and (re-)populate .env if needed, or simply for the build-image stage.

Required environment variables on the VPS .env (mirrors backend/.env.example shape, with prod values):

VariableExample valueSourceNotes
NODE_ENVproductionStaticEnables NestJS prod-mode optimizations.
PORT3000Staticnginx upstream target.
DATABASE_URLpostgresql://walkrpg:<password>@db:5432/walkrpg?schema=publicGenerated at provisioningdb is the compose service name, resolves on internal docker network.
JWT_SECRET<64 hex chars from openssl rand -hex 32>Generated at provisioningHS256 signing key per ADR-0006 §Mock auth detail.
JWT_ISSUERwalkrpg-api-prodStatic (matches ADR-0006 + ADR-0007)Wire-contract claim.
JWT_AUDIENCEwalkrpg-mobileStaticCross-platform (Android + iOS Phase 15 inherit per ADR-0007 §11).
AUTH_MODEmockStatic (per ADR-0006)Preserved across migration. Flips to firebase only when D-007/D-008/D-009 unfreeze.
CORS_ALLOWED_ORIGINShttps://api.walkrpg.<root>,https://walkrpg.<root>Per-domainPer ADR-0007 §10. Comma-separated.
SWAGGER_ENABLEDfalse (prod) / true (dev)Per-envGated workaround per §13 known limitations. Disable in prod.
POSTGRES_USERwalkrpgStaticCompose-passed to db service.
POSTGRES_PASSWORD<32 random chars>Generated at provisioningCompose-passed to db service. Mirrored into DATABASE_URL.
POSTGRES_DBwalkrpgStaticCompose-passed to db service.

.env rotation: rotated manually by CEO at incident response or pre-cohort-expansion. No automatic rotation in Phase 14 (deferred follow-up; tracked under ops).

GitLab CI Variables required for deploy stage:

VariableTypeUsed by
SSH_DEPLOY_KEYFile (masked)deploy stage — private ed25519 key paired with VPS deploy user’s authorized_keys
SSH_DEPLOY_HOSTVariabledeploy stage — VPS IPv4 or api.walkrpg.<root>
SSH_KNOWN_HOSTSVariabledeploy stage — output of ssh-keyscan <host>, pinned to prevent MITM
REGISTRY_USER, REGISTRY_PASSWORDVariable + maskedbuild-image stage — GitLab Container Registry credentials (CI_REGISTRY_USER + CI_JOB_TOKEN suffice for same-project pushes)

5. Domain + TLS topology

Registrar: Cloudflare Registrar (CEO-managed). Root domain is CEO’s personal portfolio domain; WalkRPG subdomain is api.walkrpg.<root>. Specific root is configured at runbook §D execution time.

DNS:

  • api.walkrpg.<root> — A record → VPS IPv4, proxied (orange cloud) through Cloudflare.
  • TTL: Auto (proxied records ignore TTL; CF handles).

Cloudflare SSL/TLS mode: Full (strict). This mode means CF requires a valid TLS cert on the origin, and validates the chain. LE-issued certs are valid for this mode (LE is in CF’s trusted CA set).

ModeWhat it doesUse here?
OffNo TLS anywhereNO — public traffic in plaintext
FlexibleCF↔browser TLS, CF↔origin plaintextNO — origin hop unencrypted; bearer tokens leak
FullCF↔browser TLS, CF↔origin TLS but cert not validatedNO — MITM risk between CF and origin
Full (strict)CF↔browser TLS, CF↔origin TLS with cert chain validationYES

Origin cert: Let’s Encrypt via certbot HTTP-01 challenge (webroot mode, shared with nginx). Renewal automated by certbot service running in cron mode (--keep-until-expiring, polling every 12h). Post-renewal hook signals nginx reload via docker compose kill -s HUP nginx.

Alternative considered: Cloudflare Origin Certificate. CF can issue an origin cert valid only for CF↔origin traffic, no public CA chain. Rejected for Phase 14 because (a) it ties the origin cert to CF infrastructure (lock-in), (b) LE is generic and survives a CF-proxy rip-out without cert reissuance, (c) the renewal automation is symmetric in complexity. Accepted: LE.

Edge cert: Cloudflare Universal SSL (free, CF-managed). No action required.

Cloudflare proxy security features engaged (free tier):

  • DDoS mitigation (always-on).
  • Bot Fight Mode (deferred — may be enabled later if scraping noise rises; off by default in Phase 14).
  • Always Use HTTPS (Page Rule or SSL/TLS setting — runbook §D step engages it).
  • HSTS at CF edge (additionally to origin nginx HSTS).

6. SSH hardening

Authentication: ed25519 key pair only. Password auth disabled. Root login disabled.

Users:

  • root — provisioning only, login disabled after §B step of runbook.
  • deploy — non-root, sudo-group + docker-group member. Owns /home/deploy/walkrpg/ repo clone. All CI deploys ssh in as this user.

sshd_config deltas:

PasswordAuthentication no
PermitRootLogin no
PubkeyAuthentication yes
ChallengeResponseAuthentication no
UsePAM yes
X11Forwarding no
PrintMotd no
ClientAliveInterval 300
ClientAliveCountMax 2

Firewall: ufw default deny incoming. Allowed ports: 22/tcp (SSH), 80/tcp (LE HTTP-01 challenge + nginx HTTP→HTTPS redirect), 443/tcp (nginx HTTPS). All other ports blocked.

Brute-force defense: fail2ban with default sshd jail (5 failures → 10-minute ban; 24h findtime). Sufficient for Phase 14 scale. Aggressive tuning deferred to ops follow-up.

SSH port change: out of scope. Port 22 with key-only auth + fail2ban is sufficient. Security-by-obscurity port change has marginal value and complicates GitLab CI / monitoring config; rejected.

7. CI/CD shape

GitLab CI pipeline triggers on push to master (and on MR pipelines for lint + test only — no deploy on MR).

Stages:

  1. lintpnpm lint, pnpm lint:language, pnpm lint:naming, pnpm lint:canon, pnpm lint:tags (run from monorepo root; backend has its own ESLint config too).

  2. testpnpm --filter @walkrpg/backend test (Jest unit) + pnpm --filter @walkrpg/backend test:e2e (Jest e2e, runs against ephemeral Postgres via service container).

  3. build-image — Docker buildkit, target backend/Dockerfile, push to GitLab Container Registry as registry.gitlab.com/positive-walkers/walkrpg/backend:$CI_COMMIT_SHORT_SHA AND :latest. Runs only on master.

  4. deploy — ssh into VPS as deploy@$SSH_DEPLOY_HOST, execute:

    cd /home/deploy/walkrpg
    git pull origin master
    docker compose -f backend/docker-compose.prod.yml pull api
    docker compose -f backend/docker-compose.prod.yml run --rm api pnpm prisma migrate deploy
    docker compose -f backend/docker-compose.prod.yml up -d api

    Runs only on master. Two-stage compose call: migration first (init-style, blocks until done), then api rolling restart.

Pipeline rules:

  • lint + test run on every push and every MR.
  • build-image + deploy run only on push to master (no MR previews — single-host has no preview environment surface in Phase 14).
  • .gitlab-ci.yml lives at repo root. The backend-specific test commands invoke pnpm workspace filters.

Deploy concurrency: resource_group: production on the deploy job. Serializes concurrent master pushes; prevents two deploys racing on the same docker-compose state.

Rollback: documented in runbook §K. git checkout <prev-sha> && docker compose pull && up -d on the VPS as deploy. No GitLab automation for rollback in Phase 14 (deferred ops follow-up).

8. Database migrations

Prisma migrations (backend/prisma/migrations/*) deploy via pnpm prisma migrate deploy (NOT migrate dev).

When migrations run:

  • During CI/CD deploy stage, before the api container is restarted with the new image. The two-stage compose pattern (see §7 deploy stage) runs docker compose run --rm api pnpm prisma migrate deploy first; this exits non-zero on migration failure, aborting the deploy without bringing down the old api container.
  • On first-time bootstrap (runbook §F step “First-time bootstrap”), migrations run after db is healthy and before api is started.

Migration safety:

  • Backwards-incompatible migrations (column drops, type changes, NOT NULL adds without default) require a manual two-phase deploy: phase 1 lands a backwards-compatible migration + new app code that writes both old and new schema; phase 2 lands the cleanup migration + app code that reads only new schema. Phase 14’s single-host topology means there is no zero-downtime guarantee — a migration that takes >5s will cause request failures. Accepted: ~20-walker cohort tolerates brief migration windows. Documented in ops follow-up for cohort-growth-driven revisit.
  • migrate deploy is idempotent — re-running is safe.

9. Backup / recovery

Daily Postgres dump: cron job on the VPS host (NOT in a container — host-side cron is more reliable than dockerized cron for this scale).

0 3 * * * docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml exec -T db pg_dump -U walkrpg walkrpg | gzip > /var/backups/walkrpg/walkrpg-$(date +\%Y\%m\%d).sql.gz

Retention: 7 daily copies, rotated by find /var/backups/walkrpg/ -name '*.sql.gz' -mtime +7 -delete (separate daily cron at 04:00).

Restore path:

Terminal window
gunzip -c /var/backups/walkrpg/walkrpg-YYYYMMDD.sql.gz | \
docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml exec -T db psql -U walkrpg walkrpg

Tested restore drill: required within 7 days of Phase 14 deploy. Ops follow-up.

Out of scope for Phase 14 night 1:

  • Offsite backup (S3 / B2 / Hetzner Storage Box). Tracked as ops follow-up; the 7-day local rotation is single-host-disk-failure-fatal.
  • Point-in-time recovery (PITR / WAL archiving). Phase 14 cohort tolerates daily-granularity restore; PITR deferred to ops follow-up.
  • Encrypted backups at rest. Dump file is plaintext SQL. Acceptable because (a) /var/backups/walkrpg/ is chmod 700 deploy:deploy, (b) the VPS disk is encrypted at the Hetzner block-storage layer per their default. Documented for ops review when offsite backup lands.

10. Observability

Logs: structured JSON to stdout from the NestJS app (RequestIdMiddleware + HttpExceptionFilter already emit per ADR-0007 §Backend deliverables shipped). Docker captures stdout. CEO/ops inspect via docker compose logs --tail 100 -f api.

Out of scope for Phase 14:

  • ELK / Loki / Datadog / OpenSearch stack. Single-host scale + ~20 walkers does not justify the operational complexity. Tracked as ops follow-up.
  • Metrics export (Prometheus, OpenTelemetry). Same justification.
  • Alerting (PagerDuty, OpsGenie). Same. CEO checks docker compose ps + logs manually during the closed-beta phase.

What ships in Phase 14:

  • docker compose logs rotation: Docker default log driver json-file with max-size=50m, max-file=5 per service. Configured per service in docker-compose.prod.yml.
  • Cloudflare Analytics (free tier) gives top-of-funnel traffic visibility.
  • Hetzner Cloud Console gives CPU/RAM/disk graphs per VPS (free, no opt-in).
  • htop + du -sh /home/deploy/walkrpg/backend/pgdata/ for manual capacity checks.

11. Cost

LineMonthlyAnnual
Hetzner CX22 (shared vCPU x2, 4GB RAM, 40GB NVMe, 20TB egress)€5.18 (excl. VAT) / €5.70 (incl. DE VAT)~€68
Hetzner backup feature (20% surcharge — OPTIONAL, recommended Phase 14 night 1 = SKIP, use pg_dump)€1.04
Cloudflare Registrar (.com example)€10-12
Cloudflare proxy + Universal SSL + DDoS + Analytics€0 (free tier)€0
Cloudflare Access (Zero Trust Free tier, ≤50 users)€0 (free tier)€0
Let’s Encrypt€0€0
GitLab CI/CD (free tier — 400 build minutes/mo)€0€0
GitLab Container Registry€0 (within free-tier storage at this scale)€0
Total~€5.70/mo~€78/year all-in

Cost ceiling before re-evaluation: if monthly cost exceeds €25 (e.g., due to CX22 → CX32 vertical scale + backup feature + offsite backup), tech-architect raises FLAG_LEAD to game-director and CEO. Single-line bump within the same VPS class (CX22 → CX22 + 20% backup) is B-level autonomous.

12. Mobile reconfiguration (post-Phase-14-ship)

After the VPS is live and smoke tests pass:

  • walkrpg-mobile/android/local.propertiesbase.url=https://api.walkrpg.<root>/. Single field swap per ADR-0007 §9.
  • walkrpg-mobile/android/app/src/main/res/xml/network_security_config.xml — drop the IP-specific debug-overrides block. The Phase 13 config carried a domain-config allowing cleartext to the CEO laptop’s tunnel-exposed local IP for debug builds. With public HTTPS available, the default secure config applies and the override becomes dead config. Minimal config retains only cleartext-traffic-permitted = false (which is the default; the file may be deleted entirely if no other overrides remain).
  • Rebuild APK against the new BASE_URL, install on test device, smoke test the auth/callback + walker/profile flow against the public endpoint. Runbook §J.

iOS port (Phase 15) inherits identically — Info.plist or xcconfig carries the same https://api.walkrpg.<root>/ value.

13. Known limitations / known follow-ups

13.1 Swagger gated behind SWAGGER_ENABLED=true

Per fd7d3e8 (2026-05-22), Swagger setup is conditionally gated because a circular DTO dependency crashes the NestJS Swagger module bootstrap. Phase 14 production-side keeps SWAGGER_ENABLED=false in .env. Developers re-enable locally with SWAGGER_ENABLED=true pnpm dev for the schema browsing path.

Follow-up: fix the circular DTO dep (likely in backend/src/combat/ and backend/src/quest/ DTO cross-imports). Not blocking Phase 14. Tracked as B-level backend-engineer item.

13.2 Production runtime is tsx, not compiled dist/

Per 0b0c37b (2026-05-22), nest build emits to dist/ but the compiled output references @walkrpg/data/sim through unresolvable .ts paths (tsconfig paths are compile-time only; the emitter does not rewrite them). Dev mode uses tsx watch src/main.ts and the production container uses tsx src/main.ts (no watch) for the same reason.

Performance impact: tsx adds ~200ms startup overhead vs node dist/. At Phase 14 scale (single host, ~20 walkers, no autoscaling, no cold-starts) this is invisible. Memory impact: tsx keeps the TypeScript transpiler loaded in process; +~30MB resident. CX22’s 4GB RAM tolerates this easily.

Follow-up: fix the tsconfig paths leak. Likely requires either (a) @walkrpg/data package pre-compile step in CI so the backend imports compiled .js, or (b) a custom Nest CLI emitter that resolves paths in dist. Not blocking Phase 14. Tracked as B-level backend-engineer item, co-occurs with the Swagger gating fix.

13.3 Single-host = no HA

CX22 is a single VPS. Hetzner availability SLA covers ~99.9% uptime; planned maintenance is announced via email with 24h+ notice. Single-host failure modes (hardware failure, network partition, accidental docker compose down) cause full backend outage until restoration.

Accepted: ~20-walker closed-beta cohort tolerates outage windows of minutes-to-hours. HA infrastructure (multi-host, Kubernetes, autoscaling) re-enters scope only at production migration (D-007/D-008/D-009 unfreeze).

13.4 Daily backup is the minimum bar

The Phase 14 night-1 backup strategy is daily pg_dump to local host. Loss window: up to 24h of writes on disk failure. For ~20-walker cohort, this is acceptable. Cohort-growth-driven trigger for offsite backup: ~50 walkers or one incident.

13.5 Mock auth is not production-secure

Per ADR-0006: anyone with the public hostname can POST /auth/callback {"email":"anything","displayName":"anything"} and obtain a 7-day session JWT. This is intentional for the closed-beta cohort context — testers are trusted. Production migration’s Firebase Auth + App Check engagement is the proper fix; that fix is paused indefinitely per D-009 framing.

Mitigation in Phase 14: Cloudflare proxy provides DDoS shielding + bot detection (free tier). If griefing emerges (auto-registration spam), CEO can engage CF Bot Fight Mode + custom firewall rules without code changes. CORS is the only origin gate (per ADR-0007 §10); Phase 14 inherits this posture.

13.6 No rate limiting at the application layer

NestJS-side rate limiting (e.g., @nestjs/throttler) is not configured in Phase 14. Cloudflare’s free tier rate-limits at edge. Sufficient for cohort scale; engineered rate limiting deferred to ops follow-up.

13.7 No structured monitoring dashboards

Per §10: docker compose logs is the observability tool. Acceptable for cohort scale; dashboards deferred.

13.8 Multi-env follow-ups (per ADR-0010)

ADR-0010 introduces a second stack (staging) on the same VPS. Follow-ups specific to that addition (cross-link from this ADR for discoverability):

  • No ephemeral per-MR preview env (ADR-0010 §11.1).
  • No automated db-migration forward-compatibility check between staging and prod (ADR-0010 §11.2).
  • No automated rollback across envs (ADR-0010 §11.3).
  • Orchestrator review is CEO-invoked, not webhook-driven, for night 1 (ADR-0010 §11.4).
  • No CODEOWNERS file yet (ADR-0010 §11.5).
  • No CI job that fails on direct-push to protected branches as belt-and-braces over GitLab branch protection (ADR-0010 §11.6).

14. When this ADR retires

ADR-0009 retires when D-007 / D-008 / D-009 unfreeze and the production migration to GCP europe-central2-warsaw ships:

  • Production migration ADRs (working name: ADR-00NN — numbered at trigger time, like this one) author the cutover. They reference ADR-0009 as the source state.
  • Phase 14 VPS lifecycle terminates after production-migration green-light + cutover smoke tests pass + DNS swap (api.walkrpg.<root> A record → production Cloud Run hostname).
  • Hetzner VPS is decommissioned after a 30-day grace window (rollback insurance).

Not pinned to a phase number. D-015 frames production migration as paused-indefinitely; the unfreeze trigger is CEO-led, cost-redesign-driven, not phase-driven.

15. Consequences

  • Closed-beta cohort unblocks. Asynchronous testing across timezones works. Test cohort can grow past ~30 walkers without CEO-laptop SLA pressure.
  • D-009 §3 EU residency satisfied by Hetzner DE — fully EU-resident origin + data at rest. GDPR posture (delete-on-demand, anonymization sweep) is unimplemented in mock mode and remains so per ADR-0006 §Out of scope.
  • Phase 13 sub-phase 13-13 device walkthrough can proceed against the public endpoint instead of the tunnel — improves test fidelity (public TLS chain exercised, not just CF Tunnel).
  • CEO-laptop dependency lifted for the backend. Local backend runs remain available for development (pnpm dev continues to work), but tester reach no longer requires laptop uptime.
  • Operational cost surface enters the budget. €78/year is a new line item; CEO is the budget owner. Cost ceiling per §11.
  • GitLab CI minutes consumed. Free tier (400 min/mo) sufficient for ~10-20 deploys/day at current pipeline complexity. Monitored at CI dashboard; upgrade trigger documented at CI usage > 350 min/mo.
  • Three Phase 13 backend deliverables (HttpExceptionFilter, RequestIdMiddleware, CORS bootstrap) already ship from ADR-0007 §Backend deliverables shipped. Phase 14 deploy carries them to the public endpoint unchanged.
  • Mock-auth risk posture unchanged. Phase 14 does NOT change the security model — it only moves the deployment location. Production-grade auth (D-009 §1) remains paused.
  • Three follow-up backend items tracked: Swagger circular dep, tsx-as-prod-runtime, application-layer rate limiting. None blocking Phase 14.
  • Follow-up ops items tracked: offsite backup, restore drill, PITR, encrypted backups at rest, structured monitoring, alerting, rate limiting, rotation policy for .env secrets. All deferred per cost / cohort tradeoffs documented above.

16. Open questions

  1. Domain root choice. CEO confirmed Cloudflare Registrar + subdomain pattern api.walkrpg.<root>. The specific <root> (CEO’s portfolio domain) is filled in at runbook §D execution time. Not a blocker.
  2. Migration approach when Phase 14 retires. Whether the GCP cutover is a forklift migration (DNS swap + 30-day VPS overlap) or a parallel-run blue/green is decided at production-migration ADR authoring time. Out of scope here.
  3. Container Registry choice durability. GitLab Container Registry is used in Phase 14. If GitLab pricing changes or registry availability degrades, alternates (GHCR, Docker Hub paid, self-hosted) are tracked. B-level swap if needed.
  4. docker-compose.prod.yml vs ops/docker/docker-compose.prod.yml path. Backend-engineer chooses at implementation time. The runbook references backend/docker-compose.prod.yml as canonical; if backend-engineer prefers ops/docker/, the runbook updates in the paired implementation session.

17. Implementation handoff

This ADR is the spec layer. Implementation deliverables (NOT authored in this ADR, paired session follows after CEO confirms domain registered):

  1. backend/Dockerfile — node:22-alpine base, pnpm install, copy workspace data/ package (required for @walkrpg/data import), CMD tsx src/main.ts. See §13.2 for the tsx rationale.
  2. backend/docker-compose.prod.yml — services per §3.1, networks per §3.2, volumes per §3.3, env files per §4.
  3. backend/nginx/walkrpg.conf — reverse proxy 443 → api:3000, static serve wiki.morrisassert.dev from the wiki-static volume, LE cert paths, HSTS + security headers, HTTP→HTTPS redirect, CORS-aware proxy_pass directives.
  4. .gitlab-ci.yml (or amendments to existing) — stages per §7, secrets per §4 GitLab CI Variables. Build + push the wiki-builder image alongside backend + portfolio.
  5. backend/scripts/backup-postgres.sh — cron-invoked dump script per §9.
  6. wiki/Dockerfile.build — multi-stage Astro build (node:22-alpine builder + busybox publish). Build context is the monorepo root; pnpm install --filter "@walkrpg/wiki..." pulls the wiki + @walkrpg/data closure. Compose command: invokes cp -r /build/wiki/dist/. /wiki/dist/ && tail -f /dev/null against the shared wiki-static volume.

These ship as a separate paired implementation session after the runbook’s CEO-side steps (sections A through F) are confirmed against the registered domain. The runbook documents the exact filenames so the implementation session has zero ambiguity.