ADR-0009 — VPS migration (Phase 14, Hetzner CX22)
ADR-0009 — VPS migration (Phase 14, Hetzner CX22)
Status: Accepted (CEO ratifying via deployment 2026-05-22 evening)
Date: 2026-05-22
Owner: tech-architect
Paired implementation: tech-architect + CEO (provisioning), backend-engineer (Dockerfile / compose / nginx / .gitlab-ci.yml follow-up)
Supersedes: none
Amends scope (for Phase 14 only): ADR-0006 (test-phase infrastructure — VPS replaces local + tunnel topology)
Related canon: D-015 (post-Phase-12 phasing — Phase 14 = VPS migration, Hetzner CX22 named), D-009 §3 (GDPR — EU residency mandate, satisfied by Hetzner Nuremberg / Falkenstein), D-007 §3 (anti-cheat — restrictive posture; mock auth note preserved), ADR-0006 (mock-auth backend posture, Docker Compose stack template, migration plan §2 names this ADR as Phase 14 trigger), ADR-0007 (Android network layer — §9 BASE_URL reconfiguration target)
1. Context
D-015 ratifies Phase 14 as a numbered phase: Hetzner CX22, public IP, asynchronous testing across timezones, test cohort growth past ~30 walkers without the CEO-laptop SLA ceiling that ADR-0006 §Migration plan §2 explicitly named as the trigger.
The trigger fires now because:
- Phase 13 closed 12/13 sub-phases with 13-13 device walkthrough prep complete. The backend has the surface the closed-beta cohort needs (
/auth/callback,/walker/profile,/step/ingest,/tree/state,/quest/*,/combat/encounter). No further endpoint additions block Phase 14. - The CEO-laptop tunnel SLA is now actively biting — every tester ping requires the laptop to be on with
cloudflaredrunning. Asynchronous testing across timezones is a Phase 13 close-out blocker. - D-009 §3 EU residency mandate is satisfiable on Hetzner without engaging the paused-indefinitely GCP
europe-central2-warsawestate. Hetzner Nuremberg (NUR) and Falkenstein (FSN1) are both in Germany, both inside EU/GDPR jurisdiction, both eligible. ADR-0006’s “production-target preservation” stance holds — this ADR adds a transitional infrastructure layer, it does not unfreeze D-007/D-008/D-009 production scope. - Closed-beta cohort is ~20 walkers (CEO + trusted recruits). Cost-per-walker tolerance is high; cost-per-month tolerance is essentially zero until cohort signal justifies it. €5.70/month VPS + €10-15/year domain satisfies both.
This ADR is the Phase 14 spec. The paired runbook (wiki/src/content/docs/tech/phase-14-runbook.md) is CEO-executable from the ADR’s stack-layout decisions.
Portfolio coexistence (amendment, ratified 2026-05-22 evening): the same VPS additionally hosts a personal portfolio site at morrisassert.dev (apex + www). The portfolio is a Next.js 14 standalone container alongside the backend stack; nginx differentiates by Host header. This addition is B-level — it does not change the security posture, cost band, or cohort scale of the Phase 14 backend, and reuses the same Hetzner CX22 + Cloudflare + Let’s Encrypt SAN cert + nginx ingress. The portfolio code lives in portfolio/ at the monorepo root (pnpm workspace @walkrpg/portfolio). See §3 for the updated service list, §5 for the multi-domain DNS + SAN cert topology, and the paired portfolio Dockerfile at portfolio/Dockerfile.
Wiki coexistence (amendment, ratified 2026-05-23): the same VPS additionally hosts the WalkRPG wiki (Astro Starlight static site, pnpm workspace @walkrpg/wiki) at wiki.morrisassert.dev. The wiki is a build-once / serve-statically container: wiki/Dockerfile.build produces wiki/dist/ and copies it onto a shared docker named volume wiki-static, which the main nginx mounts read-only and serves directly. Cloudflare Access (Zero Trust Free tier, ≤50 users) gates the hostname AT EDGE before requests reach the origin — nginx implicitly trusts the Cf-Access-Jwt-Assertion header (defense-in-depth JWT validation against the per-application JWKS endpoint is a tracked follow-up, see §3.5). This amendment is B-level — it does not change cost band (Hetzner CX22 unchanged, CF Access free tier $0), does not change cohort scale, and reuses the same nginx + LE SAN cert + Cloudflare proxy infrastructure. A future-state walkrpg.morrisassert.dev frontend can use the identical CF Access pattern when ratified. See §3.1 for the updated service list, §3.5 for the auth/serve model, §5 for the SAN cert topology now covering five hostnames.
Multi-env amendment (2026-05-22, ratified via ADR-0010): the original ADR-0009 framing assumed a single env on this VPS (“master = prod”). ADR-0010 supersedes that implicit assumption — the same VPS now hosts two backend stacks (prod + staging) keyed off two branches (main + dev). The compose topology, nginx config, and LE SAN list extend to cover a sixth hostname (api-staging.walkrpg.morrisassert.dev). The wiki + portfolio remain prod-env-only. See §3.1 for the per-env service-list breakdown, §11 known limitations for the cross-link to ADR-0010 follow-ups, and the paired backend/docker-compose.staging.yml file shipped alongside ADR-0010.
2. Decision
Phase 14 ships against a single-host Hetzner CX22 VPS running a Docker Compose stack identical in shape to ADR-0006’s local stack, with public TLS via Cloudflare proxy + Let’s Encrypt origin certificate, and GitLab-CI-driven push-to-master deploy.
| Component | Test phase (ADR-0006) | Phase 14 (this ADR) | Production target (D-007/8/9, ADR-0001) |
|---|---|---|---|
| Compute | Local Docker Compose on CEO laptop | Hetzner CX22 single host, Docker Compose | Cloud Run + Cloud Run realtime + walkrpg-jobs |
| Database | Local Postgres 16 container | Postgres 16 container, host-mount volume | Cloud SQL Postgres 16 europe-central2-warsaw |
| Auth | Mock JWT | Mock JWT (unchanged, AUTH_MODE=mock) | Firebase Auth Path B |
| External reach | Cloudflare Tunnel (laptop-tethered) | Cloudflare proxy (orange cloud) → nginx (443) → api (3000) | Cloud Load Balancer + Cloud Armor + EU region pin |
| TLS | Cloudflare edge only (tunnel terminus) | Cloudflare Full (strict) + Let’s Encrypt origin cert via certbot | GCP-managed certs |
| Reconciliation worker | None | None (matches ADR-0006) | Cloud Tasks reconcile-steps 60s coalescing |
| Anti-cheat | Schema + impossible-burst | Schema + impossible-burst (unchanged) | Three-layer defense in depth (ADR-0004) |
| Backups | None | Daily pg_dump to host /var/backups/walkrpg/, 7-day rotation | Cloud SQL automated backups + PITR |
| Observability | docker logs on laptop | docker compose logs on VPS, stdout-structured | Cloud Logging + Cloud Monitoring + alerting |
| CI/CD | Manual pnpm dev on laptop | GitLab CI: lint → test → build-image → ssh-deploy on push-to-master | Cloud Build → Cloud Run revision deploy |
| EU residency | Cloudflare edge proxy (egress only) | Hetzner DE (Nuremberg or Falkenstein) — fully EU-resident origin + data at rest | GCP europe-central2-warsaw |
| Op cost | €0/mo (laptop electricity) | €5.70/mo VPS + €10-15/yr domain ≈ €78/yr all-in | Band A ~€143/mo, scaling per ADR-0001 |
| SLA ceiling | CEO laptop uptime | Hetzner CX22 SLA (~99.9%) + single-host failure mode | Multi-zone GCP, ~99.95% target |
The Prisma schema is identical between Phase 14 and both ADR-0006 (test) and ADR-0001 (production). The data model is invariant across migration boundaries — only deployment, auth, and reconciliation layers differ.
3. Stack layout
3.1 Per-env stack — Docker Compose services
Prod stack defined in backend/docker-compose.prod.yml. Six services run side-by-side per the portfolio + wiki coexistence amendments (§1). Staging stack defined in backend/docker-compose.staging.yml per the ADR-0010 multi-env amendment — two services (api-staging + db-staging) on a separate compose project, separate volume, separate .env.staging. The prod nginx joins the staging network as external to terminate TLS for api-staging.walkrpg.morrisassert.dev (preserves the single-ingress invariant from §3.2).
Prod stack:
| Service | Image | Purpose | Network | Restart |
|---|---|---|---|---|
api | registry.gitlab.com/positive-walkers/walkrpg/backend:<sha> | NestJS 11 + Prisma 6, invoked via tsx src/main.ts (NOT node dist/main — see §13 known limitations) | walkrpg-net (internal) | unless-stopped |
db | postgres:16-alpine | Postgres 16, data volume mounted at /var/lib/postgresql/data from host ./pgdata/ | walkrpg-net (internal only — NO host port exposure) | unless-stopped |
nginx | nginx:1.27-alpine | Reverse proxy + static file server. Five server blocks: morrisassert.dev + www.morrisassert.dev → web:3000 (portfolio), api.walkrpg.morrisassert.dev → api:3000 (backend), walkrpg.morrisassert.dev → 503 placeholder, wiki.morrisassert.dev → static files from the wiki-static named volume. Terminates TLS using the LE SAN cert mounted from host /etc/letsencrypt/, emits HSTS + standard security headers | walkrpg-net + host ports 80, 443 | unless-stopped |
certbot | certbot/certbot:latest | Let’s Encrypt cert renewal sidecar. Webroot challenge via shared volume with nginx (/var/www/certbot/). Renews the single multi-domain SAN cert covering all five hostnames. Cron-mode: --keep-until-expiring runs every 12h, post-renewal hook signals nginx reload. | host 80 (webroot only, shared with nginx) | unless-stopped |
web | registry.gitlab.com/positive-walkers/walkrpg/portfolio:<sha> | Next.js 14 standalone (Morris portfolio). Built from portfolio/Dockerfile against the monorepo root. Listens on :3000 inside the docker network — NO host port. | walkrpg-net (internal only) | unless-stopped |
wiki-builder | registry.gitlab.com/positive-walkers/walkrpg/wiki:<sha> | Astro Starlight static-build container. Multi-stage build (builder + busybox-publish) emits wiki/dist/ and the container command copies the tree onto the shared wiki-static named volume at startup, then tail -fs to keep the container alive. NO network exposure — nginx serves the static tree directly from the volume. Cloudflare Access gates wiki.morrisassert.dev at edge. | walkrpg-net (internal, no port surface) | no (one-shot build + keepalive) |
Staging stack (added per ADR-0010 §4):
| Service | Image | Purpose | Network | Restart |
|---|---|---|---|---|
api-staging | registry.gitlab.com/positive-walkers/walkrpg/backend:<sha> (same image as prod, different tag at deploy time) | NestJS 11 + Prisma 6 running against the staging database. Loads .env.staging. Distinct JWT_SECRET from prod so sessions don’t cross envs. | walkrpg-net-staging (internal) | unless-stopped |
db-staging | postgres:16-alpine | Postgres 16 for staging, data volume mounted at /var/lib/postgresql/data from host ./pgdata-staging/. Internal-only — NO host port. | walkrpg-net-staging (internal only) | unless-stopped |
The prod nginx + certbot + portfolio + wiki services are shared (not duplicated). The LE SAN cert covers all six hostnames (see §5).
3.2 Network topology
Internet │ ▼Cloudflare edge (proxy, orange cloud, SSL Full strict) │ + Cloudflare Access (Zero Trust) gates wiki.morrisassert.dev at edge │ HTTPS ▼ for all five hostnames (apex + www + api + walkrpg + wiki subdomains)Hetzner CX22 public IPv4 (ufw: 22/80/443 only) │ ▼nginx :443 (LE SAN cert from /etc/letsencrypt/live/morrisassert.dev/) │ ├─ Host: morrisassert.dev, www.morrisassert.dev → web :3000 (Next.js) │ ├─ Host: api.walkrpg.morrisassert.dev → api :3000 (NestJS) │ │ │ ▼ │ db :5432 (Postgres, │ internal-only, never │ bound to host port) │ ├─ Host: walkrpg.morrisassert.dev → nginx 503 placeholder │ (reserved future │ WalkRPG frontend) │ └─ Host: wiki.morrisassert.dev → static files from wiki-static volume (CF Access-gated; wiki-builder service populates the volume at startup)Key invariants:
- db has no host port binding. Postgres is reachable only from inside
walkrpg-net. No5432:5432in compose. Reduces attack surface to zero externally. - web has no host port binding. The portfolio is reachable only from nginx on
walkrpg-net. Same internal port (3000) asapiis fine — docker service-name DNS disambiguates. - wiki-builder has no host port binding and no proxy_pass target. It writes to the shared
wiki-staticnamed volume; nginx serves the tree directly viaroot /var/www/wiki. Preserves the single-nginx-ingress invariant. - nginx is the only public ingress. ufw blocks all other inbound ports.
- Cloudflare proxy is mandatory for the public TLS chain. CF “Full (strict)” mode means CF validates the origin cert; LE-issued cert on origin satisfies this. If CF is set to “Flexible”, the origin-edge hop is plaintext — that is a misconfiguration that the runbook §H smoke test detects.
- Cloudflare Access gates the wiki + Swagger at edge, before requests reach the origin. CF Access is also mandatory for the wiki path; bypassing the orange cloud bypasses the auth gate.
- One SAN cert, five hostnames. Renewal updates a single fullchain; nginx reload picks it up across all five
serverblocks.
3.3 Volumes
| Volume | Host path | Container mount | Owner | Purpose |
|---|---|---|---|---|
| Postgres data | /home/deploy/walkrpg/backend/pgdata/ | /var/lib/postgresql/data | db | Database files. Backup target. |
| LE certs | /etc/letsencrypt/ | /etc/letsencrypt/ (ro) | nginx, certbot | TLS certs. |
| Webroot | /home/deploy/walkrpg/backend/certbot-webroot/ | /var/www/certbot/ | nginx, certbot | LE HTTP-01 challenge. |
| Backups | /var/backups/walkrpg/ | (host-side, via docker exec cron) | db (via cron) | Daily pg_dump output. |
| Wiki static | docker-managed (walkrpg_wiki_static) | /wiki/dist/ (wiki-builder, rw) / /var/www/wiki/ (nginx, ro) | wiki-builder writes, nginx reads | Astro Starlight static output. Repopulated by wiki-builder on container start; reproducible from git + lockfile, so persistence across restarts is incidental. |
3.5 Wiki coexistence — auth + serve model
The wiki.morrisassert.dev hostname is served as static files from the wiki-static docker named volume, populated at container startup by the wiki-builder service. The wiki-builder runs wiki/Dockerfile.build (multi-stage Astro build → busybox publish), executes cp -r /build/wiki/dist/. /wiki/dist/ against the shared volume, then tail -f /dev/null to keep the container alive for docker compose ps honesty.
Why Option A (volume-share + tail) over Option B (standalone wiki container behind reverse_proxy):
- Preserves the single-nginx-ingress invariant from §3.2 — only one TLS termination surface.
- Cheaper at runtime. Wiki content is rarely-changing canon; rebuild on every CI deploy is fine. No standby nginx-like surface to keep memory-resident.
- Simpler attack surface. The wiki container exposes zero ports; the volume is the only data path.
The tradeoff is the mild idiomatic-but-not-pretty cp + tail -f keepalive pattern. Documented inline in backend/docker-compose.prod.yml so the next reader has full context.
Cloudflare Access — auth at edge.
The walkrpg-wiki Application is a Self-hosted Application in the Zero Trust dashboard:
- Application domain:
wiki.morrisassert.dev(path: blank — gates whole host) - Session duration: 24 hours
- Identity providers: One-time PIN minimum (email OTP); optional Google / GitHub fast-follows
- Policy:
morris-only— Action Allow, rule type Emails, value<CEO_EMAIL>(extendable by editing the Emails selector to include additional tester addresses; takes effect within ~30s)
A second walkrpg-swagger Application is configured against api.walkrpg.morrisassert.dev with path /api/docs/*. This gate only activates when SWAGGER_ENABLED=true (currently false in prod per §13.1); configuring it now means the gate is live the moment Swagger is re-enabled with no scramble.
nginx — implicit trust of CF Access JWT.
CF Access forwards authenticated requests to the origin with two artifacts:
CF_Authorizationcookie (the signed JWT)Cf-Access-Jwt-Assertionrequest header (same JWT, header form)
nginx trusts both implicitly today — there is no auth_request directive that validates the JWT against the per-application JWKS endpoint at https://<team>.cloudflareaccess.com/cdn-cgi/access/certs. This is acceptable for night 1 because:
- The Hetzner VPS public IP is not advertised — DNS resolves to CF edge IPs only (orange cloud mandatory).
- ufw blocks all inbound except 22/80/443 — no direct origin path exists for an attacker who does not know the IP.
- Wiki content is non-secret canon (lore, mechanics, tech ADRs). Even if CF Access were bypassed via a discovered origin-direct connection, the worst case is exposure of work-in-progress documentation, not credential leakage.
Defense-in-depth follow-up: add JWT validation in nginx. Two implementation paths:
auth_request→ tiny sidecar (cloudflaredaccess tools or a 50-line Node service) that hits the JWKS endpoint, caches the keys, verifies the JWT signature + claims (aud must match the Application AUD; iss must be the team subdomain). Cleanest separation.lua-resty-jwtornjsinside nginx. Avoids the sidecar at the cost of an nginx-image rebuild with the OpenResty / NJS-enabled binary.
Tracked under Phase 14 follow-up items (B-level, no blocker). Revisit when (a) the wiki content includes anything genuinely sensitive (post-D-009 unfreeze artifacts, credentials in ADRs, etc.), or (b) the cohort grows past the point where origin IP discovery becomes statistically inevitable.
4. Environment / secrets management
Secrets live in two places:
/home/deploy/walkrpg/backend/.envon the VPS (NOT in git, file mode600, ownerdeploy). Loaded by Docker Compose viaenv_file:directive per service.- GitLab CI Variables (project Settings → CI/CD → Variables, masked + protected). Used during deploy step to ssh into the VPS and (re-)populate
.envif needed, or simply for the build-image stage.
Required environment variables on the VPS .env (mirrors backend/.env.example shape, with prod values):
| Variable | Example value | Source | Notes |
|---|---|---|---|
NODE_ENV | production | Static | Enables NestJS prod-mode optimizations. |
PORT | 3000 | Static | nginx upstream target. |
DATABASE_URL | postgresql://walkrpg:<password>@db:5432/walkrpg?schema=public | Generated at provisioning | db is the compose service name, resolves on internal docker network. |
JWT_SECRET | <64 hex chars from openssl rand -hex 32> | Generated at provisioning | HS256 signing key per ADR-0006 §Mock auth detail. |
JWT_ISSUER | walkrpg-api-prod | Static (matches ADR-0006 + ADR-0007) | Wire-contract claim. |
JWT_AUDIENCE | walkrpg-mobile | Static | Cross-platform (Android + iOS Phase 15 inherit per ADR-0007 §11). |
AUTH_MODE | mock | Static (per ADR-0006) | Preserved across migration. Flips to firebase only when D-007/D-008/D-009 unfreeze. |
CORS_ALLOWED_ORIGINS | https://api.walkrpg.<root>,https://walkrpg.<root> | Per-domain | Per ADR-0007 §10. Comma-separated. |
SWAGGER_ENABLED | false (prod) / true (dev) | Per-env | Gated workaround per §13 known limitations. Disable in prod. |
POSTGRES_USER | walkrpg | Static | Compose-passed to db service. |
POSTGRES_PASSWORD | <32 random chars> | Generated at provisioning | Compose-passed to db service. Mirrored into DATABASE_URL. |
POSTGRES_DB | walkrpg | Static | Compose-passed to db service. |
.env rotation: rotated manually by CEO at incident response or pre-cohort-expansion. No automatic rotation in Phase 14 (deferred follow-up; tracked under ops).
GitLab CI Variables required for deploy stage:
| Variable | Type | Used by |
|---|---|---|
SSH_DEPLOY_KEY | File (masked) | deploy stage — private ed25519 key paired with VPS deploy user’s authorized_keys |
SSH_DEPLOY_HOST | Variable | deploy stage — VPS IPv4 or api.walkrpg.<root> |
SSH_KNOWN_HOSTS | Variable | deploy stage — output of ssh-keyscan <host>, pinned to prevent MITM |
REGISTRY_USER, REGISTRY_PASSWORD | Variable + masked | build-image stage — GitLab Container Registry credentials (CI_REGISTRY_USER + CI_JOB_TOKEN suffice for same-project pushes) |
5. Domain + TLS topology
Registrar: Cloudflare Registrar (CEO-managed). Root domain is CEO’s personal portfolio domain; WalkRPG subdomain is api.walkrpg.<root>. Specific root is configured at runbook §D execution time.
DNS:
api.walkrpg.<root>— A record → VPS IPv4, proxied (orange cloud) through Cloudflare.- TTL: Auto (proxied records ignore TTL; CF handles).
Cloudflare SSL/TLS mode: Full (strict). This mode means CF requires a valid TLS cert on the origin, and validates the chain. LE-issued certs are valid for this mode (LE is in CF’s trusted CA set).
| Mode | What it does | Use here? |
|---|---|---|
| Off | No TLS anywhere | NO — public traffic in plaintext |
| Flexible | CF↔browser TLS, CF↔origin plaintext | NO — origin hop unencrypted; bearer tokens leak |
| Full | CF↔browser TLS, CF↔origin TLS but cert not validated | NO — MITM risk between CF and origin |
| Full (strict) | CF↔browser TLS, CF↔origin TLS with cert chain validation | YES |
Origin cert: Let’s Encrypt via certbot HTTP-01 challenge (webroot mode, shared with nginx). Renewal automated by certbot service running in cron mode (--keep-until-expiring, polling every 12h). Post-renewal hook signals nginx reload via docker compose kill -s HUP nginx.
Alternative considered: Cloudflare Origin Certificate. CF can issue an origin cert valid only for CF↔origin traffic, no public CA chain. Rejected for Phase 14 because (a) it ties the origin cert to CF infrastructure (lock-in), (b) LE is generic and survives a CF-proxy rip-out without cert reissuance, (c) the renewal automation is symmetric in complexity. Accepted: LE.
Edge cert: Cloudflare Universal SSL (free, CF-managed). No action required.
Cloudflare proxy security features engaged (free tier):
- DDoS mitigation (always-on).
- Bot Fight Mode (deferred — may be enabled later if scraping noise rises; off by default in Phase 14).
- Always Use HTTPS (Page Rule or SSL/TLS setting — runbook §D step engages it).
- HSTS at CF edge (additionally to origin nginx HSTS).
6. SSH hardening
Authentication: ed25519 key pair only. Password auth disabled. Root login disabled.
Users:
root— provisioning only, login disabled after §B step of runbook.deploy— non-root, sudo-group + docker-group member. Owns/home/deploy/walkrpg/repo clone. All CI deploys ssh in as this user.
sshd_config deltas:
PasswordAuthentication noPermitRootLogin noPubkeyAuthentication yesChallengeResponseAuthentication noUsePAM yesX11Forwarding noPrintMotd noClientAliveInterval 300ClientAliveCountMax 2Firewall: ufw default deny incoming. Allowed ports: 22/tcp (SSH), 80/tcp (LE HTTP-01 challenge + nginx HTTP→HTTPS redirect), 443/tcp (nginx HTTPS). All other ports blocked.
Brute-force defense: fail2ban with default sshd jail (5 failures → 10-minute ban; 24h findtime). Sufficient for Phase 14 scale. Aggressive tuning deferred to ops follow-up.
SSH port change: out of scope. Port 22 with key-only auth + fail2ban is sufficient. Security-by-obscurity port change has marginal value and complicates GitLab CI / monitoring config; rejected.
7. CI/CD shape
GitLab CI pipeline triggers on push to master (and on MR pipelines for lint + test only — no deploy on MR).
Stages:
-
lint—pnpm lint,pnpm lint:language,pnpm lint:naming,pnpm lint:canon,pnpm lint:tags(run from monorepo root; backend has its own ESLint config too). -
test—pnpm --filter @walkrpg/backend test(Jest unit) +pnpm --filter @walkrpg/backend test:e2e(Jest e2e, runs against ephemeral Postgres via service container). -
build-image— Docker buildkit, targetbackend/Dockerfile, push to GitLab Container Registry asregistry.gitlab.com/positive-walkers/walkrpg/backend:$CI_COMMIT_SHORT_SHAAND:latest. Runs only onmaster. -
deploy— ssh into VPS asdeploy@$SSH_DEPLOY_HOST, execute:cd /home/deploy/walkrpggit pull origin masterdocker compose -f backend/docker-compose.prod.yml pull apidocker compose -f backend/docker-compose.prod.yml run --rm api pnpm prisma migrate deploydocker compose -f backend/docker-compose.prod.yml up -d apiRuns only on
master. Two-stage compose call: migration first (init-style, blocks until done), then api rolling restart.
Pipeline rules:
lint+testrun on every push and every MR.build-image+deployrun only on push tomaster(no MR previews — single-host has no preview environment surface in Phase 14)..gitlab-ci.ymllives at repo root. The backend-specific test commands invoke pnpm workspace filters.
Deploy concurrency: resource_group: production on the deploy job. Serializes concurrent master pushes; prevents two deploys racing on the same docker-compose state.
Rollback: documented in runbook §K. git checkout <prev-sha> && docker compose pull && up -d on the VPS as deploy. No GitLab automation for rollback in Phase 14 (deferred ops follow-up).
8. Database migrations
Prisma migrations (backend/prisma/migrations/*) deploy via pnpm prisma migrate deploy (NOT migrate dev).
When migrations run:
- During CI/CD deploy stage, before the api container is restarted with the new image. The two-stage compose pattern (see §7 deploy stage) runs
docker compose run --rm api pnpm prisma migrate deployfirst; this exits non-zero on migration failure, aborting the deploy without bringing down the old api container. - On first-time bootstrap (runbook §F step “First-time bootstrap”), migrations run after
dbis healthy and beforeapiis started.
Migration safety:
- Backwards-incompatible migrations (column drops, type changes, NOT NULL adds without default) require a manual two-phase deploy: phase 1 lands a backwards-compatible migration + new app code that writes both old and new schema; phase 2 lands the cleanup migration + app code that reads only new schema. Phase 14’s single-host topology means there is no zero-downtime guarantee — a migration that takes >5s will cause request failures. Accepted: ~20-walker cohort tolerates brief migration windows. Documented in ops follow-up for cohort-growth-driven revisit.
migrate deployis idempotent — re-running is safe.
9. Backup / recovery
Daily Postgres dump: cron job on the VPS host (NOT in a container — host-side cron is more reliable than dockerized cron for this scale).
0 3 * * * docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml exec -T db pg_dump -U walkrpg walkrpg | gzip > /var/backups/walkrpg/walkrpg-$(date +\%Y\%m\%d).sql.gzRetention: 7 daily copies, rotated by find /var/backups/walkrpg/ -name '*.sql.gz' -mtime +7 -delete (separate daily cron at 04:00).
Restore path:
gunzip -c /var/backups/walkrpg/walkrpg-YYYYMMDD.sql.gz | \ docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml exec -T db psql -U walkrpg walkrpgTested restore drill: required within 7 days of Phase 14 deploy. Ops follow-up.
Out of scope for Phase 14 night 1:
- Offsite backup (S3 / B2 / Hetzner Storage Box). Tracked as ops follow-up; the 7-day local rotation is single-host-disk-failure-fatal.
- Point-in-time recovery (PITR / WAL archiving). Phase 14 cohort tolerates daily-granularity restore; PITR deferred to ops follow-up.
- Encrypted backups at rest. Dump file is plaintext SQL. Acceptable because (a)
/var/backups/walkrpg/ischmod 700 deploy:deploy, (b) the VPS disk is encrypted at the Hetzner block-storage layer per their default. Documented for ops review when offsite backup lands.
10. Observability
Logs: structured JSON to stdout from the NestJS app (RequestIdMiddleware + HttpExceptionFilter already emit per ADR-0007 §Backend deliverables shipped). Docker captures stdout. CEO/ops inspect via docker compose logs --tail 100 -f api.
Out of scope for Phase 14:
- ELK / Loki / Datadog / OpenSearch stack. Single-host scale + ~20 walkers does not justify the operational complexity. Tracked as ops follow-up.
- Metrics export (Prometheus, OpenTelemetry). Same justification.
- Alerting (PagerDuty, OpsGenie). Same. CEO checks
docker compose ps+ logs manually during the closed-beta phase.
What ships in Phase 14:
docker compose logsrotation: Docker default log driverjson-filewithmax-size=50m,max-file=5per service. Configured per service indocker-compose.prod.yml.- Cloudflare Analytics (free tier) gives top-of-funnel traffic visibility.
- Hetzner Cloud Console gives CPU/RAM/disk graphs per VPS (free, no opt-in).
htop+du -sh /home/deploy/walkrpg/backend/pgdata/for manual capacity checks.
11. Cost
| Line | Monthly | Annual |
|---|---|---|
| Hetzner CX22 (shared vCPU x2, 4GB RAM, 40GB NVMe, 20TB egress) | €5.18 (excl. VAT) / €5.70 (incl. DE VAT) | ~€68 |
| Hetzner backup feature (20% surcharge — OPTIONAL, recommended Phase 14 night 1 = SKIP, use pg_dump) | €1.04 | — |
Cloudflare Registrar (.com example) | — | €10-12 |
| Cloudflare proxy + Universal SSL + DDoS + Analytics | €0 (free tier) | €0 |
| Cloudflare Access (Zero Trust Free tier, ≤50 users) | €0 (free tier) | €0 |
| Let’s Encrypt | €0 | €0 |
| GitLab CI/CD (free tier — 400 build minutes/mo) | €0 | €0 |
| GitLab Container Registry | €0 (within free-tier storage at this scale) | €0 |
| Total | ~€5.70/mo | ~€78/year all-in |
Cost ceiling before re-evaluation: if monthly cost exceeds €25 (e.g., due to CX22 → CX32 vertical scale + backup feature + offsite backup), tech-architect raises FLAG_LEAD to game-director and CEO. Single-line bump within the same VPS class (CX22 → CX22 + 20% backup) is B-level autonomous.
12. Mobile reconfiguration (post-Phase-14-ship)
After the VPS is live and smoke tests pass:
walkrpg-mobile/android/local.properties→base.url=https://api.walkrpg.<root>/. Single field swap per ADR-0007 §9.walkrpg-mobile/android/app/src/main/res/xml/network_security_config.xml— drop the IP-specific debug-overrides block. The Phase 13 config carried adomain-configallowing cleartext to the CEO laptop’s tunnel-exposed local IP for debug builds. With public HTTPS available, the default secure config applies and the override becomes dead config. Minimal config retains onlycleartext-traffic-permitted = false(which is the default; the file may be deleted entirely if no other overrides remain).- Rebuild APK against the new BASE_URL, install on test device, smoke test the auth/callback + walker/profile flow against the public endpoint. Runbook §J.
iOS port (Phase 15) inherits identically — Info.plist or xcconfig carries the same https://api.walkrpg.<root>/ value.
13. Known limitations / known follow-ups
13.1 Swagger gated behind SWAGGER_ENABLED=true
Per fd7d3e8 (2026-05-22), Swagger setup is conditionally gated because a circular DTO dependency crashes the NestJS Swagger module bootstrap. Phase 14 production-side keeps SWAGGER_ENABLED=false in .env. Developers re-enable locally with SWAGGER_ENABLED=true pnpm dev for the schema browsing path.
Follow-up: fix the circular DTO dep (likely in backend/src/combat/ and backend/src/quest/ DTO cross-imports). Not blocking Phase 14. Tracked as B-level backend-engineer item.
13.2 Production runtime is tsx, not compiled dist/
Per 0b0c37b (2026-05-22), nest build emits to dist/ but the compiled output references @walkrpg/data/sim through unresolvable .ts paths (tsconfig paths are compile-time only; the emitter does not rewrite them). Dev mode uses tsx watch src/main.ts and the production container uses tsx src/main.ts (no watch) for the same reason.
Performance impact: tsx adds ~200ms startup overhead vs node dist/. At Phase 14 scale (single host, ~20 walkers, no autoscaling, no cold-starts) this is invisible. Memory impact: tsx keeps the TypeScript transpiler loaded in process; +~30MB resident. CX22’s 4GB RAM tolerates this easily.
Follow-up: fix the tsconfig paths leak. Likely requires either (a) @walkrpg/data package pre-compile step in CI so the backend imports compiled .js, or (b) a custom Nest CLI emitter that resolves paths in dist. Not blocking Phase 14. Tracked as B-level backend-engineer item, co-occurs with the Swagger gating fix.
13.3 Single-host = no HA
CX22 is a single VPS. Hetzner availability SLA covers ~99.9% uptime; planned maintenance is announced via email with 24h+ notice. Single-host failure modes (hardware failure, network partition, accidental docker compose down) cause full backend outage until restoration.
Accepted: ~20-walker closed-beta cohort tolerates outage windows of minutes-to-hours. HA infrastructure (multi-host, Kubernetes, autoscaling) re-enters scope only at production migration (D-007/D-008/D-009 unfreeze).
13.4 Daily backup is the minimum bar
The Phase 14 night-1 backup strategy is daily pg_dump to local host. Loss window: up to 24h of writes on disk failure. For ~20-walker cohort, this is acceptable. Cohort-growth-driven trigger for offsite backup: ~50 walkers or one incident.
13.5 Mock auth is not production-secure
Per ADR-0006: anyone with the public hostname can POST /auth/callback {"email":"anything","displayName":"anything"} and obtain a 7-day session JWT. This is intentional for the closed-beta cohort context — testers are trusted. Production migration’s Firebase Auth + App Check engagement is the proper fix; that fix is paused indefinitely per D-009 framing.
Mitigation in Phase 14: Cloudflare proxy provides DDoS shielding + bot detection (free tier). If griefing emerges (auto-registration spam), CEO can engage CF Bot Fight Mode + custom firewall rules without code changes. CORS is the only origin gate (per ADR-0007 §10); Phase 14 inherits this posture.
13.6 No rate limiting at the application layer
NestJS-side rate limiting (e.g., @nestjs/throttler) is not configured in Phase 14. Cloudflare’s free tier rate-limits at edge. Sufficient for cohort scale; engineered rate limiting deferred to ops follow-up.
13.7 No structured monitoring dashboards
Per §10: docker compose logs is the observability tool. Acceptable for cohort scale; dashboards deferred.
13.8 Multi-env follow-ups (per ADR-0010)
ADR-0010 introduces a second stack (staging) on the same VPS. Follow-ups specific to that addition (cross-link from this ADR for discoverability):
- No ephemeral per-MR preview env (ADR-0010 §11.1).
- No automated db-migration forward-compatibility check between staging and prod (ADR-0010 §11.2).
- No automated rollback across envs (ADR-0010 §11.3).
- Orchestrator review is CEO-invoked, not webhook-driven, for night 1 (ADR-0010 §11.4).
- No CODEOWNERS file yet (ADR-0010 §11.5).
- No CI job that fails on direct-push to protected branches as belt-and-braces over GitLab branch protection (ADR-0010 §11.6).
14. When this ADR retires
ADR-0009 retires when D-007 / D-008 / D-009 unfreeze and the production migration to GCP europe-central2-warsaw ships:
- Production migration ADRs (working name: ADR-00NN — numbered at trigger time, like this one) author the cutover. They reference ADR-0009 as the source state.
- Phase 14 VPS lifecycle terminates after production-migration green-light + cutover smoke tests pass + DNS swap (
api.walkrpg.<root>A record → production Cloud Run hostname). - Hetzner VPS is decommissioned after a 30-day grace window (rollback insurance).
Not pinned to a phase number. D-015 frames production migration as paused-indefinitely; the unfreeze trigger is CEO-led, cost-redesign-driven, not phase-driven.
15. Consequences
- Closed-beta cohort unblocks. Asynchronous testing across timezones works. Test cohort can grow past ~30 walkers without CEO-laptop SLA pressure.
- D-009 §3 EU residency satisfied by Hetzner DE — fully EU-resident origin + data at rest. GDPR posture (delete-on-demand, anonymization sweep) is unimplemented in mock mode and remains so per ADR-0006 §Out of scope.
- Phase 13 sub-phase 13-13 device walkthrough can proceed against the public endpoint instead of the tunnel — improves test fidelity (public TLS chain exercised, not just CF Tunnel).
- CEO-laptop dependency lifted for the backend. Local backend runs remain available for development (
pnpm devcontinues to work), but tester reach no longer requires laptop uptime. - Operational cost surface enters the budget. €78/year is a new line item; CEO is the budget owner. Cost ceiling per §11.
- GitLab CI minutes consumed. Free tier (400 min/mo) sufficient for ~10-20 deploys/day at current pipeline complexity. Monitored at CI dashboard; upgrade trigger documented at CI usage > 350 min/mo.
- Three Phase 13 backend deliverables (HttpExceptionFilter, RequestIdMiddleware, CORS bootstrap) already ship from ADR-0007 §Backend deliverables shipped. Phase 14 deploy carries them to the public endpoint unchanged.
- Mock-auth risk posture unchanged. Phase 14 does NOT change the security model — it only moves the deployment location. Production-grade auth (D-009 §1) remains paused.
- Three follow-up backend items tracked: Swagger circular dep, tsx-as-prod-runtime, application-layer rate limiting. None blocking Phase 14.
- Follow-up ops items tracked: offsite backup, restore drill, PITR, encrypted backups at rest, structured monitoring, alerting, rate limiting, rotation policy for
.envsecrets. All deferred per cost / cohort tradeoffs documented above.
16. Open questions
- Domain root choice. CEO confirmed Cloudflare Registrar + subdomain pattern
api.walkrpg.<root>. The specific<root>(CEO’s portfolio domain) is filled in at runbook §D execution time. Not a blocker. - Migration approach when Phase 14 retires. Whether the GCP cutover is a forklift migration (DNS swap + 30-day VPS overlap) or a parallel-run blue/green is decided at production-migration ADR authoring time. Out of scope here.
- Container Registry choice durability. GitLab Container Registry is used in Phase 14. If GitLab pricing changes or registry availability degrades, alternates (GHCR, Docker Hub paid, self-hosted) are tracked. B-level swap if needed.
docker-compose.prod.ymlvsops/docker/docker-compose.prod.ymlpath. Backend-engineer chooses at implementation time. The runbook referencesbackend/docker-compose.prod.ymlas canonical; if backend-engineer prefersops/docker/, the runbook updates in the paired implementation session.
17. Implementation handoff
This ADR is the spec layer. Implementation deliverables (NOT authored in this ADR, paired session follows after CEO confirms domain registered):
backend/Dockerfile— node:22-alpine base, pnpm install, copy workspacedata/package (required for@walkrpg/dataimport), CMDtsx src/main.ts. See §13.2 for the tsx rationale.backend/docker-compose.prod.yml— services per §3.1, networks per §3.2, volumes per §3.3, env files per §4.backend/nginx/walkrpg.conf— reverse proxy443 → api:3000, static servewiki.morrisassert.devfrom thewiki-staticvolume, LE cert paths, HSTS + security headers, HTTP→HTTPS redirect, CORS-aware proxy_pass directives..gitlab-ci.yml(or amendments to existing) — stages per §7, secrets per §4 GitLab CI Variables. Build + push thewiki-builderimage alongsidebackend+portfolio.backend/scripts/backup-postgres.sh— cron-invoked dump script per §9.wiki/Dockerfile.build— multi-stage Astro build (node:22-alpine builder + busybox publish). Build context is the monorepo root;pnpm install --filter "@walkrpg/wiki..."pulls the wiki +@walkrpg/dataclosure. Composecommand:invokescp -r /build/wiki/dist/. /wiki/dist/ && tail -f /dev/nullagainst the sharedwiki-staticvolume.
These ship as a separate paired implementation session after the runbook’s CEO-side steps (sections A through F) are confirmed against the registered domain. The runbook documents the exact filenames so the implementation session has zero ambiguity.