Skip to content

Phase 14 — VPS provisioning runbook

Phase 14 — VPS provisioning runbook

Paired with ADR-0009 — VPS migration and ADR-0010 — branch strategy + envs. Sequential CLI commands; copy-paste in order. CEO runs sections A through H. Orchestrator pairs on section I (CI/CD) and J (mobile reconfig). Sections K (day-2 ops), L (branch migration), and M (hotfix workflow) are reference material added by ADR-0010.

Time estimate: ~90 minutes CEO time end-to-end if domain is already registered. Add ~30 minutes if registering domain first.

Required before starting:

  • Hetzner Cloud account (or signup → KYC pass)
  • Cloudflare account
  • Domain registered or ready to register at Cloudflare Registrar
  • Local ~/.ssh/id_ed25519 keypair (generate with ssh-keygen -t ed25519 -C "your@email" if missing)
  • GitLab access to gitlab.com:positive-walkers/walkrpg

Placeholders used throughout:

  • <VPS_IP> — Hetzner-assigned IPv4 (from §A step 4)
  • <root> — CEO’s domain root (e.g., morris.example)
  • <deploy-pubkey> — output of cat ~/.ssh/id_ed25519.pub on CEO laptop

A — Hetzner provisioning (~10 min)

A1. Sign in to Hetzner Cloud Console

Open https://console.hetzner.cloud/. If no account: sign up, complete KYC (Hetzner requires ID verification for new accounts; can take up to 24h on weekends — plan accordingly).

A2. Create project (or reuse existing)

Hetzner organizes resources by project. Create a walkrpg-prod project if none exists. Open it.

A3. Add SSH key to project

Console → Security → SSH Keys → “Add SSH Key”. Paste contents of cat ~/.ssh/id_ed25519.pub from CEO laptop. Name it ceo-laptop-ed25519.

A4. Create server

Console → Servers → “Add Server”:

FieldValue
LocationNuremberg (NUR) or Falkenstein (FSN1) — pick whichever has lower latency from CEO location (both DE/EU, GDPR-compliant)
ImageUbuntu 24.04
TypeCX22 (Shared vCPU x2 ARM, 4GB RAM, 40GB NVMe, 20TB egress) — €5.18/mo + VAT
NetworkingIPv4 + IPv6 (default; keep both)
SSH keysCheck the ceo-laptop-ed25519 key from A3
VolumesNone
FirewallsNone (we use ufw on the host instead — Hetzner-side firewall optional, skip for now)
BackupsOFF (we use pg_dump + 7-day local rotation per ADR-0009 §9; Hetzner backup feature is +20% surcharge, defer to ops follow-up)
Placement groupsNone
Labelsphase=14, env=prod
Cloud configLeave blank
Namewalkrpg-api-1

Click “Create & Buy now”.

A5. Note the IPv4

Once provisioning completes (~30s), copy the assigned IPv4 from the server detail page. Record as <VPS_IP>.

A6. First connectivity test

From CEO laptop:

Terminal window
ssh root@<VPS_IP>

Should connect without password prompt. Type exit to disconnect.

If it prompts for password or fails: check SSH key was attached at creation (Hetzner does not let you add SSH keys to an already-created server without console-level recovery). Easiest fix: destroy + recreate.


B — SSH hardening + base user (~10 min)

B1. SSH back in as root

Terminal window
ssh root@<VPS_IP>

B2. Create deploy user

Terminal window
adduser --disabled-password --gecos "" deploy
usermod -aG sudo deploy

B3. Allow deploy passwordless sudo (provisioning only — tighten later if desired)

Terminal window
echo "deploy ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/deploy
chmod 440 /etc/sudoers.d/deploy

B4. Copy authorized_keys to deploy

Terminal window
mkdir -p /home/deploy/.ssh
cp /root/.ssh/authorized_keys /home/deploy/.ssh/authorized_keys
chown -R deploy:deploy /home/deploy/.ssh
chmod 700 /home/deploy/.ssh
chmod 600 /home/deploy/.ssh/authorized_keys

B5. Test deploy login (from CEO laptop, separate terminal)

Terminal window
ssh deploy@<VPS_IP>

Should connect. Do NOT close the root session yet — keep it open in case the next step breaks SSH.

B6. Harden sshd config (still as root)

Terminal window
sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
sed -i 's/^#*PermitRootLogin.*/PermitRootLogin no/' /etc/ssh/sshd_config
sed -i 's/^#*PubkeyAuthentication.*/PubkeyAuthentication yes/' /etc/ssh/sshd_config
sed -i 's/^#*ChallengeResponseAuthentication.*/ChallengeResponseAuthentication no/' /etc/ssh/sshd_config
sed -i 's/^#*X11Forwarding.*/X11Forwarding no/' /etc/ssh/sshd_config

Append client-alive directives:

Terminal window
cat >> /etc/ssh/sshd_config <<EOF
# WalkRPG Phase 14 hardening
ClientAliveInterval 300
ClientAliveCountMax 2
EOF

B7. Restart sshd

Terminal window
systemctl restart sshd

B8. Verify deploy still works (from CEO laptop, NEW terminal — keep the root session open)

Terminal window
ssh deploy@<VPS_IP>

If this works: root login is now disabled, password auth is disabled, key-only auth works. Close the root session.

If this fails: do NOT close the root session. Diagnose and fix from there.

B9. Install ufw + fail2ban (as deploy via sudo)

Terminal window
sudo apt update
sudo apt install -y ufw fail2ban

B10. Configure ufw

Terminal window
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow 22/tcp
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw --force enable
sudo ufw status verbose

Expected output: Status: active, three allow rules visible.

B11. Confirm fail2ban is running

Terminal window
sudo systemctl status fail2ban

Should show active (running). Default sshd jail is enabled out of the box on Ubuntu 24.04.


C — Docker + Docker Compose install (~5 min)

C1. Install Docker via the official convenience script (as deploy)

Terminal window
curl -fsSL https://get.docker.com -o /tmp/get-docker.sh
sudo sh /tmp/get-docker.sh
rm /tmp/get-docker.sh

C2. Add deploy to the docker group

Terminal window
sudo usermod -aG docker deploy

C3. Apply group change to current shell (or re-login)

Terminal window
exit
ssh deploy@<VPS_IP>

C4. Verify Docker + Docker Compose

Terminal window
docker --version
docker compose version

Both should print versions. Docker Compose v2 ships as a Docker plugin since recent versions — docker compose (no hyphen) is the canonical invocation.

C5. Smoke test

Terminal window
docker run --rm hello-world

Should print “Hello from Docker!”. If not: investigate before proceeding.


D — Cloudflare DNS + Registrar setup (CEO does on Cloudflare side, ~10 min)

This section is web UI, not CLI. Follow each step in the Cloudflare dashboard.

The VPS hosts the WalkRPG backend (prod + staging per ADR-0010), the personal portfolio site (per ADR-0009 §3.1 portfolio coexistence amendment), AND the WalkRPG wiki (per ADR-0009 §3.5 wiki coexistence amendment). Six hostnames share the single Hetzner CX22 + nginx + Let’s Encrypt SAN cert:

HostnameRoutes toPurpose
morrisassert.devweb:3000Next.js portfolio (apex)
www.morrisassert.devweb:3000Next.js portfolio (www alias)
api.walkrpg.morrisassert.devapi:3000NestJS backend, prod env (main branch)
api-staging.walkrpg.morrisassert.devapi-staging:3000NestJS backend, staging env (dev branch), ADR-0010 §4
walkrpg.morrisassert.devnginx 503Reserved for future public WalkRPG frontend
wiki.morrisassert.devstatic (volume)Astro Starlight wiki, gated by Cloudflare Access (§D2)

D1. Confirm domain at Cloudflare Registrar

If the domain is not yet registered or not yet at Cloudflare Registrar:

  • Cloudflare Dashboard → Registrar → Register a domain (or transfer-in an existing one).
  • Choose a .com or similar; Cloudflare Registrar is at-cost (~$10-12/year for .com).

For the canonical configuration the registrar’s domain is morrisassert.dev (.dev is a Google TLD, ~$13/year, mandatory-HTTPS via the Chrome HSTS preload list — strictly upgrades security posture).

If the domain is already in your Cloudflare account, skip ahead.

D2. Open the zone

Cloudflare Dashboard → choose the domain (morrisassert.dev).

D3. Add the six A records

DNS tab → Add record. Repeat for each row below:

TypeNameIPv4 addressProxy statusTTL
A@ (or morrisassert.dev)<VPS_IP>Proxied (orange cloud)Auto
Awww<VPS_IP>Proxied (orange cloud)Auto
Aapi.walkrpg<VPS_IP>Proxied (orange cloud)Auto
Aapi-staging.walkrpg<VPS_IP>Proxied (orange cloud)Auto
Awalkrpg<VPS_IP>Proxied (orange cloud)Auto
Awiki<VPS_IP>Proxied (orange cloud)Auto

Save each record. All six point at the same VPS IPv4; nginx differentiates by Host header / server_name. The Cloudflare orange cloud is mandatory for the wiki record because Cloudflare Access (configured in §D2 below) operates at edge — bypassing the proxy bypasses the auth gate.

D4. Set SSL/TLS mode to Full (strict)

SSL/TLS tab → Overview → Encryption mode → Select Full (strict).

This is critical. Not Flexible, not Full — Full (strict). ADR-0009 §5 explains why.

D5. Always Use HTTPS

SSL/TLS tab → Edge Certificates → Always Use HTTPS → ON.

SSL/TLS tab → Edge Certificates → HTTP Strict Transport Security (HSTS) → Enable. Max age: 6 months. Include subdomains: OFF (only the API subdomain is on Cloudflare; not safe to include all subdomains). Preload: OFF.

D7. Universal SSL certificate

SSL/TLS tab → Edge Certificates → Universal SSL → ensure it’s enabled (default). This handles the browser-facing cert. Cloudflare provisions automatically, ~15 minutes.

D8. Verify DNS propagation

From CEO laptop (or any external machine):

Terminal window
dig morrisassert.dev +short
dig www.morrisassert.dev +short
dig api.walkrpg.morrisassert.dev +short
dig api-staging.walkrpg.morrisassert.dev +short
dig walkrpg.morrisassert.dev +short
dig wiki.morrisassert.dev +short

All six should return Cloudflare IPs (104.x.x.x or 172.x.x.x range), not your <VPS_IP>. That confirms the proxy is active for every hostname. Allow 1-2 minutes after creating the A records.


D2 — Cloudflare Access setup for wiki + Swagger (~10 min, web UI)

The wiki hostname (wiki.morrisassert.dev) is gated by Cloudflare Access (Zero Trust free tier, ≤50 users). The Swagger UI on the API hostname (path /api/docs/*) is gated by the same mechanism when re-enabled. Both Applications run on the free tier and require no additional Cloudflare paid plan.

CF Access intercepts every request at edge, redirects unauthenticated users to its identity-broker flow, then forwards authenticated requests to the origin with a Cf-Access-Jwt-Assertion header. nginx trusts this header implicitly today; defense-in-depth JWT validation against the per-application JWKS endpoint is a follow-up (see ADR-0009 §3.5 wiki coexistence amendment for the follow-up tracking).

D2.1 Enable Zero Trust on the Cloudflare account

Cloudflare Dashboard → sidebar → Zero Trust.

If the team has never been onboarded: Cloudflare prompts you to pick a team subdomain (e.g., walkrpg.cloudflareaccess.com) and plan. Select the Free plan — covers up to 50 users, all the features Phase 14 needs, no card on file required (Cloudflare may still ask for a card for plan-tier compliance; the free tier is genuinely $0).

Pick a team name; the team subdomain becomes <team>.cloudflareaccess.com (this is the identity broker hostname).

D2.2 Create the wiki Access Application

Zero Trust → Access → Applications → Add an application → Self-hosted.

FieldValue
Application namewalkrpg-wiki
Session duration24 hours
Application domainSubdomain wiki, Domain morrisassert.dev (the form splits these into two selectors)
Application path(leave blank — gates the whole host)
App launcher visibilityOFF (no need for the CF App Launcher UI for a single-tenant tool)

Click Next.

D2.3 Choose identity providers

Identity providers screen → at minimum select One-time PIN (CF emails a 6-digit code to the address on the allowlist). Optional fast-follows:

  • Google (free, requires OAuth client setup at console.cloud.google.com — defer to follow-up)
  • GitHub (free, requires OAuth app at github.com/settings/developers — defer to follow-up)

One-time PIN alone is fine for the closed cohort.

Click Next.

D2.4 Create the access policy

Policies screen → Add a policy.

FieldValue
Policy namemorris-only
ActionAllow
Session duration(inherit application — 24 hours)
Configure rules → Include → SelectorEmails
Configure rules → Include → Value<CEO_EMAIL> (the email address used for cohort access)

Save policy.

Future tester onboarding: edit this policy’s Emails selector to include additional addresses, comma-separated. No code changes required.

Click Next, then Add application.

D2.5 Create the Swagger Access Application (path-gated)

Repeat D2.2-D2.4 with these differences:

FieldValue
Application namewalkrpg-swagger
Application domainSubdomain api.walkrpg, Domain morrisassert.dev
Application path/api/docs/* (path-gated — ONLY this prefix requires auth; the rest of the API stays open per the mock-auth JWT model from ADR-0006)
Identity providersSame — One-time PIN minimum
PolicySame morris-only shape, same Email allowlist

This Application only activates when SWAGGER_ENABLED=true is set on the API (currently false in prod per ADR-0009 §13.1). Configuring the Access policy now means the gate is live the moment Swagger is re-enabled; no scramble at the time of re-enable.

D2.6 Verify the wiki gate

From CEO laptop (in a fresh browser session, NOT logged in to CF):

https://wiki.morrisassert.dev/

Expected: CF Access splash page asks for the email. Enter the CEO email → CF emails the 6-digit PIN → enter the PIN → wiki loads.

If the CF Access splash does NOT appear and the wiki loads directly: the Application is either misconfigured or the orange cloud is off for the wiki A record. Re-check D3 + D2.2.

If the wiki returns 502 or 404 after auth: the wiki-builder container has not populated the wiki-static volume yet. Check docker compose ps on the VPS — walkrpg-wiki-builder should be running with the latest image SHA.

D2.7 Adding a new tester later

CF dashboard → Zero Trust → Access → Applications → click walkrpg-wiki → Policies tab → click morris-only → Configure rules → Include → Emails → add the new address → Save.

No origin restart, no DNS change, no cert reissuance. Takes effect within ~30s.


E — Repo deploy keys + secrets (~10 min)

E1. Generate deploy key on VPS (as deploy)

Terminal window
ssh-keygen -t ed25519 -C "walkrpg-vps-deploy" -f ~/.ssh/walkrpg_deploy -N ""
cat ~/.ssh/walkrpg_deploy.pub

Copy the public key output.

E2. Add deploy key to GitLab project

In browser: GitLab → positive-walkers/walkrpg → Settings → Repository → Deploy Keys → Add new key.

FieldValue
Titlewalkrpg-vps deploy key
Key(paste public key from E1)
Grant write permissionsOFF (read-only sufficient — CI pushes images, VPS only pulls source)

Add key.

E3. Configure git on VPS to use the deploy key

Terminal window
cat >> ~/.ssh/config <<EOF
Host gitlab.com
HostName gitlab.com
User git
IdentityFile ~/.ssh/walkrpg_deploy
IdentitiesOnly yes
EOF
chmod 600 ~/.ssh/config

E4. Accept GitLab’s host key

Terminal window

Type yes when prompted. Expected response: Welcome to GitLab, @deploy-key-name! (or similar). Connection then closes; that’s normal — GitLab does not allow shell sessions.

E5. Clone the repo

Terminal window
cd /home/deploy
git clone [email protected]:positive-walkers/walkrpg.git
cd walkrpg
git status

Should show On branch master, working tree clean.

E6. Create .env file

The .env lives at /home/deploy/walkrpg/backend/.env. It is NOT committed to git.

Generate strong secrets first:

Terminal window
echo "JWT_SECRET=$(openssl rand -hex 32)"
echo "POSTGRES_PASSWORD=$(openssl rand -base64 24 | tr -d '/+=')"

Copy both values. Now create the .env:

Terminal window
mkdir -p /home/deploy/walkrpg/backend
cat > /home/deploy/walkrpg/backend/.env <<EOF
NODE_ENV=production
PORT=3000
# Postgres — service name 'db' resolves on docker network
POSTGRES_USER=walkrpg
POSTGRES_PASSWORD=<paste-from-above>
POSTGRES_DB=walkrpg
DATABASE_URL=postgresql://walkrpg:<paste-from-above>@db:5432/walkrpg?schema=public
# JWT (ADR-0006 mock-auth posture)
JWT_SECRET=<paste-from-above>
JWT_ISSUER=walkrpg-api-prod
JWT_AUDIENCE=walkrpg-mobile
AUTH_MODE=mock
# CORS — adjust if the mobile build needs additional origins
CORS_ALLOWED_ORIGINS=https://api.walkrpg.<root>,https://walkrpg.<root>
# Swagger gating (ADR-0009 §13.1) — OFF in prod
SWAGGER_ENABLED=false
EOF
chmod 600 /home/deploy/walkrpg/backend/.env

Edit the file with nano or vi to paste the actual values in the <paste-from-above> slots. Each <paste-from-above> appears twice (POSTGRES_PASSWORD and inside DATABASE_URL); both must match.

E7. Verify .env mode

Terminal window
ls -l /home/deploy/walkrpg/backend/.env

Expected: -rw------- 1 deploy deploy .... If group/world readable: chmod 600 /home/deploy/walkrpg/backend/.env.


F — Docker compose stack (~10 min)

Prerequisite: the implementation files referenced below must exist in the repo. They ship as a separate paired implementation session AFTER the CEO confirms the domain is registered and ready (see “Implementation handoff” in ADR-0009 §17). The runbook references the canonical filenames so the paired session has zero ambiguity:

FilePurpose
backend/Dockerfilenode:22-alpine base, pnpm install, copy data/ workspace, CMD tsx src/main.ts (ADR-0009 §13.2 explains why tsx in prod)
backend/docker-compose.prod.ymlapi + db + nginx + certbot + web + wiki-builder services per ADR-0009 §3.1
backend/docker-compose.staging.ymlapi-staging + db-staging services per ADR-0010 §4 — added at ADR-0010 ratification
backend/nginx/walkrpg.confreverse proxy 443→api:3000, 443→api-staging:3000 (ADR-0010), LE cert paths, HSTS, HTTP→HTTPS redirect
backend/scripts/backup-postgres.shdaily pg_dump script invoked by cron per ADR-0009 §9
backend/.env.prod.exampleenv template for the prod stack
backend/.env.staging.exampleenv template for the staging stack — added at ADR-0010 ratification

If those files do not yet exist when you reach this section, STOP and surface to the orchestrator. The orchestrator runs the paired implementation session, commits the files to main, and resumes the runbook from F1 below.

Bring-up order matters. docker-compose.prod.yml references the staging stack’s network as external: true. Bring staging UP FIRST (it owns the network’s lifecycle), THEN prod. Tear down in reverse order. The two .env files (.env.prod, .env.staging) must both be populated on the VPS at /home/deploy/walkrpg/backend/ with mode 600 before bring-up.

F1. Pull the latest source on the VPS

Terminal window
cd /home/deploy/walkrpg
git pull origin master

F2. First-time bootstrap — bring up the staging network owner, then prod db

The staging compose file owns the walkrpg-net-staging docker network (the prod compose references it as external: true). Bring up the staging db FIRST so the network exists when prod starts:

Terminal window
cd /home/deploy/walkrpg
docker compose -f backend/docker-compose.staging.yml up -d db-staging
docker compose -f backend/docker-compose.prod.yml up -d db

Wait for both DBs to be healthy:

Terminal window
docker compose -f backend/docker-compose.prod.yml ps
docker compose -f backend/docker-compose.staging.yml ps

Status should be running (healthy) for both db (prod) and db-staging. If starting, wait 10s and re-check.

F3. Run Prisma migrations against both DBs

Terminal window
docker compose -f backend/docker-compose.prod.yml run --rm api prisma migrate deploy
docker compose -f backend/docker-compose.staging.yml run --rm api-staging prisma migrate deploy

Expected output for each: All migrations have been successfully applied. or No pending migrations to apply. Exit code 0.

If exit code non-zero: check docker compose logs db (or db-staging) for connection issues, verify the matching .env file’s DATABASE_URL matches POSTGRES_PASSWORD exactly, retry.

F4. Start the rest of both stacks — temporarily without TLS

Before the LE cert exists, nginx cannot start in TLS mode. The bootstrap profile in docker-compose.prod.yml (the nginx-bootstrap service, profile bootstrap) serves plaintext :80 for ACME validation and smoke tests across all six hostnames.

Terminal window
# Bring up the staging stack first (full — api-staging + db-staging),
# then the prod stack with the bootstrap nginx profile + api + web +
# wiki-builder.
docker compose -f backend/docker-compose.staging.yml up -d
docker compose -f backend/docker-compose.prod.yml --profile bootstrap up -d \
nginx-bootstrap api web wiki-builder

nginx-bootstrap joins both walkrpg-net and walkrpg-net-staging so the ACME HTTP-01 challenge resolves for every hostname including api-staging.walkrpg.morrisassert.dev.

F5. Verify api is reachable over plain HTTP

From the VPS:

Terminal window
curl -i http://localhost:3000/

Expected: NestJS root response (likely 404 with JSON error envelope — that’s fine, it means the app is running).

From CEO laptop:

Terminal window
curl -i http://api.walkrpg.<root>/

Expected: response makes it through Cloudflare and back. If Cloudflare strips HTTP (per D5), you may see a redirect to HTTPS — that’s expected and means the next step (LE cert) is needed.


G — Let’s Encrypt cert (~5 min)

The cert is a multi-domain SAN cert covering all six hostnames on this VPS (portfolio apex + www + prod api subdomain + staging api subdomain + reserved walkrpg subdomain + wiki subdomain). LE issues one fullchain that all six nginx server blocks reference. Renewal is shared. Standard LE limit is 100 SAN entries per cert — six is far below.

The cert lands under the first -d argument’s directory. With morrisassert.dev listed first, the live path is /etc/letsencrypt/live/morrisassert.dev/fullchain.pem (and privkey.pem). The walkrpg.conf nginx config matches this path.

G1. Run certbot against the LE staging server (verify the flow)

Terminal window
docker compose -f backend/docker-compose.prod.yml run --rm certbot \
certonly --webroot --webroot-path=/var/www/certbot \
--staging \
--email <ceo-email> \
--agree-tos \
--no-eff-email \
-d morrisassert.dev \
-d www.morrisassert.dev \
-d api.walkrpg.morrisassert.dev \
-d api-staging.walkrpg.morrisassert.dev \
-d walkrpg.morrisassert.dev \
-d wiki.morrisassert.dev

Expected: Successfully received certificate. (staging LE-server cert is untrusted but proves the SAN flow works for all six hostnames; note “staging” here means the LE staging acme directory, not the WalkRPG staging env).

If failure: most common cause is the HTTP-01 challenge not reaching the webroot for one of the hostnames. Check:

  • Cloudflare proxy is on (orange cloud) for every A record.
  • nginx is serving /.well-known/acme-challenge/ from the webroot volume for every server_name.
  • ufw allows port 80.

If only one hostname fails: certbot lists exactly which -d flag could not be challenged. Re-check that A record’s proxy + DNS propagation.

G2. Delete the staging cert before requesting production

Terminal window
docker compose -f backend/docker-compose.prod.yml run --rm certbot \
delete --cert-name morrisassert.dev

G3. Run certbot against the production LE server

Terminal window
docker compose -f backend/docker-compose.prod.yml run --rm certbot \
certonly --webroot --webroot-path=/var/www/certbot \
--email <ceo-email> \
--agree-tos \
--no-eff-email \
-d morrisassert.dev \
-d www.morrisassert.dev \
-d api.walkrpg.morrisassert.dev \
-d api-staging.walkrpg.morrisassert.dev \
-d walkrpg.morrisassert.dev \
-d wiki.morrisassert.dev

Expected: Successfully received certificate. Cert lands at /etc/letsencrypt/live/morrisassert.dev/fullchain.pem + privkey.pem. The fullchain SAN list contains all six hostnames; openssl x509 -in /etc/letsencrypt/live/morrisassert.dev/fullchain.pem -noout -text | grep DNS confirms.

G4. Switch nginx to full TLS mode

Restore the nginx config to its TLS-enabled form (depends on the bootstrap pattern used in F4):

Terminal window
docker compose -f backend/docker-compose.prod.yml down nginx
docker compose -f backend/docker-compose.prod.yml up -d nginx

Or, if using the profile pattern:

Terminal window
docker compose -f backend/docker-compose.prod.yml up -d nginx

(without --profile bootstrap).

G5. Verify TLS

From CEO laptop:

Terminal window
curl -I https://api.walkrpg.<root>/

Expected: HTTP/2 404 (or 200 if a root route exists). TLS handshake succeeded. Check headers include strict-transport-security from nginx.

In a browser, open https://api.walkrpg.<root>/. Lock icon should be present. Click the lock → cert details → issued by Let’s Encrypt (or Cloudflare Inc ECC CA-3 — that’s the edge cert; both being valid is the goal). No “Not Secure” warning.


H — Smoke test (~5 min)

H1. End-to-end auth/callback test

From CEO laptop:

Terminal window
curl -i -X POST https://api.walkrpg.<root>/auth/callback \
-H "Content-Type: application/json" \
-H "X-Request-Id: $(uuidgen)" \
-d '{"email":"[email protected]","displayName":"Phase 14 Smoke Tester"}'

Expected: HTTP/2 200 or HTTP/2 201 with response body containing session.accessToken, walker.id, isFirstLogin: true. Per ADR-0006 mock-auth shape.

H2. Use the returned token to fetch profile

Terminal window
TOKEN="<paste accessToken from H1 response>"
curl -i -X GET https://api.walkrpg.<root>/walker/profile \
-H "Authorization: Bearer ${TOKEN}" \
-H "X-Request-Id: $(uuidgen)"

Expected: HTTP/2 200 with walker profile body.

H3. Inspect logs

On VPS:

Terminal window
docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml logs --tail 50 api

Should show structured JSON log lines including the X-Request-Id values from H1 and H2.

H4. Common failure modes

SymptomLikely causeFix
curl: (6) Could not resolve hostDNS not propagated yetWait 1-2 minutes, retry. dig api.walkrpg.<root> should return Cloudflare IPs.
HTTP/2 502 or HTTP/2 504api container not running or unreachable from nginxdocker compose ps, docker compose logs api. Likely tsx boot failure — check .env syntax.
HTTP/2 503 after a few minutesapi container restarting in a loopSame diagnosis path. Check DATABASE_URL matches POSTGRES_PASSWORD.
Browser shows ERR_SSL_VERSION_OR_CIPHER_MISMATCHCloudflare SSL mode wrong (must be Full strict, not Flexible)Cloudflare dashboard → SSL/TLS → Overview → Full (strict).
Origin SSL certificate is not trusted (CF error 526)LE cert not present on origin or cert mismatchRe-run G3, verify /etc/letsencrypt/live/api.walkrpg.<root>/fullchain.pem exists, docker compose restart nginx.
Authentication errors after successful 200 on /auth/callbackJWT_SECRET mismatch between mint and verify, or env not loadedVerify api container sees env via `docker compose exec api printenv

If H1 and H2 both succeed against the prod hostname: Phase 14 prod is functionally live. Proceed to H5 for the staging smoke test, then section I (CI/CD).

H5. Repeat H1+H2 against staging

Terminal window
curl -i -X POST https://api-staging.walkrpg.morrisassert.dev/auth/callback \
-H "Content-Type: application/json" \
-H "X-Request-Id: $(uuidgen)" \
-d '{"email":"[email protected]","displayName":"Staging Smoke"}'

Expected: same shape as H1 — 200/201 with session.accessToken. The staging endpoint is a fully independent stack — its JWT_SECRET differs from prod, so the token returned here cannot verify against prod and vice versa.

Terminal window
STAGING_TOKEN="<paste accessToken from above>"
curl -i https://api-staging.walkrpg.morrisassert.dev/walker/profile \
-H "Authorization: Bearer ${STAGING_TOKEN}"

Expected: 200 with walker profile body.

If H5 fails but H1+H2 pass: the staging stack is independently broken — check docker compose -f backend/docker-compose.staging.yml ps + logs, verify .env.staging is populated, verify the walkrpg-net-staging network exists (docker network ls | grep staging), and that the prod nginx joined it (docker network inspect walkrpg-net-staging shows the walkrpg-nginx-1 container).

If H1, H2, and H5 all succeed: both envs are live. Proceed to section I (CI/CD).


I — GitLab CI/CD pipeline (~20 min, orchestrator-paired)

This section is paired with the orchestrator + backend-engineer. The runbook lists the moving parts; the orchestrator authors .gitlab-ci.yml in a separate paired session if not already present.

I1. Confirm .gitlab-ci.yml exists at repo root

If not present, surface to orchestrator. The orchestrator authors it per ADR-0009 §7 + ADR-0010 §9.

The pipeline shape is multi-env (per ADR-0010 §9):

  • lint + test run on every push (any branch) and every MR.
  • build:image runs on push to dev OR main only. Pushes one image tagged with :<short-sha> AND :<branch-name>.
  • deploy-staging runs on push to dev only. SSH to VPS, pull image, docker compose -f backend/docker-compose.staging.yml .... Serialized via resource_group: staging.
  • deploy-prod runs on push to main only. SSH to VPS, pull image, docker compose -f backend/docker-compose.prod.yml .... Serialized via resource_group: production.

The two deploy jobs CAN run in parallel (different resource groups, different envs), but each one serializes against itself so concurrent merges to the same branch don’t race on docker-compose state.

I2. Add GitLab CI Variables

In browser: GitLab → positive-walkers/walkrpg → Settings → CI/CD → Variables → Add variable:

KeyTypeValueFlags
SSH_DEPLOY_KEYFile(paste contents of VPS ~/.ssh/walkrpg_deploy PRIVATE key — from VPS cat /home/deploy/.ssh/walkrpg_deploy)Protected: ON, Masked: ON (note: File-type masking masks the file contents in logs; the value itself does not show)
SSH_DEPLOY_HOSTVariable<VPS_IP> or api.walkrpg.<root> (IP is more reliable for SSH; CF proxy doesn’t proxy SSH) — recommend <VPS_IP>Protected: ON
SSH_KNOWN_HOSTSVariableOutput of ssh-keyscan <VPS_IP> from CEO laptop (paste all lines)Protected: ON

The GitLab Container Registry credentials ($CI_REGISTRY_USER, $CI_JOB_TOKEN) are automatically injected by GitLab for same-project pushes — no manual variable needed.

I3. Push a no-op commit to main to trigger the prod pipeline

From CEO laptop (or any clone). NOTE: post-ADR-0010 the default branch is main, not master. The migration steps live in §L below.

Terminal window
git checkout main
git pull origin main
git commit --allow-empty -m "ci: trigger Phase 14 first deploy"
# Direct push to main is denied by branch protection — go through an MR.
# For the very first pipeline kick, use a throwaway feature branch:
git checkout -b feature/ci-bootstrap
git push -u origin feature/ci-bootstrap
# Open MR feature/ci-bootstrap -> dev in GitLab UI, merge it. Then open
# MR dev -> main and merge that to trigger deploy-prod.

For ongoing day-2 deploys, follow the workflow in ADR-0010 §5.

I4. Watch the pipeline

GitLab → positive-walkers/walkrpg → CI/CD → Pipelines. The new pipeline should show lint → test → build → deploy-staging (for the dev push) or lint → test → build → deploy-prod (for the main push). Each stage runs sequentially within its branch.

Expected total runtime: 4-8 minutes first run (Docker layer cache cold).

I5. Verify deploy succeeded

After the deploy stage finishes green:

Terminal window
# From CEO laptop — repeat the H1 smoke test
curl -i -X POST https://api.walkrpg.<root>/auth/callback \
-H "Content-Type: application/json" \
-d '{"email":"[email protected]","displayName":"Post Deploy"}'

Expected: 200/201 with session.

On VPS, check the image SHA:

Terminal window
docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml images api

Should show registry.gitlab.com/positive-walkers/walkrpg/backend:<short-sha> matching the latest master commit.

I6. Pipeline troubleshooting

Stage failsLook atCommon fix
lintJob logLocal pnpm lint not run; fix locally and re-push
testJob logTest failures from unrelated changes; investigate
build-imageJob logDockerfile path wrong, dependencies fail to install, or registry auth misconfigured
deployJob log + docker compose logs on VPSSSH key mismatch (re-check I2 SSH_DEPLOY_KEY paste), wrong known_hosts, or migration failure

J — Mobile reconfig (CEO + orchestrator pairing, post-J0)

Prerequisite (J0): Phase 14 backend is live + smoke tests pass.

J1. Update local.properties

In walkrpg-mobile/android/local.properties:

base.url=https://api.walkrpg.<root>/

Replace whatever local-tunnel hostname was there during Phase 13.

J2. Update network security config

Edit walkrpg-mobile/android/app/src/main/res/xml/network_security_config.xml. Drop the IP-specific debug-overrides block that allowed cleartext to the CEO-laptop tunnel.

Minimal config:

<?xml version="1.0" encoding="utf-8"?>
<network-security-config>
<base-config cleartextTrafficPermitted="false" />
</network-security-config>

Or, if no overrides remain, delete the file entirely and remove the android:networkSecurityConfig attribute from AndroidManifest.xml.

J3. Rebuild APK

Terminal window
cd walkrpg-mobile/android
./gradlew clean assembleDebug

J4. Install on test device

Terminal window
adb install -r app/build/outputs/apk/debug/app-debug.apk

J5. Smoke test on device

Open app → mock-auth screen → enter email + displayName → submit. Expected: auth succeeds against the public endpoint, walker profile loads.

In the device logcat:

Terminal window
adb logcat | grep -i walkrpg

Should show outgoing requests to https://api.walkrpg.<root>/.... No cleartext warnings.

J6. Commit the mobile changes

In walkrpg-mobile repo:

Terminal window
git add android/local.properties android/app/src/main/res/xml/network_security_config.xml
git commit -m "chore(mobile): point base.url at Phase 14 VPS endpoint"
git push origin master

local.properties is typically gitignored — if so, commit the change to a local.properties.example template instead and document the swap in the mobile README.


K — Day-2 operational reference

K1. Postgres backup cron

Install the cron job (one-time on VPS as deploy):

Terminal window
sudo mkdir -p /var/backups/walkrpg
sudo chown deploy:deploy /var/backups/walkrpg
sudo chmod 700 /var/backups/walkrpg
crontab -e

Add lines:

0 3 * * * docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml exec -T db pg_dump -U walkrpg walkrpg | gzip > /var/backups/walkrpg/walkrpg-$(date +\%Y\%m\%d).sql.gz
0 4 * * * find /var/backups/walkrpg/ -name 'walkrpg-*.sql.gz' -mtime +7 -delete

Verify with:

Terminal window
crontab -l

K2. Manual backup (ad-hoc)

Terminal window
docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml exec -T db \
pg_dump -U walkrpg walkrpg | gzip > /var/backups/walkrpg/walkrpg-manual-$(date +%Y%m%d-%H%M).sql.gz

K3. Restore from backup

Terminal window
gunzip -c /var/backups/walkrpg/walkrpg-YYYYMMDD.sql.gz | \
docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml exec -T db \
psql -U walkrpg walkrpg

Schedule a restore drill within 7 days of Phase 14 launch. Documented as an ops follow-up.

K4. Cert renewal verification

certbot runs in the compose stack and polls every 12h. Force-check renewal:

Terminal window
docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml run --rm certbot renew

Cert expiry visible at:

Terminal window
docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml run --rm certbot certificates

LE issues 90-day certs; renewal happens at 30 days remaining.

K5. Log inspection

Terminal window
# All services, last 100 lines, follow mode
docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml logs --tail 100 -f
# api only
docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml logs --tail 100 -f api
# nginx only
docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml logs --tail 100 -f nginx

Filter by X-Request-Id:

Terminal window
docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml logs api | grep "<request-id>"

K6. Rollback to previous deploy

Terminal window
cd /home/deploy/walkrpg
git log --oneline -5 origin/master # find target SHA
git checkout <prev-sha>
docker compose -f backend/docker-compose.prod.yml pull api
docker compose -f backend/docker-compose.prod.yml run --rm api pnpm prisma migrate deploy
docker compose -f backend/docker-compose.prod.yml up -d api

Note: a rollback that crosses a Prisma migration is risky if the migration is destructive (column drops). Inspect the diff before rolling back across migrations.

Return to current after recovery:

Terminal window
git checkout master
docker compose -f backend/docker-compose.prod.yml pull api
docker compose -f backend/docker-compose.prod.yml up -d api

K7. Full stack stop / start

Terminal window
# Stop everything (api, db, nginx, certbot)
docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml stop
# Start everything
docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml start
# Down (stop + remove containers, keep volumes)
docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml down
# Down + remove volumes (DESTRUCTIVE — wipes db)
docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml down -v

K8. Update Docker images outside of CI

Manual pull (skips CI deploy stage — use only for hotfixes):

Terminal window
cd /home/deploy/walkrpg
git pull origin master
docker compose -f backend/docker-compose.prod.yml pull
docker compose -f backend/docker-compose.prod.yml run --rm api pnpm prisma migrate deploy
docker compose -f backend/docker-compose.prod.yml up -d

K9. VPS resource check

Terminal window
# CPU + RAM
htop
# Disk
df -h
du -sh /home/deploy/walkrpg/backend/pgdata/
du -sh /var/backups/walkrpg/
# Network
ss -tunap | grep LISTEN

Hetzner Cloud Console also shows CPU/RAM/disk/network graphs per VPS, free, no opt-in.

K10. Secret rotation

To rotate JWT_SECRET or POSTGRES_PASSWORD:

Terminal window
# Edit .env
nano /home/deploy/walkrpg/backend/.env
# (paste new values, save)
# Postgres password rotation requires DB-side update too
docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml exec db \
psql -U walkrpg -c "ALTER USER walkrpg WITH PASSWORD '<new-password>';"
# Restart api to pick up new env
docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml up -d api

Note: rotating JWT_SECRET invalidates all live sessions immediately. All testers must re-auth. Schedule outside of active test windows.

K11. Adding a new tester (currently: no special action)

In mock-auth mode, any caller can POST /auth/callback and self-register. No allowlist exists. Sharing the public hostname api.walkrpg.<root> is the only “invite” step.

When the mock-auth posture flips to Firebase (post-production-migration unfreeze), this changes — tester onboarding becomes a Firebase Auth user invite.

K12. Incident response checklist (skeleton)

If the API is down:

  1. docker compose ps — are containers running?
  2. docker compose logs --tail 200 api — boot failure or runtime crash?
  3. docker compose logs --tail 200 nginx — TLS / upstream errors?
  4. curl http://localhost:3000/ from VPS — does api respond on internal network?
  5. curl https://api.walkrpg.<root>/ from CEO laptop — does Cloudflare reach origin?
  6. Hetzner Cloud Console — is the VPS up? Network OK? Disk full?
  7. Cloudflare dashboard → Analytics — is traffic hitting CF at all?

If a security incident is suspected:

  1. Pull VPS off-network at Hetzner Cloud Console (network detach) — preserves forensic state.
  2. Snapshot the disk (Hetzner Cloud Console → Server → Snapshots) before any cleanup.
  3. Write to ~/.claude/walkrpg/CRITICAL.md for CEO surfacing.
  4. Investigate from a separate, clean VPS or local environment.

L — Branch migration from master to main + protection rules (one-time, ~10 min)

This section is the operational counterpart to ADR-0010 §10. Run once at ADR-0010 ratification time. Subsequent day-2 workflow follows the patterns in §M and ADR-0010 §5.

L1. Confirm clean working tree

From CEO laptop:

Terminal window
cd /path/to/walkrpg
git status

Must report nothing to commit, working tree clean on master. If not, commit or stash the outstanding work before proceeding.

L2. Rename master to main in GitLab

In browser: GitLab → positive-walkers/walkrpg → Settings → Repository → Default branch.

FieldValue
Default branchmain

If main does not yet exist, GitLab UI offers a one-click rename of the existing default branch. Click “Rename master to main”. The rename is non-destructive — all history is preserved verbatim under the new name.

Confirm in the Branches list that master is gone and main is the new default.

L3. Update local clones

Terminal window
git fetch origin
git branch -m master main # rename local master to main
git branch -u origin/main main # re-track to the new remote
git remote set-head origin -a # update the symbolic-ref HEAD
git pull origin main

Verify:

Terminal window
git branch -vv

Should show * main <sha> [origin/main].

L4. Branch dev from main

Terminal window
git checkout main
git pull origin main
git checkout -b dev
git push -u origin dev

L5. Apply branch protection rules (GitLab UI)

GitLab → positive-walkers/walkrpg → Settings → Repository → Protected branches.

Add main protection:

FieldValue
Branchmain
Allowed to mergeMaintainers
Allowed to push and mergeNo one
Allowed to force pushOFF
Allow deletionOFF
Required approvals (merge requests)1

Add dev protection:

FieldValue
Branchdev
Allowed to mergeMaintainers
Allowed to push and mergeNo one
Allowed to force pushOFF
Allow deletionOFF
Required approvals (merge requests)1

feature/* and hotfix/* need no protection. Force-push is fine on those during MR iteration.

GitLab → positive-walkers/walkrpg → Settings → Merge requests → “Enable ‘Delete source branch’ option by default” → ON.

This keeps the branch list clean — merged feature/hotfix branches disappear automatically.

GitLab → positive-walkers/walkrpg → Settings → Merge requests → Squash commits when merging → “Require”. This forces squash merges, keeping the protected-branch log clean.

L8. Verify the migration

From CEO laptop:

Terminal window
# Direct push to main must FAIL.
git checkout main
git commit --allow-empty -m "test: should fail"
git push origin main
# Expected: `! [remote rejected] main -> main (protected branch hook declined)`

If the push succeeds: branch protection is misconfigured. Re-check L5.

Reset the noise:

Terminal window
git reset --hard HEAD~1

L9. From this point forward

All new work flows through feature/* branches and MRs per ADR-0010 §5. The next deploy of WalkRPG goes through:

  1. git checkout dev && git pull && git checkout -b feature/<name>
  2. Work + push.
  3. MR feature/<name> → dev → orchestrator review → CEO merge → auto-deploy staging.
  4. MR dev → main → orchestrator review → CEO merge → auto-deploy prod.

M — Hotfix workflow (reference)

Use when a bad commit reaches prod and needs a fix faster than the normal feature → dev → main release cadence. The hotfix flow bypasses staging (which would slow the fix) but still requires an MR + orchestrator review pass.

M1. Branch from main

Terminal window
git checkout main
git pull origin main
git checkout -b hotfix/<short-name>

M2. Fix + commit + push

Terminal window
# (edit + test locally)
git add <files>
git commit -m "fix(<scope>): <short description>"
git push -u origin hotfix/<short-name>

M3. Open MR hotfix/<name> → main

GitLab UI → Merge requests → New merge request → source hotfix/<short-name>, target main.

Title: same as the commit subject. Body: brief paragraph + what was broken + what was fixed + how the fix was tested locally.

M4. Orchestrator review

From CEO laptop:

Terminal window
claude /review <MR-URL>

Orchestrator dispatches relevant leads (tech-architect for backend infra, narrative-designer for content, etc.), reviews the diff inline, posts a summary verdict.

Expected verdict: APPROVE (hotfixes are small + scoped). If NEEDS_CHANGES, iterate on the hotfix branch (force-push OK) and re-run the review.

M5. CEO merges to main

Squash merge. main advances; auto-deploys to prod.

M6. Verify the fix in prod

Terminal window
# Smoke test against prod
curl -i -X POST https://api.walkrpg.morrisassert.dev/auth/callback \
-H "Content-Type: application/json" \
-d '{"email":"[email protected]","displayName":"Hotfix Verify"}'

Or whatever endpoint exercises the fix. Confirm the fix lands as expected.

M7. Sync dev to track prod

The dev branch must carry the hotfix forward; otherwise the next release MR (dev → main) would re-introduce the bug. Two patterns:

Pattern A — cherry-pick onto a sync branch (cleaner if dev has diverged from main):

Terminal window
git checkout dev
git pull origin dev
git checkout -b feature/sync-hotfix-<short-name>
git cherry-pick <merged-squash-sha-on-main>
git push -u origin feature/sync-hotfix-<short-name>

Open MR feature/sync-hotfix-<short-name> → dev. Orchestrator review (typically a quick APPROVE because the diff is identical to the already-reviewed hotfix). Merge.

Pattern B — MR the hotfix branch itself to dev (works if dev is close to main):

Terminal window
# The hotfix branch still exists if you haven't deleted it.
# Open a SECOND MR: hotfix/<short-name> → dev.

Pattern A is more robust when dev has open features that haven’t shipped yet — the squash-merge SHA is the cleanest carrier of the fix.

M8. Clean up

Delete the hotfix branch from GitLab if auto-delete is on (L6); otherwise:

Terminal window
git push origin --delete hotfix/<short-name>
git branch -d hotfix/<short-name> # local

Appendix — Sequenced overview

SectionTimeRun byBlocker if it fails
A — Hetzner provisioning~10 minCEOyes
B — SSH hardening~10 minCEOyes
C — Docker install~5 minCEOyes
D — Cloudflare DNS~10 minCEOyes
E — Repo deploy keys + secrets~10 minCEOyes
F — Docker compose stacks (prod + staging)~15 minCEO (or paused for orchestrator)yes — requires compose / Dockerfile / nginx files to exist
G — Let’s Encrypt cert (SAN over 6 hostnames)~5 minCEOyes
H — Smoke test (prod + staging)~5 minCEOyes — must pass to consider Phase 14 live
I — GitLab CI/CD (multi-env)~20 minOrchestrator + CEOno — Phase 14 is live even without CI; this enables push-to-deploy
J — Mobile reconfig~15 minOrchestrator + CEOno — separate concern from backend liveness
K — Day-2 opsreferenceCEO + opsno — reference material
L — Branch migration mastermain + protection rules (one-time)~10 minCEOonce; gates ADR-0010 going live
M — Hotfix workflow (reference)~15 min when invokedCEO + orchestratorper-incident

End of runbook.