Phase 14 — VPS provisioning runbook
Phase 14 — VPS provisioning runbook
Paired with ADR-0009 — VPS migration and ADR-0010 — branch strategy + envs. Sequential CLI commands; copy-paste in order. CEO runs sections A through H. Orchestrator pairs on section I (CI/CD) and J (mobile reconfig). Sections K (day-2 ops), L (branch migration), and M (hotfix workflow) are reference material added by ADR-0010.
Time estimate: ~90 minutes CEO time end-to-end if domain is already registered. Add ~30 minutes if registering domain first.
Required before starting:
- Hetzner Cloud account (or signup → KYC pass)
- Cloudflare account
- Domain registered or ready to register at Cloudflare Registrar
- Local
~/.ssh/id_ed25519keypair (generate withssh-keygen -t ed25519 -C "your@email"if missing) - GitLab access to
gitlab.com:positive-walkers/walkrpg
Placeholders used throughout:
<VPS_IP>— Hetzner-assigned IPv4 (from §A step 4)<root>— CEO’s domain root (e.g.,morris.example)<deploy-pubkey>— output ofcat ~/.ssh/id_ed25519.pubon CEO laptop
A — Hetzner provisioning (~10 min)
A1. Sign in to Hetzner Cloud Console
Open https://console.hetzner.cloud/. If no account: sign up, complete KYC (Hetzner requires ID verification for new accounts; can take up to 24h on weekends — plan accordingly).
A2. Create project (or reuse existing)
Hetzner organizes resources by project. Create a walkrpg-prod project if none exists. Open it.
A3. Add SSH key to project
Console → Security → SSH Keys → “Add SSH Key”. Paste contents of cat ~/.ssh/id_ed25519.pub from CEO laptop. Name it ceo-laptop-ed25519.
A4. Create server
Console → Servers → “Add Server”:
| Field | Value |
|---|---|
| Location | Nuremberg (NUR) or Falkenstein (FSN1) — pick whichever has lower latency from CEO location (both DE/EU, GDPR-compliant) |
| Image | Ubuntu 24.04 |
| Type | CX22 (Shared vCPU x2 ARM, 4GB RAM, 40GB NVMe, 20TB egress) — €5.18/mo + VAT |
| Networking | IPv4 + IPv6 (default; keep both) |
| SSH keys | Check the ceo-laptop-ed25519 key from A3 |
| Volumes | None |
| Firewalls | None (we use ufw on the host instead — Hetzner-side firewall optional, skip for now) |
| Backups | OFF (we use pg_dump + 7-day local rotation per ADR-0009 §9; Hetzner backup feature is +20% surcharge, defer to ops follow-up) |
| Placement groups | None |
| Labels | phase=14, env=prod |
| Cloud config | Leave blank |
| Name | walkrpg-api-1 |
Click “Create & Buy now”.
A5. Note the IPv4
Once provisioning completes (~30s), copy the assigned IPv4 from the server detail page. Record as <VPS_IP>.
A6. First connectivity test
From CEO laptop:
ssh root@<VPS_IP>Should connect without password prompt. Type exit to disconnect.
If it prompts for password or fails: check SSH key was attached at creation (Hetzner does not let you add SSH keys to an already-created server without console-level recovery). Easiest fix: destroy + recreate.
B — SSH hardening + base user (~10 min)
B1. SSH back in as root
ssh root@<VPS_IP>B2. Create deploy user
adduser --disabled-password --gecos "" deployusermod -aG sudo deployB3. Allow deploy passwordless sudo (provisioning only — tighten later if desired)
echo "deploy ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/deploychmod 440 /etc/sudoers.d/deployB4. Copy authorized_keys to deploy
mkdir -p /home/deploy/.sshcp /root/.ssh/authorized_keys /home/deploy/.ssh/authorized_keyschown -R deploy:deploy /home/deploy/.sshchmod 700 /home/deploy/.sshchmod 600 /home/deploy/.ssh/authorized_keysB5. Test deploy login (from CEO laptop, separate terminal)
ssh deploy@<VPS_IP>Should connect. Do NOT close the root session yet — keep it open in case the next step breaks SSH.
B6. Harden sshd config (still as root)
sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_configsed -i 's/^#*PermitRootLogin.*/PermitRootLogin no/' /etc/ssh/sshd_configsed -i 's/^#*PubkeyAuthentication.*/PubkeyAuthentication yes/' /etc/ssh/sshd_configsed -i 's/^#*ChallengeResponseAuthentication.*/ChallengeResponseAuthentication no/' /etc/ssh/sshd_configsed -i 's/^#*X11Forwarding.*/X11Forwarding no/' /etc/ssh/sshd_configAppend client-alive directives:
cat >> /etc/ssh/sshd_config <<EOF
# WalkRPG Phase 14 hardeningClientAliveInterval 300ClientAliveCountMax 2EOFB7. Restart sshd
systemctl restart sshdB8. Verify deploy still works (from CEO laptop, NEW terminal — keep the root session open)
ssh deploy@<VPS_IP>If this works: root login is now disabled, password auth is disabled, key-only auth works. Close the root session.
If this fails: do NOT close the root session. Diagnose and fix from there.
B9. Install ufw + fail2ban (as deploy via sudo)
sudo apt updatesudo apt install -y ufw fail2banB10. Configure ufw
sudo ufw default deny incomingsudo ufw default allow outgoingsudo ufw allow 22/tcpsudo ufw allow 80/tcpsudo ufw allow 443/tcpsudo ufw --force enablesudo ufw status verboseExpected output: Status: active, three allow rules visible.
B11. Confirm fail2ban is running
sudo systemctl status fail2banShould show active (running). Default sshd jail is enabled out of the box on Ubuntu 24.04.
C — Docker + Docker Compose install (~5 min)
C1. Install Docker via the official convenience script (as deploy)
curl -fsSL https://get.docker.com -o /tmp/get-docker.shsudo sh /tmp/get-docker.shrm /tmp/get-docker.shC2. Add deploy to the docker group
sudo usermod -aG docker deployC3. Apply group change to current shell (or re-login)
exitssh deploy@<VPS_IP>C4. Verify Docker + Docker Compose
docker --versiondocker compose versionBoth should print versions. Docker Compose v2 ships as a Docker plugin since recent versions — docker compose (no hyphen) is the canonical invocation.
C5. Smoke test
docker run --rm hello-worldShould print “Hello from Docker!”. If not: investigate before proceeding.
D — Cloudflare DNS + Registrar setup (CEO does on Cloudflare side, ~10 min)
This section is web UI, not CLI. Follow each step in the Cloudflare dashboard.
The VPS hosts the WalkRPG backend (prod + staging per ADR-0010), the personal portfolio site (per ADR-0009 §3.1 portfolio coexistence amendment), AND the WalkRPG wiki (per ADR-0009 §3.5 wiki coexistence amendment). Six hostnames share the single Hetzner CX22 + nginx + Let’s Encrypt SAN cert:
| Hostname | Routes to | Purpose |
|---|---|---|
morrisassert.dev | web:3000 | Next.js portfolio (apex) |
www.morrisassert.dev | web:3000 | Next.js portfolio (www alias) |
api.walkrpg.morrisassert.dev | api:3000 | NestJS backend, prod env (main branch) |
api-staging.walkrpg.morrisassert.dev | api-staging:3000 | NestJS backend, staging env (dev branch), ADR-0010 §4 |
walkrpg.morrisassert.dev | nginx 503 | Reserved for future public WalkRPG frontend |
wiki.morrisassert.dev | static (volume) | Astro Starlight wiki, gated by Cloudflare Access (§D2) |
D1. Confirm domain at Cloudflare Registrar
If the domain is not yet registered or not yet at Cloudflare Registrar:
- Cloudflare Dashboard → Registrar → Register a domain (or transfer-in an existing one).
- Choose a
.comor similar; Cloudflare Registrar is at-cost (~$10-12/year for.com).
For the canonical configuration the registrar’s domain is morrisassert.dev (.dev is a Google TLD, ~$13/year, mandatory-HTTPS via the Chrome HSTS preload list — strictly upgrades security posture).
If the domain is already in your Cloudflare account, skip ahead.
D2. Open the zone
Cloudflare Dashboard → choose the domain (morrisassert.dev).
D3. Add the six A records
DNS tab → Add record. Repeat for each row below:
| Type | Name | IPv4 address | Proxy status | TTL |
|---|---|---|---|---|
| A | @ (or morrisassert.dev) | <VPS_IP> | Proxied (orange cloud) | Auto |
| A | www | <VPS_IP> | Proxied (orange cloud) | Auto |
| A | api.walkrpg | <VPS_IP> | Proxied (orange cloud) | Auto |
| A | api-staging.walkrpg | <VPS_IP> | Proxied (orange cloud) | Auto |
| A | walkrpg | <VPS_IP> | Proxied (orange cloud) | Auto |
| A | wiki | <VPS_IP> | Proxied (orange cloud) | Auto |
Save each record. All six point at the same VPS IPv4; nginx differentiates by Host header / server_name. The Cloudflare orange cloud is mandatory for the wiki record because Cloudflare Access (configured in §D2 below) operates at edge — bypassing the proxy bypasses the auth gate.
D4. Set SSL/TLS mode to Full (strict)
SSL/TLS tab → Overview → Encryption mode → Select Full (strict).
This is critical. Not Flexible, not Full — Full (strict). ADR-0009 §5 explains why.
D5. Always Use HTTPS
SSL/TLS tab → Edge Certificates → Always Use HTTPS → ON.
D6. HSTS (optional, recommended)
SSL/TLS tab → Edge Certificates → HTTP Strict Transport Security (HSTS) → Enable. Max age: 6 months. Include subdomains: OFF (only the API subdomain is on Cloudflare; not safe to include all subdomains). Preload: OFF.
D7. Universal SSL certificate
SSL/TLS tab → Edge Certificates → Universal SSL → ensure it’s enabled (default). This handles the browser-facing cert. Cloudflare provisions automatically, ~15 minutes.
D8. Verify DNS propagation
From CEO laptop (or any external machine):
dig morrisassert.dev +shortdig www.morrisassert.dev +shortdig api.walkrpg.morrisassert.dev +shortdig api-staging.walkrpg.morrisassert.dev +shortdig walkrpg.morrisassert.dev +shortdig wiki.morrisassert.dev +shortAll six should return Cloudflare IPs (104.x.x.x or 172.x.x.x range), not your <VPS_IP>. That confirms the proxy is active for every hostname. Allow 1-2 minutes after creating the A records.
D2 — Cloudflare Access setup for wiki + Swagger (~10 min, web UI)
The wiki hostname (wiki.morrisassert.dev) is gated by Cloudflare Access (Zero Trust free tier, ≤50 users). The Swagger UI on the API hostname (path /api/docs/*) is gated by the same mechanism when re-enabled. Both Applications run on the free tier and require no additional Cloudflare paid plan.
CF Access intercepts every request at edge, redirects unauthenticated users to its identity-broker flow, then forwards authenticated requests to the origin with a Cf-Access-Jwt-Assertion header. nginx trusts this header implicitly today; defense-in-depth JWT validation against the per-application JWKS endpoint is a follow-up (see ADR-0009 §3.5 wiki coexistence amendment for the follow-up tracking).
D2.1 Enable Zero Trust on the Cloudflare account
Cloudflare Dashboard → sidebar → Zero Trust.
If the team has never been onboarded: Cloudflare prompts you to pick a team subdomain (e.g., walkrpg.cloudflareaccess.com) and plan. Select the Free plan — covers up to 50 users, all the features Phase 14 needs, no card on file required (Cloudflare may still ask for a card for plan-tier compliance; the free tier is genuinely $0).
Pick a team name; the team subdomain becomes <team>.cloudflareaccess.com (this is the identity broker hostname).
D2.2 Create the wiki Access Application
Zero Trust → Access → Applications → Add an application → Self-hosted.
| Field | Value |
|---|---|
| Application name | walkrpg-wiki |
| Session duration | 24 hours |
| Application domain | Subdomain wiki, Domain morrisassert.dev (the form splits these into two selectors) |
| Application path | (leave blank — gates the whole host) |
| App launcher visibility | OFF (no need for the CF App Launcher UI for a single-tenant tool) |
Click Next.
D2.3 Choose identity providers
Identity providers screen → at minimum select One-time PIN (CF emails a 6-digit code to the address on the allowlist). Optional fast-follows:
- Google (free, requires OAuth client setup at console.cloud.google.com — defer to follow-up)
- GitHub (free, requires OAuth app at github.com/settings/developers — defer to follow-up)
One-time PIN alone is fine for the closed cohort.
Click Next.
D2.4 Create the access policy
Policies screen → Add a policy.
| Field | Value |
|---|---|
| Policy name | morris-only |
| Action | Allow |
| Session duration | (inherit application — 24 hours) |
| Configure rules → Include → Selector | Emails |
| Configure rules → Include → Value | <CEO_EMAIL> (the email address used for cohort access) |
Save policy.
Future tester onboarding: edit this policy’s Emails selector to include additional addresses, comma-separated. No code changes required.
Click Next, then Add application.
D2.5 Create the Swagger Access Application (path-gated)
Repeat D2.2-D2.4 with these differences:
| Field | Value |
|---|---|
| Application name | walkrpg-swagger |
| Application domain | Subdomain api.walkrpg, Domain morrisassert.dev |
| Application path | /api/docs/* (path-gated — ONLY this prefix requires auth; the rest of the API stays open per the mock-auth JWT model from ADR-0006) |
| Identity providers | Same — One-time PIN minimum |
| Policy | Same morris-only shape, same Email allowlist |
This Application only activates when SWAGGER_ENABLED=true is set on the API (currently false in prod per ADR-0009 §13.1). Configuring the Access policy now means the gate is live the moment Swagger is re-enabled; no scramble at the time of re-enable.
D2.6 Verify the wiki gate
From CEO laptop (in a fresh browser session, NOT logged in to CF):
https://wiki.morrisassert.dev/Expected: CF Access splash page asks for the email. Enter the CEO email → CF emails the 6-digit PIN → enter the PIN → wiki loads.
If the CF Access splash does NOT appear and the wiki loads directly: the Application is either misconfigured or the orange cloud is off for the wiki A record. Re-check D3 + D2.2.
If the wiki returns 502 or 404 after auth: the wiki-builder container has not populated the wiki-static volume yet. Check docker compose ps on the VPS — walkrpg-wiki-builder should be running with the latest image SHA.
D2.7 Adding a new tester later
CF dashboard → Zero Trust → Access → Applications → click walkrpg-wiki → Policies tab → click morris-only → Configure rules → Include → Emails → add the new address → Save.
No origin restart, no DNS change, no cert reissuance. Takes effect within ~30s.
E — Repo deploy keys + secrets (~10 min)
E1. Generate deploy key on VPS (as deploy)
ssh-keygen -t ed25519 -C "walkrpg-vps-deploy" -f ~/.ssh/walkrpg_deploy -N ""cat ~/.ssh/walkrpg_deploy.pubCopy the public key output.
E2. Add deploy key to GitLab project
In browser: GitLab → positive-walkers/walkrpg → Settings → Repository → Deploy Keys → Add new key.
| Field | Value |
|---|---|
| Title | walkrpg-vps deploy key |
| Key | (paste public key from E1) |
| Grant write permissions | OFF (read-only sufficient — CI pushes images, VPS only pulls source) |
Add key.
E3. Configure git on VPS to use the deploy key
cat >> ~/.ssh/config <<EOF
Host gitlab.com HostName gitlab.com User git IdentityFile ~/.ssh/walkrpg_deploy IdentitiesOnly yesEOFchmod 600 ~/.ssh/configE4. Accept GitLab’s host key
Type yes when prompted. Expected response: Welcome to GitLab, @deploy-key-name! (or similar). Connection then closes; that’s normal — GitLab does not allow shell sessions.
E5. Clone the repo
cd /home/deploycd walkrpggit statusShould show On branch master, working tree clean.
E6. Create .env file
The .env lives at /home/deploy/walkrpg/backend/.env. It is NOT committed to git.
Generate strong secrets first:
echo "JWT_SECRET=$(openssl rand -hex 32)"echo "POSTGRES_PASSWORD=$(openssl rand -base64 24 | tr -d '/+=')"Copy both values. Now create the .env:
mkdir -p /home/deploy/walkrpg/backendcat > /home/deploy/walkrpg/backend/.env <<EOFNODE_ENV=productionPORT=3000
# Postgres — service name 'db' resolves on docker networkPOSTGRES_USER=walkrpgPOSTGRES_PASSWORD=<paste-from-above>POSTGRES_DB=walkrpgDATABASE_URL=postgresql://walkrpg:<paste-from-above>@db:5432/walkrpg?schema=public
# JWT (ADR-0006 mock-auth posture)JWT_SECRET=<paste-from-above>JWT_ISSUER=walkrpg-api-prodJWT_AUDIENCE=walkrpg-mobileAUTH_MODE=mock
# CORS — adjust if the mobile build needs additional originsCORS_ALLOWED_ORIGINS=https://api.walkrpg.<root>,https://walkrpg.<root>
# Swagger gating (ADR-0009 §13.1) — OFF in prodSWAGGER_ENABLED=falseEOF
chmod 600 /home/deploy/walkrpg/backend/.envEdit the file with nano or vi to paste the actual values in the <paste-from-above> slots. Each <paste-from-above> appears twice (POSTGRES_PASSWORD and inside DATABASE_URL); both must match.
E7. Verify .env mode
ls -l /home/deploy/walkrpg/backend/.envExpected: -rw------- 1 deploy deploy .... If group/world readable: chmod 600 /home/deploy/walkrpg/backend/.env.
F — Docker compose stack (~10 min)
Prerequisite: the implementation files referenced below must exist in the repo. They ship as a separate paired implementation session AFTER the CEO confirms the domain is registered and ready (see “Implementation handoff” in ADR-0009 §17). The runbook references the canonical filenames so the paired session has zero ambiguity:
| File | Purpose |
|---|---|
backend/Dockerfile | node:22-alpine base, pnpm install, copy data/ workspace, CMD tsx src/main.ts (ADR-0009 §13.2 explains why tsx in prod) |
backend/docker-compose.prod.yml | api + db + nginx + certbot + web + wiki-builder services per ADR-0009 §3.1 |
backend/docker-compose.staging.yml | api-staging + db-staging services per ADR-0010 §4 — added at ADR-0010 ratification |
backend/nginx/walkrpg.conf | reverse proxy 443→api:3000, 443→api-staging:3000 (ADR-0010), LE cert paths, HSTS, HTTP→HTTPS redirect |
backend/scripts/backup-postgres.sh | daily pg_dump script invoked by cron per ADR-0009 §9 |
backend/.env.prod.example | env template for the prod stack |
backend/.env.staging.example | env template for the staging stack — added at ADR-0010 ratification |
If those files do not yet exist when you reach this section, STOP and surface to the orchestrator. The orchestrator runs the paired implementation session, commits the files to main, and resumes the runbook from F1 below.
Bring-up order matters. docker-compose.prod.yml references the staging stack’s network as external: true. Bring staging UP FIRST (it owns the network’s lifecycle), THEN prod. Tear down in reverse order. The two .env files (.env.prod, .env.staging) must both be populated on the VPS at /home/deploy/walkrpg/backend/ with mode 600 before bring-up.
F1. Pull the latest source on the VPS
cd /home/deploy/walkrpggit pull origin masterF2. First-time bootstrap — bring up the staging network owner, then prod db
The staging compose file owns the walkrpg-net-staging docker network (the prod compose references it as external: true). Bring up the staging db FIRST so the network exists when prod starts:
cd /home/deploy/walkrpgdocker compose -f backend/docker-compose.staging.yml up -d db-stagingdocker compose -f backend/docker-compose.prod.yml up -d dbWait for both DBs to be healthy:
docker compose -f backend/docker-compose.prod.yml psdocker compose -f backend/docker-compose.staging.yml psStatus should be running (healthy) for both db (prod) and db-staging. If starting, wait 10s and re-check.
F3. Run Prisma migrations against both DBs
docker compose -f backend/docker-compose.prod.yml run --rm api prisma migrate deploydocker compose -f backend/docker-compose.staging.yml run --rm api-staging prisma migrate deployExpected output for each: All migrations have been successfully applied. or No pending migrations to apply. Exit code 0.
If exit code non-zero: check docker compose logs db (or db-staging) for connection issues, verify the matching .env file’s DATABASE_URL matches POSTGRES_PASSWORD exactly, retry.
F4. Start the rest of both stacks — temporarily without TLS
Before the LE cert exists, nginx cannot start in TLS mode. The bootstrap profile in docker-compose.prod.yml (the nginx-bootstrap service, profile bootstrap) serves plaintext :80 for ACME validation and smoke tests across all six hostnames.
# Bring up the staging stack first (full — api-staging + db-staging),# then the prod stack with the bootstrap nginx profile + api + web +# wiki-builder.docker compose -f backend/docker-compose.staging.yml up -ddocker compose -f backend/docker-compose.prod.yml --profile bootstrap up -d \ nginx-bootstrap api web wiki-buildernginx-bootstrap joins both walkrpg-net and walkrpg-net-staging so the ACME HTTP-01 challenge resolves for every hostname including api-staging.walkrpg.morrisassert.dev.
F5. Verify api is reachable over plain HTTP
From the VPS:
curl -i http://localhost:3000/Expected: NestJS root response (likely 404 with JSON error envelope — that’s fine, it means the app is running).
From CEO laptop:
curl -i http://api.walkrpg.<root>/Expected: response makes it through Cloudflare and back. If Cloudflare strips HTTP (per D5), you may see a redirect to HTTPS — that’s expected and means the next step (LE cert) is needed.
G — Let’s Encrypt cert (~5 min)
The cert is a multi-domain SAN cert covering all six hostnames on this VPS (portfolio apex + www + prod api subdomain + staging api subdomain + reserved walkrpg subdomain + wiki subdomain). LE issues one fullchain that all six nginx server blocks reference. Renewal is shared. Standard LE limit is 100 SAN entries per cert — six is far below.
The cert lands under the first -d argument’s directory. With morrisassert.dev listed first, the live path is /etc/letsencrypt/live/morrisassert.dev/fullchain.pem (and privkey.pem). The walkrpg.conf nginx config matches this path.
G1. Run certbot against the LE staging server (verify the flow)
docker compose -f backend/docker-compose.prod.yml run --rm certbot \ certonly --webroot --webroot-path=/var/www/certbot \ --staging \ --email <ceo-email> \ --agree-tos \ --no-eff-email \ -d morrisassert.dev \ -d www.morrisassert.dev \ -d api.walkrpg.morrisassert.dev \ -d api-staging.walkrpg.morrisassert.dev \ -d walkrpg.morrisassert.dev \ -d wiki.morrisassert.devExpected: Successfully received certificate. (staging LE-server cert is untrusted but proves the SAN flow works for all six hostnames; note “staging” here means the LE staging acme directory, not the WalkRPG staging env).
If failure: most common cause is the HTTP-01 challenge not reaching the webroot for one of the hostnames. Check:
- Cloudflare proxy is on (orange cloud) for every A record.
- nginx is serving
/.well-known/acme-challenge/from the webroot volume for everyserver_name. - ufw allows port 80.
If only one hostname fails: certbot lists exactly which -d flag could not be challenged. Re-check that A record’s proxy + DNS propagation.
G2. Delete the staging cert before requesting production
docker compose -f backend/docker-compose.prod.yml run --rm certbot \ delete --cert-name morrisassert.devG3. Run certbot against the production LE server
docker compose -f backend/docker-compose.prod.yml run --rm certbot \ certonly --webroot --webroot-path=/var/www/certbot \ --email <ceo-email> \ --agree-tos \ --no-eff-email \ -d morrisassert.dev \ -d www.morrisassert.dev \ -d api.walkrpg.morrisassert.dev \ -d api-staging.walkrpg.morrisassert.dev \ -d walkrpg.morrisassert.dev \ -d wiki.morrisassert.devExpected: Successfully received certificate. Cert lands at /etc/letsencrypt/live/morrisassert.dev/fullchain.pem + privkey.pem. The fullchain SAN list contains all six hostnames; openssl x509 -in /etc/letsencrypt/live/morrisassert.dev/fullchain.pem -noout -text | grep DNS confirms.
G4. Switch nginx to full TLS mode
Restore the nginx config to its TLS-enabled form (depends on the bootstrap pattern used in F4):
docker compose -f backend/docker-compose.prod.yml down nginxdocker compose -f backend/docker-compose.prod.yml up -d nginxOr, if using the profile pattern:
docker compose -f backend/docker-compose.prod.yml up -d nginx(without --profile bootstrap).
G5. Verify TLS
From CEO laptop:
curl -I https://api.walkrpg.<root>/Expected: HTTP/2 404 (or 200 if a root route exists). TLS handshake succeeded. Check headers include strict-transport-security from nginx.
In a browser, open https://api.walkrpg.<root>/. Lock icon should be present. Click the lock → cert details → issued by Let’s Encrypt (or Cloudflare Inc ECC CA-3 — that’s the edge cert; both being valid is the goal). No “Not Secure” warning.
H — Smoke test (~5 min)
H1. End-to-end auth/callback test
From CEO laptop:
curl -i -X POST https://api.walkrpg.<root>/auth/callback \ -H "Content-Type: application/json" \ -H "X-Request-Id: $(uuidgen)" \Expected: HTTP/2 200 or HTTP/2 201 with response body containing session.accessToken, walker.id, isFirstLogin: true. Per ADR-0006 mock-auth shape.
H2. Use the returned token to fetch profile
TOKEN="<paste accessToken from H1 response>"
curl -i -X GET https://api.walkrpg.<root>/walker/profile \ -H "Authorization: Bearer ${TOKEN}" \ -H "X-Request-Id: $(uuidgen)"Expected: HTTP/2 200 with walker profile body.
H3. Inspect logs
On VPS:
docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml logs --tail 50 apiShould show structured JSON log lines including the X-Request-Id values from H1 and H2.
H4. Common failure modes
| Symptom | Likely cause | Fix |
|---|---|---|
curl: (6) Could not resolve host | DNS not propagated yet | Wait 1-2 minutes, retry. dig api.walkrpg.<root> should return Cloudflare IPs. |
HTTP/2 502 or HTTP/2 504 | api container not running or unreachable from nginx | docker compose ps, docker compose logs api. Likely tsx boot failure — check .env syntax. |
HTTP/2 503 after a few minutes | api container restarting in a loop | Same diagnosis path. Check DATABASE_URL matches POSTGRES_PASSWORD. |
Browser shows ERR_SSL_VERSION_OR_CIPHER_MISMATCH | Cloudflare SSL mode wrong (must be Full strict, not Flexible) | Cloudflare dashboard → SSL/TLS → Overview → Full (strict). |
Origin SSL certificate is not trusted (CF error 526) | LE cert not present on origin or cert mismatch | Re-run G3, verify /etc/letsencrypt/live/api.walkrpg.<root>/fullchain.pem exists, docker compose restart nginx. |
| Authentication errors after successful 200 on /auth/callback | JWT_SECRET mismatch between mint and verify, or env not loaded | Verify api container sees env via `docker compose exec api printenv |
If H1 and H2 both succeed against the prod hostname: Phase 14 prod is functionally live. Proceed to H5 for the staging smoke test, then section I (CI/CD).
H5. Repeat H1+H2 against staging
curl -i -X POST https://api-staging.walkrpg.morrisassert.dev/auth/callback \ -H "Content-Type: application/json" \ -H "X-Request-Id: $(uuidgen)" \Expected: same shape as H1 — 200/201 with session.accessToken. The staging endpoint is a fully independent stack — its JWT_SECRET differs from prod, so the token returned here cannot verify against prod and vice versa.
STAGING_TOKEN="<paste accessToken from above>"curl -i https://api-staging.walkrpg.morrisassert.dev/walker/profile \ -H "Authorization: Bearer ${STAGING_TOKEN}"Expected: 200 with walker profile body.
If H5 fails but H1+H2 pass: the staging stack is independently broken — check docker compose -f backend/docker-compose.staging.yml ps + logs, verify .env.staging is populated, verify the walkrpg-net-staging network exists (docker network ls | grep staging), and that the prod nginx joined it (docker network inspect walkrpg-net-staging shows the walkrpg-nginx-1 container).
If H1, H2, and H5 all succeed: both envs are live. Proceed to section I (CI/CD).
I — GitLab CI/CD pipeline (~20 min, orchestrator-paired)
This section is paired with the orchestrator + backend-engineer. The runbook lists the moving parts; the orchestrator authors .gitlab-ci.yml in a separate paired session if not already present.
I1. Confirm .gitlab-ci.yml exists at repo root
If not present, surface to orchestrator. The orchestrator authors it per ADR-0009 §7 + ADR-0010 §9.
The pipeline shape is multi-env (per ADR-0010 §9):
lint+testrun on every push (any branch) and every MR.build:imageruns on push todevORmainonly. Pushes one image tagged with:<short-sha>AND:<branch-name>.deploy-stagingruns on push todevonly. SSH to VPS, pull image,docker compose -f backend/docker-compose.staging.yml .... Serialized viaresource_group: staging.deploy-prodruns on push tomainonly. SSH to VPS, pull image,docker compose -f backend/docker-compose.prod.yml .... Serialized viaresource_group: production.
The two deploy jobs CAN run in parallel (different resource groups, different envs), but each one serializes against itself so concurrent merges to the same branch don’t race on docker-compose state.
I2. Add GitLab CI Variables
In browser: GitLab → positive-walkers/walkrpg → Settings → CI/CD → Variables → Add variable:
| Key | Type | Value | Flags |
|---|---|---|---|
SSH_DEPLOY_KEY | File | (paste contents of VPS ~/.ssh/walkrpg_deploy PRIVATE key — from VPS cat /home/deploy/.ssh/walkrpg_deploy) | Protected: ON, Masked: ON (note: File-type masking masks the file contents in logs; the value itself does not show) |
SSH_DEPLOY_HOST | Variable | <VPS_IP> or api.walkrpg.<root> (IP is more reliable for SSH; CF proxy doesn’t proxy SSH) — recommend <VPS_IP> | Protected: ON |
SSH_KNOWN_HOSTS | Variable | Output of ssh-keyscan <VPS_IP> from CEO laptop (paste all lines) | Protected: ON |
The GitLab Container Registry credentials ($CI_REGISTRY_USER, $CI_JOB_TOKEN) are automatically injected by GitLab for same-project pushes — no manual variable needed.
I3. Push a no-op commit to main to trigger the prod pipeline
From CEO laptop (or any clone). NOTE: post-ADR-0010 the default branch is main, not master. The migration steps live in §L below.
git checkout maingit pull origin maingit commit --allow-empty -m "ci: trigger Phase 14 first deploy"# Direct push to main is denied by branch protection — go through an MR.# For the very first pipeline kick, use a throwaway feature branch:git checkout -b feature/ci-bootstrapgit push -u origin feature/ci-bootstrap# Open MR feature/ci-bootstrap -> dev in GitLab UI, merge it. Then open# MR dev -> main and merge that to trigger deploy-prod.For ongoing day-2 deploys, follow the workflow in ADR-0010 §5.
I4. Watch the pipeline
GitLab → positive-walkers/walkrpg → CI/CD → Pipelines. The new pipeline should show lint → test → build → deploy-staging (for the dev push) or lint → test → build → deploy-prod (for the main push). Each stage runs sequentially within its branch.
Expected total runtime: 4-8 minutes first run (Docker layer cache cold).
I5. Verify deploy succeeded
After the deploy stage finishes green:
# From CEO laptop — repeat the H1 smoke testcurl -i -X POST https://api.walkrpg.<root>/auth/callback \ -H "Content-Type: application/json" \Expected: 200/201 with session.
On VPS, check the image SHA:
docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml images apiShould show registry.gitlab.com/positive-walkers/walkrpg/backend:<short-sha> matching the latest master commit.
I6. Pipeline troubleshooting
| Stage fails | Look at | Common fix |
|---|---|---|
lint | Job log | Local pnpm lint not run; fix locally and re-push |
test | Job log | Test failures from unrelated changes; investigate |
build-image | Job log | Dockerfile path wrong, dependencies fail to install, or registry auth misconfigured |
deploy | Job log + docker compose logs on VPS | SSH key mismatch (re-check I2 SSH_DEPLOY_KEY paste), wrong known_hosts, or migration failure |
J — Mobile reconfig (CEO + orchestrator pairing, post-J0)
Prerequisite (J0): Phase 14 backend is live + smoke tests pass.
J1. Update local.properties
In walkrpg-mobile/android/local.properties:
base.url=https://api.walkrpg.<root>/Replace whatever local-tunnel hostname was there during Phase 13.
J2. Update network security config
Edit walkrpg-mobile/android/app/src/main/res/xml/network_security_config.xml. Drop the IP-specific debug-overrides block that allowed cleartext to the CEO-laptop tunnel.
Minimal config:
<?xml version="1.0" encoding="utf-8"?><network-security-config> <base-config cleartextTrafficPermitted="false" /></network-security-config>Or, if no overrides remain, delete the file entirely and remove the android:networkSecurityConfig attribute from AndroidManifest.xml.
J3. Rebuild APK
cd walkrpg-mobile/android./gradlew clean assembleDebugJ4. Install on test device
adb install -r app/build/outputs/apk/debug/app-debug.apkJ5. Smoke test on device
Open app → mock-auth screen → enter email + displayName → submit. Expected: auth succeeds against the public endpoint, walker profile loads.
In the device logcat:
adb logcat | grep -i walkrpgShould show outgoing requests to https://api.walkrpg.<root>/.... No cleartext warnings.
J6. Commit the mobile changes
In walkrpg-mobile repo:
git add android/local.properties android/app/src/main/res/xml/network_security_config.xmlgit commit -m "chore(mobile): point base.url at Phase 14 VPS endpoint"git push origin masterlocal.properties is typically gitignored — if so, commit the change to a local.properties.example template instead and document the swap in the mobile README.
K — Day-2 operational reference
K1. Postgres backup cron
Install the cron job (one-time on VPS as deploy):
sudo mkdir -p /var/backups/walkrpgsudo chown deploy:deploy /var/backups/walkrpgsudo chmod 700 /var/backups/walkrpg
crontab -eAdd lines:
0 3 * * * docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml exec -T db pg_dump -U walkrpg walkrpg | gzip > /var/backups/walkrpg/walkrpg-$(date +\%Y\%m\%d).sql.gz0 4 * * * find /var/backups/walkrpg/ -name 'walkrpg-*.sql.gz' -mtime +7 -deleteVerify with:
crontab -lK2. Manual backup (ad-hoc)
docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml exec -T db \ pg_dump -U walkrpg walkrpg | gzip > /var/backups/walkrpg/walkrpg-manual-$(date +%Y%m%d-%H%M).sql.gzK3. Restore from backup
gunzip -c /var/backups/walkrpg/walkrpg-YYYYMMDD.sql.gz | \ docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml exec -T db \ psql -U walkrpg walkrpgSchedule a restore drill within 7 days of Phase 14 launch. Documented as an ops follow-up.
K4. Cert renewal verification
certbot runs in the compose stack and polls every 12h. Force-check renewal:
docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml run --rm certbot renewCert expiry visible at:
docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml run --rm certbot certificatesLE issues 90-day certs; renewal happens at 30 days remaining.
K5. Log inspection
# All services, last 100 lines, follow modedocker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml logs --tail 100 -f
# api onlydocker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml logs --tail 100 -f api
# nginx onlydocker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml logs --tail 100 -f nginxFilter by X-Request-Id:
docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml logs api | grep "<request-id>"K6. Rollback to previous deploy
cd /home/deploy/walkrpggit log --oneline -5 origin/master # find target SHAgit checkout <prev-sha>docker compose -f backend/docker-compose.prod.yml pull apidocker compose -f backend/docker-compose.prod.yml run --rm api pnpm prisma migrate deploydocker compose -f backend/docker-compose.prod.yml up -d apiNote: a rollback that crosses a Prisma migration is risky if the migration is destructive (column drops). Inspect the diff before rolling back across migrations.
Return to current after recovery:
git checkout masterdocker compose -f backend/docker-compose.prod.yml pull apidocker compose -f backend/docker-compose.prod.yml up -d apiK7. Full stack stop / start
# Stop everything (api, db, nginx, certbot)docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml stop
# Start everythingdocker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml start
# Down (stop + remove containers, keep volumes)docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml down
# Down + remove volumes (DESTRUCTIVE — wipes db)docker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml down -vK8. Update Docker images outside of CI
Manual pull (skips CI deploy stage — use only for hotfixes):
cd /home/deploy/walkrpggit pull origin masterdocker compose -f backend/docker-compose.prod.yml pulldocker compose -f backend/docker-compose.prod.yml run --rm api pnpm prisma migrate deploydocker compose -f backend/docker-compose.prod.yml up -dK9. VPS resource check
# CPU + RAMhtop
# Diskdf -hdu -sh /home/deploy/walkrpg/backend/pgdata/du -sh /var/backups/walkrpg/
# Networkss -tunap | grep LISTENHetzner Cloud Console also shows CPU/RAM/disk/network graphs per VPS, free, no opt-in.
K10. Secret rotation
To rotate JWT_SECRET or POSTGRES_PASSWORD:
# Edit .envnano /home/deploy/walkrpg/backend/.env# (paste new values, save)
# Postgres password rotation requires DB-side update toodocker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml exec db \ psql -U walkrpg -c "ALTER USER walkrpg WITH PASSWORD '<new-password>';"
# Restart api to pick up new envdocker compose -f /home/deploy/walkrpg/backend/docker-compose.prod.yml up -d apiNote: rotating JWT_SECRET invalidates all live sessions immediately. All testers must re-auth. Schedule outside of active test windows.
K11. Adding a new tester (currently: no special action)
In mock-auth mode, any caller can POST /auth/callback and self-register. No allowlist exists. Sharing the public hostname api.walkrpg.<root> is the only “invite” step.
When the mock-auth posture flips to Firebase (post-production-migration unfreeze), this changes — tester onboarding becomes a Firebase Auth user invite.
K12. Incident response checklist (skeleton)
If the API is down:
docker compose ps— are containers running?docker compose logs --tail 200 api— boot failure or runtime crash?docker compose logs --tail 200 nginx— TLS / upstream errors?curl http://localhost:3000/from VPS — does api respond on internal network?curl https://api.walkrpg.<root>/from CEO laptop — does Cloudflare reach origin?- Hetzner Cloud Console — is the VPS up? Network OK? Disk full?
- Cloudflare dashboard → Analytics — is traffic hitting CF at all?
If a security incident is suspected:
- Pull VPS off-network at Hetzner Cloud Console (network detach) — preserves forensic state.
- Snapshot the disk (Hetzner Cloud Console → Server → Snapshots) before any cleanup.
- Write to
~/.claude/walkrpg/CRITICAL.mdfor CEO surfacing. - Investigate from a separate, clean VPS or local environment.
L — Branch migration from master to main + protection rules (one-time, ~10 min)
This section is the operational counterpart to ADR-0010 §10. Run once at ADR-0010 ratification time. Subsequent day-2 workflow follows the patterns in §M and ADR-0010 §5.
L1. Confirm clean working tree
From CEO laptop:
cd /path/to/walkrpggit statusMust report nothing to commit, working tree clean on master. If not, commit or stash the outstanding work before proceeding.
L2. Rename master to main in GitLab
In browser: GitLab → positive-walkers/walkrpg → Settings → Repository → Default branch.
| Field | Value |
|---|---|
| Default branch | main |
If main does not yet exist, GitLab UI offers a one-click rename of the existing default branch. Click “Rename master to main”. The rename is non-destructive — all history is preserved verbatim under the new name.
Confirm in the Branches list that master is gone and main is the new default.
L3. Update local clones
git fetch origingit branch -m master main # rename local master to maingit branch -u origin/main main # re-track to the new remotegit remote set-head origin -a # update the symbolic-ref HEADgit pull origin mainVerify:
git branch -vvShould show * main <sha> [origin/main].
L4. Branch dev from main
git checkout maingit pull origin maingit checkout -b devgit push -u origin devL5. Apply branch protection rules (GitLab UI)
GitLab → positive-walkers/walkrpg → Settings → Repository → Protected branches.
Add main protection:
| Field | Value |
|---|---|
| Branch | main |
| Allowed to merge | Maintainers |
| Allowed to push and merge | No one |
| Allowed to force push | OFF |
| Allow deletion | OFF |
| Required approvals (merge requests) | 1 |
Add dev protection:
| Field | Value |
|---|---|
| Branch | dev |
| Allowed to merge | Maintainers |
| Allowed to push and merge | No one |
| Allowed to force push | OFF |
| Allow deletion | OFF |
| Required approvals (merge requests) | 1 |
feature/* and hotfix/* need no protection. Force-push is fine on those during MR iteration.
L6. Enable auto-delete-source-branch (recommended)
GitLab → positive-walkers/walkrpg → Settings → Merge requests → “Enable ‘Delete source branch’ option by default” → ON.
This keeps the branch list clean — merged feature/hotfix branches disappear automatically.
L7. Enable squash-only (recommended)
GitLab → positive-walkers/walkrpg → Settings → Merge requests → Squash commits when merging → “Require”. This forces squash merges, keeping the protected-branch log clean.
L8. Verify the migration
From CEO laptop:
# Direct push to main must FAIL.git checkout maingit commit --allow-empty -m "test: should fail"git push origin main# Expected: `! [remote rejected] main -> main (protected branch hook declined)`If the push succeeds: branch protection is misconfigured. Re-check L5.
Reset the noise:
git reset --hard HEAD~1L9. From this point forward
All new work flows through feature/* branches and MRs per ADR-0010 §5. The next deploy of WalkRPG goes through:
git checkout dev && git pull && git checkout -b feature/<name>- Work + push.
- MR
feature/<name> → dev→ orchestrator review → CEO merge → auto-deploy staging. - MR
dev → main→ orchestrator review → CEO merge → auto-deploy prod.
M — Hotfix workflow (reference)
Use when a bad commit reaches prod and needs a fix faster than the normal feature → dev → main release cadence. The hotfix flow bypasses staging (which would slow the fix) but still requires an MR + orchestrator review pass.
M1. Branch from main
git checkout maingit pull origin maingit checkout -b hotfix/<short-name>M2. Fix + commit + push
# (edit + test locally)git add <files>git commit -m "fix(<scope>): <short description>"git push -u origin hotfix/<short-name>M3. Open MR hotfix/<name> → main
GitLab UI → Merge requests → New merge request → source hotfix/<short-name>, target main.
Title: same as the commit subject. Body: brief paragraph + what was broken + what was fixed + how the fix was tested locally.
M4. Orchestrator review
From CEO laptop:
claude /review <MR-URL>Orchestrator dispatches relevant leads (tech-architect for backend infra, narrative-designer for content, etc.), reviews the diff inline, posts a summary verdict.
Expected verdict: APPROVE (hotfixes are small + scoped). If NEEDS_CHANGES, iterate on the hotfix branch (force-push OK) and re-run the review.
M5. CEO merges to main
Squash merge. main advances; auto-deploys to prod.
M6. Verify the fix in prod
# Smoke test against prodcurl -i -X POST https://api.walkrpg.morrisassert.dev/auth/callback \ -H "Content-Type: application/json" \Or whatever endpoint exercises the fix. Confirm the fix lands as expected.
M7. Sync dev to track prod
The dev branch must carry the hotfix forward; otherwise the next release MR (dev → main) would re-introduce the bug. Two patterns:
Pattern A — cherry-pick onto a sync branch (cleaner if dev has diverged from main):
git checkout devgit pull origin devgit checkout -b feature/sync-hotfix-<short-name>git cherry-pick <merged-squash-sha-on-main>git push -u origin feature/sync-hotfix-<short-name>Open MR feature/sync-hotfix-<short-name> → dev. Orchestrator review (typically a quick APPROVE because the diff is identical to the already-reviewed hotfix). Merge.
Pattern B — MR the hotfix branch itself to dev (works if dev is close to main):
# The hotfix branch still exists if you haven't deleted it.# Open a SECOND MR: hotfix/<short-name> → dev.Pattern A is more robust when dev has open features that haven’t shipped yet — the squash-merge SHA is the cleanest carrier of the fix.
M8. Clean up
Delete the hotfix branch from GitLab if auto-delete is on (L6); otherwise:
git push origin --delete hotfix/<short-name>git branch -d hotfix/<short-name> # localAppendix — Sequenced overview
| Section | Time | Run by | Blocker if it fails |
|---|---|---|---|
| A — Hetzner provisioning | ~10 min | CEO | yes |
| B — SSH hardening | ~10 min | CEO | yes |
| C — Docker install | ~5 min | CEO | yes |
| D — Cloudflare DNS | ~10 min | CEO | yes |
| E — Repo deploy keys + secrets | ~10 min | CEO | yes |
| F — Docker compose stacks (prod + staging) | ~15 min | CEO (or paused for orchestrator) | yes — requires compose / Dockerfile / nginx files to exist |
| G — Let’s Encrypt cert (SAN over 6 hostnames) | ~5 min | CEO | yes |
| H — Smoke test (prod + staging) | ~5 min | CEO | yes — must pass to consider Phase 14 live |
| I — GitLab CI/CD (multi-env) | ~20 min | Orchestrator + CEO | no — Phase 14 is live even without CI; this enables push-to-deploy |
| J — Mobile reconfig | ~15 min | Orchestrator + CEO | no — separate concern from backend liveness |
| K — Day-2 ops | reference | CEO + ops | no — reference material |
L — Branch migration master → main + protection rules (one-time) | ~10 min | CEO | once; gates ADR-0010 going live |
| M — Hotfix workflow (reference) | ~15 min when invoked | CEO + orchestrator | per-incident |
End of runbook.