Spanner disaster-recovery plan and restore drill runbook

Closes INF-09 from the launch-readiness audit. This runbook is the canonical procedure for restoring the gateway Spanner database from backup, the stated RPO / RTO targets, and the plan for the first drill.

NEXT STEP (manual, requires human authorization): run one end-to-end restore drill against the staging Spanner instance and update the RTO/RPO capture template with the measured time. The drill is intentionally gated on human authorization because it touches GCP infrastructure and briefly reroutes staging traffic to a restored instance. Do not automate this drill.

What this runbook does not prove yet

Until the first staging drill is executed and captured:

the 2-hour RTO remains an estimate, not a measured promise
the repo does not claim that restore timing is already validated
the repo does not claim that quarterly rehearsal evidence exists yet

The runbook is complete enough to execute. The evidence package is what closes the remaining gap.

Docs-side closeout for this lane is reached now; the only remaining work is operational evidence:

the runbook is already concrete enough to execute without repo archaeology
the only missing proof is one real drill plus its durable evidence package
the issue should stay open only for measured evidence and any resulting follow-up issue, not for more runbook definition

Scope and assumptions

In scope: the production database gateway-db on the gateway-spanner regional instance (Terraform: infra/terraform/modules/spanner/main.tf). Daily full backups with 7-day retention (google_spanner_backup_schedule.daily, cron 0 2 * * *, retention 604800s) written to the same instance under CMEK (var.kms_key_id).
Current runtime wiring: every DB-using Cloud Run service (gateway-auth, gateway-card-present, gateway-management, gateway-online-txn, gateway-processing, gateway-merchant-onboarding) reads the Spanner instance path from the gateway-database-url Secret Manager secret and selects the database via the DATABASE_NAME env var. gateway-status does not connect to Spanner.
Staging drill caveat: Terraform schedules backups only for gateway-db. gateway-db-staging does not currently have a scheduled backup policy, so the rehearsal plan must create an on-demand staging backup first.
Out of scope (P2 follow-up — see launch-readiness spec):
- Multi-region failover. The instance is single-region regional-${var.region}. A GCP regional outage means waiting for GCP to restore the region; there is no hot standby.
- Cross-region backup replication. Backups live in the same region as the instance.
- PITR (point-in-time recovery) continuous backups. We do not currently enable version_retention_period, so recovery granularity is limited to the nightly backup cadence.
- Application-level replay. Processing-service side-effects (TransIT authorizations, webhooks sent) cannot be "undone" by a database restore — the restored state may reference transactions the downstream systems still consider final.

Recovery targets

RPO — recovery point objective

24 hours. Backups run once per day at 02:00 UTC (Spanner backup schedule cron 0 2 * * *). A disaster occurring at 01:59 UTC loses up to 24 hours of writes; a disaster at 02:30 UTC loses up to a few minutes.

Note: the backup is triggered at 02:00 UTC and may take 1-2 hours to complete depending on database size. The "last usable backup" is the most recent backup whose state=READY, which in practice lags the cron by up to 2 hours. Verify state before attempting restore:

gcloud spanner backups list \
  --instance=gateway-spanner \
  --project=pinpoint-gateway \
  --filter="state=READY" \
  --sort-by=~createTime \
  --limit=5

RTO — recovery time objective

Initial best-estimate target: 2 hours for full restore-and-reroute. This is a placeholder until the first drill produces a measured number. Components of the estimate:

Step	Est. time
Identify last known-good backup, confirm `state=READY`	5 min
Run `gcloud spanner backups restore` into a new database	30-60 min (scales with data volume)
Validate schema + row counts on restored DB	15 min
Update Cloud Run environment variables to point at the restored DB	10 min
Roll all Cloud Run services to pick up new config	20 min
Cutover verification (smoke transactions, health checks)	15 min

The 2-hour target must be validated against the drill. Replace this table with the observed numbers after the first run (see RTO/RPO capture template).

Restore procedure (production incident)

All commands assume project pinpoint-gateway, region us-east1, and instance gateway-spanner. Substitute as appropriate for the disaster scenario.

Operator roles

Assign these humans explicitly before changing any Cloud Run service:

Incident commander: owns the recovery-path decision, customer/internal comms, and final go/no-go on cutover.
Infra operator: runs the Spanner backup/restore and Cloud Run DATABASE_NAME updates.
Application verifier: runs smoke checks, validates revision config, and confirms writes land in the intended database.
Scribe / evidence owner: timestamps each step, captures artifact filenames/links, and ensures the drill or incident note is complete before closure.

In a two-person response, the incident commander may also be the scribe, but the infra operator and application verifier should still be distinct humans when possible.

Preflight checklist

Do not start the restore until all of the following are true:

The incident commander has explicitly chosen restore-over-rollback as the current recovery path.
The candidate backup is already in state=READY.
The target database name is new and timestamped.
One operator is assigned to infra actions and one operator is assigned to application verification.
Customer-facing comms are aware that the restore may roll data back to the selected backup timestamp.

Production evidence storage

During a real incident, keep all restore artifacts in one durable place:

incident Slack thread for live status only
incident issue / postmortem doc for the canonical evidence links
one timestamped artifact directory or ticket attachment set containing:
- backup selection output
- restore operation output
- DDL diff
- row-count and freshness results
- Cloud Run env-var verification
- smoke-transaction proof
- rollback proof if rollback occurs

Do not rely on terminal scrollback or ephemeral shell history as the evidence package.

Production service list

These are the DB-using services that must be checked during cutover:

gateway-auth
gateway-card-present
gateway-management
gateway-online-txn
gateway-processing
gateway-merchant-onboarding

gateway-status is intentionally excluded because it does not connect to Spanner.

Step 1 — identify the last known-good backup

PROJECT="pinpoint-gateway"
INSTANCE="gateway-spanner"
REGION="us-east1"

gcloud spanner backups list \
  --project="${PROJECT}" \
  --instance="${INSTANCE}" \
  --filter="state=READY AND database:gateway-db" \
  --sort-by=~createTime \
  --limit=10 \
  --format="table(name.basename(),createTime,expireTime,sizeBytes)"

Pick the most recent backup from the incident window. Record its full resource name (projects/pinpoint-gateway/instances/gateway-spanner/backups/<id>).

Step 2 — restore to a NEW database

Do not restore into the existing gateway-db — the existing database may still be the right target once the root cause is understood, and overwriting it destroys the ability to roll back the restore itself. Pick a distinct name with a timestamp:

NEW_DB="gateway-db-restore-$(date +%Y%m%d-%H%M)"
BACKUP_NAME="<from step 1>"

gcloud spanner databases restore \
  --project="${PROJECT}" \
  --destination-instance="${INSTANCE}" \
  --destination-database="${NEW_DB}" \
  --source-backup="${BACKUP_NAME}" \
  --async

The restore runs as a long-running operation. Monitor:

gcloud spanner operations list \
  --instance="${INSTANCE}" \
  --project="${PROJECT}" \
  --filter="metadata.@type:RestoreDatabaseMetadata" \
  --format="table(name.basename(),done,metadata.progress.progressPercent)"

The new database is usable for reads the moment the operation completes. Writes are permitted but the database remains in READY_OPTIMIZING state for several hours while Spanner rebuilds indices — acceptable for cutover.

Step 3 — validate the restored database

Before pointing any service at ${NEW_DB}, confirm its shape:

gcloud spanner databases ddl describe "${NEW_DB}" \
  --project="${PROJECT}" --instance="${INSTANCE}" \
  > /tmp/restored-ddl.sql

gcloud spanner databases ddl describe "gateway-db" \
  --project="${PROJECT}" --instance="${INSTANCE}" \
  > /tmp/live-ddl.sql

diff /tmp/live-ddl.sql /tmp/restored-ddl.sql

Any schema difference is a red flag — it means the backup predates a migration that is live. Investigate before proceeding.

Then capture a concrete integrity worksheet against both the live DB and the restored DB. Use a fixed table set so each drill is comparable quarter to quarter:

SOURCE_DB="gateway-db"

COUNT_SQL=$(cat <<'SQL'
SELECT 'merchants' AS check_name, COUNT(*) AS observed_value FROM merchants
UNION ALL
SELECT 'transactions', COUNT(*) FROM transactions
UNION ALL
SELECT 'checkout_sessions', COUNT(*) FROM checkout_sessions
UNION ALL
SELECT 'credential_profiles', COUNT(*) FROM credential_profiles
UNION ALL
SELECT 'webhook_events', COUNT(*) FROM webhook_events
SQL
)

FRESHNESS_SQL=$(cat <<'SQL'
SELECT 'transactions.max(created_at)' AS check_name, CAST(MAX(created_at) AS STRING) AS observed_value FROM transactions
UNION ALL
SELECT 'checkout_sessions.max(created_at)', CAST(MAX(created_at) AS STRING) FROM checkout_sessions
UNION ALL
SELECT 'credential_profiles.max(updated_at)', CAST(MAX(updated_at) AS STRING) FROM credential_profiles
UNION ALL
SELECT 'webhook_events.max(created_at)', CAST(MAX(created_at) AS STRING) FROM webhook_events
SQL
)

for DB in "${SOURCE_DB}" "${NEW_DB}"; do
  echo "=== ${DB}: row counts ==="
  gcloud spanner databases execute-sql "${DB}" \
    --project="${PROJECT}" --instance="${INSTANCE}" \
    --sql="${COUNT_SQL}"

  echo "=== ${DB}: freshness markers ==="
  gcloud spanner databases execute-sql "${DB}" \
    --project="${PROJECT}" --instance="${INSTANCE}" \
    --sql="${FRESHNESS_SQL}"
done

Expected result:

DDL matches exactly.
Row counts on ${NEW_DB} are less than or equal to ${SOURCE_DB} and consistent with the selected backup timestamp.
Freshness markers on ${NEW_DB} are no newer than the backup time and no older than expected for the stated RPO window.

If any critical table is missing, or if the restored counts/freshness are materially older than the selected backup should allow, stop before cutover.

Stop conditions before cutover

Abort the cutover and re-evaluate if any of the following are true:

the selected backup is no longer the incident-approved restore point
DDL differs between gateway-db and ${NEW_DB}
row counts imply materially more data loss than the selected backup window should allow
freshness markers are newer than the backup time or suspiciously stale relative to the selected backup
operators cannot identify a canary transaction path for post-cutover verification

Step 4 — cut over Cloud Run services

Because this restore stays on the same Spanner instance, the gateway-database-url secret does not change. Cutover is done by updating DATABASE_NAME on each DB-using Cloud Run service:

for SVC in gateway-auth gateway-card-present gateway-management \
           gateway-online-txn gateway-processing gateway-merchant-onboarding; do
  gcloud run services update "${SVC}" \
    --project="${PROJECT}" \
    --region="${REGION}" \
    --update-env-vars="DATABASE_NAME=${NEW_DB}"
done

This rolls a new revision on each service. Traffic flips to the restored DB as each revision becomes ready.

Rotate gateway-database-url only if the recovery procedure moves to a different Spanner instance or project. For the normal same-instance restore case, only DATABASE_NAME changes.

Step 5 — verify cutover

# Confirm the new revision template points at the restored DB.
for SVC in gateway-auth gateway-card-present gateway-management \
           gateway-online-txn gateway-processing gateway-merchant-onboarding; do
  CURRENT_DB=$(gcloud run services describe "${SVC}" \
    --project="${PROJECT}" \
    --region="${REGION}" \
    --format=json | jq -r '.spec.template.containers[0].env[] | select(.name=="DATABASE_NAME") | .value')
  test "${CURRENT_DB}" = "${NEW_DB}"
done

Then hit the public service health endpoints and confirm they return success from the new revisions. For internal-only services (gateway-processing, gateway-merchant-onboarding), rely on the revision config check above plus the smoke transaction below.

Run a smoke transaction against a canary merchant path of your choice. Minimum bar:

create one synthetic checkout session;
complete one authorization path that writes to Spanner;
verify the new row is visible in ${NEW_DB}.

Step 6 — rename for long-term operation

After 24 hours of stable operation on ${NEW_DB}, promote it to the canonical name. Spanner does not support direct rename, so:

Update Terraform (infra/terraform/modules/spanner/main.tf) to manage a database whose name = "${NEW_DB}". This is a destructive Terraform plan — the old gateway-db resource is removed from state, the new one is imported. Coordinate with the infra owner.
Alternatively, if ops process requires the name gateway-db specifically: restore again from backup, this time into a database named gateway-db — but only after deleting the original (and only with deletion_protection temporarily set to false).

Either path is explicitly a post-incident operation, not part of initial cutover. The restored-DB name carrying a timestamp is fine for weeks.

Rollback procedure (restored data turns out to be corrupt)

If step 5 verification fails, or if downstream reconciliation discovers the backup predates a migration or contains integrity issues after cutover has occurred, flip traffic back to the original database:

for SVC in gateway-auth gateway-card-present gateway-management \
           gateway-online-txn gateway-processing gateway-merchant-onboarding; do
  gcloud run services update "${SVC}" \
    --project="${PROJECT}" \
    --region="${REGION}" \
    --update-env-vars="DATABASE_NAME=gateway-db"
done

This restores the original DB as the live target. The restored ${NEW_DB} is left in place for forensic analysis and is NOT deleted automatically — it is deleted manually only after the root cause is understood.

If the original gateway-db is itself destroyed or corrupted (the actual disaster scenario we're planning for), rollback means picking a different backup from step 1 — e.g., the next-oldest state=READY backup — and repeating steps 2-5 with a different backup source. You are trading RPO (more data loss) for correctness.

Do not ack the incident as resolved until:

All DB-using services are pinned back to the intended DATABASE_NAME.
A smoke transaction succeeds end-to-end.
Reconciliation has run over the restored data and found no unexpected gaps.
A Jira / GitHub issue captures what was lost (all writes between the backup timestamp and the disaster time) for customer communication.

Rehearsal plan (first drill)

The drill is a real procedure run against the staging database on the same Spanner instance, following every step above with three modifications:

Staging database only. The instance remains gateway-spanner; the source database is gateway-db-staging, and the drill creates a restored database alongside it.
Staging Cloud Run services only. gateway-<svc>-staging instead of gateway-<svc>.
No real customer traffic at any step. Drill should run off-peak (US evening) and be communicated in the ops channel before starting.

Staging drill prerequisites

Before the drill window opens, confirm all of the following:

the operators can run gcloud against project pinpoint-gateway
the chosen operator identities have permission to inspect Spanner, update Cloud Run services, and read the relevant revisions
a canary staging flow is chosen ahead of time for smoke verification
the issue, doc, or incident note that will hold evidence already exists
the restored staging database naming convention is agreed in advance
the rollback owner is named before the first cutover step

Staging service list

Use these concrete service names during the drill:

gateway-auth-staging
gateway-card-present-staging
gateway-management-staging
gateway-online-txn-staging
gateway-processing-staging
gateway-merchant-onboarding-staging

Staging restore checklist

Use this as the working checklist during the drill. Do not advance until the prior step is complete and timestamped.

Recommended drill blast radius

For the first rehearsal, switch one staging service first instead of cutting over the entire stack in one move. Preferred order:

gateway-management-staging for configuration/read validation
gateway-online-txn-staging for write-path smoke validation

Expand to more services only after the one-service cutover and rollback path has been proven in the same drill.

Because staging has no scheduled backup policy today, create an on-demand backup at least one business day before the drill so the rehearsal proves restore-from-backup, not restore-from-minutes-ago:

PROJECT="pinpoint-gateway"
INSTANCE="gateway-spanner"
STAGING_DB="gateway-db-staging"
STAGING_BACKUP="gateway-db-staging-drill-$(date +%Y%m%d)"

gcloud spanner backups create "${STAGING_BACKUP}" \
  --project="${PROJECT}" \
  --instance="${INSTANCE}" \
  --database="${STAGING_DB}" \
  --retention-period="7d" \
  --async

Wait until the backup is READY, then record its create time in the drill notes below. Use that backup as the source for the drill.

Staging env-var verification

Before and after the drill cutover, confirm the service template points at the expected database:

PROJECT="pinpoint-gateway"
REGION="us-east1"
SERVICE="gateway-management-staging"

gcloud run services describe "${SERVICE}" \
  --project="${PROJECT}" \
  --region="${REGION}" \
  --format='value(spec.template.containers[0].env)'

Expected:

before cutover: DATABASE_NAME=gateway-db-staging
during the drill cutover: DATABASE_NAME=<restored staging db>
after rollback: DATABASE_NAME=gateway-db-staging

RTO/RPO capture template

Fill in these values during the drill:

Metric	Target	Notes
Recovery point objective (RPO)	24 hours	Backup timestamp and write-loss window
Recovery time objective (RTO) to restore completion	60 min or less	Start to restore-done
Recovery time objective (RTO) to cutover complete	45 min or less	Restore-done to service cutover
End-to-end drill duration	2 hours or less	Start to rollback complete
Backup state at selection time	`READY`	Confirm the backup used for the drill
Schema diff result	0 diffs	Live staging vs restored database
Row-count delta summary	Within expected backup lag	Fixed table set
Freshness delta summary	No newer than backup time	Fixed table set
Smoke transaction result	Pass	Reference/id below

If the observed RTO or RPO is outside target, note the gap here before filing follow-up work. The target itself does not change until the business owner accepts a new bound.

If observed timings miss target

If the drill succeeds functionally but misses the stated timing target:

do not mark the lane operationally complete
attach the measured numbers anyway
open a follow-up issue for the slowest step with the captured evidence
record whether the current target is still accepted or needs explicit revision
keep #417 open only for the measured-gap follow-up and/or the first real evidence package, not for more runbook writing

Drill failure criteria

The drill is not considered successful if any of these occur:

restore completes but DDL diff is non-zero
the switched staging service cannot read from the restored database
the smoke transaction path fails or writes only to the original database
rollback to gateway-db-staging fails
evidence artifacts are missing for any cutover/rollback step

Issue-close criteria

This runbook alone does not mean the DR lane is operationally complete. Treat the issue as closure-ready only when all of the following are true:

one staging restore drill has actually been run end to end
the RTO/RPO capture template is filled with observed numbers
the evidence checklist artifacts are attached somewhere durable
the canonical drill note names the evidence owner and the durable artifact location
the operator summary says whether the measured timings met the stated target

Do not close the issue on runbook presence alone, and do not reopen it for further runbook definition once the evidence package is attached.

Evidence checklist

Attach or link the following artifacts in the drill note / incident issue:

Evidence naming convention

Use filenames that sort chronologically in the issue attachment set or artifact directory. Recommended pattern:

01-backup-selection.txt
02-restore-start.txt
03-restore-complete.txt
04-ddl-diff.txt
05-row-counts.txt
06-freshness.txt
07-cutover-env.txt
08-smoke-transaction.txt
09-rollback-env.txt
10-restored-db-delete.txt
11-drill-summary.md

Ready-to-paste drill note template

Use this template in the issue, incident note, or drill document so the first run is captured in a uniform shape:

## Spanner DR Drill Summary

- Drill date:
- Operators:
- Incident commander:
- Infra operator:
- Application verifier:
- Evidence owner:
- Canonical drill note / issue link:
- Evidence artifact directory / attachment set:
- Source database: gateway-db-staging
- Restored database:
- Selected backup:
- Backup create time:
- Restore operation id:
- Restore completion time:
- Cutover completion time:
- Rollback completion time:
- Did observed RPO meet target?
- Did observed RTO meet target?
- Closure decision (`met target` / `follow-up required`):
- Follow-up issue link:
- Next rehearsal due by:
- Smoke transaction reference:
- Follow-up actions:

### Attached evidence

- [ ] backup selection output
- [ ] restore completion output
- [ ] ddl diff
- [ ] row-count output
- [ ] freshness output
- [ ] cutover env-var output
- [ ] smoke transaction proof
- [ ] rollback output
- [ ] restored database deletion output

Post-restore verification log

Record each verification step as it happens. This is the closeout log that shows the restored database was usable, the cutover succeeded, and the rollback path was exercised.

Field	Value
Drill date
Operator
Incident commander
Evidence owner
Source database	`gateway-db-staging`
Backup name
Backup create time
Restored database name
Restore operation id
Restore start time
Restore complete time
Validation start time
Validation complete time
Staging service switched
Cutover complete time
Smoke transaction id / reference
Rollback complete time
Restored database deleted at
Closure decision
Follow-up issue
Next rehearsal due by
Follow-up actions

After the first drill, replace the placeholder values in the RTO/RPO template with the measured numbers and keep the verification log attached to the issue. If the observed duration exceeds the target, file a follow-up ticket to reduce restore time and record the accepted operational bound in the same drill note.

Drill cadence

After the first drill, quarterly thereafter. A restore procedure that has not been rehearsed in the last three months is untrusted — the Rehearsal plan section should be re-run and the RTO/RPO capture template refreshed each quarter.

Secrets rotation policy — if the Spanner instance is recreated during recovery, the gateway-database-url secret must be rotated.
Alert runbook index — Spanner-originated Cloud Run 5xx bursts are triaged via the SLO fast-burn and slow-burn runbooks.
infra/terraform/modules/spanner/main.tf — source of truth for backup schedule, retention, and CMEK config.

What this runbook does not prove yet​

Scope and assumptions​

Recovery targets​

RPO — recovery point objective​

RTO — recovery time objective​

Restore procedure (production incident)​

Operator roles​

Preflight checklist​

Production evidence storage​

Production service list​

Step 1 — identify the last known-good backup​

Step 2 — restore to a NEW database​

Step 3 — validate the restored database​

Stop conditions before cutover​

Step 4 — cut over Cloud Run services​

Step 5 — verify cutover​

Step 6 — rename for long-term operation​

Rollback procedure (restored data turns out to be corrupt)​

Rehearsal plan (first drill)​

Staging drill prerequisites​

Staging service list​

Staging restore checklist​

Recommended drill blast radius​

Staging env-var verification​

RTO/RPO capture template​

If observed timings miss target​

Drill failure criteria​

Issue-close criteria​

Evidence checklist​

Evidence naming convention​

Ready-to-paste drill note template​

Post-restore verification log​

Drill cadence​

Related​