Spanner disaster-recovery plan and restore drill runbook
Closes INF-09 from the launch-readiness audit. This runbook is the canonical procedure for restoring the gateway Spanner database from backup, the stated RPO / RTO targets, and the plan for the first drill.
NEXT STEP (manual, requires human authorization): run one end-to-end restore drill against the staging Spanner instance and update the RTO/RPO capture template with the measured time. The drill is intentionally gated on human authorization because it touches GCP infrastructure and briefly reroutes staging traffic to a restored instance. Do not automate this drill.
What this runbook does not prove yet
Until the first staging drill is executed and captured:
- the 2-hour RTO remains an estimate, not a measured promise
- the repo does not claim that restore timing is already validated
- the repo does not claim that quarterly rehearsal evidence exists yet
The runbook is complete enough to execute. The evidence package is what closes the remaining gap.
Docs-side closeout for this lane is reached now; the only remaining work is operational evidence:
- the runbook is already concrete enough to execute without repo archaeology
- the only missing proof is one real drill plus its durable evidence package
- the issue should stay open only for measured evidence and any resulting follow-up issue, not for more runbook definition
Scope and assumptions
- In scope: the production database
gateway-dbon thegateway-spannerregional instance (Terraform:infra/terraform/modules/spanner/main.tf). Daily full backups with 7-day retention (google_spanner_backup_schedule.daily, cron0 2 * * *, retention604800s) written to the same instance under CMEK (var.kms_key_id). - Current runtime wiring: every DB-using Cloud Run service (
gateway-auth,gateway-card-present,gateway-management,gateway-online-txn,gateway-processing,gateway-merchant-onboarding) reads the Spanner instance path from thegateway-database-urlSecret Manager secret and selects the database via theDATABASE_NAMEenv var.gateway-statusdoes not connect to Spanner. - Staging drill caveat: Terraform schedules backups only for
gateway-db.gateway-db-stagingdoes not currently have a scheduled backup policy, so the rehearsal plan must create an on-demand staging backup first. - Out of scope (P2 follow-up — see launch-readiness spec):
- Multi-region failover. The instance is single-region
regional-${var.region}. A GCP regional outage means waiting for GCP to restore the region; there is no hot standby. - Cross-region backup replication. Backups live in the same region as the instance.
- PITR (point-in-time recovery) continuous backups. We do not currently enable
version_retention_period, so recovery granularity is limited to the nightly backup cadence. - Application-level replay. Processing-service side-effects (TransIT authorizations, webhooks sent) cannot be "undone" by a database restore — the restored state may reference transactions the downstream systems still consider final.
- Multi-region failover. The instance is single-region
Recovery targets
RPO — recovery point objective
24 hours. Backups run once per day at 02:00 UTC (Spanner backup schedule cron 0 2 * * *). A disaster occurring at 01:59 UTC loses up to 24 hours of writes; a disaster at 02:30 UTC loses up to a few minutes.
Note: the backup is triggered at 02:00 UTC and may take 1-2 hours to complete depending on database size. The "last usable backup" is the most recent backup whose state=READY, which in practice lags the cron by up to 2 hours. Verify state before attempting restore:
gcloud spanner backups list \
--instance=gateway-spanner \
--project=pinpoint-gateway \
--filter="state=READY" \
--sort-by=~createTime \
--limit=5
RTO — recovery time objective
Initial best-estimate target: 2 hours for full restore-and-reroute. This is a placeholder until the first drill produces a measured number. Components of the estimate:
| Step | Est. time |
|---|---|
Identify last known-good backup, confirm state=READY | 5 min |
Run gcloud spanner backups restore into a new database | 30-60 min (scales with data volume) |
| Validate schema + row counts on restored DB | 15 min |
| Update Cloud Run environment variables to point at the restored DB | 10 min |
| Roll all Cloud Run services to pick up new config | 20 min |
| Cutover verification (smoke transactions, health checks) | 15 min |
The 2-hour target must be validated against the drill. Replace this table with the observed numbers after the first run (see RTO/RPO capture template).
Restore procedure (production incident)
All commands assume project pinpoint-gateway, region us-east1, and instance gateway-spanner. Substitute as appropriate for the disaster scenario.
Operator roles
Assign these humans explicitly before changing any Cloud Run service:
- Incident commander: owns the recovery-path decision, customer/internal comms, and final go/no-go on cutover.
- Infra operator: runs the Spanner backup/restore and Cloud Run
DATABASE_NAMEupdates. - Application verifier: runs smoke checks, validates revision config, and confirms writes land in the intended database.
- Scribe / evidence owner: timestamps each step, captures artifact filenames/links, and ensures the drill or incident note is complete before closure.
In a two-person response, the incident commander may also be the scribe, but the infra operator and application verifier should still be distinct humans when possible.
Preflight checklist
Do not start the restore until all of the following are true:
- The incident commander has explicitly chosen restore-over-rollback as the current recovery path.
- The candidate backup is already in
state=READY. - The target database name is new and timestamped.
- One operator is assigned to infra actions and one operator is assigned to application verification.
- Customer-facing comms are aware that the restore may roll data back to the selected backup timestamp.
Production evidence storage
During a real incident, keep all restore artifacts in one durable place:
- incident Slack thread for live status only
- incident issue / postmortem doc for the canonical evidence links
- one timestamped artifact directory or ticket attachment set containing:
- backup selection output
- restore operation output
- DDL diff
- row-count and freshness results
- Cloud Run env-var verification
- smoke-transaction proof
- rollback proof if rollback occurs
Do not rely on terminal scrollback or ephemeral shell history as the evidence package.
Production service list
These are the DB-using services that must be checked during cutover:
gateway-authgateway-card-presentgateway-managementgateway-online-txngateway-processinggateway-merchant-onboarding
gateway-status is intentionally excluded because it does not connect to Spanner.
Step 1 — identify the last known-good backup
PROJECT="pinpoint-gateway"
INSTANCE="gateway-spanner"
REGION="us-east1"
gcloud spanner backups list \
--project="${PROJECT}" \
--instance="${INSTANCE}" \
--filter="state=READY AND database:gateway-db" \
--sort-by=~createTime \
--limit=10 \
--format="table(name.basename(),createTime,expireTime,sizeBytes)"
Pick the most recent backup from the incident window. Record its full resource name (projects/pinpoint-gateway/instances/gateway-spanner/backups/<id>).
Step 2 — restore to a NEW database
Do not restore into the existing gateway-db — the existing database may still be the right target once the root cause is understood, and overwriting it destroys the ability to roll back the restore itself. Pick a distinct name with a timestamp:
NEW_DB="gateway-db-restore-$(date +%Y%m%d-%H%M)"
BACKUP_NAME="<from step 1>"
gcloud spanner databases restore \
--project="${PROJECT}" \
--destination-instance="${INSTANCE}" \
--destination-database="${NEW_DB}" \
--source-backup="${BACKUP_NAME}" \
--async
The restore runs as a long-running operation. Monitor:
gcloud spanner operations list \
--instance="${INSTANCE}" \
--project="${PROJECT}" \
--filter="metadata.@type:RestoreDatabaseMetadata" \
--format="table(name.basename(),done,metadata.progress.progressPercent)"
The new database is usable for reads the moment the operation completes. Writes are permitted but the database remains in READY_OPTIMIZING state for several hours while Spanner rebuilds indices — acceptable for cutover.
Step 3 — validate the restored database
Before pointing any service at ${NEW_DB}, confirm its shape:
gcloud spanner databases ddl describe "${NEW_DB}" \
--project="${PROJECT}" --instance="${INSTANCE}" \
> /tmp/restored-ddl.sql
gcloud spanner databases ddl describe "gateway-db" \
--project="${PROJECT}" --instance="${INSTANCE}" \
> /tmp/live-ddl.sql
diff /tmp/live-ddl.sql /tmp/restored-ddl.sql
Any schema difference is a red flag — it means the backup predates a migration that is live. Investigate before proceeding.
Then capture a concrete integrity worksheet against both the live DB and the restored DB. Use a fixed table set so each drill is comparable quarter to quarter:
SOURCE_DB="gateway-db"
COUNT_SQL=$(cat <<'SQL'
SELECT 'merchants' AS check_name, COUNT(*) AS observed_value FROM merchants
UNION ALL
SELECT 'transactions', COUNT(*) FROM transactions
UNION ALL
SELECT 'checkout_sessions', COUNT(*) FROM checkout_sessions
UNION ALL
SELECT 'credential_profiles', COUNT(*) FROM credential_profiles
UNION ALL
SELECT 'webhook_events', COUNT(*) FROM webhook_events
SQL
)
FRESHNESS_SQL=$(cat <<'SQL'
SELECT 'transactions.max(created_at)' AS check_name, CAST(MAX(created_at) AS STRING) AS observed_value FROM transactions
UNION ALL
SELECT 'checkout_sessions.max(created_at)', CAST(MAX(created_at) AS STRING) FROM checkout_sessions
UNION ALL
SELECT 'credential_profiles.max(updated_at)', CAST(MAX(updated_at) AS STRING) FROM credential_profiles
UNION ALL
SELECT 'webhook_events.max(created_at)', CAST(MAX(created_at) AS STRING) FROM webhook_events
SQL
)
for DB in "${SOURCE_DB}" "${NEW_DB}"; do
echo "=== ${DB}: row counts ==="
gcloud spanner databases execute-sql "${DB}" \
--project="${PROJECT}" --instance="${INSTANCE}" \
--sql="${COUNT_SQL}"
echo "=== ${DB}: freshness markers ==="
gcloud spanner databases execute-sql "${DB}" \
--project="${PROJECT}" --instance="${INSTANCE}" \
--sql="${FRESHNESS_SQL}"
done
Expected result:
- DDL matches exactly.
- Row counts on
${NEW_DB}are less than or equal to${SOURCE_DB}and consistent with the selected backup timestamp. - Freshness markers on
${NEW_DB}are no newer than the backup time and no older than expected for the stated RPO window.
If any critical table is missing, or if the restored counts/freshness are materially older than the selected backup should allow, stop before cutover.
Stop conditions before cutover
Abort the cutover and re-evaluate if any of the following are true:
- the selected backup is no longer the incident-approved restore point
- DDL differs between
gateway-dband${NEW_DB} - row counts imply materially more data loss than the selected backup window should allow
- freshness markers are newer than the backup time or suspiciously stale relative to the selected backup
- operators cannot identify a canary transaction path for post-cutover verification
Step 4 — cut over Cloud Run services
Because this restore stays on the same Spanner instance, the gateway-database-url secret does not change. Cutover is done by updating DATABASE_NAME on each DB-using Cloud Run service:
for SVC in gateway-auth gateway-card-present gateway-management \
gateway-online-txn gateway-processing gateway-merchant-onboarding; do
gcloud run services update "${SVC}" \
--project="${PROJECT}" \
--region="${REGION}" \
--update-env-vars="DATABASE_NAME=${NEW_DB}"
done
This rolls a new revision on each service. Traffic flips to the restored DB as each revision becomes ready.
Rotate
gateway-database-urlonly if the recovery procedure moves to a different Spanner instance or project. For the normal same-instance restore case, onlyDATABASE_NAMEchanges.
Step 5 — verify cutover
# Confirm the new revision template points at the restored DB.
for SVC in gateway-auth gateway-card-present gateway-management \
gateway-online-txn gateway-processing gateway-merchant-onboarding; do
CURRENT_DB=$(gcloud run services describe "${SVC}" \
--project="${PROJECT}" \
--region="${REGION}" \
--format=json | jq -r '.spec.template.containers[0].env[] | select(.name=="DATABASE_NAME") | .value')
test "${CURRENT_DB}" = "${NEW_DB}"
done
Then hit the public service health endpoints and confirm they return success from the new revisions. For internal-only services (gateway-processing, gateway-merchant-onboarding), rely on the revision config check above plus the smoke transaction below.
Run a smoke transaction against a canary merchant path of your choice. Minimum bar:
- create one synthetic checkout session;
- complete one authorization path that writes to Spanner;
- verify the new row is visible in
${NEW_DB}.
Step 6 — rename for long-term operation
After 24 hours of stable operation on ${NEW_DB}, promote it to the canonical name. Spanner does not support direct rename, so:
- Update Terraform (
infra/terraform/modules/spanner/main.tf) to manage a database whosename = "${NEW_DB}". This is a destructive Terraform plan — the oldgateway-dbresource is removed from state, the new one is imported. Coordinate with the infra owner. - Alternatively, if ops process requires the name
gateway-dbspecifically: restore again from backup, this time into a database namedgateway-db— but only after deleting the original (and only withdeletion_protectiontemporarily set tofalse).
Either path is explicitly a post-incident operation, not part of initial cutover. The restored-DB name carrying a timestamp is fine for weeks.
Rollback procedure (restored data turns out to be corrupt)
If step 5 verification fails, or if downstream reconciliation discovers the backup predates a migration or contains integrity issues after cutover has occurred, flip traffic back to the original database:
for SVC in gateway-auth gateway-card-present gateway-management \
gateway-online-txn gateway-processing gateway-merchant-onboarding; do
gcloud run services update "${SVC}" \
--project="${PROJECT}" \
--region="${REGION}" \
--update-env-vars="DATABASE_NAME=gateway-db"
done
This restores the original DB as the live target. The restored ${NEW_DB} is left in place for forensic analysis and is NOT deleted automatically — it is deleted manually only after the root cause is understood.
If the original gateway-db is itself destroyed or corrupted (the actual disaster scenario we're planning for), rollback means picking a different backup from step 1 — e.g., the next-oldest state=READY backup — and repeating steps 2-5 with a different backup source. You are trading RPO (more data loss) for correctness.
Do not ack the incident as resolved until:
- All DB-using services are pinned back to the intended
DATABASE_NAME. - A smoke transaction succeeds end-to-end.
- Reconciliation has run over the restored data and found no unexpected gaps.
- A Jira / GitHub issue captures what was lost (all writes between the backup timestamp and the disaster time) for customer communication.
Rehearsal plan (first drill)
The drill is a real procedure run against the staging database on the same Spanner instance, following every step above with three modifications:
- Staging database only. The instance remains
gateway-spanner; the source database isgateway-db-staging, and the drill creates a restored database alongside it. - Staging Cloud Run services only.
gateway-<svc>-staginginstead ofgateway-<svc>. - No real customer traffic at any step. Drill should run off-peak (US evening) and be communicated in the ops channel before starting.
Staging drill prerequisites
Before the drill window opens, confirm all of the following:
- the operators can run
gcloudagainst projectpinpoint-gateway - the chosen operator identities have permission to inspect Spanner, update Cloud Run services, and read the relevant revisions
- a canary staging flow is chosen ahead of time for smoke verification
- the issue, doc, or incident note that will hold evidence already exists
- the restored staging database naming convention is agreed in advance
- the rollback owner is named before the first cutover step
Staging service list
Use these concrete service names during the drill:
gateway-auth-staginggateway-card-present-staginggateway-management-staginggateway-online-txn-staginggateway-processing-staginggateway-merchant-onboarding-staging
Staging restore checklist
Use this as the working checklist during the drill. Do not advance until the prior step is complete and timestamped.
- Announce the drill in the ops channel and capture the start time.
- Confirm the source database is
gateway-db-staging. - Create or identify a
READYon-demand backup forgateway-db-staging. - Record
backup_name,backup_create_time, and the observed backupstate. - Confirm the restore will target a new database name, not the existing staging database.
- Start the restore operation and record the operation id.
- Wait for restore completion and record the completion time.
- Run schema validation against live staging and the restored database.
- Run row-count and freshness checks on the fixed table set below.
- Cut one staging Cloud Run service over to the restored database.
- Run a synthetic transaction through the switched service.
- Verify the transaction is visible in the restored database.
- Flip the service back to the original staging database.
- Delete the restored database after verification and rollback are complete.
- Capture evidence links / filenames and close out the drill notes.
Recommended drill blast radius
For the first rehearsal, switch one staging service first instead of cutting over the entire stack in one move. Preferred order:
gateway-management-stagingfor configuration/read validationgateway-online-txn-stagingfor write-path smoke validation
Expand to more services only after the one-service cutover and rollback path has been proven in the same drill.
Because staging has no scheduled backup policy today, create an on-demand backup at least one business day before the drill so the rehearsal proves restore-from-backup, not restore-from-minutes-ago:
PROJECT="pinpoint-gateway"
INSTANCE="gateway-spanner"
STAGING_DB="gateway-db-staging"
STAGING_BACKUP="gateway-db-staging-drill-$(date +%Y%m%d)"
gcloud spanner backups create "${STAGING_BACKUP}" \
--project="${PROJECT}" \
--instance="${INSTANCE}" \
--database="${STAGING_DB}" \
--retention-period="7d" \
--async
Wait until the backup is READY, then record its create time in the drill notes below. Use that backup as the source for the drill.
Staging env-var verification
Before and after the drill cutover, confirm the service template points at the expected database:
PROJECT="pinpoint-gateway"
REGION="us-east1"
SERVICE="gateway-management-staging"
gcloud run services describe "${SERVICE}" \
--project="${PROJECT}" \
--region="${REGION}" \
--format='value(spec.template.containers[0].env)'
Expected:
- before cutover:
DATABASE_NAME=gateway-db-staging - during the drill cutover:
DATABASE_NAME=<restored staging db> - after rollback:
DATABASE_NAME=gateway-db-staging
RTO/RPO capture template
Fill in these values during the drill:
| Metric | Target | Observed | Notes |
|---|---|---|---|
| Recovery point objective (RPO) | 24 hours | Backup timestamp and write-loss window | |
| Recovery time objective (RTO) to restore completion | 60 min or less | Start to restore-done | |
| Recovery time objective (RTO) to cutover complete | 45 min or less | Restore-done to service cutover | |
| End-to-end drill duration | 2 hours or less | Start to rollback complete | |
| Backup state at selection time | READY | Confirm the backup used for the drill | |
| Schema diff result | 0 diffs | Live staging vs restored database | |
| Row-count delta summary | Within expected backup lag | Fixed table set | |
| Freshness delta summary | No newer than backup time | Fixed table set | |
| Smoke transaction result | Pass | Reference/id below |
If the observed RTO or RPO is outside target, note the gap here before filing follow-up work. The target itself does not change until the business owner accepts a new bound.
If observed timings miss target
If the drill succeeds functionally but misses the stated timing target:
- do not mark the lane operationally complete
- attach the measured numbers anyway
- open a follow-up issue for the slowest step with the captured evidence
- record whether the current target is still accepted or needs explicit revision
- keep
#417open only for the measured-gap follow-up and/or the first real evidence package, not for more runbook writing
Drill failure criteria
The drill is not considered successful if any of these occur:
- restore completes but DDL diff is non-zero
- the switched staging service cannot read from the restored database
- the smoke transaction path fails or writes only to the original database
- rollback to
gateway-db-stagingfails - evidence artifacts are missing for any cutover/rollback step
Issue-close criteria
This runbook alone does not mean the DR lane is operationally complete. Treat the issue as closure-ready only when all of the following are true:
- one staging restore drill has actually been run end to end
- the
RTO/RPO capture templateis filled with observed numbers - the evidence checklist artifacts are attached somewhere durable
- the canonical drill note names the evidence owner and the durable artifact location
- the operator summary says whether the measured timings met the stated target
Do not close the issue on runbook presence alone, and do not reopen it for further runbook definition once the evidence package is attached.
Evidence checklist
Attach or link the following artifacts in the drill note / incident issue:
- Link to the canonical drill note or incident document.
- Backup selection output showing the chosen
READYbackup. - Restore operation id and completion output.
- DDL diff output.
- Row-count query output for the fixed table set.
- Freshness query output for the fixed table set.
- Cloud Run service update output for the staging cutover.
- Smoke transaction reference or request id.
- Rollback output showing the service returned to the original staging database.
- Database deletion output for the restored staging database.
- Final note with timestamps, operator name, and follow-up actions.
Evidence naming convention
Use filenames that sort chronologically in the issue attachment set or artifact directory. Recommended pattern:
01-backup-selection.txt02-restore-start.txt03-restore-complete.txt04-ddl-diff.txt05-row-counts.txt06-freshness.txt07-cutover-env.txt08-smoke-transaction.txt09-rollback-env.txt10-restored-db-delete.txt11-drill-summary.md
Ready-to-paste drill note template
Use this template in the issue, incident note, or drill document so the first run is captured in a uniform shape:
## Spanner DR Drill Summary
- Drill date:
- Operators:
- Incident commander:
- Infra operator:
- Application verifier:
- Evidence owner:
- Canonical drill note / issue link:
- Evidence artifact directory / attachment set:
- Source database: gateway-db-staging
- Restored database:
- Selected backup:
- Backup create time:
- Restore operation id:
- Restore completion time:
- Cutover completion time:
- Rollback completion time:
- Did observed RPO meet target?
- Did observed RTO meet target?
- Closure decision (`met target` / `follow-up required`):
- Follow-up issue link:
- Next rehearsal due by:
- Smoke transaction reference:
- Follow-up actions:
### Attached evidence
- [ ] backup selection output
- [ ] restore completion output
- [ ] ddl diff
- [ ] row-count output
- [ ] freshness output
- [ ] cutover env-var output
- [ ] smoke transaction proof
- [ ] rollback output
- [ ] restored database deletion output
Post-restore verification log
Record each verification step as it happens. This is the closeout log that shows the restored database was usable, the cutover succeeded, and the rollback path was exercised.
| Field | Value |
|---|---|
| Drill date | |
| Operator | |
| Incident commander | |
| Evidence owner | |
| Source database | gateway-db-staging |
| Backup name | |
| Backup create time | |
| Restored database name | |
| Restore operation id | |
| Restore start time | |
| Restore complete time | |
| Validation start time | |
| Validation complete time | |
| Staging service switched | |
| Cutover complete time | |
| Smoke transaction id / reference | |
| Rollback complete time | |
| Restored database deleted at | |
| Closure decision | |
| Follow-up issue | |
| Next rehearsal due by | |
| Follow-up actions |
After the first drill, replace the placeholder values in the RTO/RPO template with the measured numbers and keep the verification log attached to the issue. If the observed duration exceeds the target, file a follow-up ticket to reduce restore time and record the accepted operational bound in the same drill note.
Drill cadence
After the first drill, quarterly thereafter. A restore procedure that has not been rehearsed in the last three months is untrusted — the Rehearsal plan section should be re-run and the RTO/RPO capture template refreshed each quarter.
Related
- Secrets rotation policy — if the Spanner instance is recreated during recovery, the
gateway-database-urlsecret must be rotated. - Alert runbook index — Spanner-originated Cloud Run 5xx bursts are triaged via the SLO fast-burn and slow-burn runbooks.
infra/terraform/modules/spanner/main.tf— source of truth for backup schedule, retention, and CMEK config.