Skip to main content

Deploy Rollback Runbook

When a production release exhibits elevated error rate, latency, or a regression that is not recoverable by a config change, flip Cloud Run traffic back to the last-known-good revision.

The approval gate on the deploy.yml deploy-production job (see INF-02 / #411) means a human approved the bad deploy — the same or a different operator is expected to run this runbook.

1. Identify the last-known-good revision

SERVICE=gateway-<service> # e.g. gateway-processing
PROJECT=peak-gateway-prod
REGION=us-central1

gcloud run revisions list \
--service "${SERVICE}" \
--region "${REGION}" \
--project "${PROJECT}" \
--format='table(metadata.name,status.conditions[0].status,metadata.creationTimestamp,spec.containers[0].image)' \
--limit=20

The most recent revision is the one currently serving traffic (the bad one). The previous revision (one row down) is usually the rollback target. Confirm the image tag matches the previous release.

2. Flip traffic

PREVIOUS=gateway-<service>-<prev-revision-suffix>

gcloud run services update-traffic "${SERVICE}" \
--to-revisions="${PREVIOUS}=100" \
--region "${REGION}" \
--project "${PROJECT}"

Cloud Run will drain connections to the bad revision within seconds. Monitor the 5xx rate in Cloud Monitoring — it should collapse in under a minute. If it doesn't, something else is wrong (downstream dependency, schema change, external API outage) and a revision flip won't help.

3. Confirm rollback

# Traffic should now read 100% on the previous revision.
gcloud run services describe "${SERVICE}" \
--region "${REGION}" \
--project "${PROJECT}" \
--format='value(status.traffic)'

# Smoke-probe the public endpoint if the service is externally routed.
curl -fsS "https://api.peakgateway.co/${service}/health"

4. Follow-up

  • File an issue with the Git SHA of the bad revision, the symptom observed, and the image tag that was rolled back.
  • If the revision history is long, prune older revisions via gcloud run revisions delete to avoid hitting the 1000-revision per-service cap — but NEVER delete the bad revision until the postmortem is complete.
  • If the bad revision shipped a schema change, coordinate with the database migration owner before rolling forward again.

Prevention notes

  • The production deploy gate (INF-02) requires manual approval on the production GitHub environment. Use that approval step to pause and confirm the staging smoke passed cleanly.
  • The staging-smoke job in deploy.yml probes the public staging URLs (status, auth OAuth metadata, online-txn health) before the prod deploy runs. Any non-200 blocks promotion.