Deploy Rollback Runbook
When a production release exhibits elevated error rate, latency, or a regression that is not recoverable by a config change, flip Cloud Run traffic back to the last-known-good revision.
The approval gate on the deploy.yml deploy-production job (see
INF-02 / #411) means a human approved the bad deploy — the same or a
different operator is expected to run this runbook.
1. Identify the last-known-good revision
SERVICE=gateway-<service> # e.g. gateway-processing
PROJECT=peak-gateway-prod
REGION=us-central1
gcloud run revisions list \
--service "${SERVICE}" \
--region "${REGION}" \
--project "${PROJECT}" \
--format='table(metadata.name,status.conditions[0].status,metadata.creationTimestamp,spec.containers[0].image)' \
--limit=20
The most recent revision is the one currently serving traffic (the bad one). The previous revision (one row down) is usually the rollback target. Confirm the image tag matches the previous release.
2. Flip traffic
PREVIOUS=gateway-<service>-<prev-revision-suffix>
gcloud run services update-traffic "${SERVICE}" \
--to-revisions="${PREVIOUS}=100" \
--region "${REGION}" \
--project "${PROJECT}"
Cloud Run will drain connections to the bad revision within seconds. Monitor the 5xx rate in Cloud Monitoring — it should collapse in under a minute. If it doesn't, something else is wrong (downstream dependency, schema change, external API outage) and a revision flip won't help.
3. Confirm rollback
# Traffic should now read 100% on the previous revision.
gcloud run services describe "${SERVICE}" \
--region "${REGION}" \
--project "${PROJECT}" \
--format='value(status.traffic)'
# Smoke-probe the public endpoint if the service is externally routed.
curl -fsS "https://api.peakgateway.co/${service}/health"
4. Follow-up
- File an issue with the Git SHA of the bad revision, the symptom observed, and the image tag that was rolled back.
- If the revision history is long, prune older revisions via
gcloud run revisions deleteto avoid hitting the 1000-revision per-service cap — but NEVER delete the bad revision until the postmortem is complete. - If the bad revision shipped a schema change, coordinate with the database migration owner before rolling forward again.
Prevention notes
- The production deploy gate (INF-02) requires manual approval on the
productionGitHub environment. Use that approval step to pause and confirm the staging smoke passed cleanly. - The staging-smoke job in
deploy.ymlprobes the public staging URLs (status, auth OAuth metadata, online-txn health) before the prod deploy runs. Any non-200 blocks promotion.