Distributed Tracing Verification
This runbook is used to verify end-to-end distributed tracing is working
Alert runbooks
5 items
Pub/Sub DLQ Drain Runbook
Use this runbook when the Pub/Sub DLQ has messages: alert fires. Any message in a DLQ represents a genuine processing failure that the subscriber could not recover from after 5 delivery attempts (20 for webhook-delivery, which tolerates more transient 5xx retries from merchant endpoints before quarantining).
On-Call Rotation
This document is authoritative for who gets paged when a Gateway alert fires, how quickly they must respond, and how escalation works. Changes to the rotation (vacations, new engineers, handoff time) are made here first and mirrored into PagerDuty.
Internal-Only Services
Peak Gateway uses a two-tier security model. Two services — processing and
Cloud Storage Bucket IAM Audit
Checklist for verifying every googlestoragebucket in
Secrets rotation policy
Source of truth for every secret the gateway holds: name, scope, storage location, rotation cadence, and procedure. Closes SEC-09 from the launch-readiness audit.
Deploy Rollback Runbook
When a production release exhibits elevated error rate, latency, or a
Spanner disaster recovery
Closes INF-09 from the launch-readiness audit. This runbook is the canonical procedure for restoring the gateway Spanner database from backup, the stated RPO / RTO targets, and the plan for the first drill.
SLO error budgets
Gateway services ship under a 99.9% availability SLO over a rolling 30-day window. That leaves a 0.1% error budget — approximately 43 minutes 12 seconds of allowed bad time per service per 30 days. This doc explains what the budget means, what triggers a page, what happens when the budget runs out, and how resets work.
Schema Migrations Runbook
db/migrations/ is intentionally empty after the launch-cutover baseline
TransIT Certification Run Runbook
Audience: Anyone who needs to re-run the TransIT v6.2 cert suite (Case 00230402) against the sandbox or production merchant credentials.