Alert runbook index
Every Cloud Monitoring alert policy defined in infra/terraform/modules/monitoring/*.tf has a matching runbook in this directory. When a pager fires, find the row below and click through.
The canonical public URL for a runbook is https://support.pinpointgateway.com/docs/ops/runbooks/<slug>. That URL belongs in the alert policy's documentation.content block so the Cloud Monitoring notification carries the link inline.
Catalog
| Alert policy (Terraform resource) | Display name | Severity | Runbook |
|---|---|---|---|
google_monitoring_alert_policy.slo_fast_burn | <service> SLO fast burn (10x / 5 min) | WARNING (operator-inferred; no explicit severity in TF) | slo-fast-burn |
google_monitoring_alert_policy.slo_slow_burn | <service> SLO slow burn (2x / 1 hr) | WARNING (operator-inferred) | slo-slow-burn |
google_monitoring_alert_policy.pubsub_dlq_depth | Pub/Sub DLQ has messages: <subscription> | ERROR | pubsub-dlq-depth |
google_monitoring_alert_policy.webhook_dlq_depth_persistent | Webhook DLQ depth > 0 for any organization | WARNING | webhook-dlq-depth |
Each runbook covers:
- What the alert means — what is broken and what the user-visible impact is.
- First three diagnostic steps — specific greps, dashboards, and revision checks.
- Rollback criteria — what evidence justifies flipping Cloud Run back to the previous revision.
- Escalation path — who gets paged if the first responder cannot resolve within the stated window.
When a runbook does not exist
If an alert fires and there is no runbook here, treat it as a process bug: file a ticket labelled ops / runbook-gap, and in the meantime follow the generic incident-response path — page the on-call engineer, snapshot the dashboard, and escalate after 15 minutes without progress.
Planned operator drills are not always alert-driven. The canonical example is the Spanner disaster-recovery runbook, which is used for manual restore rehearsal and evidence capture even when no alert is active.
For those manual drills, treat the runbook page itself as procedure only, not as proof of readiness. Closure-grade evidence still means:
- one canonical drill note or issue
- one durable attachment set or artifact directory
- one operator summary saying whether the measured target was met or a follow-up issue is required
Cross-service context
- SLO burn-rate alerts use Cloud Monitoring's
select_slo_burn_ratefunction over the per-service availability SLO (google_monitoring_slo.availability, target 0.999 over 30 days). Both fast and slow variants fire against the same SLO resource with different windows. - The Pub/Sub DLQ-depth alert is keyed per dead-letter subscription. For message-triage procedure (replay vs. quarantine), see the deeper Pub/Sub DLQ drain runbook.
- The webhook DLQ-depth alert is keyed per Organization via the management service's
webhook_dlq_depthcustom metric and is distinct from the raw Pub/Sub dead-letter alert. - Notification channels are email-only today (
google_monitoring_notification_channel.email). Pager rotation wiring is tracked separately under OBS-04 and is not yet in place.