Alert runbook index

Every Cloud Monitoring alert policy defined in infra/terraform/modules/monitoring/*.tf has a matching runbook in this directory. When a pager fires, find the row below and click through.

The canonical public URL for a runbook is https://support.pinpointgateway.com/docs/ops/runbooks/<slug>. That URL belongs in the alert policy's documentation.content block so the Cloud Monitoring notification carries the link inline.

Catalog

Alert policy (Terraform resource)	Display name	Severity	Runbook
`google_monitoring_alert_policy.slo_fast_burn`	`<service> SLO fast burn (10x / 5 min)`	WARNING (operator-inferred; no explicit severity in TF)	slo-fast-burn
`google_monitoring_alert_policy.slo_slow_burn`	`<service> SLO slow burn (2x / 1 hr)`	WARNING (operator-inferred)	slo-slow-burn
`google_monitoring_alert_policy.pubsub_dlq_depth`	`Pub/Sub DLQ has messages: <subscription>`	ERROR	pubsub-dlq-depth
`google_monitoring_alert_policy.webhook_dlq_depth_persistent`	`Webhook DLQ depth > 0 for any organization`	WARNING	webhook-dlq-depth

Each runbook covers:

What the alert means — what is broken and what the user-visible impact is.
First three diagnostic steps — specific greps, dashboards, and revision checks.
Rollback criteria — what evidence justifies flipping Cloud Run back to the previous revision.
Escalation path — who gets paged if the first responder cannot resolve within the stated window.

When a runbook does not exist

If an alert fires and there is no runbook here, treat it as a process bug: file a ticket labelled ops / runbook-gap, and in the meantime follow the generic incident-response path — page the on-call engineer, snapshot the dashboard, and escalate after 15 minutes without progress.

Planned operator drills are not always alert-driven. The canonical example is the Spanner disaster-recovery runbook, which is used for manual restore rehearsal and evidence capture even when no alert is active.

For those manual drills, treat the runbook page itself as procedure only, not as proof of readiness. Closure-grade evidence still means:

one canonical drill note or issue
one durable attachment set or artifact directory
one operator summary saying whether the measured target was met or a follow-up issue is required

Cross-service context

SLO burn-rate alerts use Cloud Monitoring's select_slo_burn_rate function over the per-service availability SLO (google_monitoring_slo.availability, target 0.999 over 30 days). Both fast and slow variants fire against the same SLO resource with different windows.
The Pub/Sub DLQ-depth alert is keyed per dead-letter subscription. For message-triage procedure (replay vs. quarantine), see the deeper Pub/Sub DLQ drain runbook.
The webhook DLQ-depth alert is keyed per Organization via the management service's webhook_dlq_depth custom metric and is distinct from the raw Pub/Sub dead-letter alert.
Notification channels are email-only today (google_monitoring_notification_channel.email). Pager rotation wiring is tracked separately under OBS-04 and is not yet in place.

Catalog​

When a runbook does not exist​

Cross-service context​

Catalog

When a runbook does not exist

Cross-service context