Skip to main content

On-Call Rotation & Paging Policy

This document is authoritative for who gets paged when a Gateway alert fires, how quickly they must respond, and how escalation works. Changes to the rotation (vacations, new engineers, handoff time) are made here first and mirrored into PagerDuty.

Why this exists

Per the launch-readiness audit (OBS-04), the only configured notification channel before this policy existed was email — which is not on-call paging. Critical alerts now route through PagerDuty with SMS as a deep-sleep fallback. The Terraform in infra/terraform/modules/monitoring/main.tf wires the channels; this document wires the humans.

Rotation

TierPersonContactNotes
PrimaryJack NelsonPagerDuty user jack@myriadnetworks.com + verified SMSFirst page target during the current window.
SecondaryCo-founderPagerDuty user (add) + verified SMSAuto-escalation target after 15 min non-ack.
TertiaryAll-hands SMS broadcastAll verified numbersDeep-sleep safety net, 30 min after initial page.

Handoff: 09:00 America/New_York, daily. Either party can extend their shift by notifying the other on Slack; do not silently carry.

Updating the rotation

Until the team grows past two engineers, this is a manual 50/50 rotation. When hiring a third on-call engineer, move the schedule into PagerDuty's rotation primitives and reduce this table to a pointer to the PagerDuty schedule URL.

What gets paged

Only these policies page the on-call channel (PagerDuty + SMS fallback). Everything else lands in the email channel and is reviewed during business hours.

Alert policySeveritySource
gateway-<service>-slo-fast-burn (10× burn / 5 min)P1modules/monitoring/main.tf
gateway-<service>-slo-slow-burn (2× burn / 1 hr)P2modules/monitoring/main.tf
gateway-pubsub-dlq-depth (any DLQ non-empty ≥ 5 min)P1modules/monitoring/main.tf
gateway-error-budget-low (budget remaining < 10%)P2modules/monitoring/dashboards.tf (OBS-11)

Runbooks:

  • SLO burn: see the dashboard widgets in the SLO Burn Cloud Monitoring dashboard. SLO Error Budget runbook lands with OBS-11.
  • DLQ: Pub/Sub DLQ Drain Runbook.
  • Data-plane restore / Spanner-originated storage incident: Spanner Disaster Recovery.
  • Error-budget low: SLO Error Budget runbook ("budget exhausted" section) lands with OBS-11.

The Spanner DR runbook also covers a planned quarterly restore rehearsal. That drill is manual and does not page by itself, but it still requires named IC/infra/verifier/evidence-owner roles and a durable drill note with attached evidence before the lane is treated as closure-ready.

Responsibilities

Incident commander (IC)

The first human to acknowledge the page becomes IC until they explicitly hand off. The IC:

  1. Acknowledges in PagerDuty within 15 minutes. Acknowledging silences the 15-min re-page but does not stop the 30-min SMS fallback — resolve the incident or explicitly snooze the fallback once you're actively working it.
  2. Opens a Slack thread in #incidents. First message states: alert, suspected blast radius, and "I am IC" so secondary stands down.
  3. Drives triage through to resolution. Follow the relevant runbook. When uncertain, err toward a rollback or feature-flag kill — you can always reopen the investigation after traffic is safe.
  4. Writes the postmortem. Due 72 h after incident close. Blameless, rooted in the 5-whys, with explicit follow-up Issues filed against the relevant service.

Secondary

Stands by until escalation fires. When paged:

  • If primary has Slack'd "I am IC", acknowledge and stay passive (you're a backup pair of hands).
  • If primary has not responded within 15 minutes, take IC. Do not wait for the 30-minute SMS re-page.

Escalation timers

T+0 PagerDuty pages primary
T+15 min PagerDuty re-pages + auto-escalates to secondary
T+30 min Cloud Monitoring SMS fallback fires to both primary and secondary
T+60 min If still un-ack'd, PagerDuty repeats the escalation cycle

The 30-minute SMS fallback is the "two founders asleep" safety net — it bypasses the PagerDuty mobile-app dependency entirely. The numbers live in the sms_fallback_numbers Terraform variable and must be verified through the Cloud Monitoring console before they can deliver.

PagerDuty service key rotation

The integration key is stored in GCP Secret Manager (secret id: gateway-pagerduty-service-key). To rotate:

  1. Create a new integration in the PagerDuty service (Settings → Integrations → Add).
  2. Add the new key as a new version of the Secret Manager secret:
    echo -n '<new-key>' | gcloud secrets versions add gateway-pagerduty-service-key \
    --data-file=- --project=pinpoint-gateway
  3. Run terraform apply in infra/terraform/ — the data source picks up the latest secret version and the notification channel is updated in place.
  4. Confirm a test alert routes through the new integration, then delete the old integration in PagerDuty.

Do not ever paste the service key into a tfvars file or commit it to VCS.

Environment coverage

EnvironmentPagerDutySMS fallbackEmail
ProductionYesYesYes
StagingNoNoYes
Preview / ephemeralNoNoYes

Staging intentionally does not page humans — noisy staging alerts erode the signal from production pages. If an issue in staging warrants paging, promote it to an Issue and triage during business hours.

Planned drill ownership

Manual restore drills are still on-call work even when no alert fires. For the quarterly Spanner rehearsal:

  • the primary on-call engineer owns scheduling the drill window or explicitly delegating it
  • the drill is not considered complete until the evidence package and operator summary are attached in a durable note
  • if the measured timings miss target, the same owner opens the follow-up issue before treating the lane as operationally complete

References

  • Monitoring module: infra/terraform/modules/monitoring/main.tf
  • PagerDuty variable wiring: infra/terraform/monitoring_paging.tf
  • Audit finding: OBS-04 in docs/superpowers/specs/2026-04-12-launch-readiness-audit-design.md