SLO error budgets
Gateway services ship under a 99.9% availability SLO over a rolling 30-day window. That leaves a 0.1% error budget — approximately 43 minutes 12 seconds of allowed bad time per service per 30 days. This doc explains what the budget means, what triggers a page, what happens when the budget runs out, and how resets work.
Terraform source for alerts:
infra/terraform/modules/monitoring/slo_budget.tf.
What the budget measures
For every service in the services map passed to the monitoring module (auth, processing, management, online_txn, merchant_onboarding, status), Cloud Monitoring tracks a single availability SLI against its Cloud Run service. The SLI uses the basic_sli { availability {} } signal — any request with a 5xx response (or a transport-level failure) counts as bad.
The budget is 1 − 0.999 = 0.001 (0.1%) of the rolling-window traffic volume.
| SLO goal | Budget per 30 days (at 100% uptime as baseline) |
|---|---|
| 99.9% | ~43m 12s |
| 99.95% | ~21m 36s |
| 99.99% | ~4m 19s |
The budget recovers continuously: as older-than-30-days bad minutes fall out of the rolling window, remaining budget rises again. So a bad incident doesn't trap a service in a permanent feature-freeze — it just forces a cooldown until the old errors age out.
What pages on-call
Three alert policies watch each service's SLO. They escalate in urgency:
| Alert | Condition | Severity | Use for |
|---|---|---|---|
| Fast burn | Burn rate > 10× over 5 min | CRITICAL | Active incident — wake someone |
| Slow burn | Burn rate > 2× over 1 hour | CRITICAL | Persistent issue — wake someone |
| Error budget < 10% remaining | Budget fraction < 0.1 for 15 min sustained | WARNING | Feature-freeze trigger — don't wake, but act next morning |
The budget-remaining alert is the new one (OBS-11). It fires with a 15-minute sustain to avoid paging on a transient dip, auto-closes after 7 days if the budget recovers, and routes to whatever channels are configured in var.slo_budget_alert_channels.
Alert routing (transitional state)
var.slo_budget_alert_channels accepts raw notification-channel IDs today. Once PR #540 lands and registers the PagerDuty channel as a first-class resource inside the monitoring module, the variable will be replaced with a direct reference. Until then, each environment's caller (infra/terraform/envs/staging/main.tf, infra/terraform/envs/production/main.tf) must pass the pre-provisioned PagerDuty channel IDs explicitly, or leave the list empty in which case the alert fires silently to the Cloud Monitoring UI only.
Consequences of exhaustion
When the remaining error budget drops below 10%, the service enters feature-freeze until one of:
- The budget recovers above 20% through natural rolling-window drift.
- The next quarterly budget reset (see below) explicitly restores it.
Feature-freeze rules:
- No new features merge into the affected service. This is enforced by convention — the SRE on-call reviews PRs against the frozen service and applies the
freeze-waiverlabel only for changes that measurably reduce burn rate (bug fixes, error-handling improvements, rollbacks, retry-budget reductions). - Existing migrations, deploys, and config changes still proceed if they're reliability-neutral or reliability-positive. A strict freeze would slow down the very work needed to escape the freeze.
- Customer-impacting patches always merge, even if they technically add behaviour — weigh customer risk against SLO compliance on a case-by-case basis.
Exit criteria:
- Budget returns above 20% and no active incident.
- OR 14 days have elapsed since entering freeze with no further burn.
- OR the next quarterly reset lands.
In all cases the SRE on-call documents the exit in the service's monthly reliability review.
Quarterly resets
Every fiscal quarter (1 Jan, 1 Apr, 1 Jul, 1 Oct), the platform team reviews each service's SLO against observed performance and explicitly resets the budget. The reset does three things:
- Adjusts the SLO goal if systematically over- or under-performing. If a service has spent two straight quarters with < 40% budget used it's a signal the SLO is too loose; if it's spent two straight quarters in feature-freeze the SLO is too tight. Tightening is a breaking change for on-call load — announce one quarter ahead.
- Reconciles the burn-rate alert thresholds so they match the new SLO goal. Fast-burn is always 10× over 5 min and slow-burn is always 2× over 1 hour, but the absolute bad-minute count these resolve to shifts with the goal.
- Lifts any active feature-freezes, regardless of recovery state. The reset is an explicit platform-team call that it's safe to ship again — on-call carries the risk of resuming at elevated burn until the rolling window catches up.
Reset cadence is documented in infra/terraform/modules/monitoring/slo_budget.tf and surfaced in the SLO dashboard panel once PR #573 lands.
Escalation
If the budget-remaining alert fires outside of business hours and no one is paged (channels empty, or the alert was accidentally silenced):
- The SRE on-call rotation is the primary escalation path — [PagerDuty schedule
gateway-sre-primary]. - If SRE on-call is unreachable for 15 minutes, the fallback is the platform engineering lead.
- After resolution, open a reliability post-mortem issue in
pinpointpos/gatewaylabelledpost-mortem+slo-budget. The post-mortem template is in.github/ISSUE_TEMPLATE/post-mortem.md.
Diagnosing high burn
- Check whether the fast-burn or slow-burn alerts are also firing. If yes, there's probably an active incident — start the incident-response runbook (
docs/ops/runbooks/*— e.g.pubsub-dlq-drain.mdfor consumer-side burn,spanner-disaster-recovery.mdfor data-plane burn). If no, the burn is likely accumulated from many small incidents over the rolling window. - Pull recent error logs for the affected service and correlate with deploy timestamps. Error spikes aligned with a deploy point to a regression — roll back first, diagnose later.
- Compare current budget consumption against the historical trend. A step change in burn indicates a regression; a gradual upward drift indicates the SLO target is too tight for the current traffic pattern.
Related docs
docs/ops/distributed-tracing.md— service-to-service latency and error correlation.docs/ops/pubsub-dlq-drain.md— dead-letter-queue failures (common source of async error budget burn).infra/terraform/modules/monitoring/main.tf— burn-rate alert definitions + SLO goals (per service).infra/terraform/modules/monitoring/slo_budget.tf— this doc's implementation source.