SLO slow burn (2x / 1 hr)
Alert policy: google_monitoring_alert_policy.slo_slow_burn (infra/terraform/modules/monitoring/main.tf). One per service, same var.services set as the fast-burn policy.
Canonical URL: https://support.pinpointgateway.com/docs/ops/runbooks/slo-slow-burn
What this alert means
The named service is burning error budget at 2x the steady-state rate over a rolling 1-hour window. Against the 99.9% / 30-day availability SLO, a sustained 2x burn will exhaust the month's budget in ~15 days. This is the "something is wrong but it isn't on fire" signal — customers may or may not be noticing yet, but budget is leaking faster than we can afford long-term.
Typical causes:
- a slow regression introduced by a recent deploy (an edge case throwing 500s a few percent of the time);
- a flaky dependency (intermittent Spanner aborts, TransIT 5xx on a specific card BIN, a Pub/Sub subscription with a poison message driving repeated retries);
- cold-start-driven 5xx on a service that has not yet been set to
min_instances >= 1(see INF-10); - a new edge case in input that slipped past validation and throws a generic handler.
User impact: intermittent failures at a low rate. Individual customers may see sporadic 500s on retry; most will not notice. The risk is cumulative — this is the alert you use to prevent a multi-day slow bleed that becomes a credibility problem before it becomes an outage.
First three diagnostic steps
-
Check whether a fast-burn alert is also firing (or recently fired and cleared) for the same service. If yes, investigate per the fast-burn runbook; slow burn is just the long-window confirmation of the same event. If no, continue — this is a genuine slow regression.
-
Run a Cloud Logging breakdown of 5xx responses by endpoint for the last hour.
resource.type="cloud_run_revision"resource.labels.service_name="gateway-<service>"httpRequest.status>=500timestamp>="<now - 1h>"Export the count grouped by
httpRequest.requestUrl. If one endpoint dominates the 5xx stream, that is your target. If errors are evenly distributed across every endpoint, the cause is almost certainly infrastructure (cold starts, Spanner, or the revision itself) — jump to step 3. -
Diff the last two Cloud Run revisions.
PROJECT="pinpoint-gateway"REGION="us-east1"SERVICE="gateway-<service>"gcloud run revisions list --service="${SERVICE}" --project="${PROJECT}" --region="${REGION}" --limit=5If the burn began within a few hours of a deploy,
git logbetween the two revisions' container tags and look for changed code on the endpoint identified in step 2. This is the most common cause of a slow burn — a regression that only trips on ~1-3% of traffic.
Rollback criteria
Slow burn rollbacks are rarely urgent — the error budget usually has hours of headroom — but roll back when:
- The burn started within 6 hours of a deploy AND
- The endpoint identified in step 2 maps to code changed in that deploy AND
- The previous revision is still within the last 7 days' retained traffic history.
Use the rollback command from the fast-burn runbook. After rollback, re-check the slow-burn metric in 2 hours (the window is 1h, so it takes at least that long to clear even after the underlying cause is gone).
Do not rollback when:
- The burn is correlated with a dependency incident (Spanner, TransIT) rather than our code — rollback won't help.
- The error rate is "bursty" rather than steady (e.g., one spike of 50 errors every 10 minutes) — that is often poison-message / DLQ-related and the Pub/Sub DLQ runbook applies.
Escalation path
Slow burn has a more relaxed SLA than fast burn:
- T+0 — on-call engineer acknowledges.
- T+1 hour without identified cause — ping the service owner in the alert channel (not a page).
- T+4 hours without identified cause — page the service owner; log a ticket labelled
ops. - T+24 hours without mitigation — escalate to engineering lead; slow burns that persist a full day are effectively permanent budget damage and need a roadmap response, not just an operational one.
Customer communication: slow-burn alerts are internal by default. Only post to the public status page if the underlying incident has also tripped a fast-burn threshold or if a single merchant reports the same pattern their side.
If there is no known fix
If step 2 and step 3 turn up nothing — errors are evenly distributed, no deploy recently, dependencies are green: this is frequently an intermittent Cloud Run infrastructure issue (cold starts on a service without min_instances, or transient regional network hiccups). The correct response is to:
- Document observed pattern in a ticket.
- Raise
min_instances(INF-10) for the service if cold starts are a plausible cause. - Escalate immediately if the pattern does not match a known cause — a persistent slow burn with no identified root cause is a monitoring gap, not an acceptable steady state.