SLO fast burn (10x / 5 min)

Alert policy: google_monitoring_alert_policy.slo_fast_burn (infra/terraform/modules/monitoring/main.tf). One per service; <service> in the display name is substituted from the var.services map (auth, management, processing, online_txn, merchant_onboarding, status).

Canonical URL: https://support.pinpointgateway.com/docs/ops/runbooks/slo-fast-burn

What this alert means

The named service is burning its error budget at 10x the steady-state rate over a 5-minute window. Against the 99.9% / 30-day availability SLO, that is enough to exhaust a full month's error budget in about three hours if the burn continues. In practice this means one of:

a Cloud Run revision just started returning 5xx for a meaningful fraction of requests (bad deploy, OOM loop, crashing startup probe);
a downstream dependency (Spanner, TransIT, XTMS, Pub/Sub) is unavailable or throttled and our wrappers are not degrading gracefully;
a traffic spike is hitting cold-start paths on services without min_instances (INF-10), and the resulting latency is propagating as 5xx at the edge.

User impact: the alert is a fast-burn signal, so customers are seeing failed checkouts, failed API calls, or failed merchant portal loads right now. Treat as incident until proven otherwise.

First three diagnostic steps

Open the Cloud Run revision history for the named service. In the GCP console navigate to Cloud Run → gateway-<service> → Revisions. If a new revision was promoted within the last ~30 minutes, suspect the deploy first. Note the previous-known-good revision name; you may need it for rollback (see below).
Filter logs by severity for the last 15 minutes. In Cloud Logging:
```
resource.type="cloud_run_revision"
resource.labels.service_name="gateway-<service>"
severity>=ERROR
timestamp>="<now - 15m>"
```
Look for repeating stack traces, the same exception class across many requests, or a cluster of UncaughtExceptionHandler messages. If you see a single root cause dominating the error stream, that is your fix target. Common patterns:
- SpannerException / DEADLINE_EXCEEDED — Spanner side, see Spanner DR runbook.
- ConnectException / SocketTimeoutException on a TransIT host — TSYS upstream, escalate to the TransIT liaison.
- OutOfMemoryError — revert the deploy, the JVM heap sizing regressed.
Open the service's Cloud Run dashboard tab and check the 5xx rate + p95 latency. Correlate the spike timestamp with the alert fire time. If the 5xx rate returns to baseline while p95 is still elevated, the underlying problem is latency-driven (dependency slow, not failing outright); shift investigation to the dependency rather than the service itself.

Rollback criteria

Flip to the previous Cloud Run revision when all of these are true:

The burn correlates in time (within ~5 min) with a new revision becoming the traffic target.
The previous revision's dashboard for the same window shows baseline 5xx rate.
The error stream in step 2 surfaces a stack trace that references code changed in the new revision, OR the startup probe is failing on the new revision.

Rollback command:

PROJECT="pinpoint-gateway"
REGION="us-east1"
SERVICE="gateway-<service>"
PREV_REVISION="<from revision history>"

gcloud run services update-traffic "${SERVICE}" \
  --project="${PROJECT}" \
  --region="${REGION}" \
  --to-revisions="${PREV_REVISION}=100"

Watch the 5xx rate in the dashboard. If it does not return to baseline within 5 minutes, the bad revision was not the cause — keep traffic on the good revision but escalate, because the root cause is upstream.

Do not rollback when:

The burn predates the most recent deploy by more than 10 minutes.
The error stream points cleanly at a downstream dependency (Spanner, TransIT, XTMS) — rollback won't help.
Multiple services are burning simultaneously — that is infrastructure-wide and rollback of one service is noise.

Escalation path

T+0 — first responder (on-call engineer) acknowledges in the alert channel and starts step 1.
T+15 min without identified root cause — page the service owner (per CODEOWNERS on the service's BUILD.bazel).
T+30 min without mitigation — page the engineering lead; declare an incident in the status system.
T+45 min without mitigation AND customer-visible impact confirmed — open an incident on the public status page (websites/status-page) and notify the CEO / head of customer success.

Every step should timestamp and paste the alert link into the incident channel. Even if the first responder resolves in step 1, post a one-line root-cause note so the eventual postmortem is cheap.

If there is no known fix

If the error stream shows no recognizable pattern, logs are quiet, and the dependency dashboards are green: escalate immediately to the service owner rather than stalling at step 2. A silent fast burn is often a monitoring gap or a poison metric; do not wait it out.

What this alert means​

First three diagnostic steps​

Rollback criteria​

Escalation path​

If there is no known fix​

What this alert means

First three diagnostic steps

Rollback criteria

Escalation path

If there is no known fix