Pub/Sub DLQ has messages

Alert policy: google_monitoring_alert_policy.pubsub_dlq_depth (infra/terraform/modules/monitoring/main.tf). One per dead-letter subscription listed in var.dlq_subscription_short_names.

Canonical URL: https://support.pinpointgateway.com/docs/ops/runbooks/pubsub-dlq-depth

Severity: ERROR (set explicitly in Terraform, unlike the SLO burn-rate policies).

What this alert means

A message landed in the named dead-letter queue. That means the consumer of the parent subscription failed to acknowledge the message after its max_delivery_attempts (5 for most subscriptions; 20 for webhook-delivery which tolerates merchant endpoint flakes). Every message in a DLQ represents a real processing failure that the subscriber could not recover from — transient retries have already been exhausted.

User impact varies by subscription:

transaction-events-dlq-sub — a transaction record was not written to the management service's log. Affects internal reporting and merchant dashboards; no direct customer impact at payment time.
settlement-events-dlq-sub — a settlement / batch-close notification was not processed. Affects reconciliation accuracy; merchants may see stale settlement status.
webhook-delivery-dead-letter-drain — a merchant webhook was never delivered. The merchant's own system is missing an event. Direct customer impact.
reconciliation-requests-dlq-sub / reconciliation-sweep-dlq-sub — reconciliation worker failures. Back-office-only impact; does not block payment flow.

DLQs are not supposed to hold messages. A non-zero depth is always a real problem. Do not acknowledge the alert until the queue is drained to zero.

First three diagnostic steps

Pull the top 10 DLQ messages without ack-ing. Follow the exact procedure in the Pub/Sub DLQ drain runbook step 1. The drain runbook is the deep operational procedure; this file is the triage frontend. The alert's own documentation.content field (in monitoring/main.tf) already includes the gcloud pubsub subscriptions pull command as a quick-reference.
Correlate with consumer service logs. Take the publishTime of the oldest DLQ message, widen the window 1 minute before, and query logs for the consuming service (from the DLQ drain quick reference table). You are looking for the stack trace that caused the first failure — that tells you whether this is a schema regression, a transient infra issue, or a bad external reference.
Decide replay vs. quarantine. Use the decision table in the drain runbook step 3 to pick option A (simple replay), B (fix-then-replay), C (quarantine + split + republish), or D (quarantine + notify merchant).

Rollback criteria

The Cloud Run rollback framing doesn't apply directly here — Pub/Sub does not have per-revision traffic splits. But if a DLQ spike correlates with a recent consumer-service deploy (for example, the gateway-management service was redeployed 20 minutes before the alert fired, and the stack traces reference code changed in that deploy), roll back the consumer service:

PROJECT="pinpoint-gateway"
REGION="us-east1"
SERVICE="<consumer service from the DLQ drain quick reference>"
PREV_REVISION="<from revision history>"

gcloud run services update-traffic "${SERVICE}" \
  --project="${PROJECT}" \
  --region="${REGION}" \
  --to-revisions="${PREV_REVISION}=100"

After rollback, the DLQ will still contain the poisoned messages — they will stay there until you replay or quarantine them. Rollback only stops new messages from landing; it does not drain the queue.

Do not roll back when:

The stack trace in step 2 points at a dependency (Spanner, TransIT, merchant webhook URL) rather than our code.
The DLQ contents are a single poison message (quarantine it directly, do not roll back).
The DLQ is for webhook-delivery-dead-letter-drain and the target URLs are merchant-owned — rolling back our service does not fix a broken merchant endpoint.

Escalation path

T+0 — on-call engineer acknowledges and opens the drain runbook.
T+30 min without identified cause — ping the consumer service owner (see CODEOWNERS on the consumer service's BUILD.bazel).
T+1 hour with growing DLQ depth — page the service owner; this may be a loop.
T+2 hours for webhook-delivery-* DLQs specifically — notify the merchant-success team, because merchants will begin noticing missing webhooks. For other DLQs, no customer-facing comms by default.
T+4 hours without mitigation — escalate to engineering lead and declare an incident.

The webhook-delivery DLQs are elevated relative to the others because they represent direct merchant impact. For those, compress each escalation step by roughly half.

If there is no known fix

If pulling messages shows them as unparseable JSON or a shape no code path in the consumer service handles: the message should never have been published. File a ticket for the publisher; meanwhile, quarantine per option D in the drain runbook. Never ack a DLQ message without either successful replay or an explicit quarantine record — silently ack-ing hides the data loss.

What this alert means​

First three diagnostic steps​

Rollback criteria​

Escalation path​

If there is no known fix​

What this alert means

First three diagnostic steps

Rollback criteria

Escalation path

If there is no known fix