Pub/Sub Dead-Letter Queue Drain Runbook
Use this runbook when the Pub/Sub DLQ has messages: <subscription-name> alert fires. Any message in a DLQ represents a genuine processing failure that the subscriber could not recover from after 5 delivery attempts (20 for webhook-delivery, which tolerates more transient 5xx retries from merchant endpoints before quarantining).
DLQs are not supposed to hold messages. A non-zero depth is always a real problem. Do not acknowledge the alert until the queue is drained to zero or every remaining message has been explicitly quarantined with a tracked reason.
Quick reference
| Subscription | DLQ drain subscription | Consumer service | What a poisoned message usually means |
|---|---|---|---|
transaction-events-sub | transaction-events-dlq-sub | management (transaction log writer) | A transaction with a malformed schema or a Spanner mutation that violates a constraint |
settlement-events-sub | settlement-events-dlq-sub | processing (batch reconciliation) | A TransIT batch close response the parser cannot handle |
webhook-delivery-sub | webhook-delivery-dead-letter-drain | management (webhook delivery worker) | A merchant webhook endpoint is permanently 4xx (misconfigured URL, deprecated HMAC) |
staging-webhook-delivery-sub | staging-webhook-delivery-dead-letter-drain | management staging | Same as above but in staging |
reconciliation-requests-sub | reconciliation-requests-dlq-sub | processing (reconciliation worker) | A recon request referencing a merchant that has since been deleted |
reconciliation-sweep-sub | reconciliation-sweep-dlq-sub | processing (reconciliation scheduler) | A scheduler fan-out produced an invalid merchant list |
Webhook-specific note
For webhook-delivery-* queues, this runbook is only the raw Pub/Sub
layer. Once the management drain consumer has translated the failed message into
an application-level delivery row (DLQ_PENDING / DLQ_EXHAUSTED), switch to
the higher-level operator workflow:
Triage — step by step
Replace <DLQ_SUB> with the drain subscription name from the alert (e.g., webhook-delivery-dead-letter-drain).
1. Pull the top 10 messages without ack-ing
PROJECT="pinpoint-gateway"
DLQ_SUB="<DLQ_SUB>"
gcloud pubsub subscriptions pull "${DLQ_SUB}" \
--project="${PROJECT}" \
--limit=10 \
--auto-ack=false \
--format='json' \
> /tmp/dlq-dump.json
cat /tmp/dlq-dump.json | jq '.[] | {
ackId,
publishTime: .message.publishTime,
attributes: .message.attributes,
dataPreview: (.message.data | @base64d | .[0:500]),
deliveryAttempt: .deliveryAttempt
}'
The deliveryAttempt field tells you how many times the consumer has failed to ack the message. By the time it's in a DLQ, it should be >= 5 (or >= 20 for webhook-delivery). If it's lower than the configured max_delivery_attempts, something is weird — stop and investigate before replaying.
2. Correlate with consumer logs
Take the publishTime of the oldest message and look at the consumer service logs from that time minus 1 minute:
PUBLISH_TIME="2026-04-13T12:34:56Z" # from the dump above
SERVICE="gateway-management" # the consumer of the parent subscription
# Widen the window to 1 minute BEFORE the first delivery so you see the
# initial failure (not just the retry cascade that ran a few seconds later).
LOOKBACK_START="$(date -u -d "${PUBLISH_TIME} - 1 minute" '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null \
|| python3 -c "import datetime, sys; t=datetime.datetime.fromisoformat('${PUBLISH_TIME}'.rstrip('Z')) - datetime.timedelta(minutes=1); print(t.isoformat()+'Z')")"
gcloud logging read \
"resource.labels.service_name=\"${SERVICE}\" AND timestamp >= \"${LOOKBACK_START}\"" \
--project="${PROJECT}" \
--limit=50 \
--format='value(timestamp,severity,jsonPayload.message,jsonPayload.exception)'
You are looking for the stack trace that caused the original failure. Once you find it, you know what category of issue this is:
| Log pattern | Category | Next step |
|---|---|---|
JsonParseException / MismatchedInputException | Schema regression | Option B (replay after code fix) |
Spanner mutation exceeds... limit | Too-large payload | Option C (quarantine, split, republish) |
403 Forbidden on webhook endpoint | Merchant misconfig | Option D (quarantine, notify merchant) |
ResourceNotFoundException: merchant | Stale reference | Option C or D depending on whether the merchant actually existed |
| Intermittent Spanner aborts or timeouts | Transient | Option A (simple replay) |
3. Choose a path — A, B, C, or D
Option A — Simple replay
The failure was transient (e.g., Spanner unavailable for 3 minutes). Fix: move the messages back to the parent topic.
PARENT_TOPIC="webhook-delivery" # the topic that the DLQ drains from
# Pull each message and republish to the parent topic. This is a
# write-then-ack sequence so if the write fails the message stays in the DLQ.
while true; do
msg="$(gcloud pubsub subscriptions pull "${DLQ_SUB}" --project="${PROJECT}" --limit=1 --auto-ack=false --format='json' | jq -c '.[0]')"
if [[ -z "${msg}" || "${msg}" == "null" ]]; then
echo "DLQ drained."
break
fi
data="$(echo "${msg}" | jq -r '.message.data')"
attrs="$(echo "${msg}" | jq -c '.message.attributes // {}')"
ack_id="$(echo "${msg}" | jq -r '.ackId')"
# `gcloud pubsub topics publish --attribute=KEY=VALUE` accepts exactly one
# key=value per flag (a single comma-joined list would be collapsed into one
# attribute value and the other keys would be lost). Build an array of flags.
attr_flags=()
while IFS= read -r entry; do
[[ -n "${entry}" ]] && attr_flags+=("--attribute=${entry}")
done < <(echo "${attrs}" | jq -r 'to_entries[] | "\(.key)=\(.value)"')
gcloud pubsub topics publish "${PARENT_TOPIC}" \
--project="${PROJECT}" \
--message="$(echo "${data}" | base64 -d)" \
"${attr_flags[@]}"
gcloud pubsub subscriptions ack "${DLQ_SUB}" --project="${PROJECT}" --ack-ids="${ack_id}"
done
Do NOT run this loop without first understanding why the messages failed in the first place. Replaying poisoned messages against a still-broken consumer puts them right back in the DLQ.
Option B — Replay after code fix
The consumer had a bug. Fix: deploy the fix (cut a hotfix release, approve the production deploy), then run Option A's replay loop.
Option C — Quarantine, split, republish
The message is legitimate but can't be processed in its current form (e.g., a batch of 5000 items that exceeds the Spanner mutation limit). Fix: write a one-off script that splits the payload into processable chunks, publish the chunks, then ack the original:
# Dump the poisoned message to a file for inspection
gcloud pubsub subscriptions pull "${DLQ_SUB}" --project="${PROJECT}" --limit=1 --auto-ack=false --format='json' \
| jq '.[0]' > /tmp/poisoned-message.json
# ... manually craft the fix and republish ...
# Finally, ack the DLQ message so it stops firing the alert
ACK_ID="$(jq -r '.ackId' < /tmp/poisoned-message.json)"
gcloud pubsub subscriptions ack "${DLQ_SUB}" --project="${PROJECT}" --ack-ids="${ACK_ID}"
File a follow-up issue documenting the split script so the next occurrence can reuse it.
Option D — Quarantine and drop
The message cannot be successfully processed by any version of the consumer (e.g., a merchant's webhook URL has been deleted from their DNS for 6 months). Fix: dump the message to a quarantine bucket for audit and ack it.
# Dump to the audit bucket
gcloud pubsub subscriptions pull "${DLQ_SUB}" --project="${PROJECT}" --limit=1 --auto-ack=false --format='json' \
> /tmp/quarantine-$(date +%s).json
gcloud storage cp /tmp/quarantine-*.json gs://pinpoint-gateway-audit-logs/dlq-quarantine/
# Ack the message
ACK_ID="$(jq -r '.[0].ackId' < /tmp/quarantine-*.json)"
gcloud pubsub subscriptions ack "${DLQ_SUB}" --project="${PROJECT}" --ack-ids="${ACK_ID}"
Every quarantine action MUST be recorded in a tracking issue with the merchant ID, message body (redacted as needed), and reason. Silent drops are prohibited.
After the drain
- Re-pull the DLQ with
--limit=100 --auto-ack=false. Expected: empty result. - Watch the alert in Cloud Monitoring — it should auto-close within 5 minutes of the last ack.
- File a post-mortem issue with: what the failure was, root cause, the fix, which option path you used (A/B/C/D), and the count of messages processed per path.
- If you used Option B (code fix), link the hotfix PR.
Appendix: Publishing a test poison pill to validate the DLQ flow
Use this once after every infra change to confirm the DLQ to alert path still works:
# Publish a message the consumer will definitely reject (e.g., malformed JSON)
gcloud pubsub topics publish webhook-delivery \
--project=pinpoint-gateway \
--message='{"this":"will":"break":"the":"consumer"}'
# Wait for the DLQ transition. Time-to-DLQ = max_delivery_attempts × retry
# backoff, which varies by subscription:
# - transaction-events / settlement-events / reconciliation-* : max=5 → typically ~1–2 min
# - webhook-delivery (prod + staging) : max=20 → typically ~10–15 min
# For a faster validation loop, run this against a max=5 subscription instead.
# Then check that the DLQ has a message:
gcloud pubsub subscriptions pull webhook-delivery-dead-letter-drain \
--project=pinpoint-gateway \
--limit=10 \
--auto-ack=false
# Expected: 1 message visible.
# Within 5-10 more minutes, the Cloud Monitoring alert fires.
# Acknowledge the alert, drain the DLQ (Option D — drop the test), and move on.