Skip to main content

Pub/Sub Dead-Letter Queue Drain Runbook

Use this runbook when the Pub/Sub DLQ has messages: <subscription-name> alert fires. Any message in a DLQ represents a genuine processing failure that the subscriber could not recover from after 5 delivery attempts (20 for webhook-delivery, which tolerates more transient 5xx retries from merchant endpoints before quarantining).

DLQs are not supposed to hold messages. A non-zero depth is always a real problem. Do not acknowledge the alert until the queue is drained to zero or every remaining message has been explicitly quarantined with a tracked reason.

Quick reference

SubscriptionDLQ drain subscriptionConsumer serviceWhat a poisoned message usually means
transaction-events-subtransaction-events-dlq-submanagement (transaction log writer)A transaction with a malformed schema or a Spanner mutation that violates a constraint
settlement-events-subsettlement-events-dlq-subprocessing (batch reconciliation)A TransIT batch close response the parser cannot handle
webhook-delivery-subwebhook-delivery-dead-letter-drainmanagement (webhook delivery worker)A merchant webhook endpoint is permanently 4xx (misconfigured URL, deprecated HMAC)
staging-webhook-delivery-substaging-webhook-delivery-dead-letter-drainmanagement stagingSame as above but in staging
reconciliation-requests-subreconciliation-requests-dlq-subprocessing (reconciliation worker)A recon request referencing a merchant that has since been deleted
reconciliation-sweep-subreconciliation-sweep-dlq-subprocessing (reconciliation scheduler)A scheduler fan-out produced an invalid merchant list

Webhook-specific note

For webhook-delivery-* queues, this runbook is only the raw Pub/Sub layer. Once the management drain consumer has translated the failed message into an application-level delivery row (DLQ_PENDING / DLQ_EXHAUSTED), switch to the higher-level operator workflow:

Triage — step by step

Replace <DLQ_SUB> with the drain subscription name from the alert (e.g., webhook-delivery-dead-letter-drain).

1. Pull the top 10 messages without ack-ing

PROJECT="pinpoint-gateway"
DLQ_SUB="<DLQ_SUB>"

gcloud pubsub subscriptions pull "${DLQ_SUB}" \
--project="${PROJECT}" \
--limit=10 \
--auto-ack=false \
--format='json' \
> /tmp/dlq-dump.json

cat /tmp/dlq-dump.json | jq '.[] | {
ackId,
publishTime: .message.publishTime,
attributes: .message.attributes,
dataPreview: (.message.data | @base64d | .[0:500]),
deliveryAttempt: .deliveryAttempt
}'

The deliveryAttempt field tells you how many times the consumer has failed to ack the message. By the time it's in a DLQ, it should be >= 5 (or >= 20 for webhook-delivery). If it's lower than the configured max_delivery_attempts, something is weird — stop and investigate before replaying.

2. Correlate with consumer logs

Take the publishTime of the oldest message and look at the consumer service logs from that time minus 1 minute:

PUBLISH_TIME="2026-04-13T12:34:56Z" # from the dump above
SERVICE="gateway-management" # the consumer of the parent subscription

# Widen the window to 1 minute BEFORE the first delivery so you see the
# initial failure (not just the retry cascade that ran a few seconds later).
LOOKBACK_START="$(date -u -d "${PUBLISH_TIME} - 1 minute" '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null \
|| python3 -c "import datetime, sys; t=datetime.datetime.fromisoformat('${PUBLISH_TIME}'.rstrip('Z')) - datetime.timedelta(minutes=1); print(t.isoformat()+'Z')")"

gcloud logging read \
"resource.labels.service_name=\"${SERVICE}\" AND timestamp >= \"${LOOKBACK_START}\"" \
--project="${PROJECT}" \
--limit=50 \
--format='value(timestamp,severity,jsonPayload.message,jsonPayload.exception)'

You are looking for the stack trace that caused the original failure. Once you find it, you know what category of issue this is:

Log patternCategoryNext step
JsonParseException / MismatchedInputExceptionSchema regressionOption B (replay after code fix)
Spanner mutation exceeds... limitToo-large payloadOption C (quarantine, split, republish)
403 Forbidden on webhook endpointMerchant misconfigOption D (quarantine, notify merchant)
ResourceNotFoundException: merchantStale referenceOption C or D depending on whether the merchant actually existed
Intermittent Spanner aborts or timeoutsTransientOption A (simple replay)

3. Choose a path — A, B, C, or D

Option A — Simple replay

The failure was transient (e.g., Spanner unavailable for 3 minutes). Fix: move the messages back to the parent topic.

PARENT_TOPIC="webhook-delivery" # the topic that the DLQ drains from

# Pull each message and republish to the parent topic. This is a
# write-then-ack sequence so if the write fails the message stays in the DLQ.
while true; do
msg="$(gcloud pubsub subscriptions pull "${DLQ_SUB}" --project="${PROJECT}" --limit=1 --auto-ack=false --format='json' | jq -c '.[0]')"
if [[ -z "${msg}" || "${msg}" == "null" ]]; then
echo "DLQ drained."
break
fi
data="$(echo "${msg}" | jq -r '.message.data')"
attrs="$(echo "${msg}" | jq -c '.message.attributes // {}')"
ack_id="$(echo "${msg}" | jq -r '.ackId')"

# `gcloud pubsub topics publish --attribute=KEY=VALUE` accepts exactly one
# key=value per flag (a single comma-joined list would be collapsed into one
# attribute value and the other keys would be lost). Build an array of flags.
attr_flags=()
while IFS= read -r entry; do
[[ -n "${entry}" ]] && attr_flags+=("--attribute=${entry}")
done < <(echo "${attrs}" | jq -r 'to_entries[] | "\(.key)=\(.value)"')

gcloud pubsub topics publish "${PARENT_TOPIC}" \
--project="${PROJECT}" \
--message="$(echo "${data}" | base64 -d)" \
"${attr_flags[@]}"

gcloud pubsub subscriptions ack "${DLQ_SUB}" --project="${PROJECT}" --ack-ids="${ack_id}"
done

Do NOT run this loop without first understanding why the messages failed in the first place. Replaying poisoned messages against a still-broken consumer puts them right back in the DLQ.

Option B — Replay after code fix

The consumer had a bug. Fix: deploy the fix (cut a hotfix release, approve the production deploy), then run Option A's replay loop.

Option C — Quarantine, split, republish

The message is legitimate but can't be processed in its current form (e.g., a batch of 5000 items that exceeds the Spanner mutation limit). Fix: write a one-off script that splits the payload into processable chunks, publish the chunks, then ack the original:

# Dump the poisoned message to a file for inspection
gcloud pubsub subscriptions pull "${DLQ_SUB}" --project="${PROJECT}" --limit=1 --auto-ack=false --format='json' \
| jq '.[0]' > /tmp/poisoned-message.json

# ... manually craft the fix and republish ...

# Finally, ack the DLQ message so it stops firing the alert
ACK_ID="$(jq -r '.ackId' < /tmp/poisoned-message.json)"
gcloud pubsub subscriptions ack "${DLQ_SUB}" --project="${PROJECT}" --ack-ids="${ACK_ID}"

File a follow-up issue documenting the split script so the next occurrence can reuse it.

Option D — Quarantine and drop

The message cannot be successfully processed by any version of the consumer (e.g., a merchant's webhook URL has been deleted from their DNS for 6 months). Fix: dump the message to a quarantine bucket for audit and ack it.

# Dump to the audit bucket
gcloud pubsub subscriptions pull "${DLQ_SUB}" --project="${PROJECT}" --limit=1 --auto-ack=false --format='json' \
> /tmp/quarantine-$(date +%s).json

gcloud storage cp /tmp/quarantine-*.json gs://pinpoint-gateway-audit-logs/dlq-quarantine/

# Ack the message
ACK_ID="$(jq -r '.[0].ackId' < /tmp/quarantine-*.json)"
gcloud pubsub subscriptions ack "${DLQ_SUB}" --project="${PROJECT}" --ack-ids="${ACK_ID}"

Every quarantine action MUST be recorded in a tracking issue with the merchant ID, message body (redacted as needed), and reason. Silent drops are prohibited.

After the drain

  1. Re-pull the DLQ with --limit=100 --auto-ack=false. Expected: empty result.
  2. Watch the alert in Cloud Monitoring — it should auto-close within 5 minutes of the last ack.
  3. File a post-mortem issue with: what the failure was, root cause, the fix, which option path you used (A/B/C/D), and the count of messages processed per path.
  4. If you used Option B (code fix), link the hotfix PR.

Appendix: Publishing a test poison pill to validate the DLQ flow

Use this once after every infra change to confirm the DLQ to alert path still works:

# Publish a message the consumer will definitely reject (e.g., malformed JSON)
gcloud pubsub topics publish webhook-delivery \
--project=pinpoint-gateway \
--message='{"this":"will":"break":"the":"consumer"}'

# Wait for the DLQ transition. Time-to-DLQ = max_delivery_attempts × retry
# backoff, which varies by subscription:
# - transaction-events / settlement-events / reconciliation-* : max=5 → typically ~1–2 min
# - webhook-delivery (prod + staging) : max=20 → typically ~10–15 min
# For a faster validation loop, run this against a max=5 subscription instead.
# Then check that the DLQ has a message:
gcloud pubsub subscriptions pull webhook-delivery-dead-letter-drain \
--project=pinpoint-gateway \
--limit=10 \
--auto-ack=false

# Expected: 1 message visible.
# Within 5-10 more minutes, the Cloud Monitoring alert fires.
# Acknowledge the alert, drain the DLQ (Option D — drop the test), and move on.