Pub/Sub Dead-Letter Queue Drain Runbook

Use this runbook when the Pub/Sub DLQ has messages: <subscription-name> alert fires. Any message in a DLQ represents a genuine processing failure that the subscriber could not recover from after 5 delivery attempts (20 for webhook-delivery, which tolerates more transient 5xx retries from merchant endpoints before quarantining).

DLQs are not supposed to hold messages. A non-zero depth is always a real problem. Do not acknowledge the alert until the queue is drained to zero or every remaining message has been explicitly quarantined with a tracked reason.

Quick reference

Subscription	DLQ drain subscription	Consumer service	What a poisoned message usually means
`transaction-events-sub`	`transaction-events-dlq-sub`	management (transaction log writer)	A transaction with a malformed schema or a Spanner mutation that violates a constraint
`settlement-events-sub`	`settlement-events-dlq-sub`	processing (batch reconciliation)	A TransIT batch close response the parser cannot handle
`webhook-delivery-sub`	`webhook-delivery-dead-letter-drain`	management (webhook delivery worker)	A merchant webhook endpoint is permanently 4xx (misconfigured URL, deprecated HMAC)
`staging-webhook-delivery-sub`	`staging-webhook-delivery-dead-letter-drain`	management staging	Same as above but in staging
`reconciliation-requests-sub`	`reconciliation-requests-dlq-sub`	processing (reconciliation worker)	A recon request referencing a merchant that has since been deleted
`reconciliation-sweep-sub`	`reconciliation-sweep-dlq-sub`	processing (reconciliation scheduler)	A scheduler fan-out produced an invalid merchant list

Webhook-specific note

For webhook-delivery-* queues, this runbook is only the raw Pub/Sub layer. Once the management drain consumer has translated the failed message into an application-level delivery row (DLQ_PENDING / DLQ_EXHAUSTED), switch to the higher-level operator workflow:

Triage — step by step

Replace <DLQ_SUB> with the drain subscription name from the alert (e.g., webhook-delivery-dead-letter-drain).

1. Pull the top 10 messages without ack-ing

PROJECT="pinpoint-gateway"
DLQ_SUB="<DLQ_SUB>"

gcloud pubsub subscriptions pull "${DLQ_SUB}" \
  --project="${PROJECT}" \
  --limit=10 \
  --auto-ack=false \
  --format='json' \
  > /tmp/dlq-dump.json

cat /tmp/dlq-dump.json | jq '.[] | {
  ackId,
  publishTime: .message.publishTime,
  attributes: .message.attributes,
  dataPreview: (.message.data | @base64d | .[0:500]),
  deliveryAttempt: .deliveryAttempt
}'

The deliveryAttempt field tells you how many times the consumer has failed to ack the message. By the time it's in a DLQ, it should be >= 5 (or >= 20 for webhook-delivery). If it's lower than the configured max_delivery_attempts, something is weird — stop and investigate before replaying.

2. Correlate with consumer logs

Take the publishTime of the oldest message and look at the consumer service logs from that time minus 1 minute:

PUBLISH_TIME="2026-04-13T12:34:56Z"  # from the dump above
SERVICE="gateway-management"         # the consumer of the parent subscription

# Widen the window to 1 minute BEFORE the first delivery so you see the
# initial failure (not just the retry cascade that ran a few seconds later).
LOOKBACK_START="$(date -u -d "${PUBLISH_TIME} - 1 minute" '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null \
  || python3 -c "import datetime, sys; t=datetime.datetime.fromisoformat('${PUBLISH_TIME}'.rstrip('Z')) - datetime.timedelta(minutes=1); print(t.isoformat()+'Z')")"

gcloud logging read \
  "resource.labels.service_name=\"${SERVICE}\" AND timestamp >= \"${LOOKBACK_START}\"" \
  --project="${PROJECT}" \
  --limit=50 \
  --format='value(timestamp,severity,jsonPayload.message,jsonPayload.exception)'

You are looking for the stack trace that caused the original failure. Once you find it, you know what category of issue this is:

Log pattern	Category	Next step
`JsonParseException` / `MismatchedInputException`	Schema regression	Option B (replay after code fix)
`Spanner mutation exceeds... limit`	Too-large payload	Option C (quarantine, split, republish)
`403 Forbidden` on webhook endpoint	Merchant misconfig	Option D (quarantine, notify merchant)
`ResourceNotFoundException: merchant`	Stale reference	Option C or D depending on whether the merchant actually existed
Intermittent Spanner aborts or timeouts	Transient	Option A (simple replay)

3. Choose a path — A, B, C, or D

Option A — Simple replay

The failure was transient (e.g., Spanner unavailable for 3 minutes). Fix: move the messages back to the parent topic.

PARENT_TOPIC="webhook-delivery"  # the topic that the DLQ drains from

# Pull each message and republish to the parent topic. This is a
# write-then-ack sequence so if the write fails the message stays in the DLQ.
while true; do
  msg="$(gcloud pubsub subscriptions pull "${DLQ_SUB}" --project="${PROJECT}" --limit=1 --auto-ack=false --format='json' | jq -c '.[0]')"
  if [[ -z "${msg}" || "${msg}" == "null" ]]; then
    echo "DLQ drained."
    break
  fi
  data="$(echo "${msg}" | jq -r '.message.data')"
  attrs="$(echo "${msg}" | jq -c '.message.attributes // {}')"
  ack_id="$(echo "${msg}" | jq -r '.ackId')"

  # `gcloud pubsub topics publish --attribute=KEY=VALUE` accepts exactly one
  # key=value per flag (a single comma-joined list would be collapsed into one
  # attribute value and the other keys would be lost). Build an array of flags.
  attr_flags=()
  while IFS= read -r entry; do
    [[ -n "${entry}" ]] && attr_flags+=("--attribute=${entry}")
  done < <(echo "${attrs}" | jq -r 'to_entries[] | "\(.key)=\(.value)"')

  gcloud pubsub topics publish "${PARENT_TOPIC}" \
    --project="${PROJECT}" \
    --message="$(echo "${data}" | base64 -d)" \
    "${attr_flags[@]}"

  gcloud pubsub subscriptions ack "${DLQ_SUB}" --project="${PROJECT}" --ack-ids="${ack_id}"
done

Do NOT run this loop without first understanding why the messages failed in the first place. Replaying poisoned messages against a still-broken consumer puts them right back in the DLQ.

Option B — Replay after code fix

The consumer had a bug. Fix: deploy the fix (cut a hotfix release, approve the production deploy), then run Option A's replay loop.

Option C — Quarantine, split, republish

The message is legitimate but can't be processed in its current form (e.g., a batch of 5000 items that exceeds the Spanner mutation limit). Fix: write a one-off script that splits the payload into processable chunks, publish the chunks, then ack the original:

# Dump the poisoned message to a file for inspection
gcloud pubsub subscriptions pull "${DLQ_SUB}" --project="${PROJECT}" --limit=1 --auto-ack=false --format='json' \
  | jq '.[0]' > /tmp/poisoned-message.json

# ... manually craft the fix and republish ...

# Finally, ack the DLQ message so it stops firing the alert
ACK_ID="$(jq -r '.ackId' < /tmp/poisoned-message.json)"
gcloud pubsub subscriptions ack "${DLQ_SUB}" --project="${PROJECT}" --ack-ids="${ACK_ID}"

File a follow-up issue documenting the split script so the next occurrence can reuse it.

Option D — Quarantine and drop

The message cannot be successfully processed by any version of the consumer (e.g., a merchant's webhook URL has been deleted from their DNS for 6 months). Fix: dump the message to a quarantine bucket for audit and ack it.

# Dump to the audit bucket
gcloud pubsub subscriptions pull "${DLQ_SUB}" --project="${PROJECT}" --limit=1 --auto-ack=false --format='json' \
  > /tmp/quarantine-$(date +%s).json

gcloud storage cp /tmp/quarantine-*.json gs://pinpoint-gateway-audit-logs/dlq-quarantine/

# Ack the message
ACK_ID="$(jq -r '.[0].ackId' < /tmp/quarantine-*.json)"
gcloud pubsub subscriptions ack "${DLQ_SUB}" --project="${PROJECT}" --ack-ids="${ACK_ID}"

Every quarantine action MUST be recorded in a tracking issue with the merchant ID, message body (redacted as needed), and reason. Silent drops are prohibited.

After the drain

Re-pull the DLQ with --limit=100 --auto-ack=false. Expected: empty result.
Watch the alert in Cloud Monitoring — it should auto-close within 5 minutes of the last ack.
File a post-mortem issue with: what the failure was, root cause, the fix, which option path you used (A/B/C/D), and the count of messages processed per path.
If you used Option B (code fix), link the hotfix PR.

Appendix: Publishing a test poison pill to validate the DLQ flow

Use this once after every infra change to confirm the DLQ to alert path still works:

# Publish a message the consumer will definitely reject (e.g., malformed JSON)
gcloud pubsub topics publish webhook-delivery \
  --project=pinpoint-gateway \
  --message='{"this":"will":"break":"the":"consumer"}'

# Wait for the DLQ transition. Time-to-DLQ = max_delivery_attempts × retry
# backoff, which varies by subscription:
#   - transaction-events / settlement-events / reconciliation-* : max=5  → typically ~1–2 min
#   - webhook-delivery (prod + staging)                         : max=20 → typically ~10–15 min
# For a faster validation loop, run this against a max=5 subscription instead.
# Then check that the DLQ has a message:
gcloud pubsub subscriptions pull webhook-delivery-dead-letter-drain \
  --project=pinpoint-gateway \
  --limit=10 \
  --auto-ack=false

# Expected: 1 message visible.
# Within 5-10 more minutes, the Cloud Monitoring alert fires.
# Acknowledge the alert, drain the DLQ (Option D — drop the test), and move on.

Quick reference​

Webhook-specific note​

Triage — step by step​

1. Pull the top 10 messages without ack-ing​

2. Correlate with consumer logs​

3. Choose a path — A, B, C, or D​

Option A — Simple replay​

Option B — Replay after code fix​

Option C — Quarantine, split, republish​

Option D — Quarantine and drop​

After the drain​

Appendix: Publishing a test poison pill to validate the DLQ flow​