Skip to main content

Webhook DLQ depth > 0

Alert policy: google_monitoring_alert_policy.webhook_dlq_depth_persistent (infra/terraform/modules/monitoring/webhook-dlq-alert.tf)

Canonical URL: https://support.pinpointgateway.com/docs/ops/runbooks/webhook-dlq-depth

Severity: WARNING

What this alert means

One or more webhook deliveries are already marked DLQ_PENDING in the management service. This is not the raw Pub/Sub dead-letter-subscription alert. By the time this alert fires, the Pub/Sub drain consumer has already translated the dead-lettered message into an application-level delivery row that an admin or merchant can inspect and retry.

The alert is keyed by organizationId from the custom metric webhook_dlq_depth. A non-zero value means at least one merchant-facing webhook was never delivered successfully and still needs operator or merchant action.

First three diagnostic steps

  1. Pull the affected Organization from the metric label. The alert groups by metric.label."organizationId". Use that Organization ID as the primary filter for every next step.
  2. Inspect the application-level failures first. In the admin Event Log, filter by the Organization and delivery_status=DLQ_PENDING. Look at last_error, last_response_headers, destination URL, and whether the same subscription is failing repeatedly. The persisted-event surface and SDK helpers are summarized in Webhook Event Log and Replay.
  3. Decide retry vs. wait-for-expiry. If the endpoint is fixed, trigger POST /api/v1/webhook-deliveries/{id}/retry. Use DLQ retry for a specific failed delivery row; use event replay only when you intentionally want to enqueue the event again from the event record. If the endpoint is still bad or merchant-owned infrastructure is down, leave the row pending while coordinating with the merchant, or allow the daily back-stop sweep to flip it to DLQ_EXHAUSTED.

When to use the Pub/Sub DLQ drain runbook instead

Use the deeper Pub/Sub DLQ drain runbook only if you suspect the management drain consumer is not translating dead-lettered Pub/Sub messages into DLQ_PENDING rows correctly. Typical signals:

  • Pub/Sub webhook-delivery-dead-letter-drain depth is growing, but webhook_dlq_depth is flat.
  • The alert fired immediately after a management deploy and no matching DLQ_PENDING rows are visible for the Organization.
  • The management logs show drain-consumer deserialization failures or lookup failures around the same time.

If the DLQ_PENDING rows already exist, stay in the application-level workflow in this runbook; do not start by replaying raw Pub/Sub messages.

Rollback criteria

Do not roll back management just because a merchant endpoint is broken. Roll back only if all of the following are true:

  • The alert started immediately after a recent gateway-management deploy.
  • Multiple Organizations began accumulating DLQ_PENDING rows at once.
  • The failure signatures point at our delivery/drain behavior, not merchant-owned endpoints.

If the failure is isolated to one merchant or one destination URL, treat it as an integration incident, not a service rollback candidate.

Escalation path

  • T+0 — on-call acknowledges, identifies the affected Organization, and inspects the Event Log.
  • T+30 min — if the failure mode is unclear, page the management service owner.
  • T+1 hour — if the rows are still accumulating for the same Organization, notify merchant-success because the merchant is actively missing webhooks.
  • T+2 hours — escalate to engineering lead if no mitigation path exists or multiple Organizations are affected.