Webhook DLQ depth > 0
Alert policy: google_monitoring_alert_policy.webhook_dlq_depth_persistent (infra/terraform/modules/monitoring/webhook-dlq-alert.tf)
Canonical URL: https://support.pinpointgateway.com/docs/ops/runbooks/webhook-dlq-depth
Severity: WARNING
What this alert means
One or more webhook deliveries are already marked DLQ_PENDING in the management service. This is not the raw Pub/Sub dead-letter-subscription alert. By the time this alert fires, the Pub/Sub drain consumer has already translated the dead-lettered message into an application-level delivery row that an admin or merchant can inspect and retry.
The alert is keyed by organizationId from the custom metric webhook_dlq_depth. A non-zero value means at least one merchant-facing webhook was never delivered successfully and still needs operator or merchant action.
First three diagnostic steps
- Pull the affected Organization from the metric label. The alert groups by
metric.label."organizationId". Use that Organization ID as the primary filter for every next step. - Inspect the application-level failures first. In the admin Event Log, filter by the Organization and
delivery_status=DLQ_PENDING. Look atlast_error,last_response_headers, destination URL, and whether the same subscription is failing repeatedly. The persisted-event surface and SDK helpers are summarized in Webhook Event Log and Replay. - Decide retry vs. wait-for-expiry. If the endpoint is fixed, trigger
POST /api/v1/webhook-deliveries/{id}/retry. Use DLQ retry for a specific failed delivery row; use event replay only when you intentionally want to enqueue the event again from the event record. If the endpoint is still bad or merchant-owned infrastructure is down, leave the row pending while coordinating with the merchant, or allow the daily back-stop sweep to flip it toDLQ_EXHAUSTED.
When to use the Pub/Sub DLQ drain runbook instead
Use the deeper Pub/Sub DLQ drain runbook only if you suspect the management drain consumer is not translating dead-lettered Pub/Sub messages into DLQ_PENDING rows correctly. Typical signals:
- Pub/Sub
webhook-delivery-dead-letter-draindepth is growing, butwebhook_dlq_depthis flat. - The alert fired immediately after a management deploy and no matching
DLQ_PENDINGrows are visible for the Organization. - The management logs show drain-consumer deserialization failures or lookup failures around the same time.
If the DLQ_PENDING rows already exist, stay in the application-level workflow in this runbook; do not start by replaying raw Pub/Sub messages.
Rollback criteria
Do not roll back management just because a merchant endpoint is broken. Roll back only if all of the following are true:
- The alert started immediately after a recent
gateway-managementdeploy. - Multiple Organizations began accumulating
DLQ_PENDINGrows at once. - The failure signatures point at our delivery/drain behavior, not merchant-owned endpoints.
If the failure is isolated to one merchant or one destination URL, treat it as an integration incident, not a service rollback candidate.
Escalation path
- T+0 — on-call acknowledges, identifies the affected Organization, and inspects the Event Log.
- T+30 min — if the failure mode is unclear, page the management service owner.
- T+1 hour — if the rows are still accumulating for the same Organization, notify merchant-success because the merchant is actively missing webhooks.
- T+2 hours — escalate to engineering lead if no mitigation path exists or multiple Organizations are affected.
Related docs
- Pub/Sub DLQ drain runbook — raw Pub/Sub dead-letter triage and replay procedure.
- Webhook Event Log and Replay — persisted event APIs, replay vs. DLQ retry semantics, and SDK helper surface.
- Alert runbook index — catalog of Cloud Monitoring alerts and their primary runbooks.