Skip to main content

Distributed Tracing Verification

This runbook is used to verify end-to-end distributed tracing is working across the gateway service-to-service call graph. Run it once per week and before every production deployment.

Prerequisites

  • gcloud CLI authenticated against the staging project
  • curl and jq
  • Access to the staging Cloud Trace UI in the GCP console

Happy-path check (automated)

From a developer machine:

bazel test //libs/security:internal_service_client_trace_test \
//libs/security:distributed_tracing_chain_test \
//libs/transit-client:transit_client_trace_test

Expected: all three targets pass. These three tests collectively prove the traceparent header is injected on outbound RestTemplate and OkHttp calls, and that a trace-id survives two hops.

Manual staging verification

  1. Pick a test merchant that is safe to transact against in staging. Record its merchant ID.

  2. Fire a test checkout session against the staging online-txn service:

    CHECKOUT_URL="https://checkout.staging.peakgateway.co/api/v1/checkout-sessions"
    curl -s -X POST "$CHECKOUT_URL" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $STAGING_TOKEN" \
    -d '{"merchantId":"<merchant-id>","lineItems":[{"priceId":"price_test","quantity":1}]}' \
    | jq '.'
  3. Note the X-Cloud-Trace-Context response header if present, or take the most recent trace from the Cloud Trace UI filtered by service.name=online-txn-service AND http.response.status_code=200.

  4. Open the trace in the GCP Cloud Trace UI. The waterfall should show at least three spans in order:

    • POST /api/v1/checkout-sessions on online-txn-service
    • POST /api/v1/transactions/sale on processing-service
    • An outbound TransIT call (no span name in the kernel, but it appears as an OkHttp child of the processing span)
  5. Confirm the three spans share the same trace-id in the URL.

When a gap appears

If step 4 shows only the top-level span and no children:

  1. Check Cloud Trace filter is not set to parent:none.
  2. Run gcloud logging read 'resource.type="cloud_run_revision" AND trace="projects/<project-id>/traces/<trace-id>"' --limit=50 and confirm log lines from both services carry the same trace value.
  3. If the child service's logs carry a different trace-id, the outbound interceptor is the suspect. Run the automated test suite to confirm TraceContextClientHttpRequestInterceptor is still registered and that TraceContextHeaders.currentHeaders() returns a non-empty map at that hop.
  4. If the child service's logs carry no trace-id, RequestMdcFilter is not processing the inbound header. Confirm the filter chain order in services/*/src/main/kotlin/.../config/SecurityConfig.kt still registers requestMdcFilter via addFilterBefore.

Cloud Trace sampling

Soft launch runs at 100% sampling (GCP_TRACE_SAMPLING_RATIO=1.0, set in services/*/src/main/resources/application.yml) — every transaction produces a trace. This is intentional: at early traffic levels we cannot afford to miss an incident because the one relevant request sampled out.

Dial this back (e.g. to 0.10) once production sustained RPS climbs past ~50 and Cloud Trace cost signals justify the drop. The per-service @Value fallback in SecurityConfig.kt / WebFilterConfig.kt mirrors the YAML default so local runs behave the same as staging. Tracked as OBS-09 in the launch-readiness spec (#441).

Why this exists

A distributed trace that stops at the service boundary makes incident response blind past the first hop. Before OBS-01 landed, outbound RestTemplate calls in InternalServiceClient and OkHttp calls in TransitClient carried no trace headers, so every downstream service started a fresh root span. The fix (TraceContextClientHttpRequestInterceptor and TransitTraceInterceptor) copies the current span's W3C traceparent onto every outbound request. This runbook is the human check that validates the machine check.