Distributed Tracing Verification
This runbook is used to verify end-to-end distributed tracing is working across the gateway service-to-service call graph. Run it once per week and before every production deployment.
Prerequisites
gcloudCLI authenticated against the staging projectcurlandjq- Access to the staging Cloud Trace UI in the GCP console
Happy-path check (automated)
From a developer machine:
bazel test //libs/security:internal_service_client_trace_test \
//libs/security:distributed_tracing_chain_test \
//libs/transit-client:transit_client_trace_test
Expected: all three targets pass. These three tests collectively prove the
traceparent header is injected on outbound RestTemplate and OkHttp calls,
and that a trace-id survives two hops.
Manual staging verification
-
Pick a test merchant that is safe to transact against in staging. Record its merchant ID.
-
Fire a test checkout session against the staging online-txn service:
CHECKOUT_URL="https://checkout.staging.peakgateway.co/api/v1/checkout-sessions"curl -s -X POST "$CHECKOUT_URL" \-H "Content-Type: application/json" \-H "Authorization: Bearer $STAGING_TOKEN" \-d '{"merchantId":"<merchant-id>","lineItems":[{"priceId":"price_test","quantity":1}]}' \| jq '.' -
Note the
X-Cloud-Trace-Contextresponse header if present, or take the most recent trace from the Cloud Trace UI filtered byservice.name=online-txn-service AND http.response.status_code=200. -
Open the trace in the GCP Cloud Trace UI. The waterfall should show at least three spans in order:
POST /api/v1/checkout-sessionsononline-txn-servicePOST /api/v1/transactions/saleonprocessing-service- An outbound TransIT call (no span name in the kernel, but it appears as an OkHttp child of the processing span)
-
Confirm the three spans share the same trace-id in the URL.
When a gap appears
If step 4 shows only the top-level span and no children:
- Check Cloud Trace filter is not set to
parent:none. - Run
gcloud logging read 'resource.type="cloud_run_revision" AND trace="projects/<project-id>/traces/<trace-id>"' --limit=50and confirm log lines from both services carry the same trace value. - If the child service's logs carry a different trace-id, the outbound
interceptor is the suspect. Run the automated test suite to confirm
TraceContextClientHttpRequestInterceptoris still registered and thatTraceContextHeaders.currentHeaders()returns a non-empty map at that hop. - If the child service's logs carry no trace-id,
RequestMdcFilteris not processing the inbound header. Confirm the filter chain order inservices/*/src/main/kotlin/.../config/SecurityConfig.ktstill registersrequestMdcFilterviaaddFilterBefore.
Cloud Trace sampling
Soft launch runs at 100% sampling (GCP_TRACE_SAMPLING_RATIO=1.0, set in
services/*/src/main/resources/application.yml) — every transaction
produces a trace. This is intentional: at early traffic levels we cannot
afford to miss an incident because the one relevant request sampled out.
Dial this back (e.g. to 0.10) once production sustained RPS climbs past
~50 and Cloud Trace cost signals justify the drop. The per-service
@Value fallback in SecurityConfig.kt / WebFilterConfig.kt mirrors
the YAML default so local runs behave the same as staging. Tracked as
OBS-09 in the launch-readiness spec (#441).
Why this exists
A distributed trace that stops at the service boundary makes incident
response blind past the first hop. Before OBS-01 landed, outbound
RestTemplate calls in InternalServiceClient and OkHttp calls in
TransitClient carried no trace headers, so every downstream service
started a fresh root span. The fix (TraceContextClientHttpRequestInterceptor
and TransitTraceInterceptor) copies the current span's W3C traceparent
onto every outbound request. This runbook is the human check that validates
the machine check.