feat(aws): cross-invocation tracecontext propagation#8182
Conversation
BenchmarksBenchmark execution time: 2026-05-20 19:53:38 Comparing candidate commit 9a25134 in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 1491 metrics, 102 unstable metrics. |
c1c351e to
7756bba
Compare
3eec288 to
2df8f55
Compare
Overall package sizeSelf size: 5.86 MB Dependency sizes| name | version | self size | total size | |------|---------|-----------|------------| | import-in-the-middle | 3.0.1 | 82.56 kB | 817.39 kB | | opentracing | 0.14.7 | 194.81 kB | 194.81 kB | | dc-polyfill | 0.1.11 | 25.74 kB | 25.74 kB |🤖 This report was automatically generated by heaviest-objects-in-the-universe |
🎉 All green!🧪 All tests passed 🎯 Code Coverage (details) 🔗 Commit SHA: 9a25134 | Docs | Datadog PR Page | Give us feedback! |
da775cf to
d2bb910
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## joey/apm-ai-toolkit/aws-durable-execution-sdk-js #8182 +/- ##
====================================================================================
- Coverage 89.79% 89.76% -0.03%
====================================================================================
Files 852 853 +1
Lines 45911 46020 +109
Branches 8534 8533 -1
====================================================================================
+ Hits 41225 41310 +85
- Misses 4686 4710 +24
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
c4eca48 to
73fdb80
Compare
411cd56 to
4180be5
Compare
…s-invocation continuity
Persist the current trace context as a synthetic `_datadog_{N}` STEP operation
when the SDK suspends to PENDING, so subsequent invocations (read by the
upstream datadog-lambda-js wrapper) can resume the same trace.
Files:
- src/handler.js: install a hook on the SDK's terminationManager.terminate
inside bindStart. Save fires only for resumable reasons (PENDING_TERMINATION_REASONS
allow-list mirrors the SDK's TerminationReason enum entries that result in
Status: PENDING). Gated by DD_DURABLE_CROSS_INVOCATION_TRACING_ENABLED
(default on; opt out with 'false'/'0').
- src/trace-checkpoint.js: NEW. Datadog-only header inject (private
TextMapPropagator with tracePropagationStyle.inject = ['datadog'], shadows
the live tracer config), dedup against prior _datadog_N op via
JSON.stringify-after-stripping-x-datadog-parent-id, deterministic blake2b
stepId so the save is idempotent within an execution.
- test/handler.checkpoint.spec.js: unit tests for the termination hook
(pending vs non-pending reasons, env-var gate, idempotency, default reason).
- test/trace-checkpoint.spec.js: unit tests for the save module
(queue START+SUCCEED before terminating, dedup on parent-id-only changes).
- test/index.spec.js: integration coverage for SDK safe-paths
(single cycle, child-context, step-suspend-step).
- packages/dd-trace/src/config/supported-configurations.json and
generated-config-types.d.ts: register DD_DURABLE_CROSS_INVOCATION_TRACING_ENABLED.
…merScheduler bug Skip wait_for_callback (happy path) and the entire invoke describe block (happy + error). All three fail deterministically in CI under @aws/durable-execution-sdk-js-testing's current TimerScheduler, whose hasScheduledFunction() undercounts in-flight scheduled functions and trips the test orchestrator's "Cannot return PENDING status with no pending operations." validation. Production (real AWS backend) is not affected — the validation is mock-only. Fix is open upstream as aws/aws-durable-execution-sdk-js#544; re-enable these tests once a release containing it is pinned in packages/dd-trace/test/plugins/versions/package.json.
…led guard The guard was defensive against a "same terminationManager passed to bindStart twice" scenario that cannot happen in the SDK as it stands — each Lambda invocation calls initializeExecutionContext, which constructs a fresh `new TerminationManager()`, so warm starts share the wrapper closure but not the terminationManager instance. Removing the Symbol + the guard + the explicit "twice across invocations" unit test that only covered a contrived re-entry. Drive-by: fix four pre-existing space-before-function-paren lint errors in the same file.
…cute span, not its parent
Drop the `getParentSpanId` helper and inline the read directly during
`state` initialization. While inlining, switch the anchor from the
execute span's *parent* (typically `aws.lambda`'s id) to the execute
span's *own* id (`span.context().toSpanId()`).
Why anchor at the execute span:
- It's a span this integration owns and just created, so always defined
and never depends on what upstream context happened to be active when
`bindStart` fired.
- Topology becomes "resumed invocations are continuations of the first
execute" — matching the user-facing model of a single durable
execution. The old shape made resumes look like sibling Lambda
invocations under whatever upstream span happened to be there.
- In the no-upstream case the old code already fell through to the
propagator default (= execute span's own id) via `if (parentId)` —
so this just makes the behavior consistent across environments.
Rename for clarity:
- `saveTraceContextCheckpointIfUpdated`'s `checkpointAnchorSpanId`
parameter -> `firstExecutionSpanId`. JSDoc spells out it's only
consulted on the very first save; once a prior `_datadog_{N}` exists,
the function reuses that checkpoint's `x-datadog-parent-id` verbatim.
- The local `latestParentId` (the value carried forward across saves)
-> `anchoredSpanId`, reflecting that it IS the anchor we've been
using since the first save.
- handler.js's `state.parentSpanId` -> `state.firstExecutionSpanId`.
Note: dd-trace-py's `_resolve_override_parent_id` currently anchors at
the execute span's parent (matching the old JS behavior). A follow-up
should bring Python in line with this change so both languages produce
the same trace shape.
…ainst TimerScheduler bug" This reverts commit 8baa8ce.
034a8f8 to
748a826
Compare
…gainst TimerScheduler bug" This reverts commit 748a826.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7f6e5f5fe4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Summary
Adds cross-invocation trace-context continuity for the @aws/durable-execution-sdk-js integration. Each invocation of a durable execution now writes datadog{N} checkpoints on suspend when the trace context updates, so subsequent invocations of the same execution can extract the trace context from checkpoints and attach to a common anchor span. NOTE : The extraction part of these checkpoints is in DataDog/datadog-lambda-js#774
Motivation
A durable execution is a logically single workflow that the SDK transparently runs across N Lambda invocations (suspending on ctx.wait, ctx.waitForCallback, ctx.invoke, retries, etc.).
Before this PR, dd-trace produced one isolated trace per physical invocation. Customers couldn't see the workflow end-to-end in APM.
This PR makes those invocations show up under a single anchor, while leaving the per-invocation aws.durable.execute spans intact.
Without this, every resume of a suspended durable execution starts a fresh, unconnected trace.
Changes
New module: packages/datadog-plugin-aws-durable-execution-sdk-js/src/trace-checkpoint.js
operation via the SDK's checkpoint manager.
anchor verbatim.
_installTerminationCheckpointHookwill:checkpointManager.checkpoint(stepId, START) + checkpoint(stepId, SUCCEED) calls.
Tests
Why commenting out some tests
All three tests race against a
TimerSchedulerbug in@aws/durable-execution-sdk-jsthat is fixed upstream in aws/aws-durable-execution-sdk-js#544. Once that fix is published in a release we pin against, the skips will be removed.The race only manifests at the suspend → resume boundary when resume is driven externally (by
sendCallbackSuccess()forwait_for_callback, or by the target function completing for chained invoke()). Timer-driven resumes (ctx.wait,ctx.waitForCondition, etc.) take a single, ordered code path throughTimerSchedulerand are unaffected.This PR adds the cross-invocation trace-context checkpoint hook. On every PENDING termination, the hook calls back into the SDK via
checkpointManager.checkpoint(stepId, START)+checkpoint(stepId, SUCCEED). That extra async work overlaps with the SDK's own pending-state bookkeeping at the same boundary where TimerScheduler is coordinating drain — which is precisely the state machine #544 cleans up.Why production is unaffected
The
TimerSchedulercode path involved is only used by@aws/durable-execution-sdk-js-testing'sLocalDurableTestRunner(simulated clock + in-process callback resolution). Real Lambda invocations don't drive resume through TimerScheduler — the resume of a suspended execution is a fresh invocation initiated by the Durable Execution service, not a continuation inside the same process. The checkpoint writes themselves complete normally; what races is the test harness's observation of the resumed invocation.Other Notes
REPLAY -> NEW transitions
In dd-trace-py PR-17773,
mark_trace_context_checkpoints_visitedmethod was added to address a glitch caused by the datadog{N} steps we added. That issue doesn't exist for NodeJS.Python SDK pattern:
visited set.
Node.js SDK pattern (durable-context.ts:209):
So our synthetic ops are invisible to the SDK's replay-mode bookkeeping — they can't keep it stuck in REPLAY.