Summary
Datadog APM error spans tagged @error.type:"In-Progress X-Ray Segment" are dominantly false positives for SQS- (and other async-) triggered Lambda functions. They originate from an AWS-side X-Ray quirk where the AWS::Lambda wrapper segment fails to be closed by AWS Lambda's control plane, despite the function itself completing successfully. Filtering these out at the extension/integration layer would significantly improve signal-to-noise for any team monitoring async Lambdas through Datadog APM.
Root cause (AWS-side, not Datadog-side)
For each Lambda invocation with tracing_config.mode = Active, X-Ray emits two segments:
AWS::Lambda::Function — function execution; closed by the Lambda runtime when the handler returns.
AWS::Lambda — service-level wrapper segment, emitted by Lambda's control plane.
For SQS / SNS / EventBridge / S3-triggered Lambdas, AWS's internal pipeline frequently fails to close the wrapper segment, leaving it in_progress: true indefinitely. The function segment closes normally.
Verification (via aws xray batch-get-traces on real production trace IDs):
AWS::Lambda::Function in_progress: false has_end_time: true duration: 1.59s
AWS::Lambda in_progress: true has_end_time: false duration: null
The function's CloudWatch REPORT log also confirms the invocation completed normally (START → REPORT → END trio with billed duration well under the configured timeout).
A note on existing public references
I haven't found an AWS-side GitHub issue that specifically describes the wrapper-segment-fails-to-close symptom. The publicly-discussed adjacent issues are about trace context propagation across async boundaries — e.g. aws-xray-sdk-node#208, aws-xray-sdk-go#218 (the latter resolved in 2023 with SNS/SQS active tracing) — which are related but mechanically different (continuity vs. closure). AWS's X-Ray troubleshooting docs acknowledge the broader category of "incomplete traces" but don't single out this specific symptom. The trace-data evidence and span-attribute comparison below are therefore the strongest signal.
Why Datadog can filter this
These error spans reach Datadog via the AWS X-Ray integration (out-of-process polling), not via the Datadog Lambda Extension. They have distinctive tags that make them easy to identify:
service: aws.lambda.<function-name>
@error.type: "In-Progress X-Ray Segment"
@error.stack: "This X-Ray segment is still in-progress... likely terminated due
to a timeout or out-of-memory error..."
ingestion_reason: xray
trace_type: xray
Compare to Datadog-Extension-emitted spans which have ingestion_reason: lambda and rich Lambda metadata (runtime-id, dd_extension_version, init_type, language, go_path). The two pipelines are cleanly distinguishable, so a filter that targets ingestion_reason: xray will never accidentally drop dd-trace-go spans.
Production impact (single team's measurement)
Over a 15-day window on one SQS-triggered production Lambda:
- 336 Datadog error spans with
@error.type:"In-Progress X-Ray Segment"
- 19 real Lambda errors per CloudWatch
aws.lambda.errors
- ~95% false-positive ratio in the APM error tracker
This corrupts error tracking dashboards, monitors, and on-call paging. Multiple engineers across the org have been burned chasing these as real Lambda timeouts (the canned "timeout/OOM" stack text is misleading — function actually completed cleanly in 1-2 s).
Requested fix
In rough order of preference:
A. Default-off Retention Filter / ingestion drop rule shipped with the Datadog AWS Lambda integration, matching:
service:aws.lambda.* @error.type:"In-Progress X-Ray Segment"
With clear documentation that customers can opt in.
B. Ingestion-time correlation drop: when ingesting an in_progress wrapper segment (origin: AWS::Lambda), check if the corresponding AWS::Lambda::Function child segment has closed normally. If yes, drop the wrapper from APM ingestion as a known-AWS-artefact.
C. Documentation in Datadog Lambda Troubleshooting explicitly addressing this error type and recommending the manual exclusion filter, with the rationale that it's an AWS-side artefact, not a real failure.
Happy to provide additional trace IDs, X-Ray segment JSON dumps, and CloudWatch correlation data if useful.
Summary
Datadog APM error spans tagged
@error.type:"In-Progress X-Ray Segment"are dominantly false positives for SQS- (and other async-) triggered Lambda functions. They originate from an AWS-side X-Ray quirk where theAWS::Lambdawrapper segment fails to be closed by AWS Lambda's control plane, despite the function itself completing successfully. Filtering these out at the extension/integration layer would significantly improve signal-to-noise for any team monitoring async Lambdas through Datadog APM.Root cause (AWS-side, not Datadog-side)
For each Lambda invocation with
tracing_config.mode = Active, X-Ray emits two segments:AWS::Lambda::Function— function execution; closed by the Lambda runtime when the handler returns.AWS::Lambda— service-level wrapper segment, emitted by Lambda's control plane.For SQS / SNS / EventBridge / S3-triggered Lambdas, AWS's internal pipeline frequently fails to close the wrapper segment, leaving it
in_progress: trueindefinitely. The function segment closes normally.Verification (via
aws xray batch-get-traceson real production trace IDs):The function's CloudWatch
REPORTlog also confirms the invocation completed normally (START→REPORT→ENDtrio with billed duration well under the configured timeout).A note on existing public references
I haven't found an AWS-side GitHub issue that specifically describes the wrapper-segment-fails-to-close symptom. The publicly-discussed adjacent issues are about trace context propagation across async boundaries — e.g.
aws-xray-sdk-node#208,aws-xray-sdk-go#218(the latter resolved in 2023 with SNS/SQS active tracing) — which are related but mechanically different (continuity vs. closure). AWS's X-Ray troubleshooting docs acknowledge the broader category of "incomplete traces" but don't single out this specific symptom. The trace-data evidence and span-attribute comparison below are therefore the strongest signal.Why Datadog can filter this
These error spans reach Datadog via the AWS X-Ray integration (out-of-process polling), not via the Datadog Lambda Extension. They have distinctive tags that make them easy to identify:
Compare to Datadog-Extension-emitted spans which have
ingestion_reason: lambdaand rich Lambda metadata (runtime-id,dd_extension_version,init_type,language,go_path). The two pipelines are cleanly distinguishable, so a filter that targetsingestion_reason: xraywill never accidentally drop dd-trace-go spans.Production impact (single team's measurement)
Over a 15-day window on one SQS-triggered production Lambda:
@error.type:"In-Progress X-Ray Segment"aws.lambda.errorsThis corrupts error tracking dashboards, monitors, and on-call paging. Multiple engineers across the org have been burned chasing these as real Lambda timeouts (the canned "timeout/OOM" stack text is misleading — function actually completed cleanly in 1-2 s).
Requested fix
In rough order of preference:
A. Default-off Retention Filter / ingestion drop rule shipped with the Datadog AWS Lambda integration, matching:
With clear documentation that customers can opt in.
B. Ingestion-time correlation drop: when ingesting an
in_progresswrapper segment (origin: AWS::Lambda), check if the correspondingAWS::Lambda::Functionchild segment has closed normally. If yes, drop the wrapper from APM ingestion as a known-AWS-artefact.C. Documentation in Datadog Lambda Troubleshooting explicitly addressing this error type and recommending the manual exclusion filter, with the rationale that it's an AWS-side artefact, not a real failure.
Happy to provide additional trace IDs, X-Ray segment JSON dumps, and CloudWatch correlation data if useful.