Skip to content

Feature request: filter "In-Progress X-Ray Segment" error spans from async-triggered Lambda traces (AWS-side quirk, ~95% false-positive ratio) #1229

@raghav-221B

Description

@raghav-221B

Summary

Datadog APM error spans tagged @error.type:"In-Progress X-Ray Segment" are dominantly false positives for SQS- (and other async-) triggered Lambda functions. They originate from an AWS-side X-Ray quirk where the AWS::Lambda wrapper segment fails to be closed by AWS Lambda's control plane, despite the function itself completing successfully. Filtering these out at the extension/integration layer would significantly improve signal-to-noise for any team monitoring async Lambdas through Datadog APM.

Root cause (AWS-side, not Datadog-side)

For each Lambda invocation with tracing_config.mode = Active, X-Ray emits two segments:

  1. AWS::Lambda::Function — function execution; closed by the Lambda runtime when the handler returns.
  2. AWS::Lambda — service-level wrapper segment, emitted by Lambda's control plane.

For SQS / SNS / EventBridge / S3-triggered Lambdas, AWS's internal pipeline frequently fails to close the wrapper segment, leaving it in_progress: true indefinitely. The function segment closes normally.

Verification (via aws xray batch-get-traces on real production trace IDs):

AWS::Lambda::Function   in_progress: false   has_end_time: true   duration: 1.59s
AWS::Lambda             in_progress: true    has_end_time: false  duration: null

The function's CloudWatch REPORT log also confirms the invocation completed normally (STARTREPORTEND trio with billed duration well under the configured timeout).

A note on existing public references

I haven't found an AWS-side GitHub issue that specifically describes the wrapper-segment-fails-to-close symptom. The publicly-discussed adjacent issues are about trace context propagation across async boundaries — e.g. aws-xray-sdk-node#208, aws-xray-sdk-go#218 (the latter resolved in 2023 with SNS/SQS active tracing) — which are related but mechanically different (continuity vs. closure). AWS's X-Ray troubleshooting docs acknowledge the broader category of "incomplete traces" but don't single out this specific symptom. The trace-data evidence and span-attribute comparison below are therefore the strongest signal.

Why Datadog can filter this

These error spans reach Datadog via the AWS X-Ray integration (out-of-process polling), not via the Datadog Lambda Extension. They have distinctive tags that make them easy to identify:

service: aws.lambda.<function-name>
@error.type: "In-Progress X-Ray Segment"
@error.stack: "This X-Ray segment is still in-progress... likely terminated due
              to a timeout or out-of-memory error..."
ingestion_reason: xray
trace_type: xray

Compare to Datadog-Extension-emitted spans which have ingestion_reason: lambda and rich Lambda metadata (runtime-id, dd_extension_version, init_type, language, go_path). The two pipelines are cleanly distinguishable, so a filter that targets ingestion_reason: xray will never accidentally drop dd-trace-go spans.

Production impact (single team's measurement)

Over a 15-day window on one SQS-triggered production Lambda:

  • 336 Datadog error spans with @error.type:"In-Progress X-Ray Segment"
  • 19 real Lambda errors per CloudWatch aws.lambda.errors
  • ~95% false-positive ratio in the APM error tracker

This corrupts error tracking dashboards, monitors, and on-call paging. Multiple engineers across the org have been burned chasing these as real Lambda timeouts (the canned "timeout/OOM" stack text is misleading — function actually completed cleanly in 1-2 s).

Requested fix

In rough order of preference:

A. Default-off Retention Filter / ingestion drop rule shipped with the Datadog AWS Lambda integration, matching:

service:aws.lambda.* @error.type:"In-Progress X-Ray Segment"

With clear documentation that customers can opt in.

B. Ingestion-time correlation drop: when ingesting an in_progress wrapper segment (origin: AWS::Lambda), check if the corresponding AWS::Lambda::Function child segment has closed normally. If yes, drop the wrapper from APM ingestion as a known-AWS-artefact.

C. Documentation in Datadog Lambda Troubleshooting explicitly addressing this error type and recommending the manual exclusion filter, with the rationale that it's an AWS-side artefact, not a real failure.

Happy to provide additional trace IDs, X-Ray segment JSON dumps, and CloudWatch correlation data if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions