fix(debugger): retry agent /info on 5xx by tylfin · Pull Request #18124 · DataDog/dd-trace-py

tylfin · 2026-05-17T13:07:05Z

Description

Adds a private _info_with_retry helper in ddtrace/internal/agent.py that wraps the /info HTTP call with the same fibonacci backoff already used by the Remote Config client. AgentCheckPeriodicService._agent_check now calls the retrying helper directly, so a single transient 5xx from the local trace agent no longer flips subclasses (most visibly the Dynamic Instrumentation signal uploader) back to _agent_check state until the next periodic interval. Snapshots/diagnostics keep flushing and the instance stops looking "not heartbeating" in the UI.

The public agent.info() helper is intentionally left unchanged (single attempt, returns None and logs a warning on non-2xx). It is shared with Remote Config polling, CI Visibility, LLM Obs and the LLM Obs writer, which all have their own scheduling and timing expectations; adding implicit retries there would change those callers' behavior.

Context: observed on a single staging instance whose local agent intermittently returned 500 on /info. RC polling itself was healthy end-to-end; the DI uploader on that host was the blast radius.

Testing

tests/tracer/test_agent.py::test_info_does_not_retry_on_5xx pins that info() keeps its single-attempt semantics.
tests/tracer/test_agent.py::test_agent_check_retries_info_on_5xx exercises the retry through AgentCheckPeriodicService._agent_check (3 total attempts before giving up, info_check receives None).
Existing test_info parametrization (incl. 500 case) still passes the contract: persistent 5xx returns None.
Both new tests verified locally on py3.10.

Risks

Low. Worst case is up to ~0.5s of added latency on the periodic agent health check when the agent is persistently failing 5xx. Other callers of agent.info() (Remote Config, CI Visibility, LLM Obs) are unaffected because their code path is untouched.

Additional Notes

None.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9f02b954be

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

cit-pr-commenter-54b7da · 2026-05-17T13:10:13Z

Codeowners resolved as

ddtrace/internal/agent.py                                               @DataDog/apm-core-python
releasenotes/notes/agent-info-retry-0a8acdda8d13a505.yaml               @DataDog/apm-python
tests/tracer/test_agent.py                                              @DataDog/apm-sdk-capabilities-python @DataDog/apm-core-python

pr-commenter · 2026-05-17T13:36:23Z

Benchmarks

Benchmark execution time: 2026-05-19 18:18:52

Comparing candidate commit db4e231 in PR branch tyler.finethy/agent-info-resilience with baseline commit 2166252 in branch main.

Found 0 performance improvements and 4 performance regressions! Performance is the same for 578 metrics, 10 unstable metrics.

scenario:iastaspects-lstrip_aspect

🟥 execution_time [+55.977µs; +61.418µs] or [+21.118%; +23.171%]

scenario:iastaspectsospath-ospathbasename_aspect

🟥 execution_time [+106.723µs; +113.994µs] or [+26.351%; +28.146%]

scenario:span-start

🟥 execution_time [+1.417ms; +1.574ms] or [+9.117%; +10.125%]

scenario:telemetryaddmetric-1-count-metric-1-times

🟥 execution_time [+267.777ns; +309.471ns] or [+12.999%; +15.023%]

Add a private _info_with_retry helper used by AgentCheckPeriodicService that retries the agent /info call on transient 5xx responses with a fibonacci backoff. Subsystems extending AgentCheckPeriodicService (e.g. the Dynamic Instrumentation signal uploader) no longer stay offline until the next periodic interval after a single transient 5xx. The public agent.info() helper is unchanged and continues to issue a single request, so existing callers retain their original semantics.

datadog-prod-us1-4 · 2026-05-19T18:12:37Z

Tests

🎉 All green!

🧪 All tests passed
❄️ No new flaky tests detected

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: db4e231 | Docs | Datadog PR Page | Give us feedback!}

tylfin · 2026-05-21T20:38:11Z

cc @P403n1x87 did we want to move forward with this or were you going to improve the resiliency alongside the UDS timeout?

tylfin requested review from a team as code owners May 17, 2026 13:07

tylfin requested review from emmettbutler and mabdinur May 17, 2026 13:07

chatgpt-codex-connector Bot reviewed May 17, 2026

View reviewed changes

Comment thread ddtrace/internal/agent.py

tylfin changed the title ~~fix(tracing): retry agent /info on 5xx~~ fix(debugger): retry agent /info on 5xx May 19, 2026

tylfin requested a review from P403n1x87 May 19, 2026 13:40

tylfin force-pushed the tyler.finethy/agent-info-resilience branch from 3f7902d to 1062343 Compare May 19, 2026 17:43

tylfin force-pushed the tyler.finethy/agent-info-resilience branch from 1062343 to db4e231 Compare May 19, 2026 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(debugger): retry agent /info on 5xx#18124

fix(debugger): retry agent /info on 5xx#18124
tylfin wants to merge 1 commit into
mainfrom
tyler.finethy/agent-info-resilience

tylfin commented May 17, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

cit-pr-commenter-54b7da Bot commented May 17, 2026

Uh oh!

pr-commenter Bot commented May 17, 2026 •

edited

Loading

Uh oh!

datadog-prod-us1-4 Bot commented May 19, 2026 •

edited by datadog-datadog-prod-us1 Bot

Loading

Uh oh!

tylfin commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tylfin commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

Risks

Additional Notes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

cit-pr-commenter-54b7da Bot commented May 17, 2026

Codeowners resolved as

Uh oh!

pr-commenter Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

scenario:iastaspects-lstrip_aspect

scenario:iastaspectsospath-ospathbasename_aspect

scenario:span-start

scenario:telemetryaddmetric-1-count-metric-1-times

Uh oh!

datadog-prod-us1-4 Bot commented May 19, 2026 • edited by datadog-datadog-prod-us1 Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tylfin commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tylfin commented May 17, 2026 •

edited

Loading

pr-commenter Bot commented May 17, 2026 •

edited

Loading

datadog-prod-us1-4 Bot commented May 19, 2026 •

edited by datadog-datadog-prod-us1 Bot

Loading