Skip to content

fix(debugger): retry agent /info on 5xx#18124

Open
tylfin wants to merge 1 commit into
mainfrom
tyler.finethy/agent-info-resilience
Open

fix(debugger): retry agent /info on 5xx#18124
tylfin wants to merge 1 commit into
mainfrom
tyler.finethy/agent-info-resilience

Conversation

@tylfin
Copy link
Copy Markdown
Member

@tylfin tylfin commented May 17, 2026

Description

Adds a private _info_with_retry helper in ddtrace/internal/agent.py that wraps the /info HTTP call with the same fibonacci backoff already used by the Remote Config client. AgentCheckPeriodicService._agent_check now calls the retrying helper directly, so a single transient 5xx from the local trace agent no longer flips subclasses (most visibly the Dynamic Instrumentation signal uploader) back to _agent_check state until the next periodic interval. Snapshots/diagnostics keep flushing and the instance stops looking "not heartbeating" in the UI.

The public agent.info() helper is intentionally left unchanged (single attempt, returns None and logs a warning on non-2xx). It is shared with Remote Config polling, CI Visibility, LLM Obs and the LLM Obs writer, which all have their own scheduling and timing expectations; adding implicit retries there would change those callers' behavior.

Context: observed on a single staging instance whose local agent intermittently returned 500 on /info. RC polling itself was healthy end-to-end; the DI uploader on that host was the blast radius.

Testing

  • tests/tracer/test_agent.py::test_info_does_not_retry_on_5xx pins that info() keeps its single-attempt semantics.
  • tests/tracer/test_agent.py::test_agent_check_retries_info_on_5xx exercises the retry through AgentCheckPeriodicService._agent_check (3 total attempts before giving up, info_check receives None).
  • Existing test_info parametrization (incl. 500 case) still passes the contract: persistent 5xx returns None.
  • Both new tests verified locally on py3.10.

Risks

Low. Worst case is up to ~0.5s of added latency on the periodic agent health check when the agent is persistently failing 5xx. Other callers of agent.info() (Remote Config, CI Visibility, LLM Obs) are unaffected because their code path is untouched.

Additional Notes

None.

@tylfin tylfin requested review from a team as code owners May 17, 2026 13:07
@tylfin tylfin requested review from emmettbutler and mabdinur May 17, 2026 13:07
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9f02b954be

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread ddtrace/internal/agent.py
@cit-pr-commenter-54b7da
Copy link
Copy Markdown

Codeowners resolved as

ddtrace/internal/agent.py                                               @DataDog/apm-core-python
releasenotes/notes/agent-info-retry-0a8acdda8d13a505.yaml               @DataDog/apm-python
tests/tracer/test_agent.py                                              @DataDog/apm-sdk-capabilities-python @DataDog/apm-core-python

@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 17, 2026

Benchmarks

Benchmark execution time: 2026-05-19 18:18:52

Comparing candidate commit db4e231 in PR branch tyler.finethy/agent-info-resilience with baseline commit 2166252 in branch main.

Found 0 performance improvements and 4 performance regressions! Performance is the same for 578 metrics, 10 unstable metrics.

scenario:iastaspects-lstrip_aspect

  • 🟥 execution_time [+55.977µs; +61.418µs] or [+21.118%; +23.171%]

scenario:iastaspectsospath-ospathbasename_aspect

  • 🟥 execution_time [+106.723µs; +113.994µs] or [+26.351%; +28.146%]

scenario:span-start

  • 🟥 execution_time [+1.417ms; +1.574ms] or [+9.117%; +10.125%]

scenario:telemetryaddmetric-1-count-metric-1-times

  • 🟥 execution_time [+267.777ns; +309.471ns] or [+12.999%; +15.023%]

@tylfin tylfin changed the title fix(tracing): retry agent /info on 5xx fix(debugger): retry agent /info on 5xx May 19, 2026
@tylfin tylfin requested a review from P403n1x87 May 19, 2026 13:40
@tylfin tylfin force-pushed the tyler.finethy/agent-info-resilience branch from 3f7902d to 1062343 Compare May 19, 2026 17:43
Add a private _info_with_retry helper used by AgentCheckPeriodicService
that retries the agent /info call on transient 5xx responses with a
fibonacci backoff. Subsystems extending AgentCheckPeriodicService (e.g.
the Dynamic Instrumentation signal uploader) no longer stay offline
until the next periodic interval after a single transient 5xx. The
public agent.info() helper is unchanged and continues to issue a single
request, so existing callers retain their original semantics.
@tylfin tylfin force-pushed the tyler.finethy/agent-info-resilience branch from 1062343 to db4e231 Compare May 19, 2026 17:54
@datadog-prod-us1-4
Copy link
Copy Markdown
Contributor

datadog-prod-us1-4 Bot commented May 19, 2026

Tests

🎉 All green!

🧪 All tests passed
❄️ No new flaky tests detected

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: db4e231 | Docs | Datadog PR Page | Give us feedback!

@tylfin
Copy link
Copy Markdown
Member Author

tylfin commented May 21, 2026

cc @P403n1x87 did we want to move forward with this or were you going to improve the resiliency alongside the UDS timeout?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant