fix(debugger): retry agent /info on 5xx#18124
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9f02b954be
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Codeowners resolved as |
BenchmarksBenchmark execution time: 2026-05-19 18:18:52 Comparing candidate commit db4e231 in PR branch Found 0 performance improvements and 4 performance regressions! Performance is the same for 578 metrics, 10 unstable metrics. scenario:iastaspects-lstrip_aspect
scenario:iastaspectsospath-ospathbasename_aspect
scenario:span-start
scenario:telemetryaddmetric-1-count-metric-1-times
|
3f7902d to
1062343
Compare
Add a private _info_with_retry helper used by AgentCheckPeriodicService that retries the agent /info call on transient 5xx responses with a fibonacci backoff. Subsystems extending AgentCheckPeriodicService (e.g. the Dynamic Instrumentation signal uploader) no longer stay offline until the next periodic interval after a single transient 5xx. The public agent.info() helper is unchanged and continues to issue a single request, so existing callers retain their original semantics.
1062343 to
db4e231
Compare
🎉 All green!🧪 All tests passed 🔗 Commit SHA: db4e231 | Docs | Datadog PR Page | Give us feedback! |
|
cc @P403n1x87 did we want to move forward with this or were you going to improve the resiliency alongside the UDS timeout? |
Description
Adds a private
_info_with_retryhelper inddtrace/internal/agent.pythat wraps the/infoHTTP call with the same fibonacci backoff already used by the Remote Config client.AgentCheckPeriodicService._agent_checknow calls the retrying helper directly, so a single transient 5xx from the local trace agent no longer flips subclasses (most visibly the Dynamic Instrumentation signal uploader) back to_agent_checkstate until the next periodic interval. Snapshots/diagnostics keep flushing and the instance stops looking "not heartbeating" in the UI.The public
agent.info()helper is intentionally left unchanged (single attempt, returnsNoneand logs a warning on non-2xx). It is shared with Remote Config polling, CI Visibility, LLM Obs and the LLM Obs writer, which all have their own scheduling and timing expectations; adding implicit retries there would change those callers' behavior.Context: observed on a single staging instance whose local agent intermittently returned 500 on
/info. RC polling itself was healthy end-to-end; the DI uploader on that host was the blast radius.Testing
tests/tracer/test_agent.py::test_info_does_not_retry_on_5xxpins thatinfo()keeps its single-attempt semantics.tests/tracer/test_agent.py::test_agent_check_retries_info_on_5xxexercises the retry throughAgentCheckPeriodicService._agent_check(3 total attempts before giving up,info_checkreceivesNone).test_infoparametrization (incl. 500 case) still passes the contract: persistent 5xx returnsNone.Risks
Low. Worst case is up to ~0.5s of added latency on the periodic agent health check when the agent is persistently failing 5xx. Other callers of
agent.info()(Remote Config, CI Visibility, LLM Obs) are unaffected because their code path is untouched.Additional Notes
None.