Skip to content

feat(trace_buffer): adopt libdatadog's TraceBuffer & borrow the host tokio runtime#226

Open
Aaalibaba42 wants to merge 6 commits into
mainfrom
jwiriath/APMSP-2990-adopt-libdatadog-s-trace-buffer
Open

feat(trace_buffer): adopt libdatadog's TraceBuffer & borrow the host tokio runtime#226
Aaalibaba42 wants to merge 6 commits into
mainfrom
jwiriath/APMSP-2990-adopt-libdatadog-s-trace-buffer

Conversation

@Aaalibaba42

@Aaalibaba42 Aaalibaba42 commented May 13, 2026

Copy link
Copy Markdown

What?

Replace the in-tree AsyncTraceExporter (buffering, flushing, shutdown
orchestration) with libdatadog's TraceBuffer + SharedRuntime, the same
implementation that was upstreamed from this repo. The DatadogExporter is now
a thin wrapper that owns a TraceBuffer and a SharedRuntime instead of
managing its own background thread, batch queue, and condvar synchronisation.

When the exporter is constructed from inside a tokio runtime — the common case
for an async application booting up dd-trace-rs from its #[tokio::main] — the
underlying SharedRuntime now borrows the host runtime instead of spawning
a second one. Borrowed mode is what makes Drop and shutdown work cleanly
from a host worker thread without nested-runtime panics.

Why?

  • The trace buffer logic was moved to libdatadog so it can be shared across
    language tracers. Keeping a duplicate copy here means diverging bug fixes,
    duplicated tests, and extra maintenance. Adopting the libdatadog version
    lets us delete ~860 lines of exporter/mod.rs and benefit from upstream
    improvements (e.g. runtime-aware shutdown) for free.
  • The previous in-tree exporter, and naïve adoption of the libdatadog one,
    both panicked with Cannot start a runtime from within a runtime when the
    user owned a tokio runtime — which is the only sane way to run a modern
    Rust web service. Borrowed-mode SharedRuntime is the proper fix.

How?

1. Adopt TraceBuffer + SharedRuntime

  • Deleted datadog-opentelemetry/src/exporter/mod.rs entirely.
  • Rewired DatadogExporter in span_exporter.rs to construct a
    TraceExporter via TraceExporterBuilder and wrap it in a TraceBuffer.
    MapperExporter implements Export<SpanData> for the OTel→DD conversion
    and lives alongside the new DatadogExporter.

2. Pick the right runtime backing at construction time

let runtime = match tokio::runtime::Handle::try_current() {
    Ok(handle) => Arc::new(SharedRuntime::from_handle(handle)),
    Err(_)     => Arc::new(SharedRuntime::new()?),
};

If we are constructed from inside a tokio runtime, borrow it; otherwise spin
up our own. Borrowed mode gives up fork-safety in exchange for letting
Drop/shutdown work without block_on — see the borrowed-runtime work in
libdd-shared-runtime.

3. Drive the async builder from a sync constructor

DatadogExporter::new is sync (-> Self) but the new
TraceExporterBuilder::build_async is async. Calling
SharedRuntime::block_on in borrowed mode returns
Unsupported; calling tokio::runtime::Handle::block_on from a tokio worker
panics. Solution: a small poll_to_completion helper — a hand-rolled,
thread-park-based executor — that drives build_async to completion. This is
safe because build_async's only suspension point is a single non-blocking
mpsc send into a freshly-created telemetry-worker channel, which is always
Ready on first poll. The contract is documented at the helper's call site.

4. Non-blocking shutdown

DatadogExporter::shutdown now uses
SharedRuntime::trigger_shutdown_signal + wait_shutdown_done (Condvar-based,
no block_on). To avoid deadlocking the single worker thread of a
current_thread host runtime, the wait is conditional on the host runtime's
flavor:

  • Multi-threaded hosttokio::task::block_in_place around
    wait_shutdown_done so other workers can keep driving the shutdown tasks.
  • current_thread host — fire-and-forget trigger_shutdown_signal and
    skip the Condvar wait entirely; blocking the only worker thread would
    prevent the shutdown tasks from ever running. This is best-effort; an
    async-aware shutdown_async API is the long-term fix.

A small tokio_is_multi_thread() helper does the flavor detection.

5. Drop for DatadogExporter

Callers that forget to call shutdown() now get a best-effort teardown with a
1s timeout. The impl is idempotent: a prior explicit shutdown makes the
Drop a no-op. Regression tests exercise drop-inside-tokio and
drop-after-explicit-shutdown, both asserting the borrowed-runtime path
(exporter.runtime.is_borrowed()).

Tests

  • Re-enabled (and rewrote) test_drop_inside_tokio_runtime and
    test_drop_after_explicit_shutdown_is_noop, which used to be #[ignore]d
    while the tokio-in-tokio panic was open. Both now run under
    #[tokio::test(flavor = "multi_thread", worker_threads = 2)] and assert
    exporter.runtime.is_borrowed().
  • Added test_new_outside_tokio_picks_owned_runtime for the owned-runtime
    fallback.
  • Existing snapshot and integration tests pass; flaky test agent /
    global-state tests are unchanged.

Additional notes

  • libdatadog deps in Cargo.toml currently point to the
    jwiriath/tracebuffer-sharedruntime-rusttracer branch via path deps for
    local development; the workspace comment documents that these need to flip
    back to git refs (or pinned versions) once the libdatadog PR lands and is
    published.
  • The test-utils feature gate exposes a wait_ready hook on
    MapperExporter so snapshot tests can wait for agent info before asserting
    on stats-derived metrics.
  • No change to the public DatadogExporter API surface: send_chunk,
    force_flush, shutdown, and queue_metrics all keep their existing
    signatures.

@datadog-prod-us1-5

This comment has been minimized.

@Aaalibaba42 Aaalibaba42 force-pushed the jwiriath/APMSP-2990-adopt-libdatadog-s-trace-buffer branch 3 times, most recently from 3ba789c to e6304ac Compare May 15, 2026 13:35
@Aaalibaba42 Aaalibaba42 changed the title feat(trace_buffer): adopt libdatadog's trace buffer implementation feat(trace_buffer): adopt libdatadog's TraceBuffer and SharedRuntime May 19, 2026
@Aaalibaba42 Aaalibaba42 force-pushed the jwiriath/APMSP-2990-adopt-libdatadog-s-trace-buffer branch from 3fdeef4 to 167d226 Compare May 19, 2026 12:31
@Aaalibaba42 Aaalibaba42 marked this pull request as ready for review May 19, 2026 12:31
@Aaalibaba42 Aaalibaba42 requested a review from a team as a code owner May 19, 2026 12:31
@Aaalibaba42 Aaalibaba42 changed the title feat(trace_buffer): adopt libdatadog's TraceBuffer and SharedRuntime feat(trace_buffer): adopt libdatadog's TraceBuffer & borrow the host tokio runtime May 26, 2026

@paullegranddc paullegranddc left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the way this currently works , inheriting the runtime and trying to perform a shutdown in the caller tokio context seems dodgy.

I think it would be cleaner and less brittle if we

  1. always created a new runtime (possibly a current thread one)
  2. sent it in a separate thread
  3. create the trace exporter and spawn tasks in the new thread
  4. wait on a shutdown signal, where we shutdown the shared runtime and drop it.

Comment thread Cargo.toml
Comment on lines +27 to +38
# Local development uses sibling path deps so changes in ../libdatadog are picked up
# without a roundtrip through GitHub. CI / publish must switch this back to the `git`
# form (see commit `3e5ed24` for the precedent). Once the borrowed-runtime work in
# libdatadog lands and is pushed, replace these with the corresponding git refs.
libdd-data-pipeline = { path = "../libdatadog/libdd-data-pipeline", default-features = false, features = ["telemetry"] }
libdd-trace-utils = { path = "../libdatadog/libdd-trace-utils", default-features = false }
libdd-telemetry = { path = "../libdatadog/libdd-telemetry", default-features = false }
libdd-common = { path = "../libdatadog/libdd-common", default-features = false }
libdd-tinybytes = { path = "../libdatadog/libdd-tinybytes", default-features = false }
libdd-library-config = { path = "../libdatadog/libdd-library-config", default-features = false }
libdd-shared-runtime = { path = "../libdatadog/libdd-shared-runtime", default-features = false }
libdd-capabilities-impl = { path = "../libdatadog/libdd-capabilities-impl", default-features = false }

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a reminder that this shouldn't be merged

@@ -0,0 +1,28 @@
[[

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have the snapshot test changed?

Comment on lines +42 to +47
let runtime = match tokio::runtime::Handle::try_current() {
Ok(handle) => Arc::new(SharedRuntime::from_handle(handle)),
Err(_) => Arc::new(
SharedRuntime::new().expect("failed to create SharedRuntime for trace exporter"),
),
};

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DIscussed IRL but I would prefer if we always created a new runtime for the SharedRuntime, and sent it in a brand new thread, not risking interacting with it from the customer side

Comment on lines +82 to +87
// Drive the async builder to completion. The future's only suspension point is
// a single non-blocking mpsc `send` into a freshly-created telemetry-worker
// channel, which is always Ready on first poll. We deliberately use
// [`poll_to_completion`] instead of `SharedRuntime::block_on` (which errors in
// borrowed mode) or `tokio::Handle::block_on` (which panics from inside a tokio
// worker thread). See [`poll_to_completion`]'s docs for the contract.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a very brittle expectations

// alive SharedRuntime) before we tear the workers down. Without this, fast-finishing
// applications would shut the workers down before the stats worker ever ran, dropping
// the stats payload and breaking expectations downstream.
let flush_result = if self.runtime.is_borrowed() && tokio_is_multi_thread() {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if the runtime is not multi threaded we just block?
This is not correct as the shutdown is an async operation, so if the shared runtime inherited the same current thread runtime as the app, we will block it and the trace buffer flush won't be able to make progress

// trace-buffer worker's flush while we park on the Condvar inside
// `flush_and_wait`. Without this, the Condvar wait would block the only
// tokio worker thread available to make progress.
tokio::task::block_in_place(|| self.trace_buffer.flush_and_wait(remaining()))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also shutting down the trace buffer should already be flushing lefotver spans in the pipe. No need to flush and wait here

}
}

fn poll_to_completion<F: std::future::Future>(future: F) -> F::Output {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if we used https://docs.rs/futures/latest/futures/executor/fn.block_on.html which does the same

@mabdinur

mabdinur commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Hey @Aaalibaba42,

Hope you’re doing well! I have a PR that’s currently blocked on this change, so I wanted to check in and see if you have a rough ETA for when this might be merged.

No rush if you’re still working through it, just trying to plan around the dependency 😄

@Aaalibaba42

Copy link
Copy Markdown
Author

Hey @mabdinur, this might still be a few days. The changes here lead to discussions and changes in libdatadog that must be settled before this is merged, I'm hoping to have this merged before the end of next week but it's the last PR of the effort :/.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants