Skip to content

feat(trace-id): add DD_TRACE_SECURE_RANDOM option for guaranteed ID uniqueness#224

Draft
litianningdatadog wants to merge 4 commits into
mainfrom
tianning.li/dd-trace-secure-random
Draft

feat(trace-id): add DD_TRACE_SECURE_RANDOM option for guaranteed ID uniqueness#224
litianningdatadog wants to merge 4 commits into
mainfrom
tianning.li/dd-trace-secure-random

Conversation

@litianningdatadog
Copy link
Copy Markdown

@litianningdatadog litianningdatadog commented May 11, 2026

Tech Doc

https://datadoghq.atlassian.net/browse/SVLS-9140

Background

For Firecracker-based container technology, in order to reduce cold-start latency, the system snapshots the entire process memory of a warmed-up instance and reuses it to launch new ones. Every resumed instance starts from the same frozen memory image — including any userspace PRNG state that was initialized before the snapshot was taken.

Motivation

Standard PRNGs seed once at startup and produce a deterministic sequence from that point forward. When Firecracker resumes thousands of instances from the same snapshot, every one of them begins at the same position in that sequence. Concurrent instances then generate identical trace IDs and span IDs, corrupting distributed traces and making sampled data statistically meaningless.

The fix cannot live inside each language tracer: from inside a resumed process, a cold start and a snapshot restore are indistinguishable. The process simply wakes up already initialized — no changed PID, no kernel signal, no env var that differs between the two cases.

Issue

SmallRng thread-local (rand 0.8); not fork-safe and not snapshot-safe

Solution

With DD_TRACE_SECURE_RANDOM=true, we should switch OsRng (getrandom(2) per call). With no userspace PRNG state to freeze, every resumed instance generates an independent sequence — regardless of when the snapshot was taken.

Summary

  • SmallRng is seeded from OS entropy once at thread-local initialization; its state is deterministic thereafter. In environments where process memory is cloned, all copies share the same thread-local PRNG state and generate identical trace/span ID sequences. A fork-safety gap for this same reason is noted in the existing TODO comment.
  • When DD_TRACE_SECURE_RANDOM=true, new_trace_id and new_span_id use OsRng (getrandom(2) per call) instead of the SmallRng thread-local. OsRng holds no userspace state, ensuring ID uniqueness regardless of how the process image was duplicated. This also addresses the fork-safety gap noted in the TODO.
  • DD_TRACE_SECURE_RANDOM is registered in the config abstraction layer (supported-configurations.json, SupportedConfigurations enum, Config struct) so it is read through the same path as all other env vars, consistent with the clippy::disallowed-methods rule that bans direct std::env::var calls.
  • TraceidGenerator now takes secure_random: bool at construction time (injected from Config) rather than reading the env var lazily via OnceLock.
  • Cargo.toml: added "getrandom" to the rand features to make OsRng available.

Test plan

  • test_trace_id_generator: trace ID follows timestamp | zeroes | random format, timestamp within bounds
  • test_new_trace_id_nonzero: generated trace ID is not INVALID
  • test_new_span_id_nonzero: generated span ID is not INVALID
  • test_osrng_produces_varied_values: 100 OsRng calls → >90 unique u64 values
  • test_secure_random_trace_id_format: secure_random=true path still produces correct timestamp | zeroes | random format
  • test_secure_random_span_id_nonzero: OsRng-backed span ID is not INVALID
  • test_secure_random_produces_varied_values: 100 calls with secure_random=true → >90 unique span IDs
  • test_trace_secure_random_default: config defaults to false with no env var
  • test_trace_secure_random_from_env: "true"true, "false"false, invalid value → false (default)
  • Existing propagation/mapping tests unaffected (13 total pass)

🤖 Generated with Claude Code

litianningdatadog and others added 4 commits May 11, 2026 14:52
…niqueness

SmallRng is seeded from OS entropy once at thread-local initialization, but
its state is deterministic thereafter. In environments where process memory is
cloned (e.g. VM snapshots, certain fork patterns), all clones share the same
thread-local PRNG state and generate identical trace and span ID sequences.
A fork-safety concern for this same reason is also noted in the existing TODO
comment in the source.

When DD_TRACE_SECURE_RANDOM=true, new_trace_id and new_span_id use OsRng
(getrandom(2) per call) instead of the SmallRng thread-local. OsRng holds no
userspace state, so ID uniqueness is guaranteed regardless of how the process
image was created or duplicated. This also addresses the fork-safety gap
noted in the TODO at no additional cost.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…n layer

Direct std::env::var call violated the clippy::disallowed-methods rule.
Registers DD_TRACE_SECURE_RANDOM in the config abstraction layer and injects
the value into TraceidGenerator at construction time instead of reading env
directly, removing the OnceLock. Adds unit tests for env parsing and the
secure-random ID generation path.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant