feat(trace-id): add DD_TRACE_SECURE_RANDOM option for guaranteed ID uniqueness#224
Draft
litianningdatadog wants to merge 4 commits into
Draft
feat(trace-id): add DD_TRACE_SECURE_RANDOM option for guaranteed ID uniqueness#224litianningdatadog wants to merge 4 commits into
litianningdatadog wants to merge 4 commits into
Conversation
…niqueness SmallRng is seeded from OS entropy once at thread-local initialization, but its state is deterministic thereafter. In environments where process memory is cloned (e.g. VM snapshots, certain fork patterns), all clones share the same thread-local PRNG state and generate identical trace and span ID sequences. A fork-safety concern for this same reason is also noted in the existing TODO comment in the source. When DD_TRACE_SECURE_RANDOM=true, new_trace_id and new_span_id use OsRng (getrandom(2) per call) instead of the SmallRng thread-local. OsRng holds no userspace state, so ID uniqueness is guaranteed regardless of how the process image was created or duplicated. This also addresses the fork-safety gap noted in the TODO at no additional cost. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…n layer Direct std::env::var call violated the clippy::disallowed-methods rule. Registers DD_TRACE_SECURE_RANDOM in the config abstraction layer and injects the value into TraceidGenerator at construction time instead of reading env directly, removing the OnceLock. Adds unit tests for env parsing and the secure-random ID generation path. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Tech Doc
https://datadoghq.atlassian.net/browse/SVLS-9140
Background
For Firecracker-based container technology, in order to reduce cold-start latency, the system snapshots the entire process memory of a warmed-up instance and reuses it to launch new ones. Every resumed instance starts from the same frozen memory image — including any userspace PRNG state that was initialized before the snapshot was taken.
Motivation
Standard PRNGs seed once at startup and produce a deterministic sequence from that point forward. When Firecracker resumes thousands of instances from the same snapshot, every one of them begins at the same position in that sequence. Concurrent instances then generate identical trace IDs and span IDs, corrupting distributed traces and making sampled data statistically meaningless.
The fix cannot live inside each language tracer: from inside a resumed process, a cold start and a snapshot restore are indistinguishable. The process simply wakes up already initialized — no changed PID, no kernel signal, no env var that differs between the two cases.
Issue
SmallRng thread-local (rand 0.8); not fork-safe and not snapshot-safe
Solution
With DD_TRACE_SECURE_RANDOM=true, we should switch OsRng (getrandom(2) per call). With no userspace PRNG state to freeze, every resumed instance generates an independent sequence — regardless of when the snapshot was taken.
Summary
SmallRngis seeded from OS entropy once at thread-local initialization; its state is deterministic thereafter. In environments where process memory is cloned, all copies share the same thread-local PRNG state and generate identical trace/span ID sequences. A fork-safety gap for this same reason is noted in the existing TODO comment.DD_TRACE_SECURE_RANDOM=true,new_trace_idandnew_span_iduseOsRng(getrandom(2)per call) instead of theSmallRngthread-local.OsRngholds no userspace state, ensuring ID uniqueness regardless of how the process image was duplicated. This also addresses the fork-safety gap noted in the TODO.DD_TRACE_SECURE_RANDOMis registered in the config abstraction layer (supported-configurations.json,SupportedConfigurationsenum,Configstruct) so it is read through the same path as all other env vars, consistent with theclippy::disallowed-methodsrule that bans directstd::env::varcalls.TraceidGeneratornow takessecure_random: boolat construction time (injected fromConfig) rather than reading the env var lazily viaOnceLock.Cargo.toml: added"getrandom"to therandfeatures to makeOsRngavailable.Test plan
test_trace_id_generator: trace ID followstimestamp | zeroes | randomformat, timestamp within boundstest_new_trace_id_nonzero: generated trace ID is notINVALIDtest_new_span_id_nonzero: generated span ID is notINVALIDtest_osrng_produces_varied_values: 100OsRngcalls → >90 uniqueu64valuestest_secure_random_trace_id_format:secure_random=truepath still produces correcttimestamp | zeroes | randomformattest_secure_random_span_id_nonzero:OsRng-backed span ID is notINVALIDtest_secure_random_produces_varied_values: 100 calls withsecure_random=true→ >90 unique span IDstest_trace_secure_random_default: config defaults tofalsewith no env vartest_trace_secure_random_from_env:"true"→true,"false"→false, invalid value →false(default)🤖 Generated with Claude Code