revert(profiling): revert _DD_PROFILING_STACK_FAST_COPY default to false#17929
Merged
Conversation
Revert the default flip from #17757 while we investigate the profiler shutdown segfault tracked in PROF-14568 (incident-53849). The fast-copy path can still be opted into via _DD_PROFILING_STACK_FAST_COPY=true. Refs: PROF-14568
Codeowners resolved as |
This comment has been minimized.
This comment has been minimized.
BenchmarksBenchmark execution time: 2026-05-07 17:26:50 Comparing candidate commit 339c95e in PR branch Found 0 performance improvements and 6 performance regressions! Performance is the same for 587 metrics, 4 unstable metrics. scenario:iastaspects-lower_aspect
scenario:iastaspects-stringio_noaspect
scenario:iastaspectsospath-ospathbasename_aspect
scenario:iastaspectssplit-rsplit_aspect
scenario:span-start
scenario:telemetryaddmetric-1-count-metric-1-times
|
wantsui
reviewed
May 6, 2026
wantsui
approved these changes
May 6, 2026
Collaborator
wantsui
left a comment
There was a problem hiding this comment.
Approving, just note the concern with the release note.
brettlangdon
approved these changes
May 7, 2026
dubloom
pushed a commit
that referenced
this pull request
May 11, 2026
…lse (#17929) ## Description Reverts the default of `_DD_PROFILING_STACK_FAST_COPY` from `true` (set in #17757) back to `false` while we investigate the profiler-shutdown segfault tracked in [PROF-14568](https://datadoghq.atlassian.net/browse/PROF-14568) (Slack `#incident-53849`, channel `C0B1H299S4R`). Reproduces 5/8 natively on x86_64 Linux Python 3.11.13 with default `safe_memcpy`. With `pytest -p no:faulthandler` it drops to **0/8**, which combined with the post-mortem core analysis pinpoints the actual mechanism: `safe_memcpy`'s `sigsetjmp/siglongjmp` recovery is incompatible with pytest's built-in faulthandler plugin owning the SIGSEGV `sigaction`. Switching the default to `process_vm_readv` (a kernel syscall that returns `-1/EFAULT` cleanly on bad source) sidesteps the SIGSEGV path entirely and is the fastest unblock for the serverless team's release. A proper fix for the underlying profiler-shutdown race (sampler walking frames during `scheduler.flush()`-driven imports) is tracked separately in PROF-14568. Concrete changes: - `ddtrace/internal/settings/profiling.py` — `fast_copy` default `True` → `False`. Users opting in via `_DD_PROFILING_STACK_FAST_COPY=1` are unaffected. - `riotfile.py` — flipped the four dedicated profile-uwsgi venvs from `_DD_PROFILING_STACK_FAST_COPY=0` to `=1` (and updated the comment) so the non-default path is still exercised in riot. - `releasenotes/notes/profiling-phase-out-process-vm-readv-97af2e74953bb9e9.yaml` — dropped (the release note from #17757 announced a default flip that we are no longer landing). ## Findings since the original PLAN.md write-up The crash decomposes into three layers, only the first of which this PR fixes: 1. **dd-trace-py: `safe_memcpy`'s recovery is fragile when another component owns SIGSEGV.** pytest's built-in faulthandler plugin runs `faulthandler.enable()` in `pytest_configure`, installing `faulthandler_fatal_error` as the SIGSEGV `sigaction`. Our `init_segv_catcher` is wrapped in `call_once` (`sampler.cpp:172`) and never re-installs after another component overwrites it. Post-mortem core confirms: `t_handler_armed = 1` on the crashing thread (we *did* arm), `g_old_segv.sa_handler = faulthandler_fatal_error` (we saved pytest's handler when ours got installed), and the call stack shows the kernel called `faulthandler_fatal_error`, not our `segv_handler`. `pytest -p no:faulthandler` eliminates the crash 0/8. 2. **dd-trace-py: profiler shutdown invokes Python imports while sampler is live.** `Profiler._stop_service` keeps collectors alive across `scheduler.flush()` "for snapshot"; flush triggers `code_provenance.get_code_provenance_file()` → `_package_for_root_module_mapping()` → cold imports of `setuptools`/`_distutils_hack`/`packaging.*`. Sampler races, reads stale `PyCodeObject*`, faults. With layer 1 fixed (via this PR using `process_vm_readv`) this becomes a dropped sample instead of a crash, but should be properly fixed in a PROF-14568 follow-up. 3. **datadog-lambda-python: `tests/test_api.py:175` calls `os.environ.clear()` without restoring env.** Strips `DD_PROFILING_STACK_FAST_COPY` for the rest of the pytest session, so any subsequent `Profiler.start()` re-reads `config.stack.fast_copy` which defaults to `True` and re-flips `safe_copy` back to `safe_memcpy_wrapper`. Confirmed via gdb breakpoint on `set_fast_copy_enabled` showing `arg=0` (env present) then `arg=1` (env cleared) when test_api runs before test_wrapper. To be filed as a separate `datadog-lambda-python` issue. This PR sidesteps layer 1 universally by defaulting to `process_vm_readv`. With layer 1 gone, layer 3 becomes harmless to the profiler regardless of test pollution. ## Testing - `hatch run lint:fmt` and `hatch run lint:riot` pass. - The C++ runtime path is unchanged; only the Python config default and the riot env values move. The C++ static-init opt-out at `ddtrace/internal/datadog/profiling/stack/src/echion/vm.cc:6-16` already treated unset/truthy as fast-copy-enabled, so behavior with the env var unset is now: Python config = `False` → `set_fast_copy_enabled(false)` → `safe_copy = process_vm_readv` (Linux) or `mach_vm_read_overwrite` (Darwin). - Existing `tests/profiling/collector/test_copy_memory_stats.py` covers both branches explicitly via env var. - Reproduction validation on workspace-tg: pre-fix `safe_memcpy` default = 5/8 crashes; with this PR's default flipped to `process_vm_readv` (or equivalently `pytest -p no:faulthandler` on the old default) = 0/8. ## Risks - **Hardened-kernel cohort.** With `VmReader` removed in #17755, the only non-fast-copy reader is `process_vm_readv`. On Linux systems where `process_vm_readv` is unavailable (e.g. `kernel.yama.ptrace_scope=3`, certain seccomp/sandboxed containers), `set_fast_copy_enabled(false)` will set `failed_safe_copy = true` and disable stack profiling, even though `safe_memcpy` would have worked. This reproduces the pre-#17757 behavior exactly, so it does not regress any user who was already running on a release without #17757. Users who upgraded to a release containing #17757 and were silently relying on `safe_memcpy` in such an environment will lose stack profiling until they set `_DD_PROFILING_STACK_FAST_COPY=1` explicitly. Accepted as a trade-off vs. the shutdown-crash risk. - **No public API change.** `_DD_PROFILING_STACK_FAST_COPY` is a private (`_DD_*`) env var. ## Additional Notes - Tracking: [PROF-14568](https://datadoghq.atlassian.net/browse/PROF-14568) - Original culprit PR: #17757 - Related upstream test PR: #17872 - `changelog/no-changelog` label applied because the env var is private and we are deleting the user-facing release note from #17757 (which announced a default flip that is no longer landing). [PROF-14568]: https://datadoghq.atlassian.net/browse/PROF-14568?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [PROF-14568]: https://datadoghq.atlassian.net/browse/PROF-14568?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Co-authored-by: Brett Langdon <brett.langdon@datadoghq.com>
Contributor
|
Thanks for taking care of that, I'll work on the follow-ups. |
P403n1x87
pushed a commit
that referenced
this pull request
May 14, 2026
…lse (#17929) ## Description Reverts the default of `_DD_PROFILING_STACK_FAST_COPY` from `true` (set in #17757) back to `false` while we investigate the profiler-shutdown segfault tracked in [PROF-14568](https://datadoghq.atlassian.net/browse/PROF-14568) (Slack `#incident-53849`, channel `C0B1H299S4R`). Reproduces 5/8 natively on x86_64 Linux Python 3.11.13 with default `safe_memcpy`. With `pytest -p no:faulthandler` it drops to **0/8**, which combined with the post-mortem core analysis pinpoints the actual mechanism: `safe_memcpy`'s `sigsetjmp/siglongjmp` recovery is incompatible with pytest's built-in faulthandler plugin owning the SIGSEGV `sigaction`. Switching the default to `process_vm_readv` (a kernel syscall that returns `-1/EFAULT` cleanly on bad source) sidesteps the SIGSEGV path entirely and is the fastest unblock for the serverless team's release. A proper fix for the underlying profiler-shutdown race (sampler walking frames during `scheduler.flush()`-driven imports) is tracked separately in PROF-14568. Concrete changes: - `ddtrace/internal/settings/profiling.py` — `fast_copy` default `True` → `False`. Users opting in via `_DD_PROFILING_STACK_FAST_COPY=1` are unaffected. - `riotfile.py` — flipped the four dedicated profile-uwsgi venvs from `_DD_PROFILING_STACK_FAST_COPY=0` to `=1` (and updated the comment) so the non-default path is still exercised in riot. - `releasenotes/notes/profiling-phase-out-process-vm-readv-97af2e74953bb9e9.yaml` — dropped (the release note from #17757 announced a default flip that we are no longer landing). ## Findings since the original PLAN.md write-up The crash decomposes into three layers, only the first of which this PR fixes: 1. **dd-trace-py: `safe_memcpy`'s recovery is fragile when another component owns SIGSEGV.** pytest's built-in faulthandler plugin runs `faulthandler.enable()` in `pytest_configure`, installing `faulthandler_fatal_error` as the SIGSEGV `sigaction`. Our `init_segv_catcher` is wrapped in `call_once` (`sampler.cpp:172`) and never re-installs after another component overwrites it. Post-mortem core confirms: `t_handler_armed = 1` on the crashing thread (we *did* arm), `g_old_segv.sa_handler = faulthandler_fatal_error` (we saved pytest's handler when ours got installed), and the call stack shows the kernel called `faulthandler_fatal_error`, not our `segv_handler`. `pytest -p no:faulthandler` eliminates the crash 0/8. 2. **dd-trace-py: profiler shutdown invokes Python imports while sampler is live.** `Profiler._stop_service` keeps collectors alive across `scheduler.flush()` "for snapshot"; flush triggers `code_provenance.get_code_provenance_file()` → `_package_for_root_module_mapping()` → cold imports of `setuptools`/`_distutils_hack`/`packaging.*`. Sampler races, reads stale `PyCodeObject*`, faults. With layer 1 fixed (via this PR using `process_vm_readv`) this becomes a dropped sample instead of a crash, but should be properly fixed in a PROF-14568 follow-up. 3. **datadog-lambda-python: `tests/test_api.py:175` calls `os.environ.clear()` without restoring env.** Strips `DD_PROFILING_STACK_FAST_COPY` for the rest of the pytest session, so any subsequent `Profiler.start()` re-reads `config.stack.fast_copy` which defaults to `True` and re-flips `safe_copy` back to `safe_memcpy_wrapper`. Confirmed via gdb breakpoint on `set_fast_copy_enabled` showing `arg=0` (env present) then `arg=1` (env cleared) when test_api runs before test_wrapper. To be filed as a separate `datadog-lambda-python` issue. This PR sidesteps layer 1 universally by defaulting to `process_vm_readv`. With layer 1 gone, layer 3 becomes harmless to the profiler regardless of test pollution. ## Testing - `hatch run lint:fmt` and `hatch run lint:riot` pass. - The C++ runtime path is unchanged; only the Python config default and the riot env values move. The C++ static-init opt-out at `ddtrace/internal/datadog/profiling/stack/src/echion/vm.cc:6-16` already treated unset/truthy as fast-copy-enabled, so behavior with the env var unset is now: Python config = `False` → `set_fast_copy_enabled(false)` → `safe_copy = process_vm_readv` (Linux) or `mach_vm_read_overwrite` (Darwin). - Existing `tests/profiling/collector/test_copy_memory_stats.py` covers both branches explicitly via env var. - Reproduction validation on workspace-tg: pre-fix `safe_memcpy` default = 5/8 crashes; with this PR's default flipped to `process_vm_readv` (or equivalently `pytest -p no:faulthandler` on the old default) = 0/8. ## Risks - **Hardened-kernel cohort.** With `VmReader` removed in #17755, the only non-fast-copy reader is `process_vm_readv`. On Linux systems where `process_vm_readv` is unavailable (e.g. `kernel.yama.ptrace_scope=3`, certain seccomp/sandboxed containers), `set_fast_copy_enabled(false)` will set `failed_safe_copy = true` and disable stack profiling, even though `safe_memcpy` would have worked. This reproduces the pre-#17757 behavior exactly, so it does not regress any user who was already running on a release without #17757. Users who upgraded to a release containing #17757 and were silently relying on `safe_memcpy` in such an environment will lose stack profiling until they set `_DD_PROFILING_STACK_FAST_COPY=1` explicitly. Accepted as a trade-off vs. the shutdown-crash risk. - **No public API change.** `_DD_PROFILING_STACK_FAST_COPY` is a private (`_DD_*`) env var. ## Additional Notes - Tracking: [PROF-14568](https://datadoghq.atlassian.net/browse/PROF-14568) - Original culprit PR: #17757 - Related upstream test PR: #17872 - `changelog/no-changelog` label applied because the env var is private and we are deleting the user-facing release note from #17757 (which announced a default flip that is no longer landing). [PROF-14568]: https://datadoghq.atlassian.net/browse/PROF-14568?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [PROF-14568]: https://datadoghq.atlassian.net/browse/PROF-14568?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Co-authored-by: Brett Langdon <brett.langdon@datadoghq.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Description
Reverts the default of
_DD_PROFILING_STACK_FAST_COPYfromtrue(set in #17757) back tofalsewhile we investigate the profiler-shutdown segfault tracked in PROF-14568 (Slack#incident-53849, channelC0B1H299S4R).Reproduces 5/8 natively on x86_64 Linux Python 3.11.13 with default
safe_memcpy. Withpytest -p no:faulthandlerit drops to 0/8, which combined with the post-mortem core analysis pinpoints the actual mechanism:safe_memcpy'ssigsetjmp/siglongjmprecovery is incompatible with pytest's built-in faulthandler plugin owning the SIGSEGVsigaction. Switching the default toprocess_vm_readv(a kernel syscall that returns-1/EFAULTcleanly on bad source) sidesteps the SIGSEGV path entirely and is the fastest unblock for the serverless team's release. A proper fix for the underlying profiler-shutdown race (sampler walking frames duringscheduler.flush()-driven imports) is tracked separately in PROF-14568.Concrete changes:
ddtrace/internal/settings/profiling.py—fast_copydefaultTrue→False. Users opting in via_DD_PROFILING_STACK_FAST_COPY=1are unaffected.riotfile.py— flipped the four dedicated profile-uwsgi venvs from_DD_PROFILING_STACK_FAST_COPY=0to=1(and updated the comment) so the non-default path is still exercised in riot.releasenotes/notes/profiling-phase-out-process-vm-readv-97af2e74953bb9e9.yaml— dropped (the release note from feat(profiling): phase outprocess_vm_readvin favour ofsafe_memcpy#17757 announced a default flip that we are no longer landing).Findings since the original PLAN.md write-up
The crash decomposes into three layers, only the first of which this PR fixes:
safe_memcpy's recovery is fragile when another component owns SIGSEGV. pytest's built-in faulthandler plugin runsfaulthandler.enable()inpytest_configure, installingfaulthandler_fatal_erroras the SIGSEGVsigaction. Ourinit_segv_catcheris wrapped incall_once(sampler.cpp:172) and never re-installs after another component overwrites it. Post-mortem core confirms:t_handler_armed = 1on the crashing thread (we did arm),g_old_segv.sa_handler = faulthandler_fatal_error(we saved pytest's handler when ours got installed), and the call stack shows the kernel calledfaulthandler_fatal_error, not oursegv_handler.pytest -p no:faulthandlereliminates the crash 0/8.Profiler._stop_servicekeeps collectors alive acrossscheduler.flush()"for snapshot"; flush triggerscode_provenance.get_code_provenance_file()→_package_for_root_module_mapping()→ cold imports ofsetuptools/_distutils_hack/packaging.*. Sampler races, reads stalePyCodeObject*, faults. With layer 1 fixed (via this PR usingprocess_vm_readv) this becomes a dropped sample instead of a crash, but should be properly fixed in a PROF-14568 follow-up.tests/test_api.py:175callsos.environ.clear()without restoring env. StripsDD_PROFILING_STACK_FAST_COPYfor the rest of the pytest session, so any subsequentProfiler.start()re-readsconfig.stack.fast_copywhich defaults toTrueand re-flipssafe_copyback tosafe_memcpy_wrapper. Confirmed via gdb breakpoint onset_fast_copy_enabledshowingarg=0(env present) thenarg=1(env cleared) when test_api runs before test_wrapper. To be filed as a separatedatadog-lambda-pythonissue.This PR sidesteps layer 1 universally by defaulting to
process_vm_readv. With layer 1 gone, layer 3 becomes harmless to the profiler regardless of test pollution.Testing
hatch run lint:fmtandhatch run lint:riotpass.ddtrace/internal/datadog/profiling/stack/src/echion/vm.cc:6-16already treated unset/truthy as fast-copy-enabled, so behavior with the env var unset is now: Python config =False→set_fast_copy_enabled(false)→safe_copy = process_vm_readv(Linux) ormach_vm_read_overwrite(Darwin).tests/profiling/collector/test_copy_memory_stats.pycovers both branches explicitly via env var.safe_memcpydefault = 5/8 crashes; with this PR's default flipped toprocess_vm_readv(or equivalentlypytest -p no:faulthandleron the old default) = 0/8.Risks
VmReaderremoved in refactor(profiling): removeVmReader#17755, the only non-fast-copy reader isprocess_vm_readv. On Linux systems whereprocess_vm_readvis unavailable (e.g.kernel.yama.ptrace_scope=3, certain seccomp/sandboxed containers),set_fast_copy_enabled(false)will setfailed_safe_copy = trueand disable stack profiling, even thoughsafe_memcpywould have worked. This reproduces the pre-feat(profiling): phase outprocess_vm_readvin favour ofsafe_memcpy#17757 behavior exactly, so it does not regress any user who was already running on a release without feat(profiling): phase outprocess_vm_readvin favour ofsafe_memcpy#17757. Users who upgraded to a release containing feat(profiling): phase outprocess_vm_readvin favour ofsafe_memcpy#17757 and were silently relying onsafe_memcpyin such an environment will lose stack profiling until they set_DD_PROFILING_STACK_FAST_COPY=1explicitly. Accepted as a trade-off vs. the shutdown-crash risk._DD_PROFILING_STACK_FAST_COPYis a private (_DD_*) env var.Additional Notes
process_vm_readvin favour ofsafe_memcpy#17757changelog/no-changeloglabel applied because the env var is private and we are deleting the user-facing release note from feat(profiling): phase outprocess_vm_readvin favour ofsafe_memcpy#17757 (which announced a default flip that is no longer landing).