Skip to content

revert(profiling): revert _DD_PROFILING_STACK_FAST_COPY default to false#17929

Merged
brettlangdon merged 2 commits into
mainfrom
taegyunkim/prof-14568-revert-stack-fast-copy
May 7, 2026
Merged

revert(profiling): revert _DD_PROFILING_STACK_FAST_COPY default to false#17929
brettlangdon merged 2 commits into
mainfrom
taegyunkim/prof-14568-revert-stack-fast-copy

Conversation

@taegyunkim
Copy link
Copy Markdown
Contributor

@taegyunkim taegyunkim commented May 6, 2026

Description

Reverts the default of _DD_PROFILING_STACK_FAST_COPY from true (set in #17757) back to false while we investigate the profiler-shutdown segfault tracked in PROF-14568 (Slack #incident-53849, channel C0B1H299S4R).

Reproduces 5/8 natively on x86_64 Linux Python 3.11.13 with default safe_memcpy. With pytest -p no:faulthandler it drops to 0/8, which combined with the post-mortem core analysis pinpoints the actual mechanism: safe_memcpy's sigsetjmp/siglongjmp recovery is incompatible with pytest's built-in faulthandler plugin owning the SIGSEGV sigaction. Switching the default to process_vm_readv (a kernel syscall that returns -1/EFAULT cleanly on bad source) sidesteps the SIGSEGV path entirely and is the fastest unblock for the serverless team's release. A proper fix for the underlying profiler-shutdown race (sampler walking frames during scheduler.flush()-driven imports) is tracked separately in PROF-14568.

Concrete changes:

  • ddtrace/internal/settings/profiling.pyfast_copy default TrueFalse. Users opting in via _DD_PROFILING_STACK_FAST_COPY=1 are unaffected.
  • riotfile.py — flipped the four dedicated profile-uwsgi venvs from _DD_PROFILING_STACK_FAST_COPY=0 to =1 (and updated the comment) so the non-default path is still exercised in riot.
  • releasenotes/notes/profiling-phase-out-process-vm-readv-97af2e74953bb9e9.yaml — dropped (the release note from feat(profiling): phase out process_vm_readv in favour of safe_memcpy #17757 announced a default flip that we are no longer landing).

Findings since the original PLAN.md write-up

The crash decomposes into three layers, only the first of which this PR fixes:

  1. dd-trace-py: safe_memcpy's recovery is fragile when another component owns SIGSEGV. pytest's built-in faulthandler plugin runs faulthandler.enable() in pytest_configure, installing faulthandler_fatal_error as the SIGSEGV sigaction. Our init_segv_catcher is wrapped in call_once (sampler.cpp:172) and never re-installs after another component overwrites it. Post-mortem core confirms: t_handler_armed = 1 on the crashing thread (we did arm), g_old_segv.sa_handler = faulthandler_fatal_error (we saved pytest's handler when ours got installed), and the call stack shows the kernel called faulthandler_fatal_error, not our segv_handler. pytest -p no:faulthandler eliminates the crash 0/8.
  2. dd-trace-py: profiler shutdown invokes Python imports while sampler is live. Profiler._stop_service keeps collectors alive across scheduler.flush() "for snapshot"; flush triggers code_provenance.get_code_provenance_file()_package_for_root_module_mapping() → cold imports of setuptools/_distutils_hack/packaging.*. Sampler races, reads stale PyCodeObject*, faults. With layer 1 fixed (via this PR using process_vm_readv) this becomes a dropped sample instead of a crash, but should be properly fixed in a PROF-14568 follow-up.
  3. datadog-lambda-python: tests/test_api.py:175 calls os.environ.clear() without restoring env. Strips DD_PROFILING_STACK_FAST_COPY for the rest of the pytest session, so any subsequent Profiler.start() re-reads config.stack.fast_copy which defaults to True and re-flips safe_copy back to safe_memcpy_wrapper. Confirmed via gdb breakpoint on set_fast_copy_enabled showing arg=0 (env present) then arg=1 (env cleared) when test_api runs before test_wrapper. To be filed as a separate datadog-lambda-python issue.

This PR sidesteps layer 1 universally by defaulting to process_vm_readv. With layer 1 gone, layer 3 becomes harmless to the profiler regardless of test pollution.

Testing

  • hatch run lint:fmt and hatch run lint:riot pass.
  • The C++ runtime path is unchanged; only the Python config default and the riot env values move. The C++ static-init opt-out at ddtrace/internal/datadog/profiling/stack/src/echion/vm.cc:6-16 already treated unset/truthy as fast-copy-enabled, so behavior with the env var unset is now: Python config = Falseset_fast_copy_enabled(false)safe_copy = process_vm_readv (Linux) or mach_vm_read_overwrite (Darwin).
  • Existing tests/profiling/collector/test_copy_memory_stats.py covers both branches explicitly via env var.
  • Reproduction validation on workspace-tg: pre-fix safe_memcpy default = 5/8 crashes; with this PR's default flipped to process_vm_readv (or equivalently pytest -p no:faulthandler on the old default) = 0/8.

Risks

Additional Notes

Revert the default flip from #17757 while we investigate the profiler
shutdown segfault tracked in PROF-14568 (incident-53849). The fast-copy
path can still be opted into via _DD_PROFILING_STACK_FAST_COPY=true.

Refs: PROF-14568
@taegyunkim taegyunkim added the changelog/no-changelog A changelog entry is not required for this PR. label May 6, 2026
@cit-pr-commenter-54b7da
Copy link
Copy Markdown

Codeowners resolved as

ddtrace/internal/settings/profiling.py                                  @DataDog/profiling-python
riotfile.py                                                             @DataDog/apm-python

@taegyunkim taegyunkim changed the title fix(profiling): revert _DD_PROFILING_STACK_FAST_COPY default to false revert(profiling): revert _DD_PROFILING_STACK_FAST_COPY default to false May 6, 2026
@taegyunkim taegyunkim added the Profiling Continous Profling label May 6, 2026
@datadog-prod-us1-6

This comment has been minimized.

@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 6, 2026

Benchmarks

Benchmark execution time: 2026-05-07 17:26:50

Comparing candidate commit 339c95e in PR branch taegyunkim/prof-14568-revert-stack-fast-copy with baseline commit e3bdccd in branch main.

Found 0 performance improvements and 6 performance regressions! Performance is the same for 587 metrics, 4 unstable metrics.

scenario:iastaspects-lower_aspect

  • 🟥 execution_time [+55.717µs; +61.185µs] or [+18.817%; +20.663%]

scenario:iastaspects-stringio_noaspect

  • 🟥 execution_time [+37.368µs; +44.435µs] or [+10.477%; +12.458%]

scenario:iastaspectsospath-ospathbasename_aspect

  • 🟥 execution_time [+86.009µs; +92.376µs] or [+20.181%; +21.675%]

scenario:iastaspectssplit-rsplit_aspect

  • 🟥 execution_time [+14.467µs; +20.041µs] or [+9.339%; +12.937%]

scenario:span-start

  • 🟥 execution_time [+1.219ms; +1.378ms] or [+7.850%; +8.876%]

scenario:telemetryaddmetric-1-count-metric-1-times

  • 🟥 execution_time [+153.930ns; +202.844ns] or [+7.284%; +9.598%]

@taegyunkim taegyunkim marked this pull request as ready for review May 6, 2026 20:22
@taegyunkim taegyunkim requested review from a team as code owners May 6, 2026 20:22
Copy link
Copy Markdown
Collaborator

@wantsui wantsui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, just note the concern with the release note.

@taegyunkim taegyunkim requested a review from r1viollet May 6, 2026 20:26
Copy link
Copy Markdown
Contributor

@vlad-scherbich vlad-scherbich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image

@brettlangdon brettlangdon merged commit 7834861 into main May 7, 2026
1165 checks passed
@brettlangdon brettlangdon deleted the taegyunkim/prof-14568-revert-stack-fast-copy branch May 7, 2026 17:36
dubloom pushed a commit that referenced this pull request May 11, 2026
…lse (#17929)

## Description

Reverts the default of `_DD_PROFILING_STACK_FAST_COPY` from `true` (set
in #17757) back to `false` while we investigate the profiler-shutdown
segfault tracked in
[PROF-14568](https://datadoghq.atlassian.net/browse/PROF-14568) (Slack
`#incident-53849`, channel `C0B1H299S4R`).

Reproduces 5/8 natively on x86_64 Linux Python 3.11.13 with default
`safe_memcpy`. With `pytest -p no:faulthandler` it drops to **0/8**,
which combined with the post-mortem core analysis pinpoints the actual
mechanism: `safe_memcpy`'s `sigsetjmp/siglongjmp` recovery is
incompatible with pytest's built-in faulthandler plugin owning the
SIGSEGV `sigaction`. Switching the default to `process_vm_readv` (a
kernel syscall that returns `-1/EFAULT` cleanly on bad source) sidesteps
the SIGSEGV path entirely and is the fastest unblock for the serverless
team's release. A proper fix for the underlying profiler-shutdown race
(sampler walking frames during `scheduler.flush()`-driven imports) is
tracked separately in PROF-14568.

Concrete changes:
- `ddtrace/internal/settings/profiling.py` — `fast_copy` default `True`
→ `False`. Users opting in via `_DD_PROFILING_STACK_FAST_COPY=1` are
unaffected.
- `riotfile.py` — flipped the four dedicated profile-uwsgi venvs from
`_DD_PROFILING_STACK_FAST_COPY=0` to `=1` (and updated the comment) so
the non-default path is still exercised in riot.
-
`releasenotes/notes/profiling-phase-out-process-vm-readv-97af2e74953bb9e9.yaml`
— dropped (the release note from #17757 announced a default flip that we
are no longer landing).

## Findings since the original PLAN.md write-up

The crash decomposes into three layers, only the first of which this PR
fixes:

1. **dd-trace-py: `safe_memcpy`'s recovery is fragile when another
component owns SIGSEGV.** pytest's built-in faulthandler plugin runs
`faulthandler.enable()` in `pytest_configure`, installing
`faulthandler_fatal_error` as the SIGSEGV `sigaction`. Our
`init_segv_catcher` is wrapped in `call_once` (`sampler.cpp:172`) and
never re-installs after another component overwrites it. Post-mortem
core confirms: `t_handler_armed = 1` on the crashing thread (we *did*
arm), `g_old_segv.sa_handler = faulthandler_fatal_error` (we saved
pytest's handler when ours got installed), and the call stack shows the
kernel called `faulthandler_fatal_error`, not our `segv_handler`.
`pytest -p no:faulthandler` eliminates the crash 0/8.
2. **dd-trace-py: profiler shutdown invokes Python imports while sampler
is live.** `Profiler._stop_service` keeps collectors alive across
`scheduler.flush()` "for snapshot"; flush triggers
`code_provenance.get_code_provenance_file()` →
`_package_for_root_module_mapping()` → cold imports of
`setuptools`/`_distutils_hack`/`packaging.*`. Sampler races, reads stale
`PyCodeObject*`, faults. With layer 1 fixed (via this PR using
`process_vm_readv`) this becomes a dropped sample instead of a crash,
but should be properly fixed in a PROF-14568 follow-up.
3. **datadog-lambda-python: `tests/test_api.py:175` calls
`os.environ.clear()` without restoring env.** Strips
`DD_PROFILING_STACK_FAST_COPY` for the rest of the pytest session, so
any subsequent `Profiler.start()` re-reads `config.stack.fast_copy`
which defaults to `True` and re-flips `safe_copy` back to
`safe_memcpy_wrapper`. Confirmed via gdb breakpoint on
`set_fast_copy_enabled` showing `arg=0` (env present) then `arg=1` (env
cleared) when test_api runs before test_wrapper. To be filed as a
separate `datadog-lambda-python` issue.

This PR sidesteps layer 1 universally by defaulting to
`process_vm_readv`. With layer 1 gone, layer 3 becomes harmless to the
profiler regardless of test pollution.

## Testing

- `hatch run lint:fmt` and `hatch run lint:riot` pass.
- The C++ runtime path is unchanged; only the Python config default and
the riot env values move. The C++ static-init opt-out at
`ddtrace/internal/datadog/profiling/stack/src/echion/vm.cc:6-16` already
treated unset/truthy as fast-copy-enabled, so behavior with the env var
unset is now: Python config = `False` → `set_fast_copy_enabled(false)` →
`safe_copy = process_vm_readv` (Linux) or `mach_vm_read_overwrite`
(Darwin).
- Existing `tests/profiling/collector/test_copy_memory_stats.py` covers
both branches explicitly via env var.
- Reproduction validation on workspace-tg: pre-fix `safe_memcpy` default
= 5/8 crashes; with this PR's default flipped to `process_vm_readv` (or
equivalently `pytest -p no:faulthandler` on the old default) = 0/8.

## Risks

- **Hardened-kernel cohort.** With `VmReader` removed in #17755, the
only non-fast-copy reader is `process_vm_readv`. On Linux systems where
`process_vm_readv` is unavailable (e.g. `kernel.yama.ptrace_scope=3`,
certain seccomp/sandboxed containers), `set_fast_copy_enabled(false)`
will set `failed_safe_copy = true` and disable stack profiling, even
though `safe_memcpy` would have worked. This reproduces the pre-#17757
behavior exactly, so it does not regress any user who was already
running on a release without #17757. Users who upgraded to a release
containing #17757 and were silently relying on `safe_memcpy` in such an
environment will lose stack profiling until they set
`_DD_PROFILING_STACK_FAST_COPY=1` explicitly. Accepted as a trade-off
vs. the shutdown-crash risk.
- **No public API change.** `_DD_PROFILING_STACK_FAST_COPY` is a private
(`_DD_*`) env var.

## Additional Notes

- Tracking:
[PROF-14568](https://datadoghq.atlassian.net/browse/PROF-14568)
- Original culprit PR: #17757
- Related upstream test PR: #17872
- `changelog/no-changelog` label applied because the env var is private
and we are deleting the user-facing release note from #17757 (which
announced a default flip that is no longer landing).


[PROF-14568]:
https://datadoghq.atlassian.net/browse/PROF-14568?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[PROF-14568]:
https://datadoghq.atlassian.net/browse/PROF-14568?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Co-authored-by: Brett Langdon <brett.langdon@datadoghq.com>
@KowalskiThomas
Copy link
Copy Markdown
Contributor

Thanks for taking care of that, I'll work on the follow-ups.

P403n1x87 pushed a commit that referenced this pull request May 14, 2026
…lse (#17929)

## Description

Reverts the default of `_DD_PROFILING_STACK_FAST_COPY` from `true` (set
in #17757) back to `false` while we investigate the profiler-shutdown
segfault tracked in
[PROF-14568](https://datadoghq.atlassian.net/browse/PROF-14568) (Slack
`#incident-53849`, channel `C0B1H299S4R`).

Reproduces 5/8 natively on x86_64 Linux Python 3.11.13 with default
`safe_memcpy`. With `pytest -p no:faulthandler` it drops to **0/8**,
which combined with the post-mortem core analysis pinpoints the actual
mechanism: `safe_memcpy`'s `sigsetjmp/siglongjmp` recovery is
incompatible with pytest's built-in faulthandler plugin owning the
SIGSEGV `sigaction`. Switching the default to `process_vm_readv` (a
kernel syscall that returns `-1/EFAULT` cleanly on bad source) sidesteps
the SIGSEGV path entirely and is the fastest unblock for the serverless
team's release. A proper fix for the underlying profiler-shutdown race
(sampler walking frames during `scheduler.flush()`-driven imports) is
tracked separately in PROF-14568.

Concrete changes:
- `ddtrace/internal/settings/profiling.py` — `fast_copy` default `True`
→ `False`. Users opting in via `_DD_PROFILING_STACK_FAST_COPY=1` are
unaffected.
- `riotfile.py` — flipped the four dedicated profile-uwsgi venvs from
`_DD_PROFILING_STACK_FAST_COPY=0` to `=1` (and updated the comment) so
the non-default path is still exercised in riot.
-
`releasenotes/notes/profiling-phase-out-process-vm-readv-97af2e74953bb9e9.yaml`
— dropped (the release note from #17757 announced a default flip that we
are no longer landing).

## Findings since the original PLAN.md write-up

The crash decomposes into three layers, only the first of which this PR
fixes:

1. **dd-trace-py: `safe_memcpy`'s recovery is fragile when another
component owns SIGSEGV.** pytest's built-in faulthandler plugin runs
`faulthandler.enable()` in `pytest_configure`, installing
`faulthandler_fatal_error` as the SIGSEGV `sigaction`. Our
`init_segv_catcher` is wrapped in `call_once` (`sampler.cpp:172`) and
never re-installs after another component overwrites it. Post-mortem
core confirms: `t_handler_armed = 1` on the crashing thread (we *did*
arm), `g_old_segv.sa_handler = faulthandler_fatal_error` (we saved
pytest's handler when ours got installed), and the call stack shows the
kernel called `faulthandler_fatal_error`, not our `segv_handler`.
`pytest -p no:faulthandler` eliminates the crash 0/8.
2. **dd-trace-py: profiler shutdown invokes Python imports while sampler
is live.** `Profiler._stop_service` keeps collectors alive across
`scheduler.flush()` "for snapshot"; flush triggers
`code_provenance.get_code_provenance_file()` →
`_package_for_root_module_mapping()` → cold imports of
`setuptools`/`_distutils_hack`/`packaging.*`. Sampler races, reads stale
`PyCodeObject*`, faults. With layer 1 fixed (via this PR using
`process_vm_readv`) this becomes a dropped sample instead of a crash,
but should be properly fixed in a PROF-14568 follow-up.
3. **datadog-lambda-python: `tests/test_api.py:175` calls
`os.environ.clear()` without restoring env.** Strips
`DD_PROFILING_STACK_FAST_COPY` for the rest of the pytest session, so
any subsequent `Profiler.start()` re-reads `config.stack.fast_copy`
which defaults to `True` and re-flips `safe_copy` back to
`safe_memcpy_wrapper`. Confirmed via gdb breakpoint on
`set_fast_copy_enabled` showing `arg=0` (env present) then `arg=1` (env
cleared) when test_api runs before test_wrapper. To be filed as a
separate `datadog-lambda-python` issue.

This PR sidesteps layer 1 universally by defaulting to
`process_vm_readv`. With layer 1 gone, layer 3 becomes harmless to the
profiler regardless of test pollution.

## Testing

- `hatch run lint:fmt` and `hatch run lint:riot` pass.
- The C++ runtime path is unchanged; only the Python config default and
the riot env values move. The C++ static-init opt-out at
`ddtrace/internal/datadog/profiling/stack/src/echion/vm.cc:6-16` already
treated unset/truthy as fast-copy-enabled, so behavior with the env var
unset is now: Python config = `False` → `set_fast_copy_enabled(false)` →
`safe_copy = process_vm_readv` (Linux) or `mach_vm_read_overwrite`
(Darwin).
- Existing `tests/profiling/collector/test_copy_memory_stats.py` covers
both branches explicitly via env var.
- Reproduction validation on workspace-tg: pre-fix `safe_memcpy` default
= 5/8 crashes; with this PR's default flipped to `process_vm_readv` (or
equivalently `pytest -p no:faulthandler` on the old default) = 0/8.

## Risks

- **Hardened-kernel cohort.** With `VmReader` removed in #17755, the
only non-fast-copy reader is `process_vm_readv`. On Linux systems where
`process_vm_readv` is unavailable (e.g. `kernel.yama.ptrace_scope=3`,
certain seccomp/sandboxed containers), `set_fast_copy_enabled(false)`
will set `failed_safe_copy = true` and disable stack profiling, even
though `safe_memcpy` would have worked. This reproduces the pre-#17757
behavior exactly, so it does not regress any user who was already
running on a release without #17757. Users who upgraded to a release
containing #17757 and were silently relying on `safe_memcpy` in such an
environment will lose stack profiling until they set
`_DD_PROFILING_STACK_FAST_COPY=1` explicitly. Accepted as a trade-off
vs. the shutdown-crash risk.
- **No public API change.** `_DD_PROFILING_STACK_FAST_COPY` is a private
(`_DD_*`) env var.

## Additional Notes

- Tracking:
[PROF-14568](https://datadoghq.atlassian.net/browse/PROF-14568)
- Original culprit PR: #17757
- Related upstream test PR: #17872
- `changelog/no-changelog` label applied because the env var is private
and we are deleting the user-facing release note from #17757 (which
announced a default flip that is no longer landing).


[PROF-14568]:
https://datadoghq.atlassian.net/browse/PROF-14568?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[PROF-14568]:
https://datadoghq.atlassian.net/browse/PROF-14568?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Co-authored-by: Brett Langdon <brett.langdon@datadoghq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/no-changelog A changelog entry is not required for this PR. Profiling Continous Profling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants