fix(profiling): detect when faulthandler is enabled by KowalskiThomas · Pull Request #17998 · DataDog/dd-trace-py

KowalskiThomas · 2026-05-11T09:19:47Z

What is this PR?

https://datadoghq.atlassian.net/browse/PROF-14600

This PR is a follow-up from #17929 where we had to remove the default enablement of "fast copy_memory" (guarded memcpy) because we realised it didn't play well with Python's own faulthandler module in certain cases (despite us having put guardrails for it to work...)

The problem essentially was that Python's faulthandler would be installed after ours (at least in some cases, e.g. pytest), so it would be notified/called first when a segmentation fault happened, and would cease execution. We never had a chance to say it was fine and we should continue, because Python's fault handler stops execution (it doesn't raise / relay to the next fault handler).

The only way to work around this is to make sure our fault handler is installed when we call memcpy. Doing this every time we take a sample would be costly (it is one system call) and racy (Python's fault handler could be installed after we check but before we memcpy); checking every time we do a memcpy is simply prohibitive and would make using memcpy irrelevant.

The proposed solution is to patch faulthandler's enable function to immediately reinstall our fault handler when it is called, after Python has installed its fault handler. Note that this does not break any existing functionality:

When a fault happens because of our memcpy call, we catch it and continue as if nothing happened
When a fault happens for some other reason, our boolean flag (indicating whether we are currently making a risky copy) is false, so our fault handler calls the next fault handler in the chain, which is... Python's. So it still works.

Note that just doing this would not be enough, because we would introduce a cycle in fault handling (ours; then faulthandler's; then ours again; then... faulthandler -- we only have one slot to keep track of fault handlers).
There is no simple way around this. The simplest thing to do is to uninstall our fault handler before installing faulthandler's, and re-install it right after. This, however, races with the Sampling Thread as it may be calling into safe_memcpy while we do the uninstall/re-install dance, meaning we could hard crash. Not good.
To work around this second problem, what I did is that I added a new API to pause and resume (not stop and start, it's much more lightweight) the Sampling Thread when we want to install the new fault handler. That way, races are eliminated. This, though, comes at the cost of delaying the installation of the new fault handler by the time it takes to finish capturing the current Sample (should be a few milliseconds at most, so I think it's OK).

One central question we need to answer is: "is that enough?" -- in other words, are there any other Python packages that install signal handlers and that we need to worry about? The short answer, based on my research, is no. The most notable packages that install signal handlers and that are relevant to us are gunicorn / uvicorn / celery / uwsgi and they all don't touch SIGSEGV / SIGBUS. So I think we're good with the current fix.

Testing

I was able to reproduce the original issues (in datadog-lambda-python) and to confirm that the changes from this branch consistently make the crash go away.

cit-pr-commenter-54b7da · 2026-05-11T09:20:21Z

Codeowners resolved as

ddtrace/internal/datadog/profiling/stack/__init__.pyi                   @DataDog/profiling-python
ddtrace/internal/datadog/profiling/stack/echion/echion/danger.h         @DataDog/profiling-python
ddtrace/internal/datadog/profiling/stack/include/sampler.hpp            @DataDog/profiling-python
ddtrace/internal/datadog/profiling/stack/src/echion/danger.cc           @DataDog/profiling-python
ddtrace/internal/datadog/profiling/stack/src/sampler.cpp                @DataDog/profiling-python
ddtrace/internal/datadog/profiling/stack/src/stack.cpp                  @DataDog/profiling-python
ddtrace/profiling/_faulthandler.py                                      @DataDog/profiling-python
ddtrace/profiling/collector/stack.py                                    @DataDog/profiling-python
ddtrace/profiling/collector/threading.py                                @DataDog/profiling-python
releasenotes/notes/profiling-detect-faulthandler-6502b3486e16212e.yaml  @DataDog/apm-python
tests/profiling/test_faulthandler.py                                    @DataDog/profiling-python

datadog-datadog-prod-us1 · 2026-05-11T11:52:38Z

✨ Fix all issues with BitsAI or with Cursor

⚠️ Warnings

❄️ 1 New flaky test detected

test_faulthandler_enable_concurrent_threads from test_faulthandler.py

(Fix with Cursor)

Expected status 0, got 1.
=== Captured STDOUT ===
=== End of captured STDOUT ===
=== Captured STDERR ===
Traceback (most recent call last):
  File "tests/profiling/test_faulthandler.py", line 398, in <module>
    assert not still_alive, f"{len(still_alive)} thread(s) deadlocked"
AssertionError: 1 thread(s) deadlocked
=== End of captured STDERR ===

ℹ️ Info

No other issues found (see more)

🧪 All tests passed

Useful? React with 👍 / 👎

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: ef59b90 | Docs | Datadog PR Page | Give us feedback!}

KowalskiThomas · 2026-05-11T15:10:47Z

@codex review

KowalskiThomas · 2026-05-13T07:24:32Z

@codex review this

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0622e501e4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

KowalskiThomas · 2026-05-13T07:59:34Z

@codex review again

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d27d1756a1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

brettlangdon

general approach and release note lgtm

vlad-scherbich

vlad-scherbich · 2026-05-18T19:26:39Z

@KowalskiThomas Since this code is not familiar to me, I've asked Claude for help after a manual review.

Here's the result:

Must address before merge:

#1 (Pause timeout fallthrough) -- If the 3-second timeout fires, the handler swap proceeds without protection. This is the same race condition the entire pause mechanism was built to prevent. A timeout under normal conditions should never happen (samples take milliseconds), but if it does, the process could crash. The fix is small: skip the swap on timeout, or at least only do reinstall_segv_handler() without the full uninstall dance.

Worth raising but can be a follow-up:

#2 (Concurrent enable() calls) -- Technically a race, but faulthandler.enable() is almost never called from multiple threads simultaneously. A threading.Lock is a one-liner fix, so it could go in this PR or a fast follow-up. Low real-world risk.

KowalskiThomas · 2026-05-19T08:23:23Z

@vlad-scherbich

Pause timeout fallthrough We now don't uninstall if we time out, but we still reinstall. There's pretty much nothing else we can do unfortunately, so it's best effort (and we don't really expect that to happen in real life).

Concurrent enable calls Added a lock as suggested :)

## What is this PR? https://datadoghq.atlassian.net/browse/PROF-14600 This PR is a follow-up from #17929 where we had to remove the default enablement of "fast `copy_memory`" (guarded `memcpy`) because we realised it didn't play well with Python's own `faulthandler` module in certain cases (despite us having put guardrails for it to work...) The problem essentially was that Python's `faulthandler` would be installed after ours (at least in some cases, e.g. `pytest`), so it would be notified/called _first_ when a segmentation fault happened, and would cease execution. We never had a chance to say it was fine and we should continue, because Python's fault handler stops execution (it doesn't raise / relay to the next fault handler). The only way to work around this is to make sure our fault handler is installed when we call `memcpy`. Doing this every time we take a sample would be costly (it is one system call) and racy (Python's fault handler could be installed after we check but before we `memcpy`); checking every time we do a `memcpy` is simply prohibitive and would make using `memcpy` irrelevant. The proposed solution is to patch `faulthandler`'s `enable` function to immediately reinstall our fault handler when it is called, after Python has installed its fault handler. Note that this does not break any existing functionality: * When a fault happens because of our `memcpy` call, we catch it and continue as if nothing happened * When a fault happens for some other reason, our boolean flag (indicating whether we are currently making a risky copy) is false, so our fault handler calls the next fault handler in the chain, which is... Python's. So it still works. Note that just doing this would not be enough, because we would introduce a cycle in fault handling (ours; then `faulthandler`'s; then ours again; then... `faulthandler` -- we only have one slot to keep track of fault handlers). There is no simple way around this. The simplest thing to do is to uninstall our fault handler before installing `faulthandler`'s, and re-install it right after. This, however, races with the Sampling Thread as it may be calling into `safe_memcpy` while we do the uninstall/re-install dance, meaning we could hard crash. Not good. To work around this second problem, what I did is that I added a new API to _pause_ and _resume_ (not _stop_ and _start_, it's much more lightweight) the Sampling Thread when we want to install the new fault handler. That way, races are eliminated. This, though, comes at the cost of delaying the installation of the new fault handler by the time it takes to finish capturing the current Sample (should be a few milliseconds at most, so I think it's OK). One central question we need to answer is: "is that enough?" -- in other words, are there any other Python packages that install signal handlers and that we need to worry about? The short answer, based on my research, is no. The most notable packages that install signal handlers and that are relevant to us are gunicorn / uvicorn / celery / uwsgi and they all don't touch `SIGSEGV` / `SIGBUS`. So I think we're good with the current fix. ## Testing I was able to reproduce the original issues (in `datadog-lambda-python`) and to confirm that the changes from this branch consistently make the crash go away. Co-authored-by: thomas.kowalski <thomas.kowalski@datadoghq.com>

## Description As reported by the [Codex Security Scan](https://chatgpt.com/codex/cloud/security/findings/535de99d265c8191b2bb3d8639b2f523?repo=https%3A%2F%2Fgithub.com%2FDataDog%2Fdd-trace-py&sev=low), follow-up to #17998 to also support disabling Python's fault handler. Co-authored-by: thomas.kowalski <thomas.kowalski@datadoghq.com>

KowalskiThomas force-pushed the kowalski/fix-profiling-detect-when-faulthandler-is-enabled branch 2 times, most recently from 6bfa8b9 to 9f7846e Compare May 11, 2026 11:43

KowalskiThomas force-pushed the kowalski/fix-profiling-detect-when-faulthandler-is-enabled branch from 9f7846e to 99763a9 Compare May 11, 2026 12:14

KowalskiThomas changed the title ~~fix(profiling): detect when faulthandler is enabled~~ fix(profiling): detect when faulthandler is enabled May 11, 2026

KowalskiThomas added the Profiling Continous Profling label May 11, 2026

KowalskiThomas force-pushed the kowalski/fix-profiling-detect-when-faulthandler-is-enabled branch 2 times, most recently from cd9725e to 0622e50 Compare May 11, 2026 13:04

chatgpt-codex-connector Bot reviewed May 13, 2026

View reviewed changes

Comment thread ddtrace/profiling/_faulthandler.py Outdated

KowalskiThomas changed the title ~~fix(profiling): detect when faulthandler is enabled~~ chore: try to reproduce May 13, 2026

KowalskiThomas changed the title ~~chore: try to reproduce~~ fix(profiling): detect when faulthandler is enabled May 13, 2026

KowalskiThomas force-pushed the kowalski/fix-profiling-detect-when-faulthandler-is-enabled branch from b1829ac to d27d175 Compare May 18, 2026 08:18

KowalskiThomas marked this pull request as ready for review May 18, 2026 08:19

KowalskiThomas requested review from a team as code owners May 18, 2026 08:19

KowalskiThomas requested review from emmettbutler and vlad-scherbich May 18, 2026 08:19

chatgpt-codex-connector Bot reviewed May 18, 2026

View reviewed changes

Comment thread ddtrace/profiling/_faulthandler.py Outdated

KowalskiThomas added 3 commits May 18, 2026 11:01

fix(profiling): detect when faulthandler is enabled

973f8e0

misc: add release notes

f73c338

refactor: guard against cyclic fault handling

8803c65

KowalskiThomas force-pushed the kowalski/fix-profiling-detect-when-faulthandler-is-enabled branch from d27d175 to a97ab37 Compare May 18, 2026 09:03

refactor: avoid race condition in sampler

e5620b5

KowalskiThomas force-pushed the kowalski/fix-profiling-detect-when-faulthandler-is-enabled branch from a97ab37 to e5620b5 Compare May 18, 2026 09:10

test: add tests

709d1e8

KowalskiThomas force-pushed the kowalski/fix-profiling-detect-when-faulthandler-is-enabled branch from 7ee05c1 to 709d1e8 Compare May 18, 2026 13:09

brettlangdon approved these changes May 18, 2026

View reviewed changes

vlad-scherbich approved these changes May 18, 2026

View reviewed changes

Comment thread ddtrace/internal/datadog/profiling/stack/__init__.pyi Outdated

Comment thread ddtrace/profiling/collector/stack.py

Comment thread ddtrace/profiling/collector/threading.py

Comment thread ddtrace/profiling/_faulthandler.py Outdated

test: better testing

39aabbc

KowalskiThomas force-pushed the kowalski/fix-profiling-detect-when-faulthandler-is-enabled branch from e90deb5 to 39aabbc Compare May 18, 2026 22:45

KowalskiThomas force-pushed the kowalski/fix-profiling-detect-when-faulthandler-is-enabled branch from 858bfb4 to 3f49fe4 Compare May 19, 2026 08:27

review: address Vlad's comments

ef59b90

KowalskiThomas force-pushed the kowalski/fix-profiling-detect-when-faulthandler-is-enabled branch from 3f49fe4 to ef59b90 Compare May 19, 2026 09:09

gh-worker-dd-mergequeue-cf854d Bot merged commit ddab0a7 into main May 19, 2026
414 checks passed

gh-worker-dd-mergequeue-cf854d Bot deleted the kowalski/fix-profiling-detect-when-faulthandler-is-enabled branch May 19, 2026 10:22

KowalskiThomas mentioned this pull request May 19, 2026

chore(profiling): also wrap faulthandler.disable #18167

Merged

Conversation

KowalskiThomas commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is this PR?

Testing

Uh oh!

cit-pr-commenter-54b7da Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codeowners resolved as

Uh oh!

datadog-datadog-prod-us1 Bot commented May 11, 2026 • edited by datadog-prod-us1-6 Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Warnings

ℹ️ Info

Uh oh!

KowalskiThomas commented May 11, 2026

Uh oh!

KowalskiThomas commented May 13, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

KowalskiThomas commented May 13, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

brettlangdon left a comment

Choose a reason for hiding this comment

Uh oh!

vlad-scherbich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vlad-scherbich commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KowalskiThomas commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KowalskiThomas commented May 11, 2026 •

edited

Loading

cit-pr-commenter-54b7da Bot commented May 11, 2026 •

edited

Loading

datadog-datadog-prod-us1 Bot commented May 11, 2026 •

edited by datadog-prod-us1-6 Bot

Loading

vlad-scherbich commented May 18, 2026 •

edited

Loading