Skip to content

fix(profiling): detect when faulthandler is enabled#17998

Merged
gh-worker-dd-mergequeue-cf854d[bot] merged 7 commits into
mainfrom
kowalski/fix-profiling-detect-when-faulthandler-is-enabled
May 19, 2026
Merged

fix(profiling): detect when faulthandler is enabled#17998
gh-worker-dd-mergequeue-cf854d[bot] merged 7 commits into
mainfrom
kowalski/fix-profiling-detect-when-faulthandler-is-enabled

Conversation

@KowalskiThomas
Copy link
Copy Markdown
Contributor

@KowalskiThomas KowalskiThomas commented May 11, 2026

What is this PR?

https://datadoghq.atlassian.net/browse/PROF-14600

This PR is a follow-up from #17929 where we had to remove the default enablement of "fast copy_memory" (guarded memcpy) because we realised it didn't play well with Python's own faulthandler module in certain cases (despite us having put guardrails for it to work...)

The problem essentially was that Python's faulthandler would be installed after ours (at least in some cases, e.g. pytest), so it would be notified/called first when a segmentation fault happened, and would cease execution. We never had a chance to say it was fine and we should continue, because Python's fault handler stops execution (it doesn't raise / relay to the next fault handler).

The only way to work around this is to make sure our fault handler is installed when we call memcpy. Doing this every time we take a sample would be costly (it is one system call) and racy (Python's fault handler could be installed after we check but before we memcpy); checking every time we do a memcpy is simply prohibitive and would make using memcpy irrelevant.

The proposed solution is to patch faulthandler's enable function to immediately reinstall our fault handler when it is called, after Python has installed its fault handler. Note that this does not break any existing functionality:

  • When a fault happens because of our memcpy call, we catch it and continue as if nothing happened
  • When a fault happens for some other reason, our boolean flag (indicating whether we are currently making a risky copy) is false, so our fault handler calls the next fault handler in the chain, which is... Python's. So it still works.

Note that just doing this would not be enough, because we would introduce a cycle in fault handling (ours; then faulthandler's; then ours again; then... faulthandler -- we only have one slot to keep track of fault handlers).
There is no simple way around this. The simplest thing to do is to uninstall our fault handler before installing faulthandler's, and re-install it right after. This, however, races with the Sampling Thread as it may be calling into safe_memcpy while we do the uninstall/re-install dance, meaning we could hard crash. Not good.
To work around this second problem, what I did is that I added a new API to pause and resume (not stop and start, it's much more lightweight) the Sampling Thread when we want to install the new fault handler. That way, races are eliminated. This, though, comes at the cost of delaying the installation of the new fault handler by the time it takes to finish capturing the current Sample (should be a few milliseconds at most, so I think it's OK).

One central question we need to answer is: "is that enough?" -- in other words, are there any other Python packages that install signal handlers and that we need to worry about? The short answer, based on my research, is no. The most notable packages that install signal handlers and that are relevant to us are gunicorn / uvicorn / celery / uwsgi and they all don't touch SIGSEGV / SIGBUS. So I think we're good with the current fix.

Testing

I was able to reproduce the original issues (in datadog-lambda-python) and to confirm that the changes from this branch consistently make the crash go away.

@cit-pr-commenter-54b7da
Copy link
Copy Markdown

cit-pr-commenter-54b7da Bot commented May 11, 2026

Codeowners resolved as

ddtrace/internal/datadog/profiling/stack/__init__.pyi                   @DataDog/profiling-python
ddtrace/internal/datadog/profiling/stack/echion/echion/danger.h         @DataDog/profiling-python
ddtrace/internal/datadog/profiling/stack/include/sampler.hpp            @DataDog/profiling-python
ddtrace/internal/datadog/profiling/stack/src/echion/danger.cc           @DataDog/profiling-python
ddtrace/internal/datadog/profiling/stack/src/sampler.cpp                @DataDog/profiling-python
ddtrace/internal/datadog/profiling/stack/src/stack.cpp                  @DataDog/profiling-python
ddtrace/profiling/_faulthandler.py                                      @DataDog/profiling-python
ddtrace/profiling/collector/stack.py                                    @DataDog/profiling-python
ddtrace/profiling/collector/threading.py                                @DataDog/profiling-python
releasenotes/notes/profiling-detect-faulthandler-6502b3486e16212e.yaml  @DataDog/apm-python
tests/profiling/test_faulthandler.py                                    @DataDog/profiling-python

@KowalskiThomas KowalskiThomas force-pushed the kowalski/fix-profiling-detect-when-faulthandler-is-enabled branch 2 times, most recently from 6bfa8b9 to 9f7846e Compare May 11, 2026 11:43
@datadog-datadog-prod-us1
Copy link
Copy Markdown
Contributor

datadog-datadog-prod-us1 Bot commented May 11, 2026

Tests

Fix all issues with BitsAI or with Cursor

⚠️ Warnings

❄️ 1 New flaky test detected

test_faulthandler_enable_concurrent_threads from test_faulthandler.py   View in Datadog   (Fix with Cursor)
Expected status 0, got 1.
=== Captured STDOUT ===
=== End of captured STDOUT ===
=== Captured STDERR ===
Traceback (most recent call last):
  File "tests/profiling/test_faulthandler.py", line 398, in <module>
    assert not still_alive, f"{len(still_alive)} thread(s) deadlocked"
AssertionError: 1 thread(s) deadlocked
=== End of captured STDERR ===

ℹ️ Info

No other issues found (see more)

🧪 All tests passed

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: ef59b90 | Docs | Datadog PR Page | Give us feedback!

@KowalskiThomas KowalskiThomas force-pushed the kowalski/fix-profiling-detect-when-faulthandler-is-enabled branch from 9f7846e to 99763a9 Compare May 11, 2026 12:14
@KowalskiThomas KowalskiThomas changed the title fix(profiling): detect when faulthandler is enabled fix(profiling): detect when faulthandler is enabled May 11, 2026
@KowalskiThomas KowalskiThomas added the Profiling Continous Profling label May 11, 2026
@KowalskiThomas KowalskiThomas force-pushed the kowalski/fix-profiling-detect-when-faulthandler-is-enabled branch 2 times, most recently from cd9725e to 0622e50 Compare May 11, 2026 13:04
@KowalskiThomas
Copy link
Copy Markdown
Contributor Author

@codex review

@KowalskiThomas
Copy link
Copy Markdown
Contributor Author

@codex review this

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0622e501e4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread ddtrace/profiling/_faulthandler.py Outdated
@KowalskiThomas
Copy link
Copy Markdown
Contributor Author

@codex review again

@KowalskiThomas KowalskiThomas changed the title fix(profiling): detect when faulthandler is enabled chore: try to reproduce May 13, 2026
@KowalskiThomas KowalskiThomas changed the title chore: try to reproduce fix(profiling): detect when faulthandler is enabled May 13, 2026
@KowalskiThomas KowalskiThomas force-pushed the kowalski/fix-profiling-detect-when-faulthandler-is-enabled branch from b1829ac to d27d175 Compare May 18, 2026 08:18
@KowalskiThomas KowalskiThomas marked this pull request as ready for review May 18, 2026 08:19
@KowalskiThomas KowalskiThomas requested review from a team as code owners May 18, 2026 08:19
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d27d1756a1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread ddtrace/profiling/_faulthandler.py Outdated
@KowalskiThomas KowalskiThomas force-pushed the kowalski/fix-profiling-detect-when-faulthandler-is-enabled branch from d27d175 to a97ab37 Compare May 18, 2026 09:03
@KowalskiThomas KowalskiThomas force-pushed the kowalski/fix-profiling-detect-when-faulthandler-is-enabled branch from a97ab37 to e5620b5 Compare May 18, 2026 09:10
@KowalskiThomas KowalskiThomas force-pushed the kowalski/fix-profiling-detect-when-faulthandler-is-enabled branch from 7ee05c1 to 709d1e8 Compare May 18, 2026 13:09
Copy link
Copy Markdown
Member

@brettlangdon brettlangdon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general approach and release note lgtm

Copy link
Copy Markdown
Contributor

@vlad-scherbich vlad-scherbich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image

Comment thread ddtrace/internal/datadog/profiling/stack/__init__.pyi Outdated
Comment thread ddtrace/profiling/collector/stack.py
Comment thread ddtrace/profiling/collector/threading.py
Comment thread ddtrace/profiling/_faulthandler.py Outdated
@vlad-scherbich
Copy link
Copy Markdown
Contributor

vlad-scherbich commented May 18, 2026

@KowalskiThomas Since this code is not familiar to me, I've asked Claude for help after a manual review.

Here's the result:

Must address before merge:

#1 (Pause timeout fallthrough) -- If the 3-second timeout fires, the handler swap proceeds without protection. This is the same race condition the entire pause mechanism was built to prevent. A timeout under normal conditions should never happen (samples take milliseconds), but if it does, the process could crash. The fix is small: skip the swap on timeout, or at least only do reinstall_segv_handler() without the full uninstall dance.

Worth raising but can be a follow-up:

#2 (Concurrent enable() calls) -- Technically a race, but faulthandler.enable() is almost never called from multiple threads simultaneously. A threading.Lock is a one-liner fix, so it could go in this PR or a fast follow-up. Low real-world risk.

@KowalskiThomas KowalskiThomas force-pushed the kowalski/fix-profiling-detect-when-faulthandler-is-enabled branch from e90deb5 to 39aabbc Compare May 18, 2026 22:45
@KowalskiThomas
Copy link
Copy Markdown
Contributor Author

@vlad-scherbich

Pause timeout fallthrough We now don't uninstall if we time out, but we still reinstall. There's pretty much nothing else we can do unfortunately, so it's best effort (and we don't really expect that to happen in real life).

Concurrent enable calls Added a lock as suggested :)

@KowalskiThomas KowalskiThomas force-pushed the kowalski/fix-profiling-detect-when-faulthandler-is-enabled branch from 858bfb4 to 3f49fe4 Compare May 19, 2026 08:27
@KowalskiThomas KowalskiThomas force-pushed the kowalski/fix-profiling-detect-when-faulthandler-is-enabled branch from 3f49fe4 to ef59b90 Compare May 19, 2026 09:09
@gh-worker-dd-mergequeue-cf854d gh-worker-dd-mergequeue-cf854d Bot merged commit ddab0a7 into main May 19, 2026
414 checks passed
@gh-worker-dd-mergequeue-cf854d gh-worker-dd-mergequeue-cf854d Bot deleted the kowalski/fix-profiling-detect-when-faulthandler-is-enabled branch May 19, 2026 10:22
vlad-scherbich pushed a commit that referenced this pull request May 20, 2026
## What is this PR?

https://datadoghq.atlassian.net/browse/PROF-14600

This PR is a follow-up from #17929 where we had to remove the default enablement of "fast `copy_memory`" (guarded `memcpy`) because we realised it didn't play well with Python's own `faulthandler` module in certain cases (despite us having put guardrails for it to work...)

The problem essentially was that Python's `faulthandler` would be installed after ours (at least in some cases, e.g. `pytest`), so it would be notified/called _first_ when a segmentation fault happened, and would cease execution. We never had a chance to say it was fine and we should continue, because Python's fault handler stops execution (it doesn't raise / relay to the next fault handler).   

The only way to work around this is to make sure our fault handler is installed when we call `memcpy`. Doing this every time we take a sample would be costly (it is one system call) and racy (Python's fault handler could be installed after we check but before we `memcpy`); checking every time we do a `memcpy` is simply prohibitive and would make using `memcpy` irrelevant. 

The proposed solution is to patch `faulthandler`'s `enable` function to immediately reinstall our fault handler when it is called, after Python has installed its fault handler. Note that this does not break any existing functionality: 
* When a fault happens because of our `memcpy` call, we catch it and continue as if nothing happened 
* When a fault happens for some other reason, our boolean flag (indicating whether we are currently making a risky copy) is false, so our fault handler calls the next fault handler in the chain, which is... Python's. So it still works.

Note that just doing this would not be enough, because we would introduce a cycle in fault handling (ours; then `faulthandler`'s; then ours again; then... `faulthandler` -- we only have one slot to keep track of fault handlers).  
There is no simple way around this. The simplest thing to do is to uninstall our fault handler before installing `faulthandler`'s, and re-install it right after. This, however, races with the Sampling Thread as it may be calling into `safe_memcpy` while we do the uninstall/re-install dance, meaning we could hard crash. Not good.  
To work around this second problem, what I did is that I added a new API to _pause_ and _resume_ (not _stop_ and _start_, it's much more lightweight) the Sampling Thread when we want to install the new fault handler. That way, races are eliminated. This, though, comes at the cost of delaying the installation of the new fault handler by the time it takes to finish capturing the current Sample (should be a few milliseconds at most, so I think it's OK).

One central question we need to answer is: "is that enough?" -- in other words, are there any other Python packages that install signal handlers and that we need to worry about? The short answer, based on my research, is no.  The most notable packages that install signal handlers and that are relevant to us are gunicorn / uvicorn / celery / uwsgi and they all don't touch `SIGSEGV` / `SIGBUS`. So I think we're good with the current fix.

## Testing 

I was able to reproduce the original issues (in `datadog-lambda-python`) and to confirm that the changes from this branch consistently make the crash go away. 

Co-authored-by: thomas.kowalski <thomas.kowalski@datadoghq.com>
gh-worker-dd-mergequeue-cf854d Bot pushed a commit that referenced this pull request May 20, 2026
## Description

As reported by the [Codex Security Scan](https://chatgpt.com/codex/cloud/security/findings/535de99d265c8191b2bb3d8639b2f523?repo=https%3A%2F%2Fgithub.com%2FDataDog%2Fdd-trace-py&sev=low), follow-up to #17998 to also support disabling Python's fault handler.	

Co-authored-by: thomas.kowalski <thomas.kowalski@datadoghq.com>
vlad-scherbich pushed a commit that referenced this pull request May 20, 2026
## Description

As reported by the [Codex Security Scan](https://chatgpt.com/codex/cloud/security/findings/535de99d265c8191b2bb3d8639b2f523?repo=https%3A%2F%2Fgithub.com%2FDataDog%2Fdd-trace-py&sev=low), follow-up to #17998 to also support disabling Python's fault handler.	

Co-authored-by: thomas.kowalski <thomas.kowalski@datadoghq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Profiling Continous Profling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants