fix(profiling): detect when faulthandler is enabled#17998
Conversation
Codeowners resolved as |
6bfa8b9 to
9f7846e
Compare
|
✨ Fix all issues with BitsAI or with Cursor
|
9f7846e to
99763a9
Compare
faulthandler is enabled
cd9725e to
0622e50
Compare
|
@codex review |
|
@codex review this |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0622e501e4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review again |
faulthandler is enabledb1829ac to
d27d175
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d27d1756a1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
d27d175 to
a97ab37
Compare
a97ab37 to
e5620b5
Compare
7ee05c1 to
709d1e8
Compare
brettlangdon
left a comment
There was a problem hiding this comment.
general approach and release note lgtm
|
@KowalskiThomas Since this code is not familiar to me, I've asked Claude for help after a manual review. Here's the result: Must address before merge: #1 (Pause timeout fallthrough) -- If the 3-second timeout fires, the handler swap proceeds without protection. This is the same race condition the entire pause mechanism was built to prevent. A timeout under normal conditions should never happen (samples take milliseconds), but if it does, the process could crash. The fix is small: skip the swap on timeout, or at least only do reinstall_segv_handler() without the full uninstall dance. Worth raising but can be a follow-up: #2 (Concurrent enable() calls) -- Technically a race, but faulthandler.enable() is almost never called from multiple threads simultaneously. A threading.Lock is a one-liner fix, so it could go in this PR or a fast follow-up. Low real-world risk. |
e90deb5 to
39aabbc
Compare
|
Pause timeout fallthrough We now don't uninstall if we time out, but we still reinstall. There's pretty much nothing else we can do unfortunately, so it's best effort (and we don't really expect that to happen in real life). Concurrent |
858bfb4 to
3f49fe4
Compare
3f49fe4 to
ef59b90
Compare
## What is this PR? https://datadoghq.atlassian.net/browse/PROF-14600 This PR is a follow-up from #17929 where we had to remove the default enablement of "fast `copy_memory`" (guarded `memcpy`) because we realised it didn't play well with Python's own `faulthandler` module in certain cases (despite us having put guardrails for it to work...) The problem essentially was that Python's `faulthandler` would be installed after ours (at least in some cases, e.g. `pytest`), so it would be notified/called _first_ when a segmentation fault happened, and would cease execution. We never had a chance to say it was fine and we should continue, because Python's fault handler stops execution (it doesn't raise / relay to the next fault handler). The only way to work around this is to make sure our fault handler is installed when we call `memcpy`. Doing this every time we take a sample would be costly (it is one system call) and racy (Python's fault handler could be installed after we check but before we `memcpy`); checking every time we do a `memcpy` is simply prohibitive and would make using `memcpy` irrelevant. The proposed solution is to patch `faulthandler`'s `enable` function to immediately reinstall our fault handler when it is called, after Python has installed its fault handler. Note that this does not break any existing functionality: * When a fault happens because of our `memcpy` call, we catch it and continue as if nothing happened * When a fault happens for some other reason, our boolean flag (indicating whether we are currently making a risky copy) is false, so our fault handler calls the next fault handler in the chain, which is... Python's. So it still works. Note that just doing this would not be enough, because we would introduce a cycle in fault handling (ours; then `faulthandler`'s; then ours again; then... `faulthandler` -- we only have one slot to keep track of fault handlers). There is no simple way around this. The simplest thing to do is to uninstall our fault handler before installing `faulthandler`'s, and re-install it right after. This, however, races with the Sampling Thread as it may be calling into `safe_memcpy` while we do the uninstall/re-install dance, meaning we could hard crash. Not good. To work around this second problem, what I did is that I added a new API to _pause_ and _resume_ (not _stop_ and _start_, it's much more lightweight) the Sampling Thread when we want to install the new fault handler. That way, races are eliminated. This, though, comes at the cost of delaying the installation of the new fault handler by the time it takes to finish capturing the current Sample (should be a few milliseconds at most, so I think it's OK). One central question we need to answer is: "is that enough?" -- in other words, are there any other Python packages that install signal handlers and that we need to worry about? The short answer, based on my research, is no. The most notable packages that install signal handlers and that are relevant to us are gunicorn / uvicorn / celery / uwsgi and they all don't touch `SIGSEGV` / `SIGBUS`. So I think we're good with the current fix. ## Testing I was able to reproduce the original issues (in `datadog-lambda-python`) and to confirm that the changes from this branch consistently make the crash go away. Co-authored-by: thomas.kowalski <thomas.kowalski@datadoghq.com>
## Description As reported by the [Codex Security Scan](https://chatgpt.com/codex/cloud/security/findings/535de99d265c8191b2bb3d8639b2f523?repo=https%3A%2F%2Fgithub.com%2FDataDog%2Fdd-trace-py&sev=low), follow-up to #17998 to also support disabling Python's fault handler. Co-authored-by: thomas.kowalski <thomas.kowalski@datadoghq.com>
## Description As reported by the [Codex Security Scan](https://chatgpt.com/codex/cloud/security/findings/535de99d265c8191b2bb3d8639b2f523?repo=https%3A%2F%2Fgithub.com%2FDataDog%2Fdd-trace-py&sev=low), follow-up to #17998 to also support disabling Python's fault handler. Co-authored-by: thomas.kowalski <thomas.kowalski@datadoghq.com>

What is this PR?
https://datadoghq.atlassian.net/browse/PROF-14600
This PR is a follow-up from #17929 where we had to remove the default enablement of "fast
copy_memory" (guardedmemcpy) because we realised it didn't play well with Python's ownfaulthandlermodule in certain cases (despite us having put guardrails for it to work...)The problem essentially was that Python's
faulthandlerwould be installed after ours (at least in some cases, e.g.pytest), so it would be notified/called first when a segmentation fault happened, and would cease execution. We never had a chance to say it was fine and we should continue, because Python's fault handler stops execution (it doesn't raise / relay to the next fault handler).The only way to work around this is to make sure our fault handler is installed when we call
memcpy. Doing this every time we take a sample would be costly (it is one system call) and racy (Python's fault handler could be installed after we check but before wememcpy); checking every time we do amemcpyis simply prohibitive and would make usingmemcpyirrelevant.The proposed solution is to patch
faulthandler'senablefunction to immediately reinstall our fault handler when it is called, after Python has installed its fault handler. Note that this does not break any existing functionality:memcpycall, we catch it and continue as if nothing happenedNote that just doing this would not be enough, because we would introduce a cycle in fault handling (ours; then
faulthandler's; then ours again; then...faulthandler-- we only have one slot to keep track of fault handlers).There is no simple way around this. The simplest thing to do is to uninstall our fault handler before installing
faulthandler's, and re-install it right after. This, however, races with the Sampling Thread as it may be calling intosafe_memcpywhile we do the uninstall/re-install dance, meaning we could hard crash. Not good.To work around this second problem, what I did is that I added a new API to pause and resume (not stop and start, it's much more lightweight) the Sampling Thread when we want to install the new fault handler. That way, races are eliminated. This, though, comes at the cost of delaying the installation of the new fault handler by the time it takes to finish capturing the current Sample (should be a few milliseconds at most, so I think it's OK).
One central question we need to answer is: "is that enough?" -- in other words, are there any other Python packages that install signal handlers and that we need to worry about? The short answer, based on my research, is no. The most notable packages that install signal handlers and that are relevant to us are gunicorn / uvicorn / celery / uwsgi and they all don't touch
SIGSEGV/SIGBUS. So I think we're good with the current fix.Testing
I was able to reproduce the original issues (in
datadog-lambda-python) and to confirm that the changes from this branch consistently make the crash go away.