Required prerequisites
What version (or hash if on master) of pybind11 are you using?
3.0.2
Problem description
Problem description
When using py::subinterpreter_scoped_activate in a multithreaded context, if the current thread's active PyThreadState does not match the target subinterpreter, pybind11 appears to always create a new PyThreadState via PyThreadState_New.
In my case, this leads to a crash when running SciPy FFT code inside a subinterpreter. The crash is reproducible in the attached test function run_fft_diagnostics(), specifically at Step 10 (scipy.fft.fft(..., n=256)).
Observed behavior
My understanding is that subinterpreter_scoped_activate checks whether the current thread state matches the target subinterpreter. When it does not match, a new PyThreadState is created rather than reusing an existing one previously associated with the same (thread, interpreter) pair.
This causes repeated creation of PyThreadState objects for the same thread and interpreter over time. Under my workload, this eventually causes SciPy to crash.
Hypothesis
I believe the core issue is that PyThreadState should be reused on a per-thread, per-interpreter basis.
Conceptually, PyThreadState is bound to both:
a specific thread
a specific interpreter
So the correct model seems to be:
each thread should keep its own PyThreadState* for a given interpreter
when switching into that interpreter on that thread, pybind11 should reuse the existing PyThreadState instead of always creating a new one on mismatch
My workaround
I implemented a TLS-based workaround on my side, and after that the crash no longer occurs.
The basic idea is:
store one PyThreadState* per thread for the interpreter
when the thread enters the interpreter again, reuse that stored thread state
activate it with PyThreadState_Swap
only call PyThreadState_New once for that thread if no cached state exists yet
My code is roughly like this:
if (tss_threadstate_.ts_object() == 0) {
zce::SmartPtr pyobj(new ZpyThread(PyThreadState_New(zvm_state_->interp), this));
tss_threadstate_.ts_object(pyobj);
}
ZpyThread* pyobj = dynamic_cast<ZpyThread*>(tss_threadstate_.ts_object());
zpy_state_swap gswap(pyobj->op());
ZCE_ASSERT(pyobj->pvm() == this);
VmInterpreterGuard guard(this);
And VmInterpreterGuard is:
class VmInterpreterGuard {
public:
explicit VmInterpreterGuard(ZpyMachine* pvm)
: sub_(py::subinterpreter_scoped_activate(pvm->sub_interpreter())) {
}
private:
py::subinterpreter_scoped_activate sub_;
};
In other words, before entering py::subinterpreter_scoped_activate, I first restore the previously created thread state from thread-local storage using PyThreadState_Swap. Once I do this, the SciPy crash disappears.
This strongly suggests that the issue is related to repeated creation of mismatched PyThreadState objects instead of reusing the existing thread-local one.
However, with this approach, I am not sure how to properly and cleanly release the PyThreadState objects stored in TLS.
Reproduction code
The following Python function reproduces the issue in my environment. Without the TLS-based reuse, it crashes at Step 10.
def run_fft_diagnostics():
P(f"thread_id={threading.get_ident()} thread_native={threading.current_thread().name}")
P(f"python={sys.version.split()[0]}")
P("STEP 1: import numpy")
import numpy as np
P(f" numpy={np.__version__} OK")
P("STEP 2: np.zeros / np.random.randn")
a = np.zeros(256, dtype=np.float64)
b = np.random.randn(256).astype(np.float64)
P(f" a.shape={a.shape} b.shape={b.shape} OK")
P("STEP 3: np.fft.fft n=64")
y = np.fft.fft(np.random.randn(64))
P(f" shape={y.shape} OK")
P("STEP 4: np.fft.fft n=256")
y = np.fft.fft(np.random.randn(256))
P(f" shape={y.shape} OK")
P("STEP 5: np.fft.fft n=1024")
y = np.fft.fft(np.random.randn(1024))
P(f" shape={y.shape} OK")
P("STEP 6: np.fft.fft n=4096")
y = np.fft.fft(np.random.randn(4096))
P(f" shape={y.shape} OK")
P("STEP 7: np.fft.rfft n=1024")
y = np.fft.rfft(np.random.randn(1024))
P(f" shape={y.shape} OK")
P("STEP 8: import scipy.fft")
try:
import scipy.fft as sfft
P(f" scipy imported OK")
except ImportError as e:
P(f" scipy not available: {e} (skip scipy steps)")
sfft = None
if sfft is not None:
P("STEP 9: scipy.fft.fft n=64")
y = sfft.fft(np.random.randn(64))
P(f" shape={y.shape} OK")
P("STEP 10: scipy.fft.fft n=256")
y = sfft.fft(np.random.randn(256))
P(f" shape={y.shape} OK")
P("STEP 11: scipy.fft.fft n=1024")
y = sfft.fft(np.random.randn(1024))
P(f" shape={y.shape} OK")
P("STEP 12: scipy.fft.fft n=4096")
y = sfft.fft(np.random.randn(4096))
P(f" shape={y.shape} OK")
Summary
What I observe is:
py::subinterpreter_scoped_activate creates a new PyThreadState when the current one does not match the target interpreter
this seems to happen instead of reusing a thread-local PyThreadState already associated with the same interpreter
in my case, this leads to a crash when calling scipy.fft
caching and restoring PyThreadState* in TLS before activation avoids the crash completely
So I suspect subinterpreter_scoped_activate may need to reuse PyThreadState on a per-thread, per-interpreter basis rather than always allocating a fresh one on mismatch.
I can provide a patch if this approach is considered acceptable.
Reproducible example code
Is this a regression? Put the last known working version here if it is.
Not a regression
Required prerequisites
What version (or hash if on master) of pybind11 are you using?
3.0.2
Problem description
Problem description
When using py::subinterpreter_scoped_activate in a multithreaded context, if the current thread's active PyThreadState does not match the target subinterpreter, pybind11 appears to always create a new PyThreadState via PyThreadState_New.
In my case, this leads to a crash when running SciPy FFT code inside a subinterpreter. The crash is reproducible in the attached test function run_fft_diagnostics(), specifically at Step 10 (scipy.fft.fft(..., n=256)).
Observed behavior
My understanding is that subinterpreter_scoped_activate checks whether the current thread state matches the target subinterpreter. When it does not match, a new PyThreadState is created rather than reusing an existing one previously associated with the same (thread, interpreter) pair.
This causes repeated creation of PyThreadState objects for the same thread and interpreter over time. Under my workload, this eventually causes SciPy to crash.
Hypothesis
I believe the core issue is that PyThreadState should be reused on a per-thread, per-interpreter basis.
Conceptually, PyThreadState is bound to both:
a specific thread
a specific interpreter
So the correct model seems to be:
each thread should keep its own PyThreadState* for a given interpreter
when switching into that interpreter on that thread, pybind11 should reuse the existing PyThreadState instead of always creating a new one on mismatch
My workaround
I implemented a TLS-based workaround on my side, and after that the crash no longer occurs.
The basic idea is:
store one PyThreadState* per thread for the interpreter
when the thread enters the interpreter again, reuse that stored thread state
activate it with PyThreadState_Swap
only call PyThreadState_New once for that thread if no cached state exists yet
My code is roughly like this:
if (tss_threadstate_.ts_object() == 0) {
zce::SmartPtr pyobj(new ZpyThread(PyThreadState_New(zvm_state_->interp), this));
tss_threadstate_.ts_object(pyobj);
}
ZpyThread* pyobj = dynamic_cast<ZpyThread*>(tss_threadstate_.ts_object());
zpy_state_swap gswap(pyobj->op());
ZCE_ASSERT(pyobj->pvm() == this);
VmInterpreterGuard guard(this);
And VmInterpreterGuard is:
class VmInterpreterGuard {
public:
explicit VmInterpreterGuard(ZpyMachine* pvm)
: sub_(py::subinterpreter_scoped_activate(pvm->sub_interpreter())) {
}
private:
py::subinterpreter_scoped_activate sub_;
};
In other words, before entering py::subinterpreter_scoped_activate, I first restore the previously created thread state from thread-local storage using PyThreadState_Swap. Once I do this, the SciPy crash disappears.
This strongly suggests that the issue is related to repeated creation of mismatched PyThreadState objects instead of reusing the existing thread-local one.
However, with this approach, I am not sure how to properly and cleanly release the PyThreadState objects stored in TLS.
Reproduction code
The following Python function reproduces the issue in my environment. Without the TLS-based reuse, it crashes at Step 10.
def run_fft_diagnostics():
P(f"thread_id={threading.get_ident()} thread_native={threading.current_thread().name}")
P(f"python={sys.version.split()[0]}")
Summary
What I observe is:
py::subinterpreter_scoped_activate creates a new PyThreadState when the current one does not match the target interpreter
this seems to happen instead of reusing a thread-local PyThreadState already associated with the same interpreter
in my case, this leads to a crash when calling scipy.fft
caching and restoring PyThreadState* in TLS before activation avoids the crash completely
So I suspect subinterpreter_scoped_activate may need to reuse PyThreadState on a per-thread, per-interpreter basis rather than always allocating a fresh one on mismatch.
I can provide a patch if this approach is considered acceptable.
Reproducible example code
Is this a regression? Put the last known working version here if it is.
Not a regression