This document describes the internal architecture of spprof, a high-performance sampling profiler for Python.
spprof is a high-performance sampling profiler for Python that captures call stacks with minimal overhead. The architecture varies by platform:
| Platform | Sampling Method | Free-Threading Safe |
|---|---|---|
| Linux | SIGPROF signal + per-thread timers | ✅ Yes (speculative capture) |
| macOS | Mach thread suspension | ✅ Yes |
| Windows | Timer queue + GIL acquisition | ✅ Yes |
The architecture consists of:
- Sampler - Platform-specific sampling (signal handler or thread suspension)
- Ring Buffer - Lock-free queue for sampler→resolver communication
- Code Registry - Reference tracking to prevent use-after-free
- Resolver - Converts raw pointers to human-readable symbols
- Python API - User-facing interface
┌─────────────────────────────────────────────────────────────────────┐
│ Python Application │
├─────────────────────────────────────────────────────────────────────┤
│ spprof Python API │
│ start() / stop() / profile() │
├─────────────────────────────────────────────────────────────────────┤
│ Native Extension Layer │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌────────────┐ │
│ │ Platform │ │ Sampler │ │ Ring │ │ Resolver │ │
│ │ (Linux: │─▶│ (Linux: │─▶│ Buffer │─▶│ (GIL, sym │ │
│ │ timer, │ │ signal, │ │ (lock-free │ │ cache, │ │
│ │ Darwin: │ │ Darwin: │ │ SPSC) │ │ dladdr) │ │
│ │ Mach, │ │ suspend, │ │ │ │ │ │
│ │ Windows: │ │ Windows: │ │ │ │ │ │ │
│ │ timer Q) │ │ GIL+walk) │ │ ▼ │ │ │ │
│ └─────────────┘ └─────────────┘ │ Code Regist │ └────────────┘ │
│ │ (ref track) │ │
│ └─────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
src/spprof/
├── __init__.py # Public Python API
├── output.py # Output formatters (Speedscope, FlameGraph)
├── _ext/ # C extension source code
│ ├── module.c # Python extension entry point
│ ├── signal_handler.c # Async-signal-safe signal handler (Linux)
│ ├── ringbuffer.c # Lock-free SPSC queue
│ ├── resolver.c # Symbol resolution with mixed-mode merging
│ ├── framewalker.c # Frame walking (vtable dispatch)
│ ├── unwind.c # Native stack unwinding (libunwind/backtrace)
│ ├── code_registry.c # Code object reference tracking
│ ├── internal/ # Python internal structure definitions
│ │ ├── pycore_frame.h # _PyInterpreterFrame for 3.11-3.14
│ │ └── pycore_tstate.h # Async-signal-safe frame capture
│ ├── compat/ # Public API compatibility (fallback)
│ │ └── py*.h # Version-specific headers
│ └── platform/ # Platform-specific implementations
│ ├── linux.c # timer_create + SIGEV_THREAD_ID + SIGPROF
│ ├── darwin.c # Legacy setitimer wrapper (delegates to Mach)
│ ├── darwin_mach.c # Mach sampler: thread suspension + frame walking
│ └── windows.c # Timer queue + GIL + CaptureStackBackTrace
└── py.typed # PEP 561 marker
The signal handler is the most critical component. It runs in async-signal-safe context, which means:
- No malloc/free - All storage is pre-allocated
- No locks - Uses atomic operations only
- No Python API - Reads internal structures directly
- No I/O - No printf, no file operations
void spprof_signal_handler(int signum, siginfo_t* info, void* ucontext) {
// 1. Check if profiler is active
// 2. Get timestamp (async-signal-safe)
// 3. Capture Python frames (direct memory access)
// 4. Write to ring buffer (lock-free)
}The handler captures:
- Timestamp (nanoseconds, monotonic clock)
- Thread ID
- Stack of
PyCodeObject*pointers - Instruction pointers (for line number resolution)
A lock-free Single-Producer Single-Consumer (SPSC) queue:
┌─────────────────────────────────────────────────────────────────┐
│ write_idx ──────────────▶ │
│ [Sample][Sample][Sample][Sample][ empty ][ empty ][ empty ] │
│ ▲── read_idx │
└─────────────────────────────────────────────────────────────────┘
- Producer: Signal handler (writes samples)
- Consumer: Resolver thread (reads and resolves)
- Overflow handling: Drops samples rather than blocking
- Memory ordering: Uses acquire/release semantics
The buffer is pre-allocated to avoid any allocation in the signal handler.
Async-signal-safe frame walking using Python's internal structures:
// Get thread state from TLS (async-signal-safe)
PyThreadState* tstate = _spprof_tstate_get();
// Get current frame (version-specific struct access)
_spprof_InterpreterFrame* frame = _spprof_get_current_frame(tstate);
// Walk the frame chain
while (frame != NULL) {
PyCodeObject* code = _spprof_frame_get_code(frame);
// Store code pointer for later resolution
frame = _spprof_frame_get_previous(frame);
}Version Compatibility:
| Python | Frame Access | Notes |
|---|---|---|
| 3.11 | cframe->current_frame |
Uses _PyCFrame |
| 3.12 | cframe->current_frame |
Similar to 3.11 |
| 3.13+ | tstate->current_frame |
Direct access, f_executable |
Runs in normal context (can use GIL, malloc, Python API):
int resolver_resolve_frame(uintptr_t code_addr, ResolvedFrame* out) {
PyGILState_STATE gstate = PyGILState_Ensure();
PyCodeObject* co = (PyCodeObject*)code_addr;
// Extract: function name, filename, line number
PyGILState_Release(gstate);
}Features:
- Symbol cache: 4-way set-associative cache with pseudo-LRU eviction
- Line number resolution: Uses instruction pointer for accuracy
- Mixed-mode merging: "Trim & Sandwich" algorithm combines native and Python frames
- Native symbol resolution: Uses
dladdr()(POSIX) orDbgHelp(Windows) - Batch processing: Drains ring buffer efficiently via streaming API
The Code Registry tracks references to PyCodeObject pointers captured during sampling to prevent use-after-free bugs.
Problem Solved:
Timeline without registry:
T1: Sample captured, frame.code = 0x7fff12345678 (PyCodeObject*)
T2: Python GC runs, frees the code object
T3: Memory at 0x7fff12345678 is reused for something else
T4: Resolver tries to read code->co_name → crash or garbage
Solution:
// During sampling (Darwin Mach sampler, with GIL held):
python_depth = capture_frames(tstate, frames, ...);
code_registry_add_refs_batch(frames, python_depth, gc_epoch); // Py_INCREF
// During resolution (later):
CodeValidationResult result = code_registry_validate(code_addr, epoch);
if (result == CODE_VALID) {
// Safe to dereference
}
// After resolution:
code_registry_release_refs_batch(frames, python_depth); // Py_DECREFPlatform Differences:
| Platform | Reference Holding | Safety Guarantee |
|---|---|---|
| Darwin/Mach | ✅ INCREF during capture (GIL held) | Guaranteed valid |
| Linux/SIGPROF | ❌ Cannot INCREF (signal context) | Best-effort validation |
| Windows | ✅ INCREF during capture (GIL held) | Guaranteed valid |
Safe Mode:
For Linux signal-handler samples (where we can't INCREF), Safe Mode discards any code objects not already tracked:
code_registry_set_safe_mode(1); // Enable safe mode
// In validation:
if (g_safe_mode && !code_registry_is_held(code_addr)) {
g_safe_mode_rejects++;
return CODE_INVALID_NOT_HELD; // Discard this frame
}This trades profile completeness for guaranteed memory safety.
Uses POSIX per-thread timers for accurate CPU-time sampling:
// Create per-thread timer
struct sigevent sev = {
.sigev_notify = SIGEV_THREAD_ID,
.sigev_signo = SIGPROF,
._sigev_un._tid = gettid(),
};
timer_create(CLOCK_THREAD_CPUTIME_ID, &sev, &timer);Benefits:
- Samples only when thread is executing (not sleeping)
- Each thread gets its own timer
- True per-thread CPU profiling
Container Fallback: When CLOCK_THREAD_CPUTIME_ID fails (common in containers with restricted syscalls), spprof automatically falls back to CLOCK_MONOTONIC (wall-time). This is transparent to the user but means sleeping threads will also be sampled.
Thread Registry (Dynamic Thread Tracking)
The Linux implementation uses a hash table (uthash) for dynamic thread tracking:
typedef struct ThreadTimerEntry {
pid_t tid; // Key: Linux thread ID
timer_t timer_id; // POSIX timer handle
uint64_t overruns; // Accumulated timer overruns
int active; // Timer running state
UT_hash_handle hh; // uthash handle
} ThreadTimerEntry;Key features:
- No thread limits: Dynamic growth replaces the old 256-thread limit
- O(1) operations: Hash table provides constant-time lookup
- RWLock protection: Concurrent reads during profiling, serialized writes
- Overrun tracking: Per-thread overrun counts aggregated globally
- Race-free shutdown: Signal blocking during cleanup prevents crashes
Thread registry operations:
| Operation | Complexity | Thread Safety |
|---|---|---|
registry_add_thread |
O(1) avg | Write lock |
registry_find_thread |
O(1) avg | Read lock |
registry_remove_thread |
O(1) avg | Write lock |
registry_count |
O(1) | Read lock |
Timer Cleanup
Safe timer deletion uses signal blocking:
// Block SIGPROF to prevent races
sigset_t block_set;
sigemptyset(&block_set);
sigaddset(&block_set, SIGPROF);
pthread_sigmask(SIG_BLOCK, &block_set, &old_set);
// Delete timer safely
timer_delete(timer_id);
// Drain pending signals
while (sigtimedwait(&block_set, &info, &timeout) > 0) {}
// Restore signal mask
pthread_sigmask(SIG_SETMASK, &old_set, NULL);Pause/Resume Support
Timers can be paused without destruction:
// Pause: Set zero interval (disarm)
struct itimerspec zero = {0};
timer_settime(timer_id, 0, &zero, NULL);
// Resume: Restore saved interval
timer_settime(timer_id, 0, &saved_interval, NULL);Python 3.13+ can be built with free-threading support (--disable-gil). spprof fully supports free-threaded Python on Linux via a speculative capture mechanism with multi-layer validation.
The Challenge:
Signal-based sampling on free-threaded Python is inherently racy because frame chains can change while being read. Unlike macOS where we can suspend threads, Linux signal handlers must work with a potentially-changing frame chain.
Solution: Speculative Capture with Validation
// Speculative capture with validation
_spprof_InterpreterFrame *frame = SPPROF_ATOMIC_LOAD_PTR(&tstate->current_frame);
while (frame && depth < max_frames) {
// 1. Validate pointer bounds and alignment
if (!_spprof_ptr_valid_speculative(frame)) break;
// 2. Cycle detection (prevent infinite loops from corruption)
if (seen_before(frame)) { drop_sample(); return 0; }
// 3. Type-check code object using cached PyCode_Type pointer
if (_spprof_looks_like_code(code)) {
frames[depth++] = code;
}
// 4. Move to next frame with atomic load
frame = SPPROF_ATOMIC_LOAD_PTR(&frame->previous);
}Safety Model:
| Check | Purpose |
|---|---|
| Pointer bounds | Detect freed/corrupted memory (heap range validation) |
| 8-byte alignment | All Python objects are aligned |
| Cycle detection | Prevent infinite loops from corrupted frame chains |
| Type validation | Verify code object via cached PyCode_Type pointer |
Performance Characteristics:
- Race window during frame updates: ~10-50ns
- Sampling interval: 10ms (default)
- Collision probability: ~0.0005% per sample
- Corrupted samples are safely dropped (increments
validation_dropscounter)
Memory Ordering:
- x86-64: Strong memory model, plain loads sufficient
- ARM64: Uses acquire semantics via
SPPROF_ATOMIC_LOAD_PTR
Monitoring Validation Drops:
stats = spprof.stats()
# Check validation_drops in extended stats for free-threading healthmacOS uses a Mach-based "Suspend-Walk-Resume" sampling pattern that is fundamentally different from signal-based sampling on Linux. This design is required for several reasons and provides important safety guarantees.
Core Approach:
// Sampler thread loop (runs at configurable interval)
void sample_all_threads() {
PyGILState_STATE gstate = PyGILState_Ensure();
for (each python_thread in interpreter) {
thread_suspend(mach_port); // Fully stop target thread
capture_python_frames(tstate); // Walk Python frame chain
capture_native_frames(registers); // Walk C stack via frame pointers
code_registry_add_refs_batch(...); // INCREF code objects
thread_resume(mach_port); // Resume target thread
write_sample_to_ringbuffer(...);
}
PyGILState_Release(gstate);
}Key Components:
| Component | Purpose |
|---|---|
pthread_introspection_hook |
Automatic thread discovery (lifecycle events) |
thread_suspend()/thread_resume() |
Safely stop thread for inspection |
thread_get_state() |
Capture registers (PC, SP, FP) for native stack |
mach_wait_until() |
Precise timing without signal overhead |
| Thread Registry | Track all threads with Mach ports and stack bounds |
Thread Discovery:
// Automatically called when threads start/terminate
void introspection_hook(unsigned int event, pthread_t thread, ...) {
switch (event) {
case PTHREAD_INTROSPECTION_THREAD_START:
registry_add(&g_state.registry, thread);
break;
case PTHREAD_INTROSPECTION_THREAD_TERMINATE:
registry_remove(&g_state.registry, thread);
break;
}
}Benefits over Signal-Based Sampling:
- Free-Threading Safe: Works with Python 3.13+
--disable-gilbuilds - Deterministic Sampling: Precise control over which threads are sampled
- No Signal Handler Constraints: Can use any API during frame capture
- Native Stack Integration: Frame pointer walking from captured registers
This section documents key design decisions in the Mach sampler and explains why certain alternative approaches were rejected.
A common suggestion is to minimize GIL hold time by:
- Using
task_threads()to discover threads instead of Python APIs - Suspending threads and walking stacks WITHOUT the GIL
- Only acquiring GIL at the end to INCREF captured code objects
This approach is fundamentally flawed. Here's why:
Problem 1: task_threads() Returns Mach Ports, Not Python Thread States
// This does NOT work for finding Python frames:
thread_act_port_array_t threads;
task_threads(mach_task_self(), &threads, &count);
// We get Mach port numbers, but:
// - No mapping to PyThreadState
// - No access to Python frame chain
// - No way to know which threads are Python threadsThe Python frame chain (_PyInterpreterFrame) is a linked list starting from PyThreadState->current_frame. There is no way to find this from a Mach port alone. We MUST iterate Python's thread state list.
Problem 2: Thread State List Requires GIL
// These APIs require GIL for safe linked-list traversal:
PyThreadState* tstate = PyInterpreterState_ThreadHead(interp);
while (tstate != NULL) {
// Process thread...
tstate = PyThreadState_Next(tstate); // Follows linked list
}Without the GIL, another thread could modify the linked list (add/remove thread states) while we're iterating, leading to use-after-free or infinite loops.
Problem 3: Code Object INCREFs Cannot Be Deferred
// In the current design:
python_depth = _spprof_capture_frames_from_tstate(tstate, frames, ...);
code_registry_add_refs_batch(frames, python_depth, gc_epoch); // INCREF here
thread_resume(mach_port);If we tried to defer INCREF to after all threads are resumed:
- GC could run between resume and INCREF
- Code objects could be freed and memory reused
- INCREF would operate on invalid/reused memory
The Safe Ordering:
suspend → capture frames → INCREF (while suspended) → resume
The INCREF must happen while the thread is still suspended AND we hold the GIL, ensuring the captured pointer is valid.
Problem 4: Caching PyThreadState Pointers is Unsafe
// This is UNSAFE:
// Step 1: Cache all thread states (with GIL)
for (tstate = ThreadHead(interp); tstate; tstate = Next(tstate)) {
cached_tstates[i++] = tstate;
}
PyGILState_Release(gstate); // Release GIL
// Step 2: Suspend and sample (without GIL)
for (i = 0; i < count; i++) {
suspend_and_sample(cached_tstates[i]); // CRASH: tstate may be freed!
}A Python thread could exit between step 1 and step 2. The cached PyThreadState* would become a dangling pointer.
Another suggestion is to maintain our own lock-free cache of thread states, updated by Python callbacks.
Problems:
- Python doesn't provide callbacks for thread state changes that work in free-threaded builds
- The
threadingmodule's hooks run too late (after thread is fully initialized) - Would need to solve the same dangling pointer problem
The pthread_introspection_hook we use provides thread LIFECYCLE events (start/terminate), but gives us pthread_t handles, not PyThreadState* pointers. We still need the GIL to map these to Python thread states.
Current Design Performance (measured on Apple M1):
| Thread Count | GIL Hold Time | Per-Thread Overhead |
|---|---|---|
| 1 thread | ~30μs | ~30μs |
| 10 threads | ~300μs | ~30μs each |
| 50 threads | ~1.5ms | ~30μs each |
| 100 threads | ~3ms | ~30μs each |
Impact Analysis:
For a 10ms sampling interval (100 Hz):
- 10 threads: 300μs / 10ms = 3% overhead
- 50 threads: 1.5ms / 10ms = 15% overhead
- 100 threads: 3ms / 10ms = 30% overhead
Recommendations for High-Thread-Count Workloads:
- Increase sampling interval (e.g., 20ms instead of 10ms)
- Profile only a subset of threads (feature TODO)
- Use shorter profiling sessions
Why This Overhead is Acceptable:
- The GIL is only held during the sampling tick, not continuously
- Python bytecode execution pauses, but C extensions that release the GIL continue
- Individual threads are suspended for only ~10-50μs each
- The alternative (unsafe sampling) risks crashes and corrupted data
The Mach sampler is safe for free-threaded Python builds because:
- Thread Suspension:
thread_suspend()fully stops the target thread - Stable Frame Chain: Frame pointers cannot change while suspended
- Critical Section:
PyGILState_Ensure()acquires the appropriate lock - Order of Operations: Suspend → Read → INCREF → Resume
Linux signal-based sampling required special handling for free-threading safety. See the Linux Free-Threading section for details on how speculative capture with validation makes this safe.
// From pycore_frame.h:
#if SPPROF_FREE_THREADED
#if defined(__APPLE__)
#define SPPROF_FREE_THREADING_SAFE 1 // Mach sampler is safe
#elif defined(__linux__)
#define SPPROF_FREE_THREADING_SAFE 1 // Speculative capture is safe
#else
#define SPPROF_FREE_THREADING_SAFE 0 // Other platforms not yet supported
#endif
#endifNative symbol resolution via dladdr() is not done during thread suspension:
// WRONG (deadlock risk):
thread_suspend(port);
dladdr(pc, &info); // dladdr may take loader lock → deadlock if target holds it
thread_resume(port);
// CORRECT (current implementation):
thread_suspend(port);
native_stack.frames[i].return_addr = pc; // Just copy PC address
thread_resume(port);
// ... later, in resolver (after resume) ...
dladdr(pc, &info); // Safe - target thread is runningIf the target thread was suspended while holding the loader lock (e.g., during dlopen), calling dladdr() would deadlock. We defer symbol resolution to the resolver, which runs after all threads are resumed.
| Requirement | Current Design | Alternative (No GIL) |
|---|---|---|
| Find Python threads | ✅ PyInterpreterState_ThreadHead (safe) | ❌ task_threads() (no PyThreadState mapping) |
| Iterate thread list | ✅ GIL protects linked list | ❌ List can change during iteration |
| Access frame chain | ✅ Thread suspended, chain stable | ❌ Would need to cache, risk dangling ptrs |
| INCREF code objects | ✅ Done while suspended + GIL | ❌ Cannot defer (GC race) |
| Free-threading safe | ✅ Works with Py_GIL_DISABLED | ✅ Linux uses speculative capture |
The GIL hold time is an acceptable trade-off for correctness and safety. The profiler is designed for observability, not to be invisible. A few milliseconds of GIL hold time per sample is preferable to crashes, corrupted profiles, or use-after-free bugs.
Uses timer queue with GIL acquisition (no thread suspension needed):
// Timer callback (runs in thread pool thread)
void CALLBACK timer_callback(PVOID param, BOOLEAN timer_fired) {
PyGILState_STATE gstate = PyGILState_Ensure();
// With GIL held, iterate all Python threads
PyThreadState* tstate = PyInterpreterState_ThreadHead(interp);
while (tstate != NULL) {
// Walk frames using public API (PyThreadState_GetFrame, etc.)
walk_thread_frames(tstate, &sample, line_numbers);
// Optionally capture native stack
if (g_native_unwinding) {
capture_native_stack(native_frames, MAX_NATIVE_FRAMES, 2);
}
tstate = PyThreadState_Next(tstate);
}
PyGILState_Release(gstate);
}Key Features:
- Uses
CaptureStackBackTrace()for native frames (fast, no DbgHelp lock) - Accurate line numbers via
PyFrame_GetLineNumber() - Sample batching for reduced ring buffer contention
- No thread suspension needed (GIL provides synchronization)
Why No Thread Suspension on Windows?
Unlike macOS Mach, Windows thread suspension (SuspendThread) has several issues:
- Can deadlock if target holds critical section
- Requires elevated privileges in some configurations
- GIL already provides sufficient synchronization for Python frame walking
Since we need the GIL anyway to safely iterate thread states and INCREF code objects, we simply hold it throughout the sample cycle.
The resolver merges native C/C++ frames with Python frames to create coherent mixed-mode stack traces:
Captured Native Stack: Captured Python Stack:
[0] some_c_function [0] my_python_func
[1] PyObject_Call [1] call_helper
[2] _PyEval_EvalFrameDefault [2] main
[3] PyEval_EvalCode
[4] run_mod
[5] PyRun_FileExFlags
[6] Py_RunMain
[7] main
Merged Result:
[0] some_c_function (native - leaf)
[1] my_python_func (python)
[2] call_helper (python)
[3] main (python)
[4] Py_RunMain (native - entry)
[5] main (native - entry)
Algorithm:
int merge_native_and_python_frames(...) {
// Walk native stack from leaf (index 0)
for (i = 0; i < native_depth; i++) {
resolve_native_frame(native_pcs[i], &frame, &is_interpreter);
if (is_interpreter && !python_inserted) {
// Hit Python interpreter - INSERT PYTHON STACK HERE
for (j = 0; j < python_depth; j++) {
resolve_code_object(python_frames[j], &out_frames[out_idx++]);
}
python_inserted = 1;
// Skip remaining interpreter frames
} else if (!is_interpreter) {
// Non-interpreter native frame - include it
out_frames[out_idx++] = frame;
}
}
}Interpreter Detection:
Uses address-based comparison for robustness:
// During resolver init:
Dl_info info;
dladdr((void*)&Py_Initialize, &info);
g_python_lib_base = info.dli_fbase; // Base address of Python library
// During frame processing:
int is_python_interpreter_frame(void* lib_base, const char* lib_path) {
// Primary: Compare library base addresses
if (g_python_lib_base != NULL && lib_base == g_python_lib_base) {
return 1;
}
// Fallback: String heuristics
if (strstr(lib_path, "Python.framework") || strstr(lib_path, "libpython")) {
return 1;
}
return 0;
}1. Timer fires (OS)
↓
2. SIGPROF delivered to thread
↓
3. Signal handler executes:
- Read PyThreadState from TLS
- Walk _PyInterpreterFrame chain
- Copy code pointers to RawSample
- Write to ring buffer
↓
4. Resolver drains buffer (after profiling stops):
- Read RawSample from ring buffer
- Resolve code pointers to symbols
- Build ResolvedSample with names/files/lines
↓
5. Python layer creates Profile object
↓
6. User exports to Speedscope/FlameGraph
┌─────────────────────────────────────────────────────┐
│ timestamp (8 bytes) │
├─────────────────────────────────────────────────────┤
│ thread_id (8 bytes) │
├─────────────────────────────────────────────────────┤
│ depth (4 bytes) │ padding (4 bytes) │
├─────────────────────────────────────────────────────┤
│ frames[0..127] - PyCodeObject* pointers (1024 B) │
├─────────────────────────────────────────────────────┤
│ instr_ptrs[0..127] - instruction pointers (1024 B) │
└─────────────────────────────────────────────────────┘
Total: ~2KB per sample
┌─────────────────────────────────────────────────────┐
│ frames[0..127] - ResolvedFrame array │
│ ├─ function_name (256 bytes) │
│ ├─ filename (1024 bytes) │
│ ├─ lineno (4 bytes) │
│ └─ is_native (4 bytes) │
├─────────────────────────────────────────────────────┤
│ depth (4 bytes) │
├─────────────────────────────────────────────────────┤
│ timestamp (8 bytes) │
├─────────────────────────────────────────────────────┤
│ thread_id (8 bytes) │
└─────────────────────────────────────────────────────┘
| Component | Thread Safety | Notes |
|---|---|---|
| Signal handler | Async-signal-safe | Single producer |
| Ring buffer write | Single producer | Lock-free |
| Ring buffer read | Single consumer | Lock-free |
| Resolver | GIL-protected | Safe Python API access |
| Platform timers | Thread-safe | OS-managed |
| Operation | Time | Notes |
|---|---|---|
| Signal handler | <10μs | No allocations, direct reads |
| Frame walk (per frame) | ~100ns | Pointer chasing |
| Ring buffer write | ~50ns | Memory copy |
| Symbol resolution | ~1-10μs | Cached |
| Symbol resolution (miss) | ~10-100μs | GIL + Python API |
The profiler is designed to fail safely:
- Invalid pointers in signal handler: Pointer validation before dereference
- Ring buffer full: Samples dropped, counter incremented (check
profile.dropped_count) - Invalid code objects in resolver: Skip frame, log error
- Thread termination: Timers cleaned up automatically
- Container restrictions: Automatic fallback to wall-time sampling when CPU-time timers fail
For troubleshooting common issues, see the Troubleshooting Guide.
- ✅ Sample aggregation: Reduce memory for repeated stacks via
profile.aggregate() - ✅ Native frame capture: Mixed-mode profiling with C/C++ frames
- ✅ Linux free-threading support: Speculative capture with validation for Python 3.13+
- ✅ macOS free-threading support: Mach-based thread suspension sampling
- Native frame interleaving: Merge C and Python frames in call order
- Streaming output: Write samples to disk for long profiles
- Per-thread buffers: Reduce contention in multi-threaded apps
PEP 669 (sys.monitoring) provides a callback-based profiling API that could offer deterministic tracing as an alternative to statistical sampling:
Potential Benefits:
- Zero-overhead when disabled (no signal delivery)
- Deterministic function entry/exit events
- Built-in safe points for free-threading
Current Status: The speculative capture approach provides good coverage with minimal drops. PEP 669 may be explored for use cases requiring deterministic tracing.
For future enhancement, Linux's perf_event_open syscall with PERF_SAMPLE_CALLCHAIN could provide hardware-assisted sampling with lower overhead than timer-based approaches.
Potential Benefits:
- Hardware-assisted sampling (uses CPU performance counters)
- Lower overhead than timer interrupts
- More accurate CPU cycle attribution
- Can sample on specific events (cache misses, branches, etc.)
Implementation Considerations:
// Conceptual perf_event_open approach
struct perf_event_attr attr = {
.type = PERF_TYPE_SOFTWARE,
.config = PERF_COUNT_SW_CPU_CLOCK,
.sample_type = PERF_SAMPLE_CALLCHAIN | PERF_SAMPLE_TID,
.sample_period = 100000, // ~100Hz sampling
.exclude_kernel = 1,
.exclude_hv = 1,
};
int fd = perf_event_open(&attr, tid, -1, -1, 0);
// Memory-map ring buffer for samples
void* mmap_buf = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);Challenges:
- Requires elevated privileges or kernel
perf_event_paranoidconfiguration - Complex API with many edge cases
- Callchain only captures native frames (need separate Python frame walking)
- Memory-mapped ring buffer management adds complexity
- Not portable (Linux-only)
Current Status: Research phase. The current timer_create implementation provides good accuracy and works without elevated privileges. perf_event_open may be explored for scenarios requiring ultra-low overhead or hardware event correlation.
References: