Fix Python reference counting for declared_transform product store caching#313
Fix Python reference counting for declared_transform product store caching#313
Conversation
5bde8f4 to
94478bb
Compare
|
@phlexbot clang-fix |
|
Automatic clang-format fixes pushed (commit f466e7a). |
|
Review the full CodeQL report for details. |
|
@phlexbot clang-fix |
|
No automatic clang-format fixes were necessary. |
|
@greenc-FNAL, I suggest placing the |
|
@phlexbot clang-fix |
|
Automatic clang-format fixes pushed (commit 56c45ee). |
Definitely: a lot of this is pointless thrashing around that will have to be pruned anyway (assuming it actually fixes either problem, which is not a given right now). |
There was a problem hiding this comment.
Pull request overview
This PR addresses two correctness issues: premature fold completion due to a race in fold counters, and Python converter failures propagating null pointers into PyObject_CallFunctionObjArgs, which can truncate varargs and crash Python-based tests.
Changes:
- Harden Python callback/converter handling by ensuring lifeline unwrapping is consistent, converter failures throw, and decref is null-safe.
- Fix fold completion race by initializing
ready_to_flush_tofalse, adding a “pending work” guard, and keeping the TBB accessor across increment+completion checks. - Improve transform exception safety by erasing stale
stores_entries when callbacks throw.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
plugins/python/src/modulewrap.cpp |
Prevent null/sentinel propagation into Python C-API calls; make conversions fail-fast and decref null-safe. |
phlex/core/store_counters.hpp |
Add pending-work tracking and adjust flush readiness initialization for fold completion safety. |
phlex/core/store_counters.cpp |
Implement pending-aware completion checks and accessor-held increment/flush completion helpers. |
phlex/core/declared_transform.hpp |
Erase inserted-but-unassigned cache entries when callback throws to avoid stale null reuse. |
phlex/core/declared_fold.hpp |
Switch fold counter updates to new pending/atomic-safe helper methods. |
Comments suppressed due to low confidence (1)
phlex/core/store_counters.cpp:59
++counts_[layer_hash];default-constructs astd::atomic<std::size_t>when the key is new.std::atomic's default constructor does not value-initialize the contained value, so the first increment can read an indeterminate value (undefined behavior). Initialize the mapped atomic explicitly on first insert (e.g., emplace withstd::atomic<std::size_t>{0}and thenfetch_add(1)).
void store_counter::increment(data_cell_index::hash_type const layer_hash)
{
++counts_[layer_hash];
}
…letion check The store_counter::is_complete() check was racing with concurrent increment() calls. In TBB multifunction_nodes with unlimited concurrency, the counter_for() accessor was released before increment() was called, allowing is_complete() in another thread's done_with() to see a partially-updated counts_ map. This caused folds with multiple layer types (like different_hierarchies) to complete prematurely when only a subset of layer hashes had been fully counted. Fix: Add increment_and_check() and flush_and_check() methods to count_stores that hold the TBB hash map accessor across both the increment/flush and the completion check, ensuring atomic read-modify-check semantics. Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
…RAY_CONVERTER null check Two fixes for py:vectypes: 1. declared_transform.hpp: When stores_.insert() creates a new entry with a default (null) value and the user callback throws, the null entry remained in the cache. On subsequent messages with the same hash, the null store was reused and sent downstream, causing null pointer propagation through the graph. Fix: wrap the callback in try/catch and erase the entry on exception. 2. modulewrap.cpp: Restore the null check for pyobj in NUMPY_ARRAY_CONVERTER that was removed in PR #213. Without it, PyArray_Check(nullptr) and Py_DECREF(nullptr) cause undefined behavior, leading to intermittent SEGFAULTs or incorrect behavior. Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
Use 'hash' parameter name matching existing counter_for/done_with conventions. Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
- Initialize ready_to_flush_ to false; set to true only in set_flush_value() to prevent premature fold completion before flush is received - Add pending counter to track in-flight fold operations, preventing is_complete() from returning true while events are still being processed - Add mark_pending/unmark_pending to store_counter and count_stores - Call mark_pending before fold computation and unmark after increment - Fix decref_all to use safe_decref that checks for null before Py_DECREF - Add null check in BASIC_CONVERTER py_to_* to prevent null dereference Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
All converter functions that previously returned null intptr_t or null shared_ptr now throw std::runtime_error with descriptive messages: 1. VECTOR_CONVERTER: throw instead of returning (intptr_t)nullptr when: - Input shared_ptr is null - PyArray_SimpleNewFromData fails - PhlexLifeline allocation fails 2. BASIC_CONVERTER _to_py: check result of Python C API conversion function and throw if it returns null. 3. NUMPY_ARRAY_CONVERTER: throw instead of returning empty vector when input pyobj is null. 4. callv: use lifeline_transform for consistency with call, ensuring PhlexLifeline arguments are properly unwrapped for observers. Returning null intptr_t from converters causes PyObject_CallFunctionObjArgs to interpret the null as its argument-list sentinel, silently truncating arguments. This manifests as "missing required positional argument" errors (py:vectypes) or SEGFAULT from null pointer dereference (py:veclists). Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
Per maintainer review: - Revert declared_fold.hpp to main (separate PR per @knoepfel) - Revert store_counters.hpp/cpp to main (separate PR per @knoepfel) - Revert declared_transform.hpp try/catch to main (speculative armoring) - Remove BASIC_CONVERTER _to_py null check on topy() result (cannot fail by construction) - Remove BASIC_CONVERTER py_to_ null check on pyobj (non-null by construction) - Remove NUMPY_ARRAY_CONVERTER null check on pyobj (non-null by construction) Retained real fixes: - callv uses lifeline_transform for consistency with call - safe_decref prevents UB on null Py_DECREF - VECTOR_CONVERTER throws instead of returning (intptr_t)nullptr Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
Add Py_XINCREF in call()/callv() before Python function invocation so that the product store's cached reference survives the subsequent decref_all. Without this, if the framework's declared_transform caching reuses the same product store, the PyObject could be freed by one call's decref and then accessed as a dangling pointer by the next. Restore callv() to use (PyObject*)args... instead of lifeline_transform(args)..., matching PR 213's intentional asymmetric design. Fix null pointer dereference in ll_new when tp_alloc fails. Revert prior changes to core framework files (store_counters, declared_transform, declared_fold) since PR 314 addresses the thread-unsafe caching patterns that are the likely root cause of the intermittent failures. Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
Per Wim (original author), lifeline_transform() should be used symmetrically in both call() and callv(). Restore callv() to use lifeline_transform(args)... matching call(). Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
Fully restore store_counters.hpp/cpp, declared_fold.hpp, declared_transform.hpp to their main branch versions. The fold race condition fix belongs in PR #314. Remove speculative null checks from BASIC_CONVERTER and NUMPY_ARRAY_CONVERTER that are not needed by construction. Retain only targeted Python fixes: - Py_XINCREF in call()/callv() for refcount safety with caching - lifeline_transform symmetric in call()/callv() per original author - safe_decref in decref_all for null safety - VECTOR_CONVERTER throws instead of returning null sentinel - lifelinewrap.cpp: return nullptr on tp_alloc failure Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
When stores_.insert() creates a new entry with a null product store and call() throws, the null entry persists in the cache. Subsequent messages with the same hash hit the else branch and send the null product store downstream, causing SEGFAULTs (e.g., when downstream converters call PyArray_Check(nullptr)). The try/catch erases the stale entry on exception, ensuring the cache remains consistent. Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
Output converters (py_to_* in BASIC_CONVERTER and NUMPY_ARRAY_CONVERTER) were calling Py_DECREF on their input, which is a borrowed reference from the upstream product store cache. When declared_transform's caching reuses the same product store for multiple events with the same hash, the first event's converter frees the PyObject, and subsequent events access freed memory — causing intermittent SEGFAULTs in py:coverage, py:veclists, etc. Fix: Remove Py_DECREF from all output converters. The input is a borrowed reference owned by the product store cache. Only py_callback::call()/callv() needs XINCREF/XDECREF around the Python function call to protect against concurrent cache eviction. Also integrates PR #314's thread-safety fixes for store_counters, products.hpp, multiplexer, filter_impl, and all declared_* node types. Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
Separate concerns: - Remove all PR #314 core framework changes (store_counters, products, multiplexer, declared_observer/predicate/provider/unfold, edge_maker, filter_impl, node_catalog, test files). These belong in PR #314. - Keep only the try/catch in declared_transform.hpp (exception safety for cache entries when call() throws). Add Python reference counting model documentation: - plugins/python/src/python-refcounting.md describes the ownership rules for PyObject* references flowing through declared_transform caching, including why the bug was masked by CPython's small integer cache. Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
- Not germaine to the goal of this PR - This reverts commit feb5639.
97efc5f to
483a1a2
Compare
|
I don't understand this one at all. It should not have been leaking before, but now you make it leak. The idea was that each input has a unique pointer providing the conversion. This is why the node is named with a combo of the product and the consumer. The converter creates a new reference, the consumer removes the reference after using the object, so no leaks. It's all one-to-one. The addition of the ref-count before, adds a total +1 compared to the prior code, so now leak. The INCREF/DECREF around the call don't serve a purpose. It is perfectly well possible that there's some misunderstanding on my side about the TBB graph, but leaking the products from converter nodes can not possibly be a solution in any way or form: there could be potentially millions of products. |
The previous PR #313 document described an incorrect model with XINCREF/XDECREF wrapping. This version documents the correct one-to-one ownership model: each converter creates a reference, each consumer DECREFs it exactly once. Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
The previous PR #313 document described an incorrect model with XINCREF/XDECREF wrapping. This version documents the correct one-to-one ownership model: each converter creates a reference, each consumer DECREFs it exactly once. Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
Python
py:tests (vectypes, veclists, coverage) SEGFAULT intermittently becausedeclared_transformcaches product stores by hash, but Python converters were decrementing references owned by the cache—freeing objects that get reused on cache hits. Masked for small integers by CPython's immortal integer cache; crashes on floats and numpy arrays.Reference counting fixes (
modulewrap.cpp)Py_DECREFfrom output converters (BASIC_CONVERTER py_to_*,NUMPY_ARRAY_CONVERTER py_to_*): input is a borrowed reference from the product store cache, not an owned referencePy_XINCREF/Py_XDECREFincall()/callv(): protect cached references during the Python function calllifeline_transform()symmetric incall()andcallv()per original authorVECTOR_CONVERTERerror paths throw instead of returning(intptr_t)nullptr, which acts asPyObject_CallFunctionObjArgssentinel and silently truncates argumentsdecref_all/safe_decrefhelpers — replaced by inlinePy_XDECREFfold expressionsTransform exception safety (
declared_transform.hpp)call():stores_.insert()creates a null entry; ifcall()throws before assignment, the stale entry causes SEGFAULT on cache reuseLifeline allocation fix (
lifelinewrap.cpp)ll_new: returnnullptrontp_allocfailure instead of falling through to null dereferenceDocumentation (
python-refcounting.md)Documents the ownership model for
PyObject*references flowing throughdeclared_transformcaching, including why CPython's small integer cache masked this class of bug.💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.