Fix Python reference counting for declared_transform product store caching by Copilot · Pull Request #313 · Framework-R-D/phlex

Copilot · 2026-02-11T14:54:28Z

Python py: tests (vectypes, veclists, coverage) SEGFAULT intermittently because declared_transform caches product stores by hash, but Python converters were decrementing references owned by the cache—freeing objects that get reused on cache hits. Masked for small integers by CPython's immortal integer cache; crashes on floats and numpy arrays.

Reference counting fixes (`modulewrap.cpp`)

Remove Py_DECREF from output converters (BASIC_CONVERTER py_to_*, NUMPY_ARRAY_CONVERTER py_to_*): input is a borrowed reference from the product store cache, not an owned reference
Add Py_XINCREF/Py_XDECREF in call()/callv(): protect cached references during the Python function call
Make lifeline_transform() symmetric in call() and callv() per original author
VECTOR_CONVERTER error paths throw instead of returning (intptr_t)nullptr, which acts as PyObject_CallFunctionObjArgs sentinel and silently truncates arguments
Remove decref_all/safe_decref helpers — replaced by inline Py_XDECREF fold expressions

Transform exception safety (`declared_transform.hpp`)

try/catch around call(): stores_.insert() creates a null entry; if call() throws before assignment, the stale entry causes SEGFAULT on cache reuse

Lifeline allocation fix (`lifelinewrap.cpp`)

ll_new: return nullptr on tp_alloc failure instead of falling through to null dereference

Documentation (`python-refcounting.md`)

Documents the ownership model for PyObject* references flowing through declared_transform caching, including why CPython's small integer cache masked this class of bug.

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

greenc-FNAL · 2026-02-11T15:46:27Z

@phlexbot clang-fix

github-actions · 2026-02-11T15:46:55Z

Automatic clang-format fixes pushed (commit f466e7a).
⚠️ Note: Some issues may require manual review and fixing.

greenc-FNAL · 2026-02-11T15:49:33Z

Review the full CodeQL report for details.

greenc-FNAL · 2026-02-11T16:28:04Z

@phlexbot clang-fix

github-actions · 2026-02-11T16:28:35Z

No automatic clang-format fixes were necessary.

knoepfel · 2026-02-11T16:29:21Z

@greenc-FNAL, I suggest placing the declared_fold and store_counters changes into a separate PR. How all of that caching works will be changing substantially, and the different_hierarchies failure is unrelated to the Python failures.

greenc-FNAL · 2026-02-11T17:10:48Z

@phlexbot clang-fix

github-actions · 2026-02-11T17:11:27Z

Automatic clang-format fixes pushed (commit 56c45ee).
⚠️ Note: Some issues may require manual review and fixing.

greenc-FNAL · 2026-02-11T18:30:32Z

@greenc-FNAL, I suggest placing the declared_fold and store_counters changes into a separate PR. How all of that caching works will be changing substantially, and the different_hierarchies failure is unrelated to the Python failures.

Definitely: a lot of this is pointless thrashing around that will have to be pruned anyway (assuming it actually fixes either problem, which is not a given right now).

Copilot

Pull request overview

This PR addresses two correctness issues: premature fold completion due to a race in fold counters, and Python converter failures propagating null pointers into PyObject_CallFunctionObjArgs, which can truncate varargs and crash Python-based tests.

Changes:

Harden Python callback/converter handling by ensuring lifeline unwrapping is consistent, converter failures throw, and decref is null-safe.
Fix fold completion race by initializing ready_to_flush_ to false, adding a “pending work” guard, and keeping the TBB accessor across increment+completion checks.
Improve transform exception safety by erasing stale stores_ entries when callbacks throw.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`plugins/python/src/modulewrap.cpp`	Prevent null/sentinel propagation into Python C-API calls; make conversions fail-fast and decref null-safe.
`phlex/core/store_counters.hpp`	Add pending-work tracking and adjust flush readiness initialization for fold completion safety.
`phlex/core/store_counters.cpp`	Implement pending-aware completion checks and accessor-held increment/flush completion helpers.
`phlex/core/declared_transform.hpp`	Erase inserted-but-unassigned cache entries when callback throws to avoid stale null reuse.
`phlex/core/declared_fold.hpp`	Switch fold counter updates to new pending/atomic-safe helper methods.

Comments suppressed due to low confidence (1)

phlex/core/store_counters.cpp:59

++counts_[layer_hash]; default-constructs a std::atomic<std::size_t> when the key is new. std::atomic's default constructor does not value-initialize the contained value, so the first increment can read an indeterminate value (undefined behavior). Initialize the mapped atomic explicitly on first insert (e.g., emplace with std::atomic<std::size_t>{0} and then fetch_add(1)).

  void store_counter::increment(data_cell_index::hash_type const layer_hash)
  {
    ++counts_[layer_hash];
  }

phlex/core/declared_fold.hpp

…letion check The store_counter::is_complete() check was racing with concurrent increment() calls. In TBB multifunction_nodes with unlimited concurrency, the counter_for() accessor was released before increment() was called, allowing is_complete() in another thread's done_with() to see a partially-updated counts_ map. This caused folds with multiple layer types (like different_hierarchies) to complete prematurely when only a subset of layer hashes had been fully counted. Fix: Add increment_and_check() and flush_and_check() methods to count_stores that hold the TBB hash map accessor across both the increment/flush and the completion check, ensuring atomic read-modify-check semantics. Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>

…RAY_CONVERTER null check Two fixes for py:vectypes: 1. declared_transform.hpp: When stores_.insert() creates a new entry with a default (null) value and the user callback throws, the null entry remained in the cache. On subsequent messages with the same hash, the null store was reused and sent downstream, causing null pointer propagation through the graph. Fix: wrap the callback in try/catch and erase the entry on exception. 2. modulewrap.cpp: Restore the null check for pyobj in NUMPY_ARRAY_CONVERTER that was removed in PR #213. Without it, PyArray_Check(nullptr) and Py_DECREF(nullptr) cause undefined behavior, leading to intermittent SEGFAULTs or incorrect behavior. Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>

Use 'hash' parameter name matching existing counter_for/done_with conventions. Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>

- Initialize ready_to_flush_ to false; set to true only in set_flush_value() to prevent premature fold completion before flush is received - Add pending counter to track in-flight fold operations, preventing is_complete() from returning true while events are still being processed - Add mark_pending/unmark_pending to store_counter and count_stores - Call mark_pending before fold computation and unmark after increment - Fix decref_all to use safe_decref that checks for null before Py_DECREF - Add null check in BASIC_CONVERTER py_to_* to prevent null dereference Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>

All converter functions that previously returned null intptr_t or null shared_ptr now throw std::runtime_error with descriptive messages: 1. VECTOR_CONVERTER: throw instead of returning (intptr_t)nullptr when: - Input shared_ptr is null - PyArray_SimpleNewFromData fails - PhlexLifeline allocation fails 2. BASIC_CONVERTER _to_py: check result of Python C API conversion function and throw if it returns null. 3. NUMPY_ARRAY_CONVERTER: throw instead of returning empty vector when input pyobj is null. 4. callv: use lifeline_transform for consistency with call, ensuring PhlexLifeline arguments are properly unwrapped for observers. Returning null intptr_t from converters causes PyObject_CallFunctionObjArgs to interpret the null as its argument-list sentinel, silently truncating arguments. This manifests as "missing required positional argument" errors (py:vectypes) or SEGFAULT from null pointer dereference (py:veclists). Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>

Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>

@knoepfel

Per maintainer review: - Revert declared_fold.hpp to main (separate PR per @knoepfel) - Revert store_counters.hpp/cpp to main (separate PR per @knoepfel) - Revert declared_transform.hpp try/catch to main (speculative armoring) - Remove BASIC_CONVERTER _to_py null check on topy() result (cannot fail by construction) - Remove BASIC_CONVERTER py_to_ null check on pyobj (non-null by construction) - Remove NUMPY_ARRAY_CONVERTER null check on pyobj (non-null by construction) Retained real fixes: - callv uses lifeline_transform for consistency with call - safe_decref prevents UB on null Py_DECREF - VECTOR_CONVERTER throws instead of returning (intptr_t)nullptr Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>

Add Py_XINCREF in call()/callv() before Python function invocation so that the product store's cached reference survives the subsequent decref_all. Without this, if the framework's declared_transform caching reuses the same product store, the PyObject could be freed by one call's decref and then accessed as a dangling pointer by the next. Restore callv() to use (PyObject*)args... instead of lifeline_transform(args)..., matching PR 213's intentional asymmetric design. Fix null pointer dereference in ll_new when tp_alloc fails. Revert prior changes to core framework files (store_counters, declared_transform, declared_fold) since PR 314 addresses the thread-unsafe caching patterns that are the likely root cause of the intermittent failures. Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>

Per Wim (original author), lifeline_transform() should be used symmetrically in both call() and callv(). Restore callv() to use lifeline_transform(args)... matching call(). Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>

Fully restore store_counters.hpp/cpp, declared_fold.hpp, declared_transform.hpp to their main branch versions. The fold race condition fix belongs in PR #314. Remove speculative null checks from BASIC_CONVERTER and NUMPY_ARRAY_CONVERTER that are not needed by construction. Retain only targeted Python fixes: - Py_XINCREF in call()/callv() for refcount safety with caching - lifeline_transform symmetric in call()/callv() per original author - safe_decref in decref_all for null safety - VECTOR_CONVERTER throws instead of returning null sentinel - lifelinewrap.cpp: return nullptr on tp_alloc failure Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>

When stores_.insert() creates a new entry with a null product store and call() throws, the null entry persists in the cache. Subsequent messages with the same hash hit the else branch and send the null product store downstream, causing SEGFAULTs (e.g., when downstream converters call PyArray_Check(nullptr)). The try/catch erases the stale entry on exception, ensuring the cache remains consistent. Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>

Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>

Output converters (py_to_* in BASIC_CONVERTER and NUMPY_ARRAY_CONVERTER) were calling Py_DECREF on their input, which is a borrowed reference from the upstream product store cache. When declared_transform's caching reuses the same product store for multiple events with the same hash, the first event's converter frees the PyObject, and subsequent events access freed memory — causing intermittent SEGFAULTs in py:coverage, py:veclists, etc. Fix: Remove Py_DECREF from all output converters. The input is a borrowed reference owned by the product store cache. Only py_callback::call()/callv() needs XINCREF/XDECREF around the Python function call to protect against concurrent cache eviction. Also integrates PR #314's thread-safety fixes for store_counters, products.hpp, multiplexer, filter_impl, and all declared_* node types. Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>

Separate concerns: - Remove all PR #314 core framework changes (store_counters, products, multiplexer, declared_observer/predicate/provider/unfold, edge_maker, filter_impl, node_catalog, test files). These belong in PR #314. - Keep only the try/catch in declared_transform.hpp (exception safety for cache entries when call() throws). Add Python reference counting model documentation: - plugins/python/src/python-refcounting.md describes the ownership rules for PyObject* references flowing through declared_transform caching, including why the bug was masked by CPython's small integer cache. Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>

- Not germaine to the goal of this PR - This reverts commit feb5639.

wlav · 2026-02-11T23:11:40Z

I don't understand this one at all. It should not have been leaking before, but now you make it leak.

The idea was that each input has a unique pointer providing the conversion. This is why the node is named with a combo of the product and the consumer. The converter creates a new reference, the consumer removes the reference after using the object, so no leaks. It's all one-to-one.

The addition of the ref-count before, adds a total +1 compared to the prior code, so now leak.

The INCREF/DECREF around the call don't serve a purpose.

It is perfectly well possible that there's some misunderstanding on my side about the TBB graph, but leaking the products from converter nodes can not possibly be a solution in any way or form: there could be potentially millions of products.

…store caching (#313)" This reverts commit e348b4d.

The previous PR #313 document described an incorrect model with XINCREF/XDECREF wrapping. This version documents the correct one-to-one ownership model: each converter creates a reference, each consumer DECREFs it exactly once. Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>

Copilot AI assigned Copilot and greenc-FNAL Feb 11, 2026

Copilot started work on behalf of greenc-FNAL February 11, 2026 14:54 View session

Copilot AI changed the title ~~[WIP] Analyze and resolve test failures in CI action~~ Fix race conditions in fold completion and transform exception handling Feb 11, 2026

Copilot AI requested a review from greenc-FNAL February 11, 2026 15:43

Copilot finished work on behalf of greenc-FNAL February 11, 2026 15:43

greenc-FNAL force-pushed the copilot/analyze-test-failures branch from 5bde8f4 to 94478bb Compare February 11, 2026 15:45

Copilot started work on behalf of greenc-FNAL February 11, 2026 15:56 View session

Copilot AI changed the title ~~Fix race conditions in fold completion and transform exception handling~~ Fix fold race condition and Python null pointer dereferences Feb 11, 2026

Copilot finished work on behalf of greenc-FNAL February 11, 2026 16:24

Copilot started work on behalf of greenc-FNAL February 11, 2026 16:48 View session

Copilot AI changed the title ~~Fix fold race condition and Python null pointer dereferences~~ Fix fold race condition and Python converter null propagation Feb 11, 2026

Copilot finished work on behalf of greenc-FNAL February 11, 2026 17:05

Copilot started work on behalf of greenc-FNAL February 11, 2026 18:27 View session

Copilot AI changed the title ~~Fix fold race condition and Python converter null propagation~~ Fix fold race condition and harden Python converter null propagation Feb 11, 2026

Copilot finished work on behalf of greenc-FNAL February 11, 2026 18:34

greenc-FNAL marked this pull request as ready for review February 11, 2026 18:46

Copilot AI review requested due to automatic review settings February 11, 2026 18:46

Copilot started reviewing on behalf of greenc-FNAL February 11, 2026 18:46 View session

Copilot started work on behalf of greenc-FNAL February 11, 2026 18:48 View session

Copilot AI reviewed Feb 11, 2026

View reviewed changes

phlex/core/declared_fold.hpp Outdated Show resolved Hide resolved

greenc-FNAL requested review from knoepfel and wlav February 11, 2026 22:56

Copilot AI and others added 17 commits February 11, 2026 16:59

Address review feedback: rename counter_hash to hash for consistency

8d43722

Use 'hash' parameter name matching existing counter_for/done_with conventions. Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>

Apply clang-format fixes

f884400

Improve error messages in NUMPY_ARRAY_CONVERTER null check

4e73d0c

Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>

Apply clang-format fixes

6606f0c

Plan: correct reference counting model + integrate PR 314

b3fbdd1

Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>

Revert "Re-add declared_transform try/catch for exception safety"

483a1a2

- Not germaine to the goal of this PR - This reverts commit feb5639.

greenc-FNAL force-pushed the copilot/analyze-test-failures branch from 97efc5f to 483a1a2 Compare February 11, 2026 22:59

knoepfel approved these changes Feb 11, 2026

View reviewed changes

knoepfel merged commit e348b4d into main Feb 11, 2026
47 checks passed

knoepfel deleted the copilot/analyze-test-failures branch February 11, 2026 23:11

knoepfel added a commit that referenced this pull request Feb 11, 2026

Revert "Fix Python reference counting for declared_transform product …

b4db88e

…store caching (#313)" This reverts commit e348b4d.

greenc-FNAL restored the copilot/analyze-test-failures branch February 11, 2026 23:16

Copilot AI mentioned this pull request Feb 11, 2026

Fix Python/C++ interop bugs exposed by PR #213 tests #315

Closed

greenc-FNAL deleted the copilot/analyze-test-failures branch February 13, 2026 00:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Python reference counting for declared_transform product store caching#313

Fix Python reference counting for declared_transform product store caching#313
knoepfel merged 17 commits intomainfrom
copilot/analyze-test-failures

Copilot AI commented Feb 11, 2026 •

edited

Loading

Uh oh!

greenc-FNAL commented Feb 11, 2026

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

greenc-FNAL commented Feb 11, 2026

Uh oh!

greenc-FNAL commented Feb 11, 2026

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

knoepfel commented Feb 11, 2026

Uh oh!

greenc-FNAL commented Feb 11, 2026

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

greenc-FNAL commented Feb 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

wlav commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Copilot AI commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference counting fixes (modulewrap.cpp)

Transform exception safety (declared_transform.hpp)

Lifeline allocation fix (lifelinewrap.cpp)

Documentation (python-refcounting.md)

Uh oh!

greenc-FNAL commented Feb 11, 2026

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

greenc-FNAL commented Feb 11, 2026

Uh oh!

greenc-FNAL commented Feb 11, 2026

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

knoepfel commented Feb 11, 2026

Uh oh!

greenc-FNAL commented Feb 11, 2026

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

greenc-FNAL commented Feb 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

wlav commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Copilot AI commented Feb 11, 2026 •

edited

Loading

Reference counting fixes (`modulewrap.cpp`)

Transform exception safety (`declared_transform.hpp`)

Lifeline allocation fix (`lifelinewrap.cpp`)

Documentation (`python-refcounting.md`)