Skip to content

Fix Python reference counting for declared_transform product store caching#313

Merged
knoepfel merged 17 commits intomainfrom
copilot/analyze-test-failures
Feb 11, 2026
Merged

Fix Python reference counting for declared_transform product store caching#313
knoepfel merged 17 commits intomainfrom
copilot/analyze-test-failures

Conversation

Copy link
Contributor

Copilot AI commented Feb 11, 2026

Python py: tests (vectypes, veclists, coverage) SEGFAULT intermittently because declared_transform caches product stores by hash, but Python converters were decrementing references owned by the cache—freeing objects that get reused on cache hits. Masked for small integers by CPython's immortal integer cache; crashes on floats and numpy arrays.

Reference counting fixes (modulewrap.cpp)

  • Remove Py_DECREF from output converters (BASIC_CONVERTER py_to_*, NUMPY_ARRAY_CONVERTER py_to_*): input is a borrowed reference from the product store cache, not an owned reference
  • Add Py_XINCREF/Py_XDECREF in call()/callv(): protect cached references during the Python function call
  • Make lifeline_transform() symmetric in call() and callv() per original author
  • VECTOR_CONVERTER error paths throw instead of returning (intptr_t)nullptr, which acts as PyObject_CallFunctionObjArgs sentinel and silently truncates arguments
  • Remove decref_all/safe_decref helpers — replaced by inline Py_XDECREF fold expressions

Transform exception safety (declared_transform.hpp)

  • try/catch around call(): stores_.insert() creates a null entry; if call() throws before assignment, the stale entry causes SEGFAULT on cache reuse

Lifeline allocation fix (lifelinewrap.cpp)

  • ll_new: return nullptr on tp_alloc failure instead of falling through to null dereference

Documentation (python-refcounting.md)

Documents the ownership model for PyObject* references flowing through declared_transform caching, including why CPython's small integer cache masked this class of bug.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI changed the title [WIP] Analyze and resolve test failures in CI action Fix race conditions in fold completion and transform exception handling Feb 11, 2026
Copilot AI requested a review from greenc-FNAL February 11, 2026 15:43
@greenc-FNAL greenc-FNAL force-pushed the copilot/analyze-test-failures branch from 5bde8f4 to 94478bb Compare February 11, 2026 15:45
@greenc-FNAL
Copy link
Contributor

@phlexbot clang-fix

@github-actions
Copy link
Contributor

Automatic clang-format fixes pushed (commit f466e7a).
⚠️ Note: Some issues may require manual review and fixing.

@greenc-FNAL
Copy link
Contributor

Review the full CodeQL report for details.

Copilot AI changed the title Fix race conditions in fold completion and transform exception handling Fix fold race condition and Python null pointer dereferences Feb 11, 2026
@greenc-FNAL
Copy link
Contributor

@phlexbot clang-fix

@github-actions
Copy link
Contributor

No automatic clang-format fixes were necessary.

@knoepfel
Copy link
Member

@greenc-FNAL, I suggest placing the declared_fold and store_counters changes into a separate PR. How all of that caching works will be changing substantially, and the different_hierarchies failure is unrelated to the Python failures.

Copilot AI changed the title Fix fold race condition and Python null pointer dereferences Fix fold race condition and Python converter null propagation Feb 11, 2026
@greenc-FNAL
Copy link
Contributor

@phlexbot clang-fix

@github-actions
Copy link
Contributor

Automatic clang-format fixes pushed (commit 56c45ee).
⚠️ Note: Some issues may require manual review and fixing.

@greenc-FNAL
Copy link
Contributor

@greenc-FNAL, I suggest placing the declared_fold and store_counters changes into a separate PR. How all of that caching works will be changing substantially, and the different_hierarchies failure is unrelated to the Python failures.

Definitely: a lot of this is pointless thrashing around that will have to be pruned anyway (assuming it actually fixes either problem, which is not a given right now).

Copilot AI changed the title Fix fold race condition and Python converter null propagation Fix fold race condition and harden Python converter null propagation Feb 11, 2026
@greenc-FNAL greenc-FNAL marked this pull request as ready for review February 11, 2026 18:46
Copilot AI review requested due to automatic review settings February 11, 2026 18:46
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses two correctness issues: premature fold completion due to a race in fold counters, and Python converter failures propagating null pointers into PyObject_CallFunctionObjArgs, which can truncate varargs and crash Python-based tests.

Changes:

  • Harden Python callback/converter handling by ensuring lifeline unwrapping is consistent, converter failures throw, and decref is null-safe.
  • Fix fold completion race by initializing ready_to_flush_ to false, adding a “pending work” guard, and keeping the TBB accessor across increment+completion checks.
  • Improve transform exception safety by erasing stale stores_ entries when callbacks throw.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
plugins/python/src/modulewrap.cpp Prevent null/sentinel propagation into Python C-API calls; make conversions fail-fast and decref null-safe.
phlex/core/store_counters.hpp Add pending-work tracking and adjust flush readiness initialization for fold completion safety.
phlex/core/store_counters.cpp Implement pending-aware completion checks and accessor-held increment/flush completion helpers.
phlex/core/declared_transform.hpp Erase inserted-but-unassigned cache entries when callback throws to avoid stale null reuse.
phlex/core/declared_fold.hpp Switch fold counter updates to new pending/atomic-safe helper methods.
Comments suppressed due to low confidence (1)

phlex/core/store_counters.cpp:59

  • ++counts_[layer_hash]; default-constructs a std::atomic<std::size_t> when the key is new. std::atomic's default constructor does not value-initialize the contained value, so the first increment can read an indeterminate value (undefined behavior). Initialize the mapped atomic explicitly on first insert (e.g., emplace with std::atomic<std::size_t>{0} and then fetch_add(1)).
  void store_counter::increment(data_cell_index::hash_type const layer_hash)
  {
    ++counts_[layer_hash];
  }

@greenc-FNAL greenc-FNAL requested review from knoepfel and wlav February 11, 2026 22:56
Copilot AI and others added 17 commits February 11, 2026 16:59
…letion check

The store_counter::is_complete() check was racing with concurrent increment()
calls. In TBB multifunction_nodes with unlimited concurrency, the counter_for()
accessor was released before increment() was called, allowing is_complete() in
another thread's done_with() to see a partially-updated counts_ map. This caused
folds with multiple layer types (like different_hierarchies) to complete
prematurely when only a subset of layer hashes had been fully counted.

Fix: Add increment_and_check() and flush_and_check() methods to count_stores
that hold the TBB hash map accessor across both the increment/flush and the
completion check, ensuring atomic read-modify-check semantics.

Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
…RAY_CONVERTER null check

Two fixes for py:vectypes:

1. declared_transform.hpp: When stores_.insert() creates a new entry with a
   default (null) value and the user callback throws, the null entry remained
   in the cache. On subsequent messages with the same hash, the null store was
   reused and sent downstream, causing null pointer propagation through the
   graph. Fix: wrap the callback in try/catch and erase the entry on exception.

2. modulewrap.cpp: Restore the null check for pyobj in NUMPY_ARRAY_CONVERTER
   that was removed in PR #213. Without it, PyArray_Check(nullptr) and
   Py_DECREF(nullptr) cause undefined behavior, leading to intermittent
   SEGFAULTs or incorrect behavior.

Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
Use 'hash' parameter name matching existing counter_for/done_with conventions.

Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
- Initialize ready_to_flush_ to false; set to true only in set_flush_value()
  to prevent premature fold completion before flush is received
- Add pending counter to track in-flight fold operations, preventing
  is_complete() from returning true while events are still being processed
- Add mark_pending/unmark_pending to store_counter and count_stores
- Call mark_pending before fold computation and unmark after increment
- Fix decref_all to use safe_decref that checks for null before Py_DECREF
- Add null check in BASIC_CONVERTER py_to_* to prevent null dereference

Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
All converter functions that previously returned null intptr_t or null
shared_ptr now throw std::runtime_error with descriptive messages:

1. VECTOR_CONVERTER: throw instead of returning (intptr_t)nullptr when:
   - Input shared_ptr is null
   - PyArray_SimpleNewFromData fails
   - PhlexLifeline allocation fails

2. BASIC_CONVERTER _to_py: check result of Python C API conversion
   function and throw if it returns null.

3. NUMPY_ARRAY_CONVERTER: throw instead of returning empty vector when
   input pyobj is null.

4. callv: use lifeline_transform for consistency with call, ensuring
   PhlexLifeline arguments are properly unwrapped for observers.

Returning null intptr_t from converters causes PyObject_CallFunctionObjArgs
to interpret the null as its argument-list sentinel, silently truncating
arguments. This manifests as "missing required positional argument" errors
(py:vectypes) or SEGFAULT from null pointer dereference (py:veclists).

Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
Per maintainer review:
- Revert declared_fold.hpp to main (separate PR per @knoepfel)
- Revert store_counters.hpp/cpp to main (separate PR per @knoepfel)
- Revert declared_transform.hpp try/catch to main (speculative armoring)
- Remove BASIC_CONVERTER _to_py null check on topy() result (cannot fail by construction)
- Remove BASIC_CONVERTER py_to_ null check on pyobj (non-null by construction)
- Remove NUMPY_ARRAY_CONVERTER null check on pyobj (non-null by construction)

Retained real fixes:
- callv uses lifeline_transform for consistency with call
- safe_decref prevents UB on null Py_DECREF
- VECTOR_CONVERTER throws instead of returning (intptr_t)nullptr

Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
Add Py_XINCREF in call()/callv() before Python function invocation so
that the product store's cached reference survives the subsequent
decref_all. Without this, if the framework's declared_transform
caching reuses the same product store, the PyObject could be freed
by one call's decref and then accessed as a dangling pointer by
the next.

Restore callv() to use (PyObject*)args... instead of
lifeline_transform(args)..., matching PR 213's intentional asymmetric
design.

Fix null pointer dereference in ll_new when tp_alloc fails.

Revert prior changes to core framework files (store_counters,
declared_transform, declared_fold) since PR 314 addresses the
thread-unsafe caching patterns that are the likely root cause
of the intermittent failures.

Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
Per Wim (original author), lifeline_transform() should be used
symmetrically in both call() and callv(). Restore callv() to use
lifeline_transform(args)... matching call().

Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
Fully restore store_counters.hpp/cpp, declared_fold.hpp,
declared_transform.hpp to their main branch versions. The fold
race condition fix belongs in PR #314.

Remove speculative null checks from BASIC_CONVERTER and
NUMPY_ARRAY_CONVERTER that are not needed by construction.

Retain only targeted Python fixes:
- Py_XINCREF in call()/callv() for refcount safety with caching
- lifeline_transform symmetric in call()/callv() per original author
- safe_decref in decref_all for null safety
- VECTOR_CONVERTER throws instead of returning null sentinel
- lifelinewrap.cpp: return nullptr on tp_alloc failure

Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
When stores_.insert() creates a new entry with a null product store
and call() throws, the null entry persists in the cache. Subsequent
messages with the same hash hit the else branch and send the null
product store downstream, causing SEGFAULTs (e.g., when downstream
converters call PyArray_Check(nullptr)).

The try/catch erases the stale entry on exception, ensuring the
cache remains consistent.

Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
Output converters (py_to_* in BASIC_CONVERTER and NUMPY_ARRAY_CONVERTER)
were calling Py_DECREF on their input, which is a borrowed reference from
the upstream product store cache. When declared_transform's caching reuses
the same product store for multiple events with the same hash, the first
event's converter frees the PyObject, and subsequent events access freed
memory — causing intermittent SEGFAULTs in py:coverage, py:veclists, etc.

Fix: Remove Py_DECREF from all output converters. The input is a borrowed
reference owned by the product store cache. Only py_callback::call()/callv()
needs XINCREF/XDECREF around the Python function call to protect against
concurrent cache eviction.

Also integrates PR #314's thread-safety fixes for store_counters,
products.hpp, multiplexer, filter_impl, and all declared_* node types.

Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
Separate concerns:
- Remove all PR #314 core framework changes (store_counters, products,
  multiplexer, declared_observer/predicate/provider/unfold, edge_maker,
  filter_impl, node_catalog, test files). These belong in PR #314.
- Keep only the try/catch in declared_transform.hpp (exception safety
  for cache entries when call() throws).

Add Python reference counting model documentation:
- plugins/python/src/python-refcounting.md describes the ownership
  rules for PyObject* references flowing through declared_transform
  caching, including why the bug was masked by CPython's small integer
  cache.

Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
- Not germaine to the goal of this PR
- This reverts commit feb5639.
@greenc-FNAL greenc-FNAL force-pushed the copilot/analyze-test-failures branch from 97efc5f to 483a1a2 Compare February 11, 2026 22:59
@knoepfel knoepfel merged commit e348b4d into main Feb 11, 2026
47 checks passed
@knoepfel knoepfel deleted the copilot/analyze-test-failures branch February 11, 2026 23:11
@wlav
Copy link
Contributor

wlav commented Feb 11, 2026

I don't understand this one at all. It should not have been leaking before, but now you make it leak.

The idea was that each input has a unique pointer providing the conversion. This is why the node is named with a combo of the product and the consumer. The converter creates a new reference, the consumer removes the reference after using the object, so no leaks. It's all one-to-one.

The addition of the ref-count before, adds a total +1 compared to the prior code, so now leak.

The INCREF/DECREF around the call don't serve a purpose.

It is perfectly well possible that there's some misunderstanding on my side about the TBB graph, but leaking the products from converter nodes can not possibly be a solution in any way or form: there could be potentially millions of products.

knoepfel added a commit that referenced this pull request Feb 11, 2026
@greenc-FNAL greenc-FNAL restored the copilot/analyze-test-failures branch February 11, 2026 23:16
Copilot AI added a commit that referenced this pull request Feb 11, 2026
The previous PR #313 document described an incorrect model with
XINCREF/XDECREF wrapping. This version documents the correct
one-to-one ownership model: each converter creates a reference,
each consumer DECREFs it exactly once.

Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
greenc-FNAL added a commit that referenced this pull request Feb 12, 2026
The previous PR #313 document described an incorrect model with
XINCREF/XDECREF wrapping. This version documents the correct
one-to-one ownership model: each converter creates a reference,
each consumer DECREFs it exactly once.

Co-authored-by: greenc-FNAL <2372949+greenc-FNAL@users.noreply.github.com>
@greenc-FNAL greenc-FNAL deleted the copilot/analyze-test-failures branch February 13, 2026 00:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants