[Perf] Adstack max-reducer: launch cache + zero-copy result map; content-stable registry_id by duburcqa · Pull Request #671 · Genesis-Embodied-AI/quadrants

duburcqa · 2026-05-09T16:38:49Z

Adstack max-reducer: launch cache + zero-copy result map; content-stable registry_id

Three commits stacked on top of #635. Two restore the AMDGPU autodiff hot-path FPS that #635 lost (280 → 350 FPS, +25% on the rigid-step repro), one fixes a latent offline-cache correctness gap that surfaced as _get_max_reducer_dispatch_count() >= 1 failing across the AMDGPU test_max_reducer_* suite. All three are scoped to the LLVM-side adstack plumbing; the SPIR-V backend is untouched.

TL;DR

QD_OFFLINE_CACHE=0 GS_ENABLE_NDARRAY=0 python repro_amdgpu_rigid_step.py (RX 7900 XTX, 100 substeps + 1 backward, 3-trial median):

state	trial 0	trial 1	trial 2
#635 (regressed)	277.6 FPS	283.3 FPS	283.8 FPS
#635 + commit 1 (drop deep copy + per-launch cache)	318.2 FPS	322.5 FPS	324.7 FPS
#635 + commit 1 + 2 (shared_ptr)	342.0 FPS	348.1 FPS	351.9 FPS
this PR (= #635 + commits 1 + 2 + 3)	348.1 FPS	348.1 FPS	349.2 FPS
upper bound (full skip via env-gate)	357.8 FPS	367.6 FPS	368.9 FPS

AMDGPU test_max_reducer_* suite (QD_WANTED_ARCHS=amdgpu pytest -k max_reducer):

state	passed	failed
#635	1	15
this PR	16	0

Why

#635 introduced LlvmRuntimeExecutor::dispatch_max_reducers_for_tasks to dispatch captured MaxOverRange specs in parallel. The dispatcher's design expectation was that the per-spec AdStackCache::max_reducer_cache_ would short-circuit every launch after the first - and instrumentation confirms it does (99.75% per-spec hit rate, near-zero GPU dispatches in steady state). But the hot-path host-side bookkeeping the dispatcher does on the way to that short-circuit added ~290 ms across 2008 launches in the rigid-step bench - about 27% of trial wallclock - dominated by:

The OffloadedTask overload's per-launch deep copy of every task's AdStackSizingInfo into a std::vector<AdStackSizingInfo> ad_stacks_view, including the embedded size_exprs / allocas / bound_expr / max_reducer_specs sub-vectors. Probe data: 125.2 ms / 290 ms (43%) was this single copy loop.
An O(specs) hash-lookup-and-observation-replay loop walking every captured spec through try_max_reducer_cache_hit. Probe data: 62.8 ms / 290 ms (22%).
The current_max_reducer_results_ value-typed map: copied once per launch from the cache entry, then snapshotted again per task in publish_adstack_metadata (the recursive-reentry guard). With ~600 entries per map and ~1.8 tasks per call, that's ~3.4 M map entries copied across the trial.

Separately, AdStackSizingInfo::registry_id was excluded from QD_IO_DEF under the rationale "ids are per-Program lifetime; a deserialised task re-registers itself at the next launch." That comment was aspirational on the LLVM path - only the SPIR-V launcher actually re-registered (in adstack_sizer_launch.cpp). After offline-cache reload an LLVM kernel had registry_id == 0, so the dispatcher's if (registry_id == 0) continue gate skipped every spec, the per-thread sizer fell through to its 1<<24 host-eval cap, and the dispatch silently no-op'd. This is what test_max_reducer_pins_stride_for_oversized_axis[arch=amdgpu-*] and the dispatch-counter tests were exposing. The gate also affected the diagnose-on-overflow path: the codegen-baked cmpxchg(0, registry_id) immediate IS preserved through the LLVM IR text in the offline cache, so on cache reload the LLVM IR wrote an old sequential id while the host registry was empty - any overflow on a freshly cache-loaded kernel would print the generic dual-cause fallback instead of the offending kernel + task identity.

Surface behavior

AMDGPU autodiff hot path recovers to ~350 FPS from [AutoDiff] Adstack max-reducer: parallel multi-axis MaxOverRange dispatch #635's ~280 FPS on the rigid-step repro. CUDA / CPU paths are equivalent on the same code paths (they go through the same LLVM dispatcher) but the perf win there is smaller because their per-launch baseline was already lower-overhead. No API or output-numerics change on any backend.
_get_max_reducer_dispatch_count() now matches between cache-miss and cache-hit launches on AMDGPU. Before this PR a fresh test process on a warm ~/.cache/quadrants/ returned 0 dispatches and silently used the per-thread sizer's worst-case enumeration; now the dispatcher fires correctly on the first launch and the subsequent per-spec / per-launch caches short-circuit identically to the cache-miss path.
An overflow on a cache-loaded kernel now resolves to the matching kernel + task identity in the diagnose message instead of the generic dual-cause fallback. This was a latent bug that affected all LLVM backends, not just AMDGPU - the AMDGPU test suite is what surfaced it.

Mechanism end-to-end

1. Drop the per-launch `AdStackSizingInfo` deep copy

Before, dispatch_max_reducers_for_tasks(const std::vector<OffloadedTask> &) materialised std::vector<AdStackSizingInfo> by copy and forwarded to the AdStackSizingInfo overload. This commit refactors both public overloads to forward a std::vector<const AdStackSizingInfo *> pointer-view to a private dispatch_max_reducers_impl(launch_cache_key, ad_stacks_view, ctx, dev_ctx) - no AdStackSizingInfo copies, no per-launch allocation cliff. The launch cache key is the address of the launcher's stable per-handle vector (KernelLauncher::contexts_[i].offloaded_tasks for CUDA / AMDGPU, ad_stacks for the CPU launcher); the address is reused on every launch of the same kernel handle so it serves as a stable identity.

file	change
`quadrants/runtime/llvm/adstack_lazy_claim/bound_eval.cpp`	Both public overloads build `std::vector<const AdStackSizingInfo >` pointer-views (no `AdStackSizingInfo` copies) and forward to the new private `dispatch_max_reducers_impl`. The inner loop's `ad_stack` accessor changes from `ad_stacks[ti]` to `ad_stacks[ti]`.
`quadrants/runtime/llvm/llvm_runtime_executor.h`	Adds the private `dispatch_max_reducers_impl` declaration.

2. Add a per-kernel-handle launch cache

AdStackCache::try_max_reducer_launch_cache_hit and record_max_reducer_launch_cache factor the launch-cache logic into the existing adstack cache module. The cache entry stores a deduplicated (snode_id, gen) / (arg_id, devalloc, gen) snapshot covering every spec's observation deps; the fast path replays the snapshot in O(distinct deps) and short-circuits the per-spec walk on full match. Slow path falls through to today's per-spec cache.

file	change
`quadrants/program/adstack/cache.h`	New `MaxReducerLaunchCacheEntry` / `ArgGenObservation` types nested in `AdStackCache`; new `max_reducer_launch_cache_` field; `try_max_reducer_launch_cache_hit` / `record_max_reducer_launch_cache` / `invalidate_max_reducer_launch` declarations; `invalidate_all_per_task` extended to clear the new map.
`quadrants/program/adstack/cache.cpp`	Method bodies. The dependency aggregation walks each spec's recorded observations via `lookup_max_reducer_reads`; deduplication keeps the replay O(distinct deps) instead of O(specs * obs/spec).
`quadrants/runtime/llvm/adstack_lazy_claim/bound_eval.cpp`	`dispatch_max_reducers_impl` calls `try_max_reducer_launch_cache_hit` at the top and `record_max_reducer_launch_cache` at the bottom of every successful dispatch.

3. Switch `current_max_reducer_results_` to `shared_ptr<const map>`

The cache entry from step 2 stored its result map by value; the executor field was also value-typed and copied from the entry on every fast-path hit. Replacing both with std::shared_ptr<const std::unordered_map<uint64_t, int64_t>> collapses the fast-path assignment to a refcount bump (no map data is copied). The cache entry retains its own shared_ptr so the per-task local_max_reducer_results = current_max_reducer_results_ snapshot in publish_adstack_metadata (the recursive-reentry defence) becomes a refcount bump too, and the cache-entry's allocation stays alive even if a recursive snode-reader-kernel reentry repoints the executor's transient mid-walk. The slow path wraps result in a shared_ptr once at the end and hands the same allocation to both the cache entry and the executor field. Both dispatch_max_reducers_for_tasks overloads (and the private dispatch_max_reducers_impl) now return void since callers only read the result through current_max_reducer_results_.

file	change
`quadrants/program/adstack/max_reducer.h`	New `MaxReducerResultMapPtr = std::shared_ptr<const MaxReducerResultMap>` alias (shared by the cache entry and the executor field).
`quadrants/program/adstack/cache.{h,cpp}`	`MaxReducerLaunchCacheEntry::result` switches to `shared_ptr<const map>`; `try_max_reducer_launch_cache_hit` / `record_max_reducer_launch_cache` signatures updated.
`quadrants/runtime/llvm/llvm_runtime_executor.h`	`current_max_reducer_results_` field switches to `shared_ptr<const map>`; both public overloads + the private impl return `void`.
`quadrants/runtime/llvm/adstack_lazy_claim/bound_eval.cpp`	Slow path wraps `result` in `make_shared<const ...>` once and shares the allocation between the cache entry and the executor field. The fast path's `current_max_reducer_results_ = std::move(hit)` is now a refcount move.
`quadrants/runtime/llvm/adstack_lazy_claim/metadata_publish.cpp`	Consumer dereferences. Per-task snapshot is a `shared_ptr` copy (refcount only); encoder call sites pass `*current_max_reducer_results_` (`dispatch_max_reducers_impl` initialises the field to a non-null empty-map sentinel so the deref is unconditional).

4. Make `registry_id` content-stable and serialise it

Replaces sequential id assignment with a 32-bit FNV-1a folded from (kernel_name, task_id_in_kernel). Same (kernel_name, task_id_in_kernel) pair always yields the same id across Program lifetimes, re-compiles, and offline-cache reloads. Codegen-baked id and host-side id match by construction. The registry storage switches from std::vector<Entry> (sequential indexing) to std::unordered_map<uint32_t, Entry> (arbitrary hash keys); collisions linear-probe past occupied slots (1.2e-4 collision probability for 1000 distinct keys with 32-bit FNV-1a). AdStackSizingInfo gains kernel_name + task_id_in_kernel fields so the runtime registration call (the Program registry seed for the diagnose path) can re-derive the hash inputs without parsing OffloadedTask::name.

file	change
`quadrants/codegen/llvm/llvm_compiled_data.h`	`AdStackSizingInfo` gains `kernel_name` and `task_id_in_kernel` fields; `registry_id` + the two new fields added to `QD_IO_DEF`.
`quadrants/codegen/llvm/codegen_llvm.cpp`	Both `register_adstack_sizing_info` call sites in `finalize_offloaded_task_function` populate the two new fields on `current_task->ad_stack` before registering.
`quadrants/program/adstack/cache.h`	Registry storage type switches from `std::vector` to `std::unordered_map<uint32_t, AdStackSizingInfoEntry>`; `is_adstack_sizing_info_registered` / `ensure_runtime_registry_ids_for_max_reducer` declarations.
`quadrants/program/adstack/cache.cpp`	New `fnv1a32_for_registry` helper; `register_adstack_sizing_info` rewritten to compute id by hash and linear-probe past collisions; `lookup_adstack_sizing_info` / `update_adstack_sizing_info_size_exprs` switch to map find. `ensure_runtime_registry_ids_for_max_reducer` seeds the per-`Program` registry on the first launch of each cache-loaded kernel handle, gated by `is_adstack_sizing_info_registered(&ad_stack)` so steady-state launches are O(1).
`quadrants/runtime/llvm/adstack_lazy_claim/bound_eval.cpp`	`dispatch_max_reducers_for_tasks(OffloadedTask)` calls `ensure_runtime_registry_ids_for_max_reducer` once before forwarding into `dispatch_max_reducers_impl`. The `const_cast` is documented; the OffloadedTasks live in non-const launcher storage.

Per-backend coverage matrix

backend	per-launch deep copy dropped	launch cache active	shared_ptr fast path	content-stable registry_id
CPU LLVM (x64 / arm64)	✓ (AdStackSizingInfo overload also routed through pointer-view)	✓	✓	✓ (LLVM IR codegen path)
CUDA	✓	✓	✓	✓
AMDGPU	✓	✓	✓	✓
SPIR-V (Vulkan / Metal)	N/A (gfx runtime has its own dispatcher)	N/A	N/A	N/A (gfx already runtime-registers via `adstack_sizer_launch.cpp:236`)

Tests

Existing tests/python/test_adstack.py test_max_reducer_* suite is the regression coverage:

test_max_reducer_pins_stride_for_oversized_axis[arch=amdgpu-*] (5 parametrisations) - pins that the dispatcher fires for above-1<<24 ndarray axes on cache-hit, and that _get_max_reducer_dispatch_count() >= 1 after the first compute + grad. These were the visible AMDGPU CI failures; before this PR they all returned dispatch_count == 0 on cache hit. Now they all pass on first run AND on cache-hit re-run.
test_max_reducer_dispatch_counts_advance_on_input_mutation[arch=amdgpu] - pins that a host mutation of the gating ndarray bumps ndarray_data_gen and forces re-dispatch; would silently regress if the launch cache fast path ignored the new dependency snapshot.
test_max_reducer_field_load_bound_var_dispatch[arch=amdgpu-*] (8 parametrisations) - exercise the SNode-backed bound_var-indexed FieldLoad body grammar across the full body-shape matrix; verifies the fast-path replay handles SNode generation deps the same way the per-spec cache does.
test_max_reducer_field_load_bound_var_cache_invalidates_on_snode_mutation[arch=amdgpu] - pins that a SNode write between launches invalidates both the per-spec cache and the new launch cache (the launch-cache snode_gens snapshot must carry the post-mutation gen forward).

repro_amdgpu_rigid_step.py (the user's profiling script) is the perf regression coverage; numbers in the TL;DR table.

Side-effect audit

concern	where checked	verdict
Offline cache binary compatibility	`AdStackSizingInfo::QD_IO_DEF` adds `registry_id` / `kernel_name` / `task_id_in_kernel`	Cache invalidation needed. Existing cached LLVM IR has the old sequential `registry_id` baked in; on reload the JSON metadata won't match the old QD_IO_DEF field set. Users must clear `~/.cache/quadrants/qdcache/` once.
Recursive snode-reader-kernel reentry into `dispatch_max_reducers_for_tasks`	`metadata_publish.cpp:327` snapshot via `local_max_reducer_results = current_max_reducer_results_` (now `shared_ptr` copy)	Safe. The cache entry's `shared_ptr` keeps the result map alive across the recursive call's `current_max_reducer_results_` repoint; the outer task continues to read the snapshot it took.
Per-spec `max_reducer_cache_` invalidation on dependency change	Unchanged. Launch cache layers on top via the same `(snode_write_gen, ndarray_data_gen)` counters; `invalidate_max_reducer` and `invalidate_max_reducer_launch` are always invalidated together via `invalidate_all_per_task`.	Safe.
Hash collisions in `fnv1a32_for_registry`	`register_adstack_sizing_info` linear-probes past occupied slots; same `identity_key` short-circuits via `adstack_sizing_info_id_by_ptr_` before hash lookup	Safe. ~1.2e-4 collision probability for 1000 distinct keys; if a real collision appears the linear probe mints the next free slot.
Diagnose-on-overflow registry resolution	`ensure_runtime_registry_ids_for_max_reducer` seeds the registry on first launch of each cache-loaded kernel handle, gated by `is_adstack_sizing_info_registered(&ad_stack)`	Now correct on cache-hit launches. Was silently broken before this PR (registry empty after cache load).
Forward-only kernels (no `max_reducer_specs`)	`dispatch_max_reducers_for_tasks` early-out unchanged. The `ensure_runtime_registry_ids_for_max_reducer` loop also early-skips when `max_reducer_specs.empty()`.	Zero added overhead for forward-only kernels.
SPIR-V backend	Untouched. The gfx runtime has its own dispatcher in `runtime/gfx/adstack_max_reducer_launch.cpp` and its own runtime registration in `adstack_sizer_launch.cpp:236`.	Unaffected.
C++-only test setup (null `Program *`)	All paths gate on `prog != nullptr` / `cache != nullptr`; `ensure_runtime_registry_ids_for_max_reducer` defensive-seeds `ad_stack.registry_id` from the just-minted id when codegen-time registration was skipped.	Safe.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0f5a25f35a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

github-actions · 2026-05-09T17:43:09Z

Total: 8 file(s) changed, +267 -41 code lines.

…result transient

…es diagnose-on-overflow + max-reducer-on-cache-hit

…fline-cache reload registry seeding

github-actions · 2026-05-09T19:11:07Z

Total: 10 file(s) changed, +347 -41 code lines.

… 120-col

github-actions · 2026-05-09T19:43:48Z

Diff coverage: 8% · 39 lines, 36 missing

github-actions · 2026-05-09T20:15:06Z

Total: 10 file(s) changed, +347 -41 code lines.

hughperkins · 2026-05-09T20:46:20Z

some line wrapping issues to address https://github.com/Genesis-Embodied-AI/quadrants/actions/runs/25610119418/job/75178772513?pr=671

hughperkins · 2026-05-09T20:48:09Z

Since this 1. contains many changes, 2. touch core, non-adstack files, 3. modifies launch cache, would prefer to have genesis unit tests report and genesis benchmark results please

duburcqa · 2026-05-09T21:01:00Z

env	batch_size	backend	gjk_collision	constraint_solver	runtime_fps_main	runtime_fps_671	runtime_fps_delta_pct
anymal_random	30000	cuda	-	-	9274564	9346562	+0.78
anymal_uniform	30000	cuda	-	-	12303670	12242896	-0.49
anymal_uniform_kinematic	0	cpu	-	-	1953	1956	+0.15
anymal_uniform_kinematic	30000	cuda	-	-	10440441	10386176	-0.52
anymal_zero	0	cpu	-	-	7215	7045	-2.36
anymal_zero	30000	cuda	-	-	18906213	19191135	+1.51
box_pyramid_3	4096	cuda	-	-	976550	993128	+1.70
box_pyramid_4	4096	cuda	-	-	386731	391566	+1.25
box_pyramid_5	4096	cuda	-	-	140041	138435	-1.15
box_pyramid_6	4096	cuda	False	-	58296	58500	+0.35
box_pyramid_6	4096	cuda	True	-	59085	60558	+2.49
dex_hand	4096	cuda	-	-	17198	17322	+0.72
duck_in_box_easy	30000	cuda	False	-	27162758	27039291	-0.45
duck_in_box_easy	30000	cuda	True	-	9576866	9562423	-0.15
duck_in_box_hard	0	cpu	-	-	5139	5092	-0.91
duck_in_box_hard	30000	cuda	False	-	10222575	10288079	+0.64
duck_in_box_hard	30000	cuda	True	-	3393998	3402313	+0.24
franka	30000	cuda	-	-	21898508	21495589	-1.84
franka_accessors	0	cpu	-	-	1167	1142	-2.14
franka_accessors	30000	cuda	-	-	15408945	15615816	+1.34
franka_free	30000	cuda	-	-	32234122	32824114	+1.83
franka_random	0	cpu	-	-	6223	6073	-2.41
franka_random	30000	cuda	-	CG	16853652	16823133	-0.18
franka_random	30000	cuda	-	Newton	16561090	16690425	+0.78
franka_random	30000	cuda	False	-	16272814	16677558	+2.49
franka_random	30000	cuda	True	-	11412186	11460639	+0.42
g1_fall	4096	cuda	-	Newton	928269	930055	+0.19
go2	4096	cuda	False	CG	3724945	3733568	+0.23
go2	4096	cuda	False	Newton	4554251	4584400	+0.66
go2	4096	cuda	True	-	3306931	3321830	+0.45
shadow_hand_cubes	0	cpu	-	-	41	41	+0.00
shadow_hand_cubes_sparse	0	cpu	-	-	66	64	-3.03

speed_test_671.txt

github-actions · 2026-05-09T21:01:06Z

Diff coverage: 8% · 39 lines, 36 missing

duburcqa · 2026-05-09T21:06:34Z

====== 642 passed, 3 skipped, 2 xfailed in 1169.75s (0:19:29) ======

hughperkins · 2026-05-09T21:08:52Z

=> ok to merge

github-actions · 2026-05-09T21:33:06Z

Total: 10 file(s) changed, +347 -41 code lines.

github-actions · 2026-05-09T22:29:04Z

Diff coverage: 8% · 39 lines, 36 missing

chatgpt-codex-connector Bot reviewed May 9, 2026

View reviewed changes

Comment thread quadrants/program/adstack/cache.cpp Outdated

Comment thread quadrants/runtime/llvm/adstack_lazy_claim/bound_eval.cpp

duburcqa force-pushed the duburcqa/adstack_max_reducer_perf branch 2 times, most recently from 4025e0b to 5d2cb11 Compare May 9, 2026 17:09

duburcqa force-pushed the duburcqa/adstack_max_reducer_perf branch from 7247a3d to b89c3f7 Compare May 9, 2026 18:10

duburcqa added 4 commits May 9, 2026 11:38

[Perf] Adstack max-reducer: per-launch cache + drop ad_stack deep copy

8f180dd

[Perf] Adstack max-reducer: shared_ptr<const map> for the per-launch …

9f6a7a4

…result transient

[AutoDiff] Adstack registry_id: content-stable hash + serialised, fix…

56ba209

…es diagnose-on-overflow + max-reducer-on-cache-hit

[AutoDiff] Adstack: tests for content-stable registry_id dedup and of…

6aa04c4

…fline-cache reload registry seeding

duburcqa force-pushed the duburcqa/adstack_max_reducer_perf branch from b89c3f7 to 6aa04c4 Compare May 9, 2026 18:38

[AutoDiff] Adstack: reflow two underwrapped comments to fit closer to…

f59cc4f

… 120-col

[AutoDiff] Adstack: reflow added/modified comments to maximize linewidth

db5168a

duburcqa merged commit 255d80a into main May 9, 2026
56 checks passed

duburcqa deleted the duburcqa/adstack_max_reducer_perf branch May 9, 2026 22:30

duburcqa mentioned this pull request May 9, 2026

[SPIR-V] dispatch_max_reducers: register each task with the real kernel name #675

Merged

Conversation

duburcqa commented May 9, 2026

Adstack max-reducer: launch cache + zero-copy result map; content-stable registry_id

TL;DR

Why

Surface behavior

Mechanism end-to-end

1. Drop the per-launch AdStackSizingInfo deep copy

2. Add a per-kernel-handle launch cache

3. Switch current_max_reducer_results_ to shared_ptr<const map>

4. Make registry_id content-stable and serialise it

Per-backend coverage matrix

Tests

Side-effect audit

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

hughperkins commented May 9, 2026

Uh oh!

hughperkins commented May 9, 2026

Uh oh!

duburcqa commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

duburcqa commented May 9, 2026

Uh oh!

hughperkins commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Drop the per-launch `AdStackSizingInfo` deep copy

3. Switch `current_max_reducer_results_` to `shared_ptr<const map>`

4. Make `registry_id` content-stable and serialise it