Skip to content

[Perf] Adstack max-reducer: launch cache + zero-copy result map; content-stable registry_id#671

Merged
duburcqa merged 6 commits into
mainfrom
duburcqa/adstack_max_reducer_perf
May 9, 2026
Merged

[Perf] Adstack max-reducer: launch cache + zero-copy result map; content-stable registry_id#671
duburcqa merged 6 commits into
mainfrom
duburcqa/adstack_max_reducer_perf

Conversation

@duburcqa
Copy link
Copy Markdown
Contributor

@duburcqa duburcqa commented May 9, 2026

Adstack max-reducer: launch cache + zero-copy result map; content-stable registry_id

Three commits stacked on top of #635. Two restore the AMDGPU autodiff hot-path FPS that #635 lost (280 → 350 FPS, +25% on the rigid-step repro), one fixes a latent offline-cache correctness gap that surfaced as _get_max_reducer_dispatch_count() >= 1 failing across the AMDGPU test_max_reducer_* suite. All three are scoped to the LLVM-side adstack plumbing; the SPIR-V backend is untouched.

TL;DR

QD_OFFLINE_CACHE=0 GS_ENABLE_NDARRAY=0 python repro_amdgpu_rigid_step.py (RX 7900 XTX, 100 substeps + 1 backward, 3-trial median):

state trial 0 trial 1 trial 2
#635 (regressed) 277.6 FPS 283.3 FPS 283.8 FPS
#635 + commit 1 (drop deep copy + per-launch cache) 318.2 FPS 322.5 FPS 324.7 FPS
#635 + commit 1 + 2 (shared_ptr) 342.0 FPS 348.1 FPS 351.9 FPS
this PR (= #635 + commits 1 + 2 + 3) 348.1 FPS 348.1 FPS 349.2 FPS
upper bound (full skip via env-gate) 357.8 FPS 367.6 FPS 368.9 FPS

AMDGPU test_max_reducer_* suite (QD_WANTED_ARCHS=amdgpu pytest -k max_reducer):

state passed failed
#635 1 15
this PR 16 0

Why

#635 introduced LlvmRuntimeExecutor::dispatch_max_reducers_for_tasks to dispatch captured MaxOverRange specs in parallel. The dispatcher's design expectation was that the per-spec AdStackCache::max_reducer_cache_ would short-circuit every launch after the first - and instrumentation confirms it does (99.75% per-spec hit rate, near-zero GPU dispatches in steady state). But the hot-path host-side bookkeeping the dispatcher does on the way to that short-circuit added ~290 ms across 2008 launches in the rigid-step bench - about 27% of trial wallclock - dominated by:

  1. The OffloadedTask overload's per-launch deep copy of every task's AdStackSizingInfo into a std::vector<AdStackSizingInfo> ad_stacks_view, including the embedded size_exprs / allocas / bound_expr / max_reducer_specs sub-vectors. Probe data: 125.2 ms / 290 ms (43%) was this single copy loop.
  2. An O(specs) hash-lookup-and-observation-replay loop walking every captured spec through try_max_reducer_cache_hit. Probe data: 62.8 ms / 290 ms (22%).
  3. The current_max_reducer_results_ value-typed map: copied once per launch from the cache entry, then snapshotted again per task in publish_adstack_metadata (the recursive-reentry guard). With ~600 entries per map and ~1.8 tasks per call, that's ~3.4 M map entries copied across the trial.

Separately, AdStackSizingInfo::registry_id was excluded from QD_IO_DEF under the rationale "ids are per-Program lifetime; a deserialised task re-registers itself at the next launch." That comment was aspirational on the LLVM path - only the SPIR-V launcher actually re-registered (in adstack_sizer_launch.cpp). After offline-cache reload an LLVM kernel had registry_id == 0, so the dispatcher's if (registry_id == 0) continue gate skipped every spec, the per-thread sizer fell through to its 1<<24 host-eval cap, and the dispatch silently no-op'd. This is what test_max_reducer_pins_stride_for_oversized_axis[arch=amdgpu-*] and the dispatch-counter tests were exposing. The gate also affected the diagnose-on-overflow path: the codegen-baked cmpxchg(0, registry_id) immediate IS preserved through the LLVM IR text in the offline cache, so on cache reload the LLVM IR wrote an old sequential id while the host registry was empty - any overflow on a freshly cache-loaded kernel would print the generic dual-cause fallback instead of the offending kernel + task identity.

Surface behavior

  • AMDGPU autodiff hot path recovers to ~350 FPS from [AutoDiff] Adstack max-reducer: parallel multi-axis MaxOverRange dispatch #635's ~280 FPS on the rigid-step repro. CUDA / CPU paths are equivalent on the same code paths (they go through the same LLVM dispatcher) but the perf win there is smaller because their per-launch baseline was already lower-overhead. No API or output-numerics change on any backend.
  • _get_max_reducer_dispatch_count() now matches between cache-miss and cache-hit launches on AMDGPU. Before this PR a fresh test process on a warm ~/.cache/quadrants/ returned 0 dispatches and silently used the per-thread sizer's worst-case enumeration; now the dispatcher fires correctly on the first launch and the subsequent per-spec / per-launch caches short-circuit identically to the cache-miss path.
  • An overflow on a cache-loaded kernel now resolves to the matching kernel + task identity in the diagnose message instead of the generic dual-cause fallback. This was a latent bug that affected all LLVM backends, not just AMDGPU - the AMDGPU test suite is what surfaced it.

Mechanism end-to-end

1. Drop the per-launch AdStackSizingInfo deep copy

Before, dispatch_max_reducers_for_tasks(const std::vector<OffloadedTask> &) materialised std::vector<AdStackSizingInfo> by copy and forwarded to the AdStackSizingInfo overload. This commit refactors both public overloads to forward a std::vector<const AdStackSizingInfo *> pointer-view to a private dispatch_max_reducers_impl(launch_cache_key, ad_stacks_view, ctx, dev_ctx) - no AdStackSizingInfo copies, no per-launch allocation cliff. The launch cache key is the address of the launcher's stable per-handle vector (KernelLauncher::contexts_[i].offloaded_tasks for CUDA / AMDGPU, ad_stacks for the CPU launcher); the address is reused on every launch of the same kernel handle so it serves as a stable identity.

file change
quadrants/runtime/llvm/adstack_lazy_claim/bound_eval.cpp Both public overloads build std::vector<const AdStackSizingInfo *> pointer-views (no AdStackSizingInfo copies) and forward to the new private dispatch_max_reducers_impl. The inner loop's ad_stack accessor changes from ad_stacks[ti] to *ad_stacks[ti].
quadrants/runtime/llvm/llvm_runtime_executor.h Adds the private dispatch_max_reducers_impl declaration.

2. Add a per-kernel-handle launch cache

AdStackCache::try_max_reducer_launch_cache_hit and record_max_reducer_launch_cache factor the launch-cache logic into the existing adstack cache module. The cache entry stores a deduplicated (snode_id, gen) / (arg_id, devalloc, gen) snapshot covering every spec's observation deps; the fast path replays the snapshot in O(distinct deps) and short-circuits the per-spec walk on full match. Slow path falls through to today's per-spec cache.

file change
quadrants/program/adstack/cache.h New MaxReducerLaunchCacheEntry / ArgGenObservation types nested in AdStackCache; new max_reducer_launch_cache_ field; try_max_reducer_launch_cache_hit / record_max_reducer_launch_cache / invalidate_max_reducer_launch declarations; invalidate_all_per_task extended to clear the new map.
quadrants/program/adstack/cache.cpp Method bodies. The dependency aggregation walks each spec's recorded observations via lookup_max_reducer_reads; deduplication keeps the replay O(distinct deps) instead of O(specs * obs/spec).
quadrants/runtime/llvm/adstack_lazy_claim/bound_eval.cpp dispatch_max_reducers_impl calls try_max_reducer_launch_cache_hit at the top and record_max_reducer_launch_cache at the bottom of every successful dispatch.

3. Switch current_max_reducer_results_ to shared_ptr<const map>

The cache entry from step 2 stored its result map by value; the executor field was also value-typed and copied from the entry on every fast-path hit. Replacing both with std::shared_ptr<const std::unordered_map<uint64_t, int64_t>> collapses the fast-path assignment to a refcount bump (no map data is copied). The cache entry retains its own shared_ptr so the per-task local_max_reducer_results = current_max_reducer_results_ snapshot in publish_adstack_metadata (the recursive-reentry defence) becomes a refcount bump too, and the cache-entry's allocation stays alive even if a recursive snode-reader-kernel reentry repoints the executor's transient mid-walk. The slow path wraps result in a shared_ptr once at the end and hands the same allocation to both the cache entry and the executor field. Both dispatch_max_reducers_for_tasks overloads (and the private dispatch_max_reducers_impl) now return void since callers only read the result through current_max_reducer_results_.

file change
quadrants/program/adstack/max_reducer.h New MaxReducerResultMapPtr = std::shared_ptr<const MaxReducerResultMap> alias (shared by the cache entry and the executor field).
quadrants/program/adstack/cache.{h,cpp} MaxReducerLaunchCacheEntry::result switches to shared_ptr<const map>; try_max_reducer_launch_cache_hit / record_max_reducer_launch_cache signatures updated.
quadrants/runtime/llvm/llvm_runtime_executor.h current_max_reducer_results_ field switches to shared_ptr<const map>; both public overloads + the private impl return void.
quadrants/runtime/llvm/adstack_lazy_claim/bound_eval.cpp Slow path wraps result in make_shared<const ...> once and shares the allocation between the cache entry and the executor field. The fast path's current_max_reducer_results_ = std::move(hit) is now a refcount move.
quadrants/runtime/llvm/adstack_lazy_claim/metadata_publish.cpp Consumer dereferences. Per-task snapshot is a shared_ptr copy (refcount only); encoder call sites pass *current_max_reducer_results_ (dispatch_max_reducers_impl initialises the field to a non-null empty-map sentinel so the deref is unconditional).

4. Make registry_id content-stable and serialise it

Replaces sequential id assignment with a 32-bit FNV-1a folded from (kernel_name, task_id_in_kernel). Same (kernel_name, task_id_in_kernel) pair always yields the same id across Program lifetimes, re-compiles, and offline-cache reloads. Codegen-baked id and host-side id match by construction. The registry storage switches from std::vector<Entry> (sequential indexing) to std::unordered_map<uint32_t, Entry> (arbitrary hash keys); collisions linear-probe past occupied slots (1.2e-4 collision probability for 1000 distinct keys with 32-bit FNV-1a). AdStackSizingInfo gains kernel_name + task_id_in_kernel fields so the runtime registration call (the Program registry seed for the diagnose path) can re-derive the hash inputs without parsing OffloadedTask::name.

file change
quadrants/codegen/llvm/llvm_compiled_data.h AdStackSizingInfo gains kernel_name and task_id_in_kernel fields; registry_id + the two new fields added to QD_IO_DEF.
quadrants/codegen/llvm/codegen_llvm.cpp Both register_adstack_sizing_info call sites in finalize_offloaded_task_function populate the two new fields on current_task->ad_stack before registering.
quadrants/program/adstack/cache.h Registry storage type switches from std::vector to std::unordered_map<uint32_t, AdStackSizingInfoEntry>; is_adstack_sizing_info_registered / ensure_runtime_registry_ids_for_max_reducer declarations.
quadrants/program/adstack/cache.cpp New fnv1a32_for_registry helper; register_adstack_sizing_info rewritten to compute id by hash and linear-probe past collisions; lookup_adstack_sizing_info / update_adstack_sizing_info_size_exprs switch to map find. ensure_runtime_registry_ids_for_max_reducer seeds the per-Program registry on the first launch of each cache-loaded kernel handle, gated by is_adstack_sizing_info_registered(&ad_stack) so steady-state launches are O(1).
quadrants/runtime/llvm/adstack_lazy_claim/bound_eval.cpp dispatch_max_reducers_for_tasks(OffloadedTask) calls ensure_runtime_registry_ids_for_max_reducer once before forwarding into dispatch_max_reducers_impl. The const_cast is documented; the OffloadedTasks live in non-const launcher storage.

Per-backend coverage matrix

backend per-launch deep copy dropped launch cache active shared_ptr fast path content-stable registry_id
CPU LLVM (x64 / arm64) ✓ (AdStackSizingInfo overload also routed through pointer-view) ✓ (LLVM IR codegen path)
CUDA
AMDGPU
SPIR-V (Vulkan / Metal) N/A (gfx runtime has its own dispatcher) N/A N/A N/A (gfx already runtime-registers via adstack_sizer_launch.cpp:236)

Tests

Existing tests/python/test_adstack.py test_max_reducer_* suite is the regression coverage:

  • test_max_reducer_pins_stride_for_oversized_axis[arch=amdgpu-*] (5 parametrisations) - pins that the dispatcher fires for above-1<<24 ndarray axes on cache-hit, and that _get_max_reducer_dispatch_count() >= 1 after the first compute + grad. These were the visible AMDGPU CI failures; before this PR they all returned dispatch_count == 0 on cache hit. Now they all pass on first run AND on cache-hit re-run.
  • test_max_reducer_dispatch_counts_advance_on_input_mutation[arch=amdgpu] - pins that a host mutation of the gating ndarray bumps ndarray_data_gen and forces re-dispatch; would silently regress if the launch cache fast path ignored the new dependency snapshot.
  • test_max_reducer_field_load_bound_var_dispatch[arch=amdgpu-*] (8 parametrisations) - exercise the SNode-backed bound_var-indexed FieldLoad body grammar across the full body-shape matrix; verifies the fast-path replay handles SNode generation deps the same way the per-spec cache does.
  • test_max_reducer_field_load_bound_var_cache_invalidates_on_snode_mutation[arch=amdgpu] - pins that a SNode write between launches invalidates both the per-spec cache and the new launch cache (the launch-cache snode_gens snapshot must carry the post-mutation gen forward).

repro_amdgpu_rigid_step.py (the user's profiling script) is the perf regression coverage; numbers in the TL;DR table.

Side-effect audit

concern where checked verdict
Offline cache binary compatibility AdStackSizingInfo::QD_IO_DEF adds registry_id / kernel_name / task_id_in_kernel Cache invalidation needed. Existing cached LLVM IR has the old sequential registry_id baked in; on reload the JSON metadata won't match the old QD_IO_DEF field set. Users must clear ~/.cache/quadrants/qdcache/ once.
Recursive snode-reader-kernel reentry into dispatch_max_reducers_for_tasks metadata_publish.cpp:327 snapshot via local_max_reducer_results = current_max_reducer_results_ (now shared_ptr copy) Safe. The cache entry's shared_ptr keeps the result map alive across the recursive call's current_max_reducer_results_ repoint; the outer task continues to read the snapshot it took.
Per-spec max_reducer_cache_ invalidation on dependency change Unchanged. Launch cache layers on top via the same (snode_write_gen, ndarray_data_gen) counters; invalidate_max_reducer and invalidate_max_reducer_launch are always invalidated together via invalidate_all_per_task. Safe.
Hash collisions in fnv1a32_for_registry register_adstack_sizing_info linear-probes past occupied slots; same identity_key short-circuits via adstack_sizing_info_id_by_ptr_ before hash lookup Safe. ~1.2e-4 collision probability for 1000 distinct keys; if a real collision appears the linear probe mints the next free slot.
Diagnose-on-overflow registry resolution ensure_runtime_registry_ids_for_max_reducer seeds the registry on first launch of each cache-loaded kernel handle, gated by is_adstack_sizing_info_registered(&ad_stack) Now correct on cache-hit launches. Was silently broken before this PR (registry empty after cache load).
Forward-only kernels (no max_reducer_specs) dispatch_max_reducers_for_tasks early-out unchanged. The ensure_runtime_registry_ids_for_max_reducer loop also early-skips when max_reducer_specs.empty(). Zero added overhead for forward-only kernels.
SPIR-V backend Untouched. The gfx runtime has its own dispatcher in runtime/gfx/adstack_max_reducer_launch.cpp and its own runtime registration in adstack_sizer_launch.cpp:236. Unaffected.
C++-only test setup (null Program *) All paths gate on prog != nullptr / cache != nullptr; ensure_runtime_registry_ids_for_max_reducer defensive-seeds ad_stack.registry_id from the just-minted id when codegen-time registration was skipped. Safe.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0f5a25f35a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread quadrants/program/adstack/cache.cpp Outdated
Comment thread quadrants/runtime/llvm/adstack_lazy_claim/bound_eval.cpp
@duburcqa duburcqa force-pushed the duburcqa/adstack_max_reducer_perf branch 2 times, most recently from 4025e0b to 5d2cb11 Compare May 9, 2026 17:09
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

@duburcqa duburcqa force-pushed the duburcqa/adstack_max_reducer_perf branch from 7247a3d to b89c3f7 Compare May 9, 2026 18:10
@duburcqa duburcqa force-pushed the duburcqa/adstack_max_reducer_perf branch from b89c3f7 to 6aa04c4 Compare May 9, 2026 18:38
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

@hughperkins
Copy link
Copy Markdown
Collaborator

some line wrapping issues to address https://github.com/Genesis-Embodied-AI/quadrants/actions/runs/25610119418/job/75178772513?pr=671

@hughperkins
Copy link
Copy Markdown
Collaborator

Since this 1. contains many changes, 2. touch core, non-adstack files, 3. modifies launch cache, would prefer to have genesis unit tests report and genesis benchmark results please

@duburcqa
Copy link
Copy Markdown
Contributor Author

duburcqa commented May 9, 2026

env batch_size backend gjk_collision constraint_solver runtime_fps_main runtime_fps_671 runtime_fps_delta_pct
anymal_random 30000 cuda - - 9274564 9346562 +0.78
anymal_uniform 30000 cuda - - 12303670 12242896 -0.49
anymal_uniform_kinematic 0 cpu - - 1953 1956 +0.15
anymal_uniform_kinematic 30000 cuda - - 10440441 10386176 -0.52
anymal_zero 0 cpu - - 7215 7045 -2.36
anymal_zero 30000 cuda - - 18906213 19191135 +1.51
box_pyramid_3 4096 cuda - - 976550 993128 +1.70
box_pyramid_4 4096 cuda - - 386731 391566 +1.25
box_pyramid_5 4096 cuda - - 140041 138435 -1.15
box_pyramid_6 4096 cuda False - 58296 58500 +0.35
box_pyramid_6 4096 cuda True - 59085 60558 +2.49
dex_hand 4096 cuda - - 17198 17322 +0.72
duck_in_box_easy 30000 cuda False - 27162758 27039291 -0.45
duck_in_box_easy 30000 cuda True - 9576866 9562423 -0.15
duck_in_box_hard 0 cpu - - 5139 5092 -0.91
duck_in_box_hard 30000 cuda False - 10222575 10288079 +0.64
duck_in_box_hard 30000 cuda True - 3393998 3402313 +0.24
franka 30000 cuda - - 21898508 21495589 -1.84
franka_accessors 0 cpu - - 1167 1142 -2.14
franka_accessors 30000 cuda - - 15408945 15615816 +1.34
franka_free 30000 cuda - - 32234122 32824114 +1.83
franka_random 0 cpu - - 6223 6073 -2.41
franka_random 30000 cuda - CG 16853652 16823133 -0.18
franka_random 30000 cuda - Newton 16561090 16690425 +0.78
franka_random 30000 cuda False - 16272814 16677558 +2.49
franka_random 30000 cuda True - 11412186 11460639 +0.42
g1_fall 4096 cuda - Newton 928269 930055 +0.19
go2 4096 cuda False CG 3724945 3733568 +0.23
go2 4096 cuda False Newton 4554251 4584400 +0.66
go2 4096 cuda True - 3306931 3321830 +0.45
shadow_hand_cubes 0 cpu - - 41 41 +0.00
shadow_hand_cubes_sparse 0 cpu - - 66 64 -3.03

speed_test_671.txt

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

@duburcqa
Copy link
Copy Markdown
Contributor Author

duburcqa commented May 9, 2026

====== 642 passed, 3 skipped, 2 xfailed in 1169.75s (0:19:29) ======

@hughperkins
Copy link
Copy Markdown
Collaborator

=> ok to merge

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

@duburcqa duburcqa merged commit 255d80a into main May 9, 2026
56 checks passed
@duburcqa duburcqa deleted the duburcqa/adstack_max_reducer_perf branch May 9, 2026 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants