[Perf] Adstack max-reducer: launch cache + zero-copy result map; content-stable registry_id#671
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0f5a25f35a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
4025e0b to
5d2cb11
Compare
7247a3d to
b89c3f7
Compare
…es diagnose-on-overflow + max-reducer-on-cache-hit
…fline-cache reload registry seeding
b89c3f7 to
6aa04c4
Compare
|
some line wrapping issues to address https://github.com/Genesis-Embodied-AI/quadrants/actions/runs/25610119418/job/75178772513?pr=671 |
|
Since this 1. contains many changes, 2. touch core, non-adstack files, 3. modifies launch cache, would prefer to have genesis unit tests report and genesis benchmark results please |
|
|
|
|
=> ok to merge |
Adstack max-reducer: launch cache + zero-copy result map; content-stable registry_id
TL;DR
QD_OFFLINE_CACHE=0 GS_ENABLE_NDARRAY=0 python repro_amdgpu_rigid_step.py(RX 7900 XTX, 100 substeps + 1 backward, 3-trial median):AMDGPU
test_max_reducer_*suite (QD_WANTED_ARCHS=amdgpu pytest -k max_reducer):Why
#635 introduced
LlvmRuntimeExecutor::dispatch_max_reducers_for_tasksto dispatch capturedMaxOverRangespecs in parallel. The dispatcher's design expectation was that the per-specAdStackCache::max_reducer_cache_would short-circuit every launch after the first - and instrumentation confirms it does (99.75% per-spec hit rate, near-zero GPU dispatches in steady state). But the hot-path host-side bookkeeping the dispatcher does on the way to that short-circuit added ~290 ms across 2008 launches in the rigid-step bench - about 27% of trial wallclock - dominated by:AdStackSizingInfointo astd::vector<AdStackSizingInfo> ad_stacks_view, including the embeddedsize_exprs/allocas/bound_expr/max_reducer_specssub-vectors. Probe data: 125.2 ms / 290 ms (43%) was this single copy loop.try_max_reducer_cache_hit. Probe data: 62.8 ms / 290 ms (22%).current_max_reducer_results_value-typed map: copied once per launch from the cache entry, then snapshotted again per task inpublish_adstack_metadata(the recursive-reentry guard). With ~600 entries per map and ~1.8 tasks per call, that's ~3.4 M map entries copied across the trial.Separately,
AdStackSizingInfo::registry_idwas excluded fromQD_IO_DEFunder the rationale "ids are per-Programlifetime; a deserialised task re-registers itself at the next launch." That comment was aspirational on the LLVM path - only the SPIR-V launcher actually re-registered (inadstack_sizer_launch.cpp). After offline-cache reload an LLVM kernel hadregistry_id == 0, so the dispatcher'sif (registry_id == 0) continuegate skipped every spec, the per-thread sizer fell through to its1<<24host-eval cap, and the dispatch silently no-op'd. This is whattest_max_reducer_pins_stride_for_oversized_axis[arch=amdgpu-*]and the dispatch-counter tests were exposing. The gate also affected the diagnose-on-overflow path: the codegen-bakedcmpxchg(0, registry_id)immediate IS preserved through the LLVM IR text in the offline cache, so on cache reload the LLVM IR wrote an old sequential id while the host registry was empty - any overflow on a freshly cache-loaded kernel would print the generic dual-cause fallback instead of the offending kernel + task identity.Surface behavior
_get_max_reducer_dispatch_count()now matches between cache-miss and cache-hit launches on AMDGPU. Before this PR a fresh test process on a warm~/.cache/quadrants/returned 0 dispatches and silently used the per-thread sizer's worst-case enumeration; now the dispatcher fires correctly on the first launch and the subsequent per-spec / per-launch caches short-circuit identically to the cache-miss path.kernel + taskidentity in the diagnose message instead of the generic dual-cause fallback. This was a latent bug that affected all LLVM backends, not just AMDGPU - the AMDGPU test suite is what surfaced it.Mechanism end-to-end
1. Drop the per-launch
AdStackSizingInfodeep copyBefore,
dispatch_max_reducers_for_tasks(const std::vector<OffloadedTask> &)materialisedstd::vector<AdStackSizingInfo>by copy and forwarded to the AdStackSizingInfo overload. This commit refactors both public overloads to forward astd::vector<const AdStackSizingInfo *>pointer-view to a privatedispatch_max_reducers_impl(launch_cache_key, ad_stacks_view, ctx, dev_ctx)- noAdStackSizingInfocopies, no per-launch allocation cliff. The launch cache key is the address of the launcher's stable per-handle vector (KernelLauncher::contexts_[i].offloaded_tasksfor CUDA / AMDGPU,ad_stacksfor the CPU launcher); the address is reused on every launch of the same kernel handle so it serves as a stable identity.quadrants/runtime/llvm/adstack_lazy_claim/bound_eval.cppstd::vector<const AdStackSizingInfo *>pointer-views (noAdStackSizingInfocopies) and forward to the new privatedispatch_max_reducers_impl. The inner loop'sad_stackaccessor changes fromad_stacks[ti]to*ad_stacks[ti].quadrants/runtime/llvm/llvm_runtime_executor.hdispatch_max_reducers_impldeclaration.2. Add a per-kernel-handle launch cache
AdStackCache::try_max_reducer_launch_cache_hitandrecord_max_reducer_launch_cachefactor the launch-cache logic into the existing adstack cache module. The cache entry stores a deduplicated(snode_id, gen)/(arg_id, devalloc, gen)snapshot covering every spec's observation deps; the fast path replays the snapshot in O(distinct deps) and short-circuits the per-spec walk on full match. Slow path falls through to today's per-spec cache.quadrants/program/adstack/cache.hMaxReducerLaunchCacheEntry/ArgGenObservationtypes nested inAdStackCache; newmax_reducer_launch_cache_field;try_max_reducer_launch_cache_hit/record_max_reducer_launch_cache/invalidate_max_reducer_launchdeclarations;invalidate_all_per_taskextended to clear the new map.quadrants/program/adstack/cache.cpplookup_max_reducer_reads; deduplication keeps the replay O(distinct deps) instead of O(specs * obs/spec).quadrants/runtime/llvm/adstack_lazy_claim/bound_eval.cppdispatch_max_reducers_implcallstry_max_reducer_launch_cache_hitat the top andrecord_max_reducer_launch_cacheat the bottom of every successful dispatch.3. Switch
current_max_reducer_results_toshared_ptr<const map>The cache entry from step 2 stored its result map by value; the executor field was also value-typed and copied from the entry on every fast-path hit. Replacing both with
std::shared_ptr<const std::unordered_map<uint64_t, int64_t>>collapses the fast-path assignment to a refcount bump (no map data is copied). The cache entry retains its ownshared_ptrso the per-tasklocal_max_reducer_results = current_max_reducer_results_snapshot inpublish_adstack_metadata(the recursive-reentry defence) becomes a refcount bump too, and the cache-entry's allocation stays alive even if a recursive snode-reader-kernel reentry repoints the executor's transient mid-walk. The slow path wrapsresultin ashared_ptronce at the end and hands the same allocation to both the cache entry and the executor field. Bothdispatch_max_reducers_for_tasksoverloads (and the privatedispatch_max_reducers_impl) now returnvoidsince callers only read the result throughcurrent_max_reducer_results_.quadrants/program/adstack/max_reducer.hMaxReducerResultMapPtr = std::shared_ptr<const MaxReducerResultMap>alias (shared by the cache entry and the executor field).quadrants/program/adstack/cache.{h,cpp}MaxReducerLaunchCacheEntry::resultswitches toshared_ptr<const map>;try_max_reducer_launch_cache_hit/record_max_reducer_launch_cachesignatures updated.quadrants/runtime/llvm/llvm_runtime_executor.hcurrent_max_reducer_results_field switches toshared_ptr<const map>; both public overloads + the private impl returnvoid.quadrants/runtime/llvm/adstack_lazy_claim/bound_eval.cppresultinmake_shared<const ...>once and shares the allocation between the cache entry and the executor field. The fast path'scurrent_max_reducer_results_ = std::move(hit)is now a refcount move.quadrants/runtime/llvm/adstack_lazy_claim/metadata_publish.cppshared_ptrcopy (refcount only); encoder call sites pass*current_max_reducer_results_(dispatch_max_reducers_implinitialises the field to a non-null empty-map sentinel so the deref is unconditional).4. Make
registry_idcontent-stable and serialise itReplaces sequential id assignment with a 32-bit FNV-1a folded from
(kernel_name, task_id_in_kernel). Same(kernel_name, task_id_in_kernel)pair always yields the same id acrossProgramlifetimes, re-compiles, and offline-cache reloads. Codegen-baked id and host-side id match by construction. The registry storage switches fromstd::vector<Entry>(sequential indexing) tostd::unordered_map<uint32_t, Entry>(arbitrary hash keys); collisions linear-probe past occupied slots (1.2e-4 collision probability for 1000 distinct keys with 32-bit FNV-1a).AdStackSizingInfogainskernel_name+task_id_in_kernelfields so the runtime registration call (theProgramregistry seed for the diagnose path) can re-derive the hash inputs without parsingOffloadedTask::name.quadrants/codegen/llvm/llvm_compiled_data.hAdStackSizingInfogainskernel_nameandtask_id_in_kernelfields;registry_id+ the two new fields added toQD_IO_DEF.quadrants/codegen/llvm/codegen_llvm.cppregister_adstack_sizing_infocall sites infinalize_offloaded_task_functionpopulate the two new fields oncurrent_task->ad_stackbefore registering.quadrants/program/adstack/cache.hstd::vectortostd::unordered_map<uint32_t, AdStackSizingInfoEntry>;is_adstack_sizing_info_registered/ensure_runtime_registry_ids_for_max_reducerdeclarations.quadrants/program/adstack/cache.cppfnv1a32_for_registryhelper;register_adstack_sizing_inforewritten to compute id by hash and linear-probe past collisions;lookup_adstack_sizing_info/update_adstack_sizing_info_size_exprsswitch to map find.ensure_runtime_registry_ids_for_max_reducerseeds the per-Programregistry on the first launch of each cache-loaded kernel handle, gated byis_adstack_sizing_info_registered(&ad_stack)so steady-state launches are O(1).quadrants/runtime/llvm/adstack_lazy_claim/bound_eval.cppdispatch_max_reducers_for_tasks(OffloadedTask)callsensure_runtime_registry_ids_for_max_reduceronce before forwarding intodispatch_max_reducers_impl. Theconst_castis documented; the OffloadedTasks live in non-const launcher storage.Per-backend coverage matrix
adstack_sizer_launch.cpp:236)Tests
Existing
tests/python/test_adstack.pytest_max_reducer_*suite is the regression coverage:test_max_reducer_pins_stride_for_oversized_axis[arch=amdgpu-*](5 parametrisations) - pins that the dispatcher fires for above-1<<24ndarray axes on cache-hit, and that_get_max_reducer_dispatch_count() >= 1after the first compute + grad. These were the visible AMDGPU CI failures; before this PR they all returneddispatch_count == 0on cache hit. Now they all pass on first run AND on cache-hit re-run.test_max_reducer_dispatch_counts_advance_on_input_mutation[arch=amdgpu]- pins that a host mutation of the gating ndarray bumpsndarray_data_genand forces re-dispatch; would silently regress if the launch cache fast path ignored the new dependency snapshot.test_max_reducer_field_load_bound_var_dispatch[arch=amdgpu-*](8 parametrisations) - exercise the SNode-backedbound_var-indexed FieldLoad body grammar across the full body-shape matrix; verifies the fast-path replay handles SNode generation deps the same way the per-spec cache does.test_max_reducer_field_load_bound_var_cache_invalidates_on_snode_mutation[arch=amdgpu]- pins that a SNode write between launches invalidates both the per-spec cache and the new launch cache (the launch-cachesnode_genssnapshot must carry the post-mutation gen forward).repro_amdgpu_rigid_step.py(the user's profiling script) is the perf regression coverage; numbers in the TL;DR table.Side-effect audit
AdStackSizingInfo::QD_IO_DEFaddsregistry_id/kernel_name/task_id_in_kernelregistry_idbaked in; on reload the JSON metadata won't match the old QD_IO_DEF field set. Users must clear~/.cache/quadrants/qdcache/once.dispatch_max_reducers_for_tasksmetadata_publish.cpp:327snapshot vialocal_max_reducer_results = current_max_reducer_results_(nowshared_ptrcopy)shared_ptrkeeps the result map alive across the recursive call'scurrent_max_reducer_results_repoint; the outer task continues to read the snapshot it took.max_reducer_cache_invalidation on dependency change(snode_write_gen, ndarray_data_gen)counters;invalidate_max_reducerandinvalidate_max_reducer_launchare always invalidated together viainvalidate_all_per_task.fnv1a32_for_registryregister_adstack_sizing_infolinear-probes past occupied slots; sameidentity_keyshort-circuits viaadstack_sizing_info_id_by_ptr_before hash lookupensure_runtime_registry_ids_for_max_reducerseeds the registry on first launch of each cache-loaded kernel handle, gated byis_adstack_sizing_info_registered(&ad_stack)max_reducer_specs)dispatch_max_reducers_for_tasksearly-out unchanged. Theensure_runtime_registry_ids_for_max_reducerloop also early-skips whenmax_reducer_specs.empty().runtime/gfx/adstack_max_reducer_launch.cppand its own runtime registration inadstack_sizer_launch.cpp:236.Program *)prog != nullptr/cache != nullptr;ensure_runtime_registry_ids_for_max_reducerdefensive-seedsad_stack.registry_idfrom the just-minted id when codegen-time registration was skipped.