Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
53710fb
[AutoDiff] Stage 1.1: recognize MaxOverRange specs reducible by a par…
duburcqa May 2, 2026
04555c9
[AutoDiff] Stage 1.2: SPIR-V max-reducer shader (option D); body byte…
duburcqa May 5, 2026
6141399
[AutoDiff] Stage 1.3: LLVM runtime function for the option-D max redu…
duburcqa May 5, 2026
f3f7678
[AutoDiff] Stage 1.4a: AdStackCache max-reducer cache methods + body …
duburcqa May 5, 2026
ff71ec2
[AutoDiff] Stage 1.6: substitute_precomputed_max_over_range helper (r…
duburcqa May 5, 2026
ee88986
[AutoDiff] Stage 1.4b: GfxRuntime::dispatch_max_reducers + adstack_ma…
duburcqa May 6, 2026
893bbeb
[AutoDiff] Stage 1.4+1.6: launch_kernel wires dispatch_max_reducers a…
duburcqa May 6, 2026
4d2a8d6
[AutoDiff] Hard-require PSB+Int64 at the adstack reverse-mode entry; …
duburcqa May 6, 2026
a5808c0
[AutoDiff] Stage 1.5 + comment cleanup: LLVM dispatch_max_reducers_fo…
duburcqa May 6, 2026
99ce158
[AutoDiff] Adstack max-reducer: dispatch fixes, Metal u32 atomic, cap…
duburcqa May 6, 2026
17bef65
[AutoDiff] Adstack: short-circuit MaxOverRange walk on cap-hit (avoid…
duburcqa May 6, 2026
e85efd7
[AutoDiff] Adstack: drop LLVM device sizer overflow-flag write to avo…
duburcqa May 6, 2026
07c009d
[AutoDiff] Adstack: scope cap-hit tripwire test to backends with expl…
duburcqa May 6, 2026
5431e11
[AutoDiff] Adstack: drop arch restriction on cap-hit tripwire test
duburcqa May 6, 2026
6e72837
[Docs] Document the per-task sizer iteration cap and its parallel-eva…
duburcqa May 6, 2026
e80f9dd
[AutoDiff] Adstack max-reducer: capture nested MaxOverRange chains ac…
duburcqa May 6, 2026
7f12ce5
[AutoDiff] Adstack max-reducer: round-based dispatch substitutes capt…
duburcqa May 7, 2026
43de9a6
[AutoDiff] Adstack max-reducer: support bound-var-indexed FieldLoad i…
duburcqa May 7, 2026
efd3f69
[AutoDiff] LLVM adstack lazy-claim: split into stage-grouped subdir (…
duburcqa May 7, 2026
9c59fc0
[CI debug] fprintf trace dispatch_max_reducers + cuda allocate_memory…
duburcqa May 7, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions cmake/QuadrantsCore.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ file(GLOB QUADRANTS_CORE_SOURCE
"quadrants/jit/*"
"quadrants/math/*"
"quadrants/program/*"
"quadrants/program/adstack/*"
"quadrants/struct/*"
"quadrants/system/*"
"quadrants/transforms/*"
Expand Down
5 changes: 4 additions & 1 deletion docs/source/user_guide/autodiff.md
Original file line number Diff line number Diff line change
Expand Up @@ -350,10 +350,13 @@ A large `ndrange` combined with several loop-carried variables multiplies quickl
## What can go wrong

- **Adstack overflow.** Surfaces as `QuadrantsAssertionError: Adstack overflow ...` at the next Quadrants Python entry. The message names the offending kernel + offload task and the most likely cause:
- *Untracked tensor mutation between launches.* A tensor backing a data-dependent loop bound was written to outside Quadrants's tracking - typically a DLPack zero-copy mutation through a torch tensor sharing storage with a Quadrants ndarray, or a raw pointer write through a non-torch consumer. The cached adstack capacity was sized against the value before the mutation; if the mutation grew the bound, the next launch overflows. Fix: route the write through a Quadrants API (`Ndarray.write` / `Ndarray.fill` / a kernel that writes the value). Alternatively, catch the exception and re-launch - Quadrants invalidates the cached bound on raise, so the retry runs against the live state. Kernel state may be inconsistent after an overflow; do not retry the same step without restarting from a clean state.
- *Untracked tensor mutation between launches.* A tensor backing a data-dependent loop bound was written to outside Quadrants's tracking - typically a DLPack zero-copy mutation through a torch tensor sharing storage with a Quadrants ndarray, or a raw pointer write through a non-torch consumer. The cached adstack capacity was sized against the value before the mutation; if the mutation grew the bound, the next launch overflows. Workaround: route the write through a Quadrants API (`Ndarray.write` / `Ndarray.fill` / a kernel that writes the value). Alternatively, catch the exception and re-launch - Quadrants invalidates the cached bound on raise, so the retry runs against the live state. Kernel state may be inconsistent after an overflow; do not retry the same step without restarting from a clean state.
- *Sizer under-estimated the bound (Quadrants bug).* On unusually intricate nested loops - typically deeply nested `for i in range(arr[...])` with cumulative-index arithmetic - the sizer can compute a bound that is mathematically tighter than the actual push count. To file a bug: clear `/tmp/ir/`, rerun your script with `QD_DUMP_IR=1` set in the environment so Quadrants dumps the kernel IR there, then open an issue on the Quadrants repo with the contents of `/tmp/ir/` attached as a zip. Workaround: pass a generous `ad_stack_size=N` to `qd.init()` with `N` large enough to cover the real push count (bypasses the sizer).
- **Out-of-memory before the kernel even runs.** A reverse pass through many loop-carried variables at a large ndrange can ask the runtime for more adstack memory than the device can physically back, even when the sizer's number is correct. Surfaces as an allocator OOM at launch time. Remedies are the ones listed under *Avoiding OOM on GPU* above: fewer loop-carried variables, a smaller ndrange, manual checkpointing, or more device-memory headroom.
- **Loop bounds backed by a mutated ndarray.** A reverse-mode kernel with `for i in range(n[j])` requires `n[j]` to hold the same value at the forward call and at `.grad()`. If anything writes to `n[j]` between those two points - the differentiable kernel itself, or any other kernel call - the backward call will trigger an `Adstack overflow` exception or the computed gradient would come out silently wrong. The safe rule: populate loop-bound ndarrays before the forward call and leave them untouched until `.grad()` returns. The reason for that is Quadrants' adstack sizer design: it reads the loop bound separately at each dispatch, which includes forward and backward calls. Tape-based eager AD like [PyTorch's autograd](https://pytorch.org/docs/stable/notes/autograd.html) is not affected, since the trip count is recorded as the forward runs and reused at backward time.
- **Inner reverse-mode loop with a complex bound at very large extent.** An arbitrarily large enclosing range works only when the inner trip count fits a fixed subset of expressions; other shapes cap at ~16 million enclosing iterations and raise `RuntimeError: ... iteration count ... exceeds the 16777216 guard` past that. Workaround: rewrite the trip count to stay within the supported subset, or shrink the enclosing loop below the threshold.
- *Works at any enclosing-range size:* integer ndarray reads up to 32 bits wide (single- or multi-axis, indexed by literal constants or enclosing loop variables), field reads of the same width indexed by literal constants or enclosing loop variables (`my_field[None]`, `my_field[k]` for a constant `k`, `my_field[i]` where `i` is an enclosing loop variable), `arr.shape[k]` shape terms, literal integer constants, and `+`, `-`, `*`, `max` of those.
- *Caps at the threshold:* 64-bit integer ndarray or field reads, arithmetic-indexed reads (`arr[i // 2]`), and ragged inner ranges whose own bound depends on an enclosing loop variable through an unsupported leaf shape.

## Performance characteristics

Expand Down
5 changes: 5 additions & 0 deletions quadrants/codegen/llvm/codegen_llvm.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
#include "quadrants/codegen/llvm/struct_llvm.h"
#include "quadrants/util/file_sequence_writer.h"
#include "quadrants/codegen/codegen_utils.h"
#include "quadrants/program/adstack_size_expr_eval.h"
#include "llvm/Support/SourceMgr.h"
#include "llvm/AsmParser/Parser.h"
#include "quadrants/codegen/ir_dump.h"
Expand Down Expand Up @@ -1993,6 +1994,10 @@ void TaskCodeGenLLVM::finalize_offloaded_task_function() {
current_task->ad_stack.allocas = ad_stack_allocas_info_;
current_task->ad_stack.size_exprs = ad_stack_size_exprs_;
current_task->ad_stack.bound_expr = ad_stack_static_bound_expr_;
// recognize `MaxOverRange` nodes that the runtime can reduce in parallel via the dedicated max-reducer dispatch
// instead of letting the per-thread sizer enumerate. Indexing matches `ad_stack_size_exprs_` (same iteration order
// as the pre-scan above).
current_task->ad_stack.max_reducer_specs = recognize_adstack_max_reducer_specs(ad_stack_size_exprs_);
// Snodes the task body mutates. Persisted on `OffloadedTask::snode_writes` so the LLVM
// launcher can invalidate the per-task adstack metadata cache when a kernel that runs in
// between mutated a SNode an enclosing `size_expr::FieldLoad` reads. Mirrors the SPIR-V
Expand Down
8 changes: 7 additions & 1 deletion quadrants/codegen/llvm/llvm_compiled_data.h
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,11 @@ struct AdStackSizingInfo {
// ids are assigned per `Program` lifetime, not per-kernel-content; a deserialised task re-registers
// itself at the next launch.
uint32_t registry_id{0};
// Per-task list of `MaxOverRange` nodes the runtime reduces in parallel via a dedicated max-reducer dispatch (see the
// max-reducer recognizer). Empty when no captured `size_expr` contains a recognized shape. Each entry references one
// alloca's `size_expr` by `(stack_id, mor_node_idx)`; the runtime substitutes the dispatched value as a `Const` into
// the tree before the per-thread sizer walks it.
std::vector<StaticAdStackMaxReducerSpec> max_reducer_specs;
QD_IO_DEF(per_thread_stride,
per_thread_stride_float,
per_thread_stride_int,
Expand All @@ -92,7 +97,8 @@ struct AdStackSizingInfo {
end_offset_bytes,
allocas,
size_exprs,
bound_expr);
bound_expr,
max_reducer_specs);
};

class OffloadedTask {
Expand Down
1 change: 1 addition & 0 deletions quadrants/codegen/spirv/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ add_library(spirv_codegen)
target_sources(spirv_codegen
PRIVATE
adstack_bound_reducer_shader.cpp
adstack_max_reducer_shader.cpp
adstack_sizer_shader.cpp
kernel_utils.cpp
snode_struct_compiler.cpp
Expand Down
Loading
Loading