Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
c56d388
[AutoDiff] Stage 1.1: recognize MaxOverRange specs reducible by a par…
duburcqa May 2, 2026
5b7c464
[AutoDiff] Stage 1.2: SPIR-V max-reducer shader (option D); body byte…
duburcqa May 5, 2026
cf1fdd2
[AutoDiff] Stage 1.3: LLVM runtime function for the option-D max redu…
duburcqa May 5, 2026
872e2a6
[AutoDiff] Stage 1.4a: AdStackCache max-reducer cache methods + body …
duburcqa May 5, 2026
73f1f36
[AutoDiff] Stage 1.6: substitute_precomputed_max_over_range helper (r…
duburcqa May 5, 2026
961349f
[AutoDiff] Stage 1.4b: GfxRuntime::dispatch_max_reducers + adstack_ma…
duburcqa May 6, 2026
639e1cd
[AutoDiff] Stage 1.4+1.6: launch_kernel wires dispatch_max_reducers a…
duburcqa May 6, 2026
0df62e1
[AutoDiff] Hard-require PSB+Int64 at the adstack reverse-mode entry; …
duburcqa May 6, 2026
a9c9b95
[AutoDiff] Stage 1.5 + comment cleanup: LLVM dispatch_max_reducers_fo…
duburcqa May 6, 2026
c03016f
[AutoDiff] Adstack max-reducer: dispatch fixes, Metal u32 atomic, cap…
duburcqa May 6, 2026
f73c157
[AutoDiff] Adstack: short-circuit MaxOverRange walk on cap-hit (avoid…
duburcqa May 6, 2026
3e6a03e
[AutoDiff] Adstack: drop LLVM device sizer overflow-flag write to avo…
duburcqa May 6, 2026
d0b908f
[AutoDiff] Adstack: scope cap-hit tripwire test to backends with expl…
duburcqa May 6, 2026
9748cc9
[AutoDiff] Adstack: drop arch restriction on cap-hit tripwire test
duburcqa May 6, 2026
98dd82d
[Docs] Document the per-task sizer iteration cap and its parallel-eva…
duburcqa May 6, 2026
9a8bc2d
[AutoDiff] Adstack max-reducer: capture nested MaxOverRange chains ac…
duburcqa May 6, 2026
47fc8d2
[AutoDiff] Adstack max-reducer: round-based dispatch substitutes capt…
duburcqa May 7, 2026
df42498
[AutoDiff] Adstack max-reducer: support bound-var-indexed FieldLoad i…
duburcqa May 7, 2026
f6c146b
[AutoDiff] LLVM adstack lazy-claim: split into stage-grouped subdir (…
duburcqa May 7, 2026
91aa148
[Runtime] Split adstack runtime helpers into a separate translation u…
duburcqa May 7, 2026
3dc7253
[Docs] Reformat 'What can go wrong' as FAQ-style subsections; tighten…
duburcqa May 7, 2026
f34db99
[CI] Search $LLVM_DIR/bin for llvm-link so the runtime bitcode link s…
duburcqa May 7, 2026
85ceb31
[CI] chmod 0755 LLVM toolchain binaries after extract so the bitcode …
duburcqa May 7, 2026
d92fee3
[Docs] Reflow three comment blocks in adstack max-reducer files to wr…
duburcqa May 7, 2026
279baf6
[Runtime] Revert separate-TU build to single-TU include-cpp; llvm-lin…
duburcqa May 7, 2026
1d695c8
[AutoDiff] Skip LLVM max-reducer dispatch on pre-Ampere CUDA where th…
May 7, 2026
e830d60
Fix CUDA Graph grad for adstack.
duburcqa May 7, 2026
d9397bb
[AutoDiff] Pin max-reducer dispatch to nullptr stream on CUDA to matc…
duburcqa May 7, 2026
157ddef
[AutoDiff] LLVM max-reducer: split CPU serial vs CUDA/AMDGPU parallel…
duburcqa May 7, 2026
fbe4c6e
[Docs] Reword 'Inner reverse-mode loop with a complex bound' to use c…
duburcqa May 7, 2026
f19244c
[Perf] Adstack max-reducer: gate per-launch dispatch on captured spec…
duburcqa May 8, 2026
9ca862f
[Docs] Reword 'Inner reverse-mode loop with a complex bound' section …
duburcqa May 8, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions cmake/QuadrantsCore.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ file(GLOB QUADRANTS_CORE_SOURCE
"quadrants/jit/*"
"quadrants/math/*"
"quadrants/program/*"
"quadrants/program/adstack/*"
"quadrants/struct/*"
"quadrants/system/*"
"quadrants/transforms/*"
Expand Down
81 changes: 76 additions & 5 deletions docs/source/user_guide/autodiff.md
Original file line number Diff line number Diff line change
Expand Up @@ -349,11 +349,28 @@ A large `ndrange` combined with several loop-carried variables multiplies quickl

## What can go wrong

- **Adstack overflow.** Surfaces as `QuadrantsAssertionError: Adstack overflow ...` at the next Quadrants Python entry. The message names the offending kernel + offload task and the most likely cause:
- *Untracked tensor mutation between launches.* A tensor backing a data-dependent loop bound was written to outside Quadrants's tracking - typically a DLPack zero-copy mutation through a torch tensor sharing storage with a Quadrants ndarray, or a raw pointer write through a non-torch consumer. The cached adstack capacity was sized against the value before the mutation; if the mutation grew the bound, the next launch overflows. Fix: route the write through a Quadrants API (`Ndarray.write` / `Ndarray.fill` / a kernel that writes the value). Alternatively, catch the exception and re-launch - Quadrants invalidates the cached bound on raise, so the retry runs against the live state. Kernel state may be inconsistent after an overflow; do not retry the same step without restarting from a clean state.
- *Sizer under-estimated the bound (Quadrants bug).* On unusually intricate nested loops - typically deeply nested `for i in range(arr[...])` with cumulative-index arithmetic - the sizer can compute a bound that is mathematically tighter than the actual push count. To file a bug: clear `/tmp/ir/`, rerun your script with `QD_DUMP_IR=1` set in the environment so Quadrants dumps the kernel IR there, then open an issue on the Quadrants repo with the contents of `/tmp/ir/` attached as a zip. Workaround: pass a generous `ad_stack_size=N` to `qd.init()` with `N` large enough to cover the real push count (bypasses the sizer).
- **Out-of-memory before the kernel even runs.** A reverse pass through many loop-carried variables at a large ndrange can ask the runtime for more adstack memory than the device can physically back, even when the sizer's number is correct. Surfaces as an allocator OOM at launch time. Remedies are the ones listed under *Avoiding OOM on GPU* above: fewer loop-carried variables, a smaller ndrange, manual checkpointing, or more device-memory headroom.
- **Loop bounds backed by a mutated ndarray.** A reverse-mode kernel with `for i in range(n[j])` requires `n[j]` to hold the same value at the forward call and at `.grad()`. If anything writes to `n[j]` between those two points - the differentiable kernel itself, or any other kernel call - the backward call will trigger an `Adstack overflow` exception or the computed gradient would come out silently wrong. The safe rule: populate loop-bound ndarrays before the forward call and leave them untouched until `.grad()` returns. The reason for that is Quadrants' adstack sizer design: it reads the loop bound separately at each dispatch, which includes forward and backward calls. Tape-based eager AD like [PyTorch's autograd](https://pytorch.org/docs/stable/notes/autograd.html) is not affected, since the trip count is recorded as the forward runs and reused at backward time.
### Adstack overflow

Surfaces as `QuadrantsAssertionError: Adstack overflow ...` at the next Quadrants Python entry. The message names the offending kernel + offload task and the most likely cause.

The two cases the runtime distinguishes:

- *Untracked tensor mutation between launches.* A tensor backing a data-dependent loop bound was written to outside Quadrants's tracking - typically a DLPack zero-copy mutation through a torch tensor sharing storage with a Quadrants ndarray, or a raw pointer write through a non-torch consumer. The cached adstack capacity was sized against the value before the mutation; if the mutation grew the bound, the next launch overflows. Workaround: route the write through a Quadrants API (`Ndarray.write` / `Ndarray.fill` / a kernel that writes the value). Alternatively, catch the exception and re-launch - Quadrants invalidates the cached bound on raise, so the retry runs against the live state. Kernel state may be inconsistent after an overflow; do not retry the same step without restarting from a clean state.
- *Sizer under-estimated the bound (Quadrants bug).* On unusually intricate nested loops - typically deeply nested `for i in range(arr[...])` with cumulative-index arithmetic - the sizer can compute a bound that is mathematically tighter than the actual push count. To file a bug: clear `/tmp/ir/`, rerun your script with `QD_DUMP_IR=1` set in the environment so Quadrants dumps the kernel IR there, then open an issue on the Quadrants repo with the contents of `/tmp/ir/` attached as a zip. Workaround: pass a generous `ad_stack_size=N` to `qd.init()` with `N` large enough to cover the real push count (bypasses the sizer).

### Out-of-memory before the kernel even runs

A reverse pass through many loop-carried variables at a large ndrange can ask the runtime for more adstack memory than the device can physically back, even when the sizer's number is correct. Surfaces as an allocator OOM at launch time. Remedies are the ones listed under *Avoiding OOM on GPU* above: fewer loop-carried variables, a smaller ndrange, manual checkpointing, or more device-memory headroom.

### Loop bounds backed by a mutated ndarray

A reverse-mode kernel with `for i in range(n[j])` requires `n[j]` to hold the same value at the forward call and at `.grad()`. If anything writes to `n[j]` between those two points - the differentiable kernel itself, or any other kernel call - the backward call will trigger an `Adstack overflow` exception or the computed gradient would come out silently wrong.

The safe rule: populate loop-bound ndarrays before the forward call and leave them untouched until `.grad()` returns. The reason for that is Quadrants' adstack sizer design: it reads the loop bound separately at each dispatch, which includes forward and backward calls. Tape-based eager AD like [PyTorch's autograd](https://pytorch.org/docs/stable/notes/autograd.html) is not affected, since the trip count is recorded as the forward runs and reused at backward time.

### Inner reverse-mode loop with a complex bound at very large extent

A reverse-mode kernel with two nested loops is in some cases limited to an outer-loop extent of at most `1 << 24`. In particular when the enclosed loop's trip count is an uncommon expression of the outer-loop variable, e.g. `for i in range(arr.shape[0]): ... for j in range(arr[i // 2]):`. See [Appendix C](#appendix-c-evaluation-of-the-enclosed-loops-bound-expression) for a complete walkthrough of the enclosed loop's bound expression and workarounds. When the limit applies and the outer extent exceeds it, the kernel raises `RuntimeError: ... iteration count ... exceeds the 16777216 guard` at launch.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cant help suspecting this is very likely the original bullet point, just with a reference to Appendix C added :) However, having the link to appendix C does reduce the burden on being easliy undersatndable I feel. So, ok :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not!


## Performance characteristics

Expand Down Expand Up @@ -394,6 +411,12 @@ def k_data_dependent(a):
for i in range(a.shape[0]):
while a[i] < 10: # bound that can only be known by running the loop body
a[i] = a[i] + 1

@qd.kernel
def k_inner_struct_for(a, field):
for i in range(a.shape[0]):
for j in field: # struct-for as the enclosed loop with reverse-mode pushes
...
```

## Appendix B: gate-index shapes that capture vs fall back to the worst-case heap
Expand All @@ -414,3 +437,51 @@ Patterns that fall back to the worst-case heap:
- **Constant-index gate**: `field[42]`, or any axis that is a literal constant.
- **Kernel-argument index, no iterating axis**: `field[arg]` where every axis is launch-constant.
- **Indirect index via runtime load**: `field[other_field[i]]`; the compiler cannot prove `other_field` is injective.

## Appendix C: evaluation of the enclosed loop's bound expression

This appendix details how the runtime computes the worst-case trip count of an enclosed reverse-mode loop and which expression shapes each evaluation path accepts. It backs the *Inner reverse-mode loop with a complex bound at very large extent* entry under [What can go wrong](#what-can-go-wrong).

Consider a reverse-mode kernel with two nested loops where the enclosed loop's iteration count depends on the outer loop variable through an arithmetic expression on an ndarray index:

```python
for i in range(arr.shape[0]): # outer loop
for j in range(arr[i // 2]): # enclosed loop: for <var> in range(<bound expression>)
...
```

The enclosed loop's iteration count `arr[i // 2]` is what we call the enclosed loop's *bound expression*. It is a function of the outer-loop variable `i`: as `i` ranges over `[0, arr.shape[0])`, the bound expression evaluates to a different integer at each iteration. Reverse-mode autodiff needs the adstack sized for the worst case - the largest inner-loop trip count that will ever occur across the outer loop's full range, i.e. `max(arr[i // 2] for i in range(arr.shape[0]))`. For example, if `arr = [3, 5, 1]` and the outer loop runs `i` over `[0, 6)`:

| `i` | `i // 2` | bound expression `arr[i // 2]` |
| --- | --- | --- |
| 0 | 0 | 3 |
| 1 | 0 | 3 |
| 2 | 1 | 5 |
| 3 | 1 | 5 |
| 4 | 2 | 1 |
| 5 | 2 | 1 |

Quadrants computes that worst case at launch time - in this example, the max of the column above, 5 - and sizes the adstack accordingly: each outer iteration accommodates up to 5 pushes and the adstack never overflows. With deeper loop nests each enclosed loop's bound expression is reduced separately and the adstack is sized as the product of those maxes.

### Evaluation paths

The compiler picks one of two evaluation paths to compute the maximum based on the bound expression's structure:

- **Parallel:** the maximum is computed with a tiny parallel reduction kernel for efficiency. The reducer accepts a common subset of bound expressions:
- **Integer ndarray or field read** up to 32 bits wide, indexed by literal constants or outer-loop variables: `arr[i, j]`, `field[i]`.
- **Shape term**: `arr.shape[k]`.
- **Literal integer constant**: `42`.
- **Arithmetic combinator**: any `+`, `-`, `*`, `max` of the above.
- **Sequential:** the fallback path, used whenever the parallel path doesn't support the bound expression. Quadrants walks the bound expression one outer-loop iteration at a time on a single thread; the adstack is sized identically, only the upfront cost differs. This path accepts everything the parallel path does, plus:
- **Arithmetic-indexed read**: `arr[i // 2]`, `arr[i % 4]`.
- **Indirect / nested read**: `arr1[arr2[i]]`, `my_field[arr[i]]`.

### Nested loops

Quadrants supports arbitrarily nested loops. When the bound expression itself contains another enclosed loop whose own bound expression must be reduced first, the enclosing bound expression takes the parallel path only if every nested bound expression also fits the parallel-path grammar; otherwise it falls back to the sequential walk. This keeps the runtime from mixing parallel and sequential evaluators inside a single bound expression, which would otherwise force per-iteration kernel launches.

### Sequential walk cap

The sequential walk's outer loop is artificially capped at 2^24 = 16 777 216 iterations to keep both the walk time and the read-tracking memory bounded; past that the kernel raises `RuntimeError: ... iteration count ... exceeds the 16777216 guard`. In the example above, the iteration count of the enclosed loop takes the sequential path because of the `i // 2` index, so it would raise at launch if `arr.shape[0] > (1 << 24)`.

To circumvent this limitation, rewrite the bound expression to unlock the parallel path (e.g. precompute `bounds[i] = arr[i // 2]` into a persistent separate buffer, pass `bounds` in as an input, and use `for j in range(bounds[i]):`), or keep the outer loop count below 2^24.
5 changes: 5 additions & 0 deletions quadrants/codegen/llvm/codegen_llvm.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
#include "quadrants/codegen/llvm/struct_llvm.h"
#include "quadrants/util/file_sequence_writer.h"
#include "quadrants/codegen/codegen_utils.h"
#include "quadrants/program/adstack_size_expr_eval.h"
#include "llvm/Support/SourceMgr.h"
#include "llvm/AsmParser/Parser.h"
#include "quadrants/codegen/ir_dump.h"
Expand Down Expand Up @@ -1993,6 +1994,10 @@ void TaskCodeGenLLVM::finalize_offloaded_task_function() {
current_task->ad_stack.allocas = ad_stack_allocas_info_;
current_task->ad_stack.size_exprs = ad_stack_size_exprs_;
current_task->ad_stack.bound_expr = ad_stack_static_bound_expr_;
// recognize `MaxOverRange` nodes that the runtime can reduce in parallel via the dedicated max-reducer dispatch
// instead of letting the per-thread sizer enumerate. Indexing matches `ad_stack_size_exprs_` (same iteration order
// as the pre-scan above).
current_task->ad_stack.max_reducer_specs = recognize_adstack_max_reducer_specs(ad_stack_size_exprs_);
// Snodes the task body mutates. Persisted on `OffloadedTask::snode_writes` so the LLVM
// launcher can invalidate the per-task adstack metadata cache when a kernel that runs in
// between mutated a SNode an enclosing `size_expr::FieldLoad` reads. Mirrors the SPIR-V
Expand Down
8 changes: 7 additions & 1 deletion quadrants/codegen/llvm/llvm_compiled_data.h
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,11 @@ struct AdStackSizingInfo {
// ids are assigned per `Program` lifetime, not per-kernel-content; a deserialised task re-registers
// itself at the next launch.
uint32_t registry_id{0};
// Per-task list of `MaxOverRange` nodes the runtime reduces in parallel via a dedicated max-reducer dispatch (see the
// max-reducer recognizer). Empty when no captured `size_expr` contains a recognized shape. Each entry references one
// alloca's `size_expr` by `(stack_id, mor_node_idx)`; the runtime substitutes the dispatched value as a `Const` into
// the tree before the per-thread sizer walks it.
std::vector<StaticAdStackMaxReducerSpec> max_reducer_specs;
QD_IO_DEF(per_thread_stride,
per_thread_stride_float,
per_thread_stride_int,
Expand All @@ -92,7 +97,8 @@ struct AdStackSizingInfo {
end_offset_bytes,
allocas,
size_exprs,
bound_expr);
bound_expr,
max_reducer_specs);
};

class OffloadedTask {
Expand Down
1 change: 1 addition & 0 deletions quadrants/codegen/spirv/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ add_library(spirv_codegen)
target_sources(spirv_codegen
PRIVATE
adstack_bound_reducer_shader.cpp
adstack_max_reducer_shader.cpp
adstack_sizer_shader.cpp
kernel_utils.cpp
snode_struct_compiler.cpp
Expand Down
Loading
Loading