[Docs] Add user-guide page for atomics and bit operations#640
[Docs] Add user-guide page for atomics and bit operations#640hughperkins wants to merge 6 commits intomainfrom
Conversation
Documents qd.atomic_* (add, sub, mul, min, max, and, or, xor) and the bit-counting helpers qd.math.popcnt / qd.math.clz. Covers semantics (returns old value, per-location atomicity, no fence on the rest of memory), backend / dtype support, performance pitfalls (contention, f64 fast path, atomic_mul CAS loop), and worked examples (slot reservation, histogram, popcnt sum, MSB). Adds atomics.md to the SIMT-primitives toctree in index.md.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 96eaa1904b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
| ### `qd.math.popcnt(x)` | ||
|
|
||
| Counts set bits in `x` and returns an `i32`. Lowers to `__popc` / `__popcll` on CUDA, `OpBitCount` on SPIR-V, `__builtin_amdgcn_popcnt` on AMDGPU. Defined for all integer dtypes. |
There was a problem hiding this comment.
Stop advertising AMDGPU popcnt support
When users run on qd.amdgpu, qd.math.popcnt is currently not lowered: tests/python/test_unary_ops.py::test_popcnt explicitly xfails AMDGPU with “codegen does not lower this op,” and quadrants/codegen/amdgpu/codegen_amdgpu.cpp::emit_extra_unary has no UnaryOpType::popcnt branch. This line (and the later “fully cross-backend” note) will send AMDGPU users to an API path that hits QD_NOT_IMPLEMENTED, so the docs should mark AMDGPU as unsupported until that lowering exists.
Useful? React with 👍 / 👎.
|
|
||
| ### `qd.math.clz(x)` | ||
|
|
||
| Counts leading zero bits in `x` and returns an `i32`. For a 32-bit input, `clz(0) = 32`; otherwise the result is in `[0, 31]`. Lowers to `__nv_clz` / `__nv_clzll` on CUDA, `FindMSB` on SPIR-V (with `bitwidth - 1 - FindMSB` to convert MSB index into leading-zero count), `__builtin_amdgcn_sffbh_i32` on AMDGPU. See the cross-backend caveats in the support table. |
There was a problem hiding this comment.
Stop advertising AMDGPU clz support
For the same AMDGPU context, qd.math.clz is not implemented today: tests/python/test_unary_ops.py::test_clz xfails AMDGPU because the codegen does not lower it, and the AMDGPU unary lowering override does not handle UnaryOpType::clz. Documenting an AMDGPU builtin here makes the new user guide claim support for kernels that currently fail at codegen time.
Useful? React with 👍 / 👎.
The previous single page covered two unrelated topics: the qd.atomic_* read-modify-write ops, and the qd.math.popcnt / qd.math.clz bit-counting helpers. They share no concerns (memory coordination vs single-thread register ops) and live in different namespaces. Split into: - atomics.md -- title shortened to "Atomics". Drops the bit-counting table, the popcnt / clz semantics subsections, the bitset-popcount and Morton-MSB examples, and the bit-trick portability bullet. Adds a one-line cross-reference to math.md in the intro and the Related section. - math.md (new) -- framed as a qd.math.* page that currently documents only the bit operations. Lists the broader qd.math surface (vec / mat / GLSL ops / trig / rotations / float-classification / constants) in the intro per the project's own python/quadrants/math/ __init__.py docstring, and notes explicitly that those are exported but not yet documented here. Reuses the bit-counting table, semantics, examples, and portability notes verbatim from the previous atomics.md. Toctree gains a "math" entry next to "atomics" under the SIMT-primitives caption, preserving the existing placement of atomics rather than re-categorising as part of this split.
Change "All atomic ops work on both global memory (fields, ndarrays) and block-shared memory (qd.simt.block.SharedArray)." to "All atomic ops can be called on either global memory (fields, ndarrays) or block-shared memory (qd.simt.block.SharedArray)." The earlier wording read as "they work in both places", which is true but a little off-target; "can be called on either" matches how a user picks one memory kind per call site.
Removes "## Examples" and its two subsections ("Reserving a
slot in an output array" and "Histogram"). No other content
changes; no in-page anchors referenced the dropped section.
Removes the second sentence of the math.md intro, which listed every category of qd.math.* helpers (vec / mat constructors, GLSL-style ops, trig, rotation builders, float-classification, constants). The remaining intro is just "qd.math is the quadrants standard library of math helpers", followed by the existing note that this page currently documents only the bit-counting helpers.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 74ff985659
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
| There is no `atomic_cas` (compare-and-swap) exposed in Python today. The C++ runtime uses CmpXchg internally; surfacing it requires extending `AtomicOpType`. | ||
|
|
||
| All atomic ops can be called on either global memory (fields, ndarrays) or block-shared memory (`qd.simt.block.SharedArray`). They are sequentially consistent on the location they touch; they are **not** memory fences for the rest of the address space — to publish other writes alongside an atomic, pair the atomic with `qd.simt.block.mem_sync()` (block scope) or `qd.simt.grid.memfence()` (device scope). |
There was a problem hiding this comment.
Don't present mem_sync as portable
This guidance points every block-scope publishing pattern at qd.simt.block.mem_sync(), but that helper is not available on AMDGPU: python/quadrants/lang/simt/block.py only returns for CUDA or SPIR-V-backed arches and otherwise raises ValueError("qd.block.mem_sync is not supported..."). On AMDGPU kernels, following this new atomics page to pair a shared-memory atomic with the documented fence will fail during tracing, so the doc should either caveat AMDGPU or point AMDGPU users at a supported synchronization primitive.
Useful? React with 👍 / 👎.
|
|
||
| ### `qd.math.popcnt(x)` | ||
|
|
||
| Counts set bits in `x` and returns an `i32`. Lowers to `__popc` / `__popcll` on CUDA, `OpBitCount` on SPIR-V, `__builtin_amdgcn_popcnt` on AMDGPU. Defined for all integer dtypes. |
There was a problem hiding this comment.
Don't claim popcnt covers all integer widths
For CUDA, qd.math.popcnt is only lowered for i32/u32/i64/u64 in quadrants/codegen/cuda/codegen_cuda.cpp; any other integer width falls into QD_NOT_IMPLEMENTED. Saying it is “Defined for all integer dtypes” will send users with e.g. qd.i16/qd.u16 masks down a path that compiles on the frontend but fails in CUDA codegen, so this should be limited to the documented 32- and 64-bit types or explicitly marked backend-dependent.
Useful? React with 👍 / 👎.
alanray-tech
left a comment
There was a problem hiding this comment.
Code Review
Five inline findings below — three factual inaccuracies that should be fixed before merge, plus two suggested improvements. All are backed by source-code evidence and official specs.
|
|
||
| ### `qd.math.popcnt(x)` | ||
|
|
||
| Counts set bits in `x` and returns an `i32`. Lowers to `__popc` / `__popcll` on CUDA, `OpBitCount` on SPIR-V, `__builtin_amdgcn_popcnt` on AMDGPU. Defined for all integer dtypes. |
There was a problem hiding this comment.
Must fix — AMDGPU lowering does not exist; dtype claim is too broad
Two issues with this line:
-
AMDGPU
popcntis not lowered. The AMDGPU codegen overridesemit_extra_unarybut has noUnaryOpType::popcntbranch — it falls through toQD_NOT_IMPLEMENTED(codegen_amdgpu.cpp:47–175). The test suite confirms this:test_popcntexplicitly xfails AMDGPU with"BUG: AMDGPU codegen does not lower this op."(test_unary_ops.py:99–100). Mentioning__builtin_amdgcn_popcnthere will send AMDGPU users to a path that crashes at codegen time. -
"Defined for all integer dtypes" is incorrect on CUDA. The CUDA codegen only handles
i32/u32(__nv_popc) andi64/u64(__nv_popcll); any other integer width (e.g.i8,i16) hitsQD_NOT_IMPLEMENTED(codegen_cuda.cpp:269–279). See also CUDA Math API —__popc: onlyunsigned intandunsigned long long intoverloads exist.
Suggested rewrite:
Counts set bits in
xand returns ani32. Lowers to__popc/__popcllon CUDA (i32/u32/i64/u64 only),OpBitCounton SPIR-V. AMDGPU does not currently lower this op. Narrower integer types (i8, i16, u8, u16) are not supported on CUDA.
|
|
||
| ### `qd.math.clz(x)` | ||
|
|
||
| Counts leading zero bits in `x` and returns an `i32`. For a 32-bit input, `clz(0) = 32`; otherwise the result is in `[0, 31]`. Lowers to `__nv_clz` / `__nv_clzll` on CUDA, `FindMSB` on SPIR-V (with `bitwidth - 1 - FindMSB` to convert MSB index into leading-zero count), `__builtin_amdgcn_sffbh_i32` on AMDGPU. See the cross-backend caveats in the support table. |
There was a problem hiding this comment.
Must fix — AMDGPU clz lowering does not exist
Same situation as popcnt: the AMDGPU codegen's emit_extra_unary override has no UnaryOpType::clz branch (codegen_amdgpu.cpp:47–175), and test_clz xfails AMDGPU with "BUG: AMDGPU codegen does not lower this op." (test_unary_ops.py:134–135).
Referencing __builtin_amdgcn_sffbh_i32 here documents an API path that does not exist in the codebase.
Suggested rewrite:
Counts leading zero bits in
xand returns ani32. For a 32-bit input,clz(0) = 32; otherwise the result is in[0, 31]. Lowers to__nv_clz/__nv_clzllon CUDA,FindMSBon SPIR-V (withbitwidth - 1 - FindMSBto convert MSB index into leading-zero count). AMDGPU does not currently lower this op. See the cross-backend caveats in the support table.
|
|
||
| ## Performance and portability notes | ||
|
|
||
| - `qd.math.popcnt` is fully cross-backend. |
There was a problem hiding this comment.
Must fix — popcnt is not fully cross-backend
As noted above, AMDGPU does not lower popcnt (xfail in tests, no codegen branch). This line should be corrected.
Suggested rewrite:
qd.math.popcntis supported on CUDA (32- and 64-bit integers), SPIR-V, and CPU. AMDGPU does not currently support it.
|
|
||
| There is no `atomic_cas` (compare-and-swap) exposed in Python today. The C++ runtime uses CmpXchg internally; surfacing it requires extending `AtomicOpType`. | ||
|
|
||
| All atomic ops can be called on either global memory (fields, ndarrays) or block-shared memory (`qd.simt.block.SharedArray`). They are sequentially consistent on the location they touch; they are **not** memory fences for the rest of the address space — to publish other writes alongside an atomic, pair the atomic with `qd.simt.block.mem_sync()` (block scope) or `qd.simt.grid.memfence()` (device scope). |
There was a problem hiding this comment.
Suggestion — mem_sync() is not available on AMDGPU
qd.simt.block.mem_sync() only supports CUDA and SPIR-V arches. On AMDGPU it raises ValueError("qd.block.mem_sync is not supported for arch ...") (block.py:43–49).
This line (and the repeated references at lines 42 and 60) recommends mem_sync() as the block-scope fence without noting the AMDGPU gap. An AMDGPU user following this guide will hit a runtime error.
Consider appending:
Note:
qd.simt.block.mem_sync()is currently supported on CUDA and SPIR-V only; on AMDGPU it raisesValueError.
|
|
||
| ### `qd.atomic_min(x, y)` / `qd.atomic_max(x, y)` | ||
|
|
||
| Atomically writes back `min(x, y)` (resp. `max(x, y)`). Returns the old value of `x`. Floating-point min/max follow IEEE rules — `NaN` propagates: if either input is `NaN`, the result is `NaN`. |
There was a problem hiding this comment.
Suggestion — NaN propagation claim is likely incorrect
The doc states: "NaN propagates: if either input is NaN, the result is NaN." However:
- The
f16CAS-loop path uses LLVM'sCreateMinNum/CreateMaxNum(codegen_llvm.cpp:1337–1342), which implement IEEE 754-2008minNum/maxNumsemantics — when exactly one operand is NaN, the non-NaN value is returned (LLVM LangRef —llvm.minnum). - CUDA's hardware
atomicMin/atomicMaxfor floats (sm_80+) also do not propagate NaN — NaN is treated as larger than any non-NaN value (CUDA Programming Guide §B.12.1.6). - The SPIR-V path uses GLSL
FMin/FMax(extended instruction set ids 37/40), which per the GLSL 4.50 spec §8.3 return the non-NaN operand when one input is NaN.
Suggested rewrite:
Floating-point min/max use
minNum/maxNum-style semantics — if exactly one input is NaN, the non-NaN value is returned. Behavior when both inputs are NaN is backend-dependent.
…ope, NaN semantics)
|
Summary of changes from Opus: Pushed to hp/docs-atomics. Summary of what I changed in response to each review comment: math.md: • Support table rewritten per-backend (CUDA / AMDGPU / SPIR-V columns instead of dtype columns) — drops the false "i32, u32, i64, u64 all yes everywhere" implication. atomics.md: • Backend-caveat paragraph added directly after the "pair atomics with mem_sync / memfence" recommendation. Spells out that on AMDGPU block.mem_sync raises and grid.memfence Stale codex bot comments (atomics.md ones for popcnt/clz, before the split into math.md) are now moot — the canonical fix lives in math.md. |
Summary
New user-guide page
docs/source/user_guide/atomics.mdcovering the per-thread / scalar tier of Quadrants primitives — atomic read-modify-write ops (qd.atomic_*) and bit-counting helpers (qd.math.popcnt,qd.math.clz).Companion to the in-flight subgroup (#639) and block (#638) docs; this is one of a series of tier-by-tier user-guide pages. Independent of the others — no cross-links to unmerged docs.
Includes:
f64fast path,clzu32 / u64 + SPV 32-bit hard-cap).clz.f64andatomic_mulcost,clzportability.Adds
atomicsto the SIMT-primitives toctree inindex.md.Test plan
cd docs && make htmlbuilds without warnings (no broken cross-links — all unmerged-doc references are plain prose, not Markdown links).Made with Cursor