[Docs] Add user-guide page for atomics and bit operations by hughperkins · Pull Request #640 · Genesis-Embodied-AI/quadrants

hughperkins · 2026-05-07T07:41:55Z

Summary

New user-guide page docs/source/user_guide/atomics.md covering the per-thread / scalar tier of Quadrants primitives — atomic read-modify-write ops (qd.atomic_*) and bit-counting helpers (qd.math.popcnt, qd.math.clz).

Companion to the in-flight subgroup (#639) and block (#638) docs; this is one of a series of tier-by-tier user-guide pages. Independent of the others — no cross-links to unmerged docs.

Includes:

Backend / dtype support matrices for all 8 atomics and the two bit-count helpers, with caveats (f64 fast path, clz u32 / u64 + SPV 32-bit hard-cap).
Per-op semantics with the universal "returns the old value" / "per-location atomicity, not a fence on the rest of memory" framing.
Worked examples: slot reservation (workhorse pattern for select / compact), histogram, popcnt sum, MSB-via-clz.
Performance and portability notes — atomic contention, fence-pairing, f64 and atomic_mul cost, clz portability.

Adds atomics to the SIMT-primitives toctree in index.md.

Test plan

cd docs && make html builds without warnings (no broken cross-links — all unmerged-doc references are plain prose, not Markdown links).
Visual check rendered page on RTD preview.

Made with Cursor

Documents qd.atomic_* (add, sub, mul, min, max, and, or, xor) and the bit-counting helpers qd.math.popcnt / qd.math.clz. Covers semantics (returns old value, per-location atomicity, no fence on the rest of memory), backend / dtype support, performance pitfalls (contention, f64 fast path, atomic_mul CAS loop), and worked examples (slot reservation, histogram, popcnt sum, MSB). Adds atomics.md to the SIMT-primitives toctree in index.md.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 96eaa1904b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-07T07:45:11Z

+
+### `qd.math.popcnt(x)`
+
+Counts set bits in `x` and returns an `i32`. Lowers to `__popc` / `__popcll` on CUDA, `OpBitCount` on SPIR-V, `__builtin_amdgcn_popcnt` on AMDGPU. Defined for all integer dtypes.


Stop advertising AMDGPU popcnt support

When users run on qd.amdgpu, qd.math.popcnt is currently not lowered: tests/python/test_unary_ops.py::test_popcnt explicitly xfails AMDGPU with “codegen does not lower this op,” and quadrants/codegen/amdgpu/codegen_amdgpu.cpp::emit_extra_unary has no UnaryOpType::popcnt branch. This line (and the later “fully cross-backend” note) will send AMDGPU users to an API path that hits QD_NOT_IMPLEMENTED, so the docs should mark AMDGPU as unsupported until that lowering exists.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-07T07:45:11Z

+
+### `qd.math.clz(x)`
+
+Counts leading zero bits in `x` and returns an `i32`. For a 32-bit input, `clz(0) = 32`; otherwise the result is in `[0, 31]`. Lowers to `__nv_clz` / `__nv_clzll` on CUDA, `FindMSB` on SPIR-V (with `bitwidth - 1 - FindMSB` to convert MSB index into leading-zero count), `__builtin_amdgcn_sffbh_i32` on AMDGPU. See the cross-backend caveats in the support table.


Stop advertising AMDGPU clz support

For the same AMDGPU context, qd.math.clz is not implemented today: tests/python/test_unary_ops.py::test_clz xfails AMDGPU because the codegen does not lower it, and the AMDGPU unary lowering override does not handle UnaryOpType::clz. Documenting an AMDGPU builtin here makes the new user guide claim support for kernels that currently fail at codegen time.

Useful? React with 👍 / 👎.

github-actions · 2026-05-07T08:59:50Z

Diff coverage: 0% · 0 lines, 0 missing

The previous single page covered two unrelated topics: the qd.atomic_* read-modify-write ops, and the qd.math.popcnt / qd.math.clz bit-counting helpers. They share no concerns (memory coordination vs single-thread register ops) and live in different namespaces. Split into: - atomics.md -- title shortened to "Atomics". Drops the bit-counting table, the popcnt / clz semantics subsections, the bitset-popcount and Morton-MSB examples, and the bit-trick portability bullet. Adds a one-line cross-reference to math.md in the intro and the Related section. - math.md (new) -- framed as a qd.math.* page that currently documents only the bit operations. Lists the broader qd.math surface (vec / mat / GLSL ops / trig / rotations / float-classification / constants) in the intro per the project's own python/quadrants/math/ __init__.py docstring, and notes explicitly that those are exported but not yet documented here. Reuses the bit-counting table, semantics, examples, and portability notes verbatim from the previous atomics.md. Toctree gains a "math" entry next to "atomics" under the SIMT-primitives caption, preserving the existing placement of atomics rather than re-categorising as part of this split.

Change "All atomic ops work on both global memory (fields, ndarrays) and block-shared memory (qd.simt.block.SharedArray)." to "All atomic ops can be called on either global memory (fields, ndarrays) or block-shared memory (qd.simt.block.SharedArray)." The earlier wording read as "they work in both places", which is true but a little off-target; "can be called on either" matches how a user picks one memory kind per call site.

Removes "## Examples" and its two subsections ("Reserving a slot in an output array" and "Histogram"). No other content changes; no in-page anchors referenced the dropped section.

Removes the second sentence of the math.md intro, which listed every category of qd.math.* helpers (vec / mat constructors, GLSL-style ops, trig, rotation builders, float-classification, constants). The remaining intro is just "qd.math is the quadrants standard library of math helpers", followed by the existing note that this page currently documents only the bit-counting helpers.

github-actions · 2026-05-07T14:57:47Z

Diff coverage: 0% · 0 lines, 0 missing

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 74ff985659

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-07T15:21:39Z

+
+There is no `atomic_cas` (compare-and-swap) exposed in Python today. The C++ runtime uses CmpXchg internally; surfacing it requires extending `AtomicOpType`.
+
+All atomic ops can be called on either global memory (fields, ndarrays) or block-shared memory (`qd.simt.block.SharedArray`). They are sequentially consistent on the location they touch; they are **not** memory fences for the rest of the address space — to publish other writes alongside an atomic, pair the atomic with `qd.simt.block.mem_sync()` (block scope) or `qd.simt.grid.memfence()` (device scope).


Don't present mem_sync as portable

This guidance points every block-scope publishing pattern at qd.simt.block.mem_sync(), but that helper is not available on AMDGPU: python/quadrants/lang/simt/block.py only returns for CUDA or SPIR-V-backed arches and otherwise raises ValueError("qd.block.mem_sync is not supported..."). On AMDGPU kernels, following this new atomics page to pair a shared-memory atomic with the documented fence will fail during tracing, so the doc should either caveat AMDGPU or point AMDGPU users at a supported synchronization primitive.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-07T15:21:39Z

+
+### `qd.math.popcnt(x)`
+
+Counts set bits in `x` and returns an `i32`. Lowers to `__popc` / `__popcll` on CUDA, `OpBitCount` on SPIR-V, `__builtin_amdgcn_popcnt` on AMDGPU. Defined for all integer dtypes.


Don't claim popcnt covers all integer widths

For CUDA, qd.math.popcnt is only lowered for i32/u32/i64/u64 in quadrants/codegen/cuda/codegen_cuda.cpp; any other integer width falls into QD_NOT_IMPLEMENTED. Saying it is “Defined for all integer dtypes” will send users with e.g. qd.i16/qd.u16 masks down a path that compiles on the frontend but fails in CUDA codegen, so this should be limited to the documented 32- and 64-bit types or explicitly marked backend-dependent.

Useful? React with 👍 / 👎.

alanray-tech

Code Review

Five inline findings below — three factual inaccuracies that should be fixed before merge, plus two suggested improvements. All are backed by source-code evidence and official specs.

alanray-tech · 2026-05-07T17:40:37Z

+
+### `qd.math.popcnt(x)`
+
+Counts set bits in `x` and returns an `i32`. Lowers to `__popc` / `__popcll` on CUDA, `OpBitCount` on SPIR-V, `__builtin_amdgcn_popcnt` on AMDGPU. Defined for all integer dtypes.


Must fix — AMDGPU lowering does not exist; dtype claim is too broad

Two issues with this line:

AMDGPU popcnt is not lowered. The AMDGPU codegen overrides emit_extra_unary but has no UnaryOpType::popcnt branch — it falls through to QD_NOT_IMPLEMENTED (codegen_amdgpu.cpp:47–175). The test suite confirms this: test_popcnt explicitly xfails AMDGPU with "BUG: AMDGPU codegen does not lower this op." (test_unary_ops.py:99–100). Mentioning __builtin_amdgcn_popcnt here will send AMDGPU users to a path that crashes at codegen time.

"Defined for all integer dtypes" is incorrect on CUDA. The CUDA codegen only handles i32/u32 (__nv_popc) and i64/u64 (__nv_popcll); any other integer width (e.g. i8, i16) hits QD_NOT_IMPLEMENTED (codegen_cuda.cpp:269–279). See also CUDA Math API — __popc: only unsigned int and unsigned long long int overloads exist.

Suggested rewrite:

Counts set bits in x and returns an i32. Lowers to __popc / __popcll on CUDA (i32/u32/i64/u64 only), OpBitCount on SPIR-V. AMDGPU does not currently lower this op. Narrower integer types (i8, i16, u8, u16) are not supported on CUDA.

alanray-tech · 2026-05-07T17:40:37Z

+
+### `qd.math.clz(x)`
+
+Counts leading zero bits in `x` and returns an `i32`. For a 32-bit input, `clz(0) = 32`; otherwise the result is in `[0, 31]`. Lowers to `__nv_clz` / `__nv_clzll` on CUDA, `FindMSB` on SPIR-V (with `bitwidth - 1 - FindMSB` to convert MSB index into leading-zero count), `__builtin_amdgcn_sffbh_i32` on AMDGPU. See the cross-backend caveats in the support table.


Must fix — AMDGPU clz lowering does not exist

Same situation as popcnt: the AMDGPU codegen's emit_extra_unary override has no UnaryOpType::clz branch (codegen_amdgpu.cpp:47–175), and test_clz xfails AMDGPU with "BUG: AMDGPU codegen does not lower this op." (test_unary_ops.py:134–135).

Referencing __builtin_amdgcn_sffbh_i32 here documents an API path that does not exist in the codebase.

Suggested rewrite:

Counts leading zero bits in x and returns an i32. For a 32-bit input, clz(0) = 32; otherwise the result is in [0, 31]. Lowers to __nv_clz / __nv_clzll on CUDA, FindMSB on SPIR-V (with bitwidth - 1 - FindMSB to convert MSB index into leading-zero count). AMDGPU does not currently lower this op. See the cross-backend caveats in the support table.

alanray-tech · 2026-05-07T17:40:37Z

+
+## Performance and portability notes
+
+- `qd.math.popcnt` is fully cross-backend.


Must fix — popcnt is not fully cross-backend

As noted above, AMDGPU does not lower popcnt (xfail in tests, no codegen branch). This line should be corrected.

Suggested rewrite:

qd.math.popcnt is supported on CUDA (32- and 64-bit integers), SPIR-V, and CPU. AMDGPU does not currently support it.

alanray-tech · 2026-05-07T17:40:37Z

+
+There is no `atomic_cas` (compare-and-swap) exposed in Python today. The C++ runtime uses CmpXchg internally; surfacing it requires extending `AtomicOpType`.
+
+All atomic ops can be called on either global memory (fields, ndarrays) or block-shared memory (`qd.simt.block.SharedArray`). They are sequentially consistent on the location they touch; they are **not** memory fences for the rest of the address space — to publish other writes alongside an atomic, pair the atomic with `qd.simt.block.mem_sync()` (block scope) or `qd.simt.grid.memfence()` (device scope).


Suggestion — mem_sync() is not available on AMDGPU

qd.simt.block.mem_sync() only supports CUDA and SPIR-V arches. On AMDGPU it raises ValueError("qd.block.mem_sync is not supported for arch ...") (block.py:43–49).

This line (and the repeated references at lines 42 and 60) recommends mem_sync() as the block-scope fence without noting the AMDGPU gap. An AMDGPU user following this guide will hit a runtime error.

Consider appending:

Note: qd.simt.block.mem_sync() is currently supported on CUDA and SPIR-V only; on AMDGPU it raises ValueError.

alanray-tech · 2026-05-07T17:40:37Z

+
+### `qd.atomic_min(x, y)` / `qd.atomic_max(x, y)`
+
+Atomically writes back `min(x, y)` (resp. `max(x, y)`). Returns the old value of `x`. Floating-point min/max follow IEEE rules — `NaN` propagates: if either input is `NaN`, the result is `NaN`.


Suggestion — NaN propagation claim is likely incorrect

The doc states: "NaN propagates: if either input is NaN, the result is NaN." However:

The f16 CAS-loop path uses LLVM's CreateMinNum / CreateMaxNum (codegen_llvm.cpp:1337–1342), which implement IEEE 754-2008 minNum/maxNum semantics — when exactly one operand is NaN, the non-NaN value is returned (LLVM LangRef — llvm.minnum).

CUDA's hardware atomicMin/atomicMax for floats (sm_80+) also do not propagate NaN — NaN is treated as larger than any non-NaN value (CUDA Programming Guide §B.12.1.6).

The SPIR-V path uses GLSL FMin/FMax (extended instruction set ids 37/40), which per the GLSL 4.50 spec §8.3 return the non-NaN operand when one input is NaN.

Suggested rewrite:

Floating-point min/max use minNum/maxNum-style semantics — if exactly one input is NaN, the non-NaN value is returned. Behavior when both inputs are NaN is backend-dependent.

…ope, NaN semantics)

hughperkins · 2026-05-07T18:01:04Z

Summary of changes from Opus:

Pushed to hp/docs-atomics. Summary of what I changed in response to each review comment:

math.md:

• Support table rewritten per-backend (CUDA / AMDGPU / SPIR-V columns instead of dtype columns) — drops the false "i32, u32, i64, u64 all yes everywhere" implication.
• popcnt: CUDA = i32, u32, i64, u64; AMDGPU = unsupported (codegen FIXME); SPIR-V = any int (OpBitCount).
• clz: CUDA = i32, i64 only; AMDGPU = unsupported (codegen FIXME); SPIR-V = 32-bit only (FindMSB); 64-bit silently truncates.
• FIXME paragraph added for the AMDGPU codegen gap, citing codegen_amdgpu.cpp::emit_extra_unary and the existing xfail markers on tests/python/test_unary_ops.py::test_popcnt /
::test_clz.
• ### qd.math.popcnt(x) semantics: removed __builtin_amdgcn_popcnt, narrowed CUDA to "i32 / u32 / i64 / u64 only".
• ### qd.math.clz(x) semantics: removed __builtin_amdgcn_sffbh_i32, kept the SPIR-V 32-bit caveat, added "AMDGPU is unsupported".
• "Performance and portability" bullets: dropped the "fully cross-backend" claim for popcnt; restated the actual backend coverage.

atomics.md:

• Backend-caveat paragraph added directly after the "pair atomics with mem_sync / memfence" recommendation. Spells out that on AMDGPU block.mem_sync raises and grid.memfence
silently no-ops, and that the fence-pair recipes do not work as documented on AMDGPU today. Cross-references grid.md.
• NaN semantics rewritten for qd.atomic_min / qd.atomic_max: drops the wrong "NaN propagates" claim, replaces with minNum/maxNum-style semantics (non-NaN wins for the f16,
GPU-native, and SPIR-V paths), and flags the asymmetric NaN behaviour of the f32/f64 CPU CAS-loop path with a file/line citation.

Stale codex bot comments (atomics.md ones for popcnt/clz, before the split into math.md) are now moot — the canonical fix lives in math.md.

This was referenced May 7, 2026

[Docs] Add user-guide page for qd.simt.grid.* primitives #641

Open

[Docs] Add user-guide page for qd.algorithms.* device-wide algorithms #642

Open

chatgpt-codex-connector Bot reviewed May 7, 2026

View reviewed changes

hughperkins mentioned this pull request May 7, 2026

[Docs] Add user-guide page for matrix decompositions and solvers #643

Merged

3 tasks

hughperkins marked this pull request as draft May 7, 2026 08:49

hughperkins added 4 commits May 7, 2026 06:31

[Docs] Drop Examples section from atomics.md

e3f89db

Removes "## Examples" and its two subsections ("Reserving a slot in an output array" and "Histogram"). No other content changes; no in-page anchors referenced the dropped section.

hughperkins marked this pull request as ready for review May 7, 2026 15:18

chatgpt-codex-connector Bot reviewed May 7, 2026

View reviewed changes

alanray-tech reviewed May 7, 2026

View reviewed changes

[Docs] atomics/math: address PR #640 review (AMDGPU support, dtype sc…

fef1510

…ope, NaN semantics)


		### `qd.math.popcnt(x)`

		Counts set bits in `x` and returns an `i32`. Lowers to `__popc` / `__popcll` on CUDA, `OpBitCount` on SPIR-V, `__builtin_amdgcn_popcnt` on AMDGPU. Defined for all integer dtypes.


		### `qd.math.clz(x)`

		Counts leading zero bits in `x` and returns an `i32`. For a 32-bit input, `clz(0) = 32`; otherwise the result is in `[0, 31]`. Lowers to `__nv_clz` / `__nv_clzll` on CUDA, `FindMSB` on SPIR-V (with `bitwidth - 1 - FindMSB` to convert MSB index into leading-zero count), `__builtin_amdgcn_sffbh_i32` on AMDGPU. See the cross-backend caveats in the support table.


		There is no `atomic_cas` (compare-and-swap) exposed in Python today. The C++ runtime uses CmpXchg internally; surfacing it requires extending `AtomicOpType`.

		All atomic ops can be called on either global memory (fields, ndarrays) or block-shared memory (`qd.simt.block.SharedArray`). They are sequentially consistent on the location they touch; they are not memory fences for the rest of the address space — to publish other writes alongside an atomic, pair the atomic with `qd.simt.block.mem_sync()` (block scope) or `qd.simt.grid.memfence()` (device scope).


		## Performance and portability notes

		- `qd.math.popcnt` is fully cross-backend.


		### `qd.atomic_min(x, y)` / `qd.atomic_max(x, y)`

		Atomically writes back `min(x, y)` (resp. `max(x, y)`). Returns the old value of `x`. Floating-point min/max follow IEEE rules — `NaN` propagates: if either input is `NaN`, the result is `NaN`.

Conversation

hughperkins commented May 7, 2026

Summary

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

alanray-tech left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

alanray-tech May 7, 2026

Choose a reason for hiding this comment

Uh oh!

alanray-tech May 7, 2026

Choose a reason for hiding this comment

Uh oh!

alanray-tech May 7, 2026

Choose a reason for hiding this comment

Uh oh!

alanray-tech May 7, 2026

Choose a reason for hiding this comment

Uh oh!

alanray-tech May 7, 2026

Choose a reason for hiding this comment

Uh oh!

hughperkins commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants