[Docs] Add user-guide page for qd.algorithms.* device-wide algorithms#642
[Docs] Add user-guide page for qd.algorithms.* device-wide algorithms#642hughperkins wants to merge 13 commits intomainfrom
Conversation
Documents the two ops currently in qd.algorithms: parallel_sort (odd-even merge sort, key or key-value, all backends, not stable) and PrefixSumExecutor (Kogge-Stone hierarchical inclusive scan, i32 only, CUDA + Vulkan only). Covers semantics, the i32 / CUDA + Vulkan limitation that cross-platform code most commonly hits, the allocate-once / run-many pattern, and worked examples (key-value sort, scan-based compact). Adds a new 'Algorithms' caption to the toctree in index.md.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3c80d6083f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
| `run(input_arr)`: | ||
|
|
||
| - `input_arr` must be a 1-D `qd.field(qd.i32, shape=(L,))` with `L <= length`. |
There was a problem hiding this comment.
Require the scanned array to match the executor length
This documents length as a maximum and says any L <= length input is valid, but PrefixSumExecutor.run() always copies exactly self.sorting_length elements from and back to input_arr (python/quadrants/algorithms/_algorithms.py:90 and :114) and never switches to the runtime array length. If a user follows this and reuses an executor for a shorter field, the kernel reads/writes past that field's extent rather than scanning only L elements, so the docs should state that the field length must match the constructor length unless the implementation is changed.
Useful? React with 👍 / 👎.
…already covered upstream)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0fd07ef95e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
| ### `qd.algorithms.parallel_sort(keys, values=None)` | ||
|
|
||
| In-place sort. Reorders `keys` ascending; if `values` is provided, applies the same permutation to `values` (key-value sort). Both arguments must be 1-D `field` or `ndarray`. |
There was a problem hiding this comment.
This documents parallel_sort as accepting qd.ndarray, but the implementation passes the argument to sort_stage, which immediately accesses keys.snode.ptr.offset / values.snode.ptr.offset in python/quadrants/_kernels.py; ScalarNdarray stores an arr handle and has no snode attribute. In the scenario where a user follows this new guide and calls qd.algorithms.parallel_sort(qd.ndarray(...)), the call fails during tracing instead of sorting, so the public docs should not advertise ndarray support unless the kernel is changed to handle ndarray arguments.
Useful? React with 👍 / 👎.
alanray-tech
left a comment
There was a problem hiding this comment.
Code Review
Two inline findings — both are factual inaccuracies that will mislead users into hitting runtime crashes. Backed by source-code evidence.
|
|
||
| ### `qd.algorithms.parallel_sort(keys, values=None)` | ||
|
|
||
| In-place sort. Reorders `keys` ascending; if `values` is provided, applies the same permutation to `values` (key-value sort). Both arguments must be 1-D `field` or `ndarray`. |
There was a problem hiding this comment.
Must fix — parallel_sort does not support ndarray, only field
This line says:
Both arguments must be 1-D
fieldorndarray.
But parallel_sort delegates to the sort_stage kernel, which directly accesses keys.snode.ptr.offset and values.snode.ptr.offset (_kernels.py:237–238). ndarray (ScalarNdarray) has no snode attribute — it uses an arr handle with a different storage mechanism. Passing an ndarray will fail during tracing with an AttributeError.
The test suite (test_sort.py) exclusively uses qd.field() and never passes ndarray to parallel_sort.
Suggested fix:
Both arguments must be 1-D
field.
|
|
||
| `run(input_arr)`: | ||
|
|
||
| - `input_arr` must be a 1-D `qd.field(qd.i32, shape=(L,))` with `L <= length`. |
There was a problem hiding this comment.
Must fix — L <= length contract is incorrect; L must equal length
This line says:
input_arrmust be a 1-Dqd.field(qd.i32, shape=(L,))withL <= length.
But run() always uses the constructor's length, not the runtime array size. The relevant code path in _algorithms.py:
# line 75-76: length is fixed at construction time
def run(self, input_arr):
length = self.sorting_length
# line 90: copies `length` elements FROM input_arr
blit_from_field_to_field(self.large_arr, input_arr, 0, length)
# line 114: copies `length` elements BACK to input_arr
blit_from_field_to_field(input_arr, self.large_arr, 0, length)If input_arr has L < length elements, both blit_from_field_to_field calls read/write past the field's extent — out-of-bounds access with undefined behavior.
The constructor description at line 51 also says length is "the maximum number of elements", which reinforces the incorrect impression that shorter arrays are safe.
Suggested fix — either:
- Change the doc to state that
input_arrmust have exactlylengthelements (and update line 51 accordingly), or - Fix the implementation to use
min(self.sorting_length, input_arr.shape[0])so shorter arrays are actually supported.
…tor length contract (PR #642 review)
Summary
New user-guide page
docs/source/user_guide/algorithms.mddocumentingqd.algorithms.parallel_sortandqd.algorithms.PrefixSumExecutor.Companion to the in-flight subgroup (#639), block (#638), atomics (#640), and grid (#641) docs; this is one of a series of tier-by-tier user-guide pages. Independent — no cross-links to unmerged docs.
Includes:
PrefixSumExecutorisi32-only and CUDA + Vulkan only.parallel_sort).O(N log² N)vs radix, allocate-once / run-many for the executor, where the cost actually goes.Adds a new "Algorithms" caption to the user-guide toctree in
index.md.Test plan
cd docs && make htmlbuilds without warnings.Made with Cursor