Expand dimensional mapreduce / reduce#83
Conversation
|
@maleadt and @christiangnrd, Please take a look. |
|
This seems to crash on Also, please keep the LLM attribution. |
9997d93 to
3f9e269
Compare
Currently 2-4× slower than PyTorch, and beating or matching GPUArrays' own kernel on most cases. The implementation has dimension canonicalization, Warp shuffle reduction: PyTorch replaces the last 5 levels of the shared memory tree with __shfl_down_sync, eliminating 5 @synchronize calls and shared memory writes. This alone is probably worth ~1.5× on the block reduction. |
3f9e269 to
93f3a8d
Compare
|
Nit: I don't see you detecting duplicate dims (e.g. Performance wise, I think you should also compare against the current AK.jl implementation; this work shouldn't significantly regress that. Testing locally with Metal.jl:
I'm guessing this is because of still doing AFAIU you're also specializing on the exact runtime array sizes (Val(outer_sizes_tup) / Val(reduce_sizes_tup)). That's not a viable path. PyTorch's OffsetCalculator exists exactly to avoid this. |
93f3a8d to
38ef5c5
Compare
|
Pushed some optimizations that improve performance significantly on my M1 (purposefully chosen as an old, weak GPU). Core fixes:
Performance, in microseconds:
|
3194f0d to
b65af66
Compare
|
With GPU: NVIDIA GeForce RTX 3060
|
|
The per-element offset decode for non-contiguous reduced dim sets (e.g. For power-of-two segment sizes (the common case), replace div/rem with Results |
The single reduce/outer segment that canonicalization produces in the common case needs no division: the index is in-range, so the offset is just j*stride. Restores the multiply-add inner loop the divmod decode had regressed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The 4-way @unroll hurt coalesced stride-1 reductions and gave only noisy, non-reproducible gains on strided ones (in-process A/B on Metal). Revert to the plain strided while-loop, matching the pre-existing 1D reduction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The old gate fired multi-group whenever dst_size >= block_size, exploding normal reductions into thousands of blocks + a second pass (e.g. 1000x1000 dims=1). Only split when there are too few outputs to fill the GPU and the reduction is large; cap reduce_groups at block_size so the second pass is single-level, dropping the recursive fallback. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extract the repeated alloc-or-validate-temp block into _alloc_or_temp (also adds the backend check the zero-dim case was missing), flatten the GPU path behind an early return, drop the unused CPU `dims` argument, and rewrite the stale header/section comments to match the implementation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
9bb4f68 to
4bd1fc1
Compare
|
Did a bunch of work yesterday, and managed to improve performance across the board some more, as verified on M1/M3/5080/Iris Xe:
I also added support for multi-input mapreduce ( A bit of analysis by Codex:
|
The stride-based fast-path kernels index the source by a flat linear offset (`src[offset + 1]`), which is only valid for dense column-major arrays. Until now any other GPU source was rejected with an ArgumentError, and a Broadcasted over a non-trivial wrapper (e.g. PermutedDimsArray) failed to compile at all because `@Const(src)` keeps the wrapper's bounds-throw from being elided. Replace the hard dense-stride rejection with a route to a new generic kernel `_mapreduce_nd_generic!`: one thread per output, reducing over the reduced extents via Cartesian indexing (`src[J]`, `J = max(Iother, Ireduce)`), mirroring the CPU `_mapreduce_nd_cpu_sections!` path. It makes no layout assumption and deliberately does not wrap the source in `@Const`, so it compiles for strided views, adjoints, permuted dims, and broadcasts over them. Dense arrays keep the fast paths via the `_mapreduce_fastpath_dense` predicate; all Broadcasted sources currently take the fallback, which also removes the synthetic `_mapreduce_strides(::Broadcasted)` values that previously misled the coalescing/tiled dispatch heuristics. Tests that asserted rejection of strided GPU sources now assert correct results (strided view and PermutedDimsArray). Verified on Metal (non-dense now correct, dense fast paths unchanged) and the CPU suite. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extend the stride-based fast paths to any source backed by a single dense column-major buffer — strided views, adjoints, permuted dims, and reshapes — instead of sending them to the generic Cartesian fallback. The kernels already do stride arithmetic; they were only limited by indexing the wrapper logically (`src[offset+1]`), which assumed offset == linear index (dense only). `_mapreduce_strided_layout` resolves a source to `(buffer, base_offset, strides)`: the dense buffer to index, the wrapper's element offset within it, and the real strides. The kernels now index `buffer[base_offset + Σ coordᵈ·strideᵈ + 1]`, with `base_offset` folded into the per-output base so dense reductions pay at most one extra add per output (never per reduced element). Passing the dense parent buffer rather than the wrapper also keeps `@Const` valid, so this compiles for wrappers (e.g. PermutedDimsArray) that the wrapper-as-`@Const` path cannot. The coalescing and tiled-strided dispatch heuristics now see the true strides instead of the fabricated dense strides a Broadcasted source used to supply. Only `Broadcasted` and sources not backed by a dense buffer (lazy/computed arrays, nested wrappers, complex adjoints without `strides`) take the generic fallback. Verified vs Base on CPU (Pkg.test), Metal (M1/M3), and CUDA, covering strided views with nonzero base offset, adjoints, permuted dims over the by_thread / by_block / multigroup paths, and multi-input. No measurable dense-path regression (CUDA/Metal A/B within run-to-run noise); strided sources that previously hit the generic fallback are now ~8x faster on CUDA (e.g. strided-view dims=2 0.17→0.02 ms). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ard) The earlier "destructure Val dims, drop div/rem intrinsics" cleanup changed the tiled-strided and multigroup row/lane and output/group decode to `unsigned(x) ÷ unsigned(const)`. That regressed square `dims=2` (tiled) by ~15-20% and wide `dims=2` (multigroup) by ~4% on CUDA only. Root cause is signedness, not the divide. The divide-by-constant folds to shift/and on every backend (no div/rem instruction, no divide-by-zero check). But the unsigned *result* flows into the signed index arithmetic (`iout = iblock*rows + row`, …), forcing a checked `Int(::Unsigned)` conversion and a cold `throw(InexactError)` guard. GPUCompiler's PTX backend (ptx.jl `lower_unreachable!`) keeps that guard in the kernel — the compare plus a `call julia_throw_inexacterror` — only turning the final `unreachable` into a trap; the guard then sits in the latency-bound hot path. The Metal backend (metal.jl `lower_unreachable_control_flow!`) force-inlines the throwing function and rewrites unreachable→ret, scrubbing it, so Metal/oneAPI never regressed. Fix: wrap the constant-divisor div/rem in `Int(...)` so the result stays signed. The divide still folds to shift/and, and the signed result avoids the Int-conversion guard entirely. PTX for the tiled kernel goes 157→134 lines, 3→0 inexact-throw sites. Confirmed on an RTX 5080 vs tb/mapreduce_wip (the MAREDUCE reference): square dims=2 0.0207→0.0178 ms, 512² 0.0184→0.0167, wide dims=2 0.0281→0.0280; all dense shapes now within run-to-run noise of WIP. CPU/Metal/CUDA correctness re-verified. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mapreduce / reduce
This PR extends
AcceleratedKernels.mapreduceandreducefrom single-dimension reductions to a more Base-compatible dimensional reduction implementation. It adds tupledims,dims=:,dims=(), oversized and duplicate dims handling, type-changingmapreduce, and multi-inputmapreduce(f, op, A, B, ...; dims=...).Notable API include:
dimsis now accepted asUnion{Nothing, Int, Tuple{Vararg{Int}}, Colon}acrossreduce,mapreduce, and the arithmetic wrappers. Dimensional reductions now follow Base behavior for tuple dims, duplicate dims, dims beyondndims, empty dims, colon dims, and zero-sized reduced or kept dimensions.mapreduceis also supported, including an explicit backend argument after the input arrays.neutralis now derived fromtypeof(init), preserving the documented result-type behavior for type-changing reductions.Implementation
The original CartesianIndices-only GPU approach was correct but too expensive for common reductions because it moved index decoding and integer division into hot loops. The final implementation keeps Cartesian indexing as a generic fallback, but uses stride-based fast paths for dense and strided GPU sources backed by a single dense column-major buffer.
For fast-path sources,
mapreduce_ndresolves the layout to(buffer, base_offset, strides). This lets views, offset views, adjoints, permuted dims, and reshapes use optimized flat-buffer kernels while still indexing the right underlying storage. Broadcasted, lazy, or otherwise non-dense sources use the generic fallback, which mirrors the CPU Cartesian path.The kernel setup canonicalizes reduced and kept dimensions into contiguous stride segments. Common cases such as
dims=2ordims=(1, 2)collapse to a single reduce segment and avoid per-element division entirely. Non-contiguous reductions such asdims=(1, 3)still use multi-segment decoding, with power-of-two segment sizes optimized via shift/mask operations.GPU reductions dispatch between one-thread-per-output, tiled strided, one-block-per-output, and multigroup strategies. The heuristics were tuned to avoid the earlier regressions from overusing multigroup reductions while preserving occupancy and coalescing-sensitive cases.
Performance
Several performance fixes were made: fast paths avoid hot-loop Cartesian indexing, common single-segment reductions avoid per-element division, multigroup reductions are gated to low-output large-reduction cases,
by_blockcan grid-stride over outputs, and strided GPU sources now route through fast kernels instead of falling back unnecessarily.The current implementation is broadly competitive with, and often faster than, the previous AK implementation and GPUArrays-style reductions across CUDA, Metal, and oneAPI. Remaining gaps are mostly backend-specialization opportunities such as subgroup collectives, vectorized loads/stores, small fixed-extent output-vectorized kernels, hardware-aware launch sizing, and selected vendor primitive dispatch.