Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
ceffc52
Support tuple dims in mapreduce_nd using CartesianIndices
shreyas-omkar Jun 1, 2026
e721a22
WIP: rewrite mapreduce_nd with stride arithmetic and multi-group redu…
shreyas-omkar Jun 6, 2026
368ffe3
perf: canonicalization, Val{sizes}, multi-group, input staging, corre…
shreyas-omkar Jun 9, 2026
0842d12
fix: OpenCL.jl and SPIRVIntrinsics branch renamed master to main
shreyas-omkar Jun 10, 2026
9e7cddd
fix: correct neutral element in partial reduction, add duplicate dims…
shreyas-omkar Jun 10, 2026
fad213a
perf: drop per-element division in mapreduce_nd index decode
maleadt Jun 11, 2026
8ee753a
perf: remove register staging from block reductions
maleadt Jun 11, 2026
4accf21
fix: gate multi-group reduction on GPU occupancy
maleadt Jun 11, 2026
7352ef1
refactor: dedupe destination allocation and tidy mapreduce_nd
maleadt Jun 11, 2026
113a2de
docs: note tuple dims support in reduce/mapreduce docstrings
maleadt Jun 11, 2026
378ca8c
perf: grid-stride over outputs, coalescing-aware dispatch, multigroup…
shreyas-omkar Jun 12, 2026
ad87399
fix: Reverted target blocks to 256
shreyas-omkar Jun 12, 2026
7af1f69
perf: avoid integer division in multi-segment _reduce_offset
shreyas-omkar Jun 13, 2026
f018929
fix: match Base dims semantics in mapreduce_nd
maleadt Jun 17, 2026
d0fde28
perf: reroute low-occupancy nd reductions to by-block
maleadt Jun 17, 2026
402e959
perf: constrain low-occupancy nd reduction dispatch
maleadt Jun 17, 2026
0cd10a6
fix: support dims colon in mapreduce APIs
maleadt Jun 17, 2026
5981ed1
perf: specialize multigroup output mapping
maleadt Jun 17, 2026
63637d8
feat: support multi-input mapreduce
maleadt Jun 17, 2026
24fd7b1
fix: align multi-input mapreduce semantics
maleadt Jun 17, 2026
ff281c6
fix: support CPU broadcasted mapreduce sources
maleadt Jun 17, 2026
2084ccd
test: cover type-changing mapreduce
maleadt Jun 17, 2026
152d6a8
docs: document multi-input mapreduce
maleadt Jun 17, 2026
8fa0dcd
test: cover multi-input empty dims cases
maleadt Jun 17, 2026
8e34d20
fix: derive mapreduce neutral from init
maleadt Jun 17, 2026
0c4a4f0
docs: clean mapreduce neutral wording
maleadt Jun 17, 2026
0a5f66c
perf: add tiled strided mapreduce path
maleadt Jun 17, 2026
7e45b8d
test: cover tiled strided mapreduce cases
maleadt Jun 17, 2026
d8ad558
perf: merge contiguous mapreduce output segments
maleadt Jun 17, 2026
48e708c
docs: clarify mapreduce dispatch paths
maleadt Jun 17, 2026
83ff72d
docs: fix dimensional temp examples
maleadt Jun 17, 2026
bff0ba7
fix: validate mapreduce block size
maleadt Jun 17, 2026
1b8b006
fix: reject strided gpu mapreduce dims sources
maleadt Jun 17, 2026
9151a95
fix: accept explicit backend for multi-input mapreduce
maleadt Jun 17, 2026
4bd1fc1
refactor: destructure Val dims, drop div/rem intrinsics
maleadt Jun 18, 2026
526a67f
fix: materialize broadcasts on older Julia versions.
maleadt Jun 18, 2026
2dbfd28
feat: generic fallback for non-dense GPU mapreduce sources
maleadt Jun 18, 2026
a57e94d
feat: route strided GPU mapreduce sources through the fast kernels
maleadt Jun 18, 2026
b578145
perf: keep tiled/multigroup index decode signed (avoids CUDA throw gu…
maleadt Jun 18, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 16 additions & 16 deletions src/arithmetics.jl
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
sum(
src::AbstractArray, backend::Backend=get_backend(src);
init=zero(eltype(src)),
dims::Union{Nothing, Int}=nothing,
dims::Union{Nothing, Int, Tuple{Vararg{Int}}, Colon}=nothing,

# CPU settings
max_tasks=Threads.nthreads(),
Expand Down Expand Up @@ -33,11 +33,11 @@ m = MtlArray(rand(Int32(1):Int32(100), 10, 100_000))
s = AK.sum(m, dims=1)
```

If you know the shape of the resulting array (in case of a axis-wise sum, i.e. `dims` is not
If you know the shape of the resulting array (in case of a dimensionwise sum, i.e. `dims` is not
`nothing`), you can provide the `temp` argument to save results into and avoid allocations:
```julia
m = MtlArray(rand(Int32(1):Int32(100), 10, 100_000))
temp = MtlArray(zeros(Int32, 10))
temp = MtlArray(zeros(Int32, 10, 1))
s = AK.sum(m, dims=2, temp=temp)
```
"""
Expand All @@ -58,7 +58,7 @@ end
prod(
src::AbstractArray, backend::Backend=get_backend(src);
init=one(eltype(src)),
dims::Union{Nothing, Int}=nothing,
dims::Union{Nothing, Int, Tuple{Vararg{Int}}, Colon}=nothing,

# CPU settings
max_tasks=Threads.nthreads(),
Expand Down Expand Up @@ -89,11 +89,11 @@ m = ROCArray(rand(Int32(1):Int32(100), 10, 100_000))
p = AK.prod(m, dims=1)
```

If you know the shape of the resulting array (in case of a axis-wise product, i.e. `dims` is not
If you know the shape of the resulting array (in case of a dimensionwise product, i.e. `dims` is not
`nothing`), you can provide the `temp` argument to save results into and avoid allocations:
```julia
m = ROCArray(rand(Int32(1):Int32(100), 10, 100_000))
temp = ROCArray(ones(Int32, 10))
temp = ROCArray(ones(Int32, 10, 1))
p = AK.prod(m, dims=2, temp=temp)
```
"""
Expand All @@ -114,7 +114,7 @@ end
maximum(
src::AbstractArray, backend::Backend=get_backend(src);
init=typemin(eltype(src)),
dims::Union{Nothing, Int}=nothing,
dims::Union{Nothing, Int, Tuple{Vararg{Int}}, Colon}=nothing,

# CPU settings
max_tasks=Threads.nthreads(),
Expand Down Expand Up @@ -145,11 +145,11 @@ m = oneArray(rand(Int32(1):Int32(100), 10, 100_000))
m = AK.maximum(m, dims=1)
```

If you know the shape of the resulting array (in case of a axis-wise maximum, i.e. `dims` is not
If you know the shape of the resulting array (in case of a dimensionwise maximum, i.e. `dims` is not
`nothing`), you can provide the `temp` argument to save results into and avoid allocations:
```julia
m = oneArray(rand(Int32(1):Int32(100), 10, 100_000))
temp = oneArray(zeros(Int32, 10))
temp = oneArray(zeros(Int32, 10, 1))
m = AK.maximum(m, dims=2, temp=temp)
```
"""
Expand All @@ -170,7 +170,7 @@ end
minimum(
src::AbstractArray, backend::Backend=get_backend(src);
init=typemax(eltype(src)),
dims::Union{Nothing, Int}=nothing,
dims::Union{Nothing, Int, Tuple{Vararg{Int}}, Colon}=nothing,

# CPU settings
max_tasks=Threads.nthreads(),
Expand Down Expand Up @@ -201,11 +201,11 @@ m = CuArray(rand(Int32(1):Int32(100), 10, 100_000))
m = AK.minimum(m, dims=1)
```

If you know the shape of the resulting array (in case of a axis-wise minimum, i.e. `dims` is not
If you know the shape of the resulting array (in case of a dimensionwise minimum, i.e. `dims` is not
`nothing`), you can provide the `temp` argument to save results into and avoid allocations:
```julia
m = CuArray(rand(Int32(1):Int32(100), 10, 100_000))
temp = CuArray(ones(Int32, 10))
temp = CuArray(ones(Int32, 10, 1))
m = AK.minimum(m, dims=2, temp=temp)
```
"""
Expand All @@ -226,7 +226,7 @@ end
count(
[f=identity], src::AbstractArray, backend::Backend=get_backend(src);
init=0,
dims::Union{Nothing, Int}=nothing,
dims::Union{Nothing, Int, Tuple{Vararg{Int}}, Colon}=nothing,

# CPU settings
max_tasks=Threads.nthreads(),
Expand Down Expand Up @@ -263,12 +263,12 @@ m = MtlArray(rand(Bool, 10, 100_000))
c = AK.count(m, dims=1)
```

If you know the shape of the resulting array (in case of a axis-wise count, i.e. `dims` is not
If you know the shape of the resulting array (in case of a dimensionwise count, i.e. `dims` is not
`nothing`), you can provide the `temp` argument to save results into and avoid allocations:
```julia
m = MtlArray(rand(Bool, 10, 100_000))
temp = MtlArray(zeros(Int32, 10))
c = AK.count(m, dims=2, temp=temp)
temp = MtlArray(zeros(Int32, 10, 1))
c = AK.count(m; init=Int32(0), dims=2, temp=temp)
```
"""
function count(
Expand Down
6 changes: 5 additions & 1 deletion src/reduce/mapreduce_1d_cpu.jl
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
function mapreduce_1d_cpu(
f, op, src::AbstractArray, backend::Backend;
f, op, src::MapReduceSource, backend::Backend;
init,
neutral,

Expand All @@ -12,6 +12,10 @@ function mapreduce_1d_cpu(
temp::Union{Nothing, AbstractArray},
switch_below::Int,
)
if src isa Base.Broadcast.Broadcasted
return op(init, Base.mapreduce(f, op, src; init=neutral))
end

if max_tasks == 1
return op(init, Base.mapreduce(f, op, src; init=neutral))
end
Expand Down
12 changes: 8 additions & 4 deletions src/reduce/mapreduce_1d_gpu.jl
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ end


function mapreduce_1d_gpu(
f, op, src::AbstractArray, backend::Backend;
f, op, src::MapReduceSource, backend::Backend;
init,
neutral,

Expand All @@ -61,12 +61,13 @@ function mapreduce_1d_gpu(
switch_below::Int,
)
@argcheck 1 <= block_size <= 1024
@argcheck ispow2(block_size)
@argcheck switch_below >= 0

# Degenerate cases
len = length(src)
len == 0 && return init
len == 1 && return @allowscalar f(src[1])
len == 1 && return op(init, @allowscalar f(src[1]))
if len < switch_below
h_src = Vector(src)
return Base.mapreduce(f, op, h_src; init)
Expand All @@ -87,8 +88,8 @@ function mapreduce_1d_gpu(
dst = KernelAbstractions.allocate(backend, dst_type, blocks * 2)
end

# Later the kernel will be compiled for views anyways, so use same types
src_view = @view src[1:end]
# Later the kernel will be compiled for views anyways, so use same types for arrays.
src_view = _mapreduce_1d_src_view(src)
dst_view = @view dst[1:blocks]

kernel! = _mapreduce_block!(backend, block_size)
Expand Down Expand Up @@ -125,3 +126,6 @@ function mapreduce_1d_gpu(
# The GPU kernel reduced all elements to one, but without the init value
return op(init, @allowscalar(p1[1]))
end

_mapreduce_1d_src_view(src::AbstractArray) = @view src[1:end]
_mapreduce_1d_src_view(src::Base.Broadcast.Broadcasted) = src
Loading
Loading