Skip to content

Default to SharedStorage#822

Open
christiangnrd wants to merge 6 commits into
mainfrom
shared
Open

Default to SharedStorage#822
christiangnrd wants to merge 6 commits into
mainfrom
shared

Conversation

@christiangnrd

Copy link
Copy Markdown
Member

No description provided.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metal Benchmarks

Details
Benchmark suite Current: 97ae406 Previous: a637d21 Ratio
array/accumulate/Float32/1d 825208 ns 810709 ns 1.02
array/accumulate/Float32/dims=1 1036396 ns 1044104.5 ns 0.99
array/accumulate/Float32/dims=1L 10566833 ns 10212459 ns 1.03
array/accumulate/Float32/dims=2 1316958 ns 1321041 ns 1.00
array/accumulate/Float32/dims=2L 5597250 ns 5482166.5 ns 1.02
array/accumulate/Int64/1d 1005833 ns 971167 ns 1.04
array/accumulate/Int64/dims=1 1159375 ns 1195417 ns 0.97
array/accumulate/Int64/dims=1L 12571166.5 ns 12106667 ns 1.04
array/accumulate/Int64/dims=2 1523833 ns 1504958 ns 1.01
array/accumulate/Int64/dims=2L 9164583 ns 9371250 ns 0.98
array/broadcast 348395.5 ns 386166.5 ns 0.90
array/construct 5666 ns 5666 ns 1
array/permutedims/2d 662458.5 ns 629125 ns 1.05
array/permutedims/3d 1138458 ns 1127041.5 ns 1.01
array/permutedims/4d 1511375 ns 2002334 ns 0.75
array/private/copy 424000 ns 449291.5 ns 0.94
array/private/copyto!/cpu_to_gpu 365125 ns 364729 ns 1.00
array/private/copyto!/gpu_to_cpu 348333 ns 382334 ns 0.91
array/private/copyto!/gpu_to_gpu 325333 ns 358667 ns 0.91
array/private/iteration/findall/bool 1163250 ns 1093437.5 ns 1.06
array/private/iteration/findall/int 1319812.5 ns 1266208 ns 1.04
array/private/iteration/findfirst/bool 1404750 ns 1488416 ns 0.94
array/private/iteration/findfirst/int 1469500 ns 1458833 ns 1.01
array/private/iteration/findmin/1d 1580333.5 ns 1602833 ns 0.99
array/private/iteration/findmin/2d 1302291.5 ns 1331208.5 ns 0.98
array/private/iteration/logical 1912958 ns 1774646 ns 1.08
array/private/iteration/scalar 1853208 ns 2913459 ns 0.64
array/random/rand/Float32 643792 ns 659333 ns 0.98
array/random/rand/Int64 692375 ns 734333 ns 0.94
array/random/rand!/Float32 545875 ns 601375 ns 0.91
array/random/rand!/Int64 498167 ns 524458 ns 0.95
array/random/randn/Float32 599333 ns 621083.5 ns 0.96
array/random/randn!/Float32 501250 ns 550583 ns 0.91
array/reductions/mapreduce/Float32/1d 338666 ns 506042 ns 0.67
array/reductions/mapreduce/Float32/dims=1 481916 ns 523958 ns 0.92
array/reductions/mapreduce/Float32/dims=1L 745708 ns 767916.5 ns 0.97
array/reductions/mapreduce/Float32/dims=2 486167 ns 535625 ns 0.91
array/reductions/mapreduce/Float32/dims=2L 1120333 ns 1366000 ns 0.82
array/reductions/mapreduce/Int64/1d 615334 ns 979500 ns 0.63
array/reductions/mapreduce/Int64/dims=1 781708 ns 854459 ns 0.91
array/reductions/mapreduce/Int64/dims=1L 1305208 ns 1441520.5 ns 0.91
array/reductions/mapreduce/Int64/dims=2 959854 ns 975313 ns 0.98
array/reductions/mapreduce/Int64/dims=2L 2242042 ns 2238375 ns 1.00
array/reductions/reduce/Float32/1d 336750 ns 502209 ns 0.67
array/reductions/reduce/Float32/dims=1 476584 ns 524750 ns 0.91
array/reductions/reduce/Float32/dims=1L 744896 ns 782270.5 ns 0.95
array/reductions/reduce/Float32/dims=2 482666 ns 477083 ns 1.01
array/reductions/reduce/Float32/dims=2L 1121833 ns 1357459 ns 0.83
array/reductions/reduce/Int64/1d 626687.5 ns 965959 ns 0.65
array/reductions/reduce/Int64/dims=1 793729 ns 814250 ns 0.97
array/reductions/reduce/Int64/dims=1L 1307729.5 ns 1609583 ns 0.81
array/reductions/reduce/Int64/dims=2 966000 ns 972875 ns 0.99
array/reductions/reduce/Int64/dims=2L 2225292 ns 2215667 ns 1.00
array/shared/copy 188333 ns 244354 ns 0.77
array/shared/copyto!/cpu_to_gpu 40500 ns 39917 ns 1.01
array/shared/copyto!/gpu_to_cpu 40125 ns 40500 ns 0.99
array/shared/copyto!/gpu_to_gpu 40958 ns 40834 ns 1.00
array/shared/iteration/findall/bool 1173625 ns 1102958 ns 1.06
array/shared/iteration/findall/int 1322708 ns 1300875 ns 1.02
array/shared/iteration/findfirst/bool 1122375 ns 1196875 ns 0.94
array/shared/iteration/findfirst/int 1206187.5 ns 1225417 ns 0.98
array/shared/iteration/findmin/1d 1342354 ns 1347667 ns 1.00
array/shared/iteration/findmin/2d 1303041.5 ns 1337708 ns 0.97
array/shared/iteration/logical 1770375 ns 1627500 ns 1.09
array/shared/iteration/scalar 6158.4 ns 5777.833333333333 ns 1.07
integration/byval/reference 1182417 ns 1169916 ns 1.01
integration/byval/slices=1 1184166 ns 1170583 ns 1.01
integration/byval/slices=2 2128666 ns 2096042 ns 1.02
integration/byval/slices=3 18881854 ns 8003333 ns 2.36
integration/metaldevrt 474042 ns 502500 ns 0.94
kernel/indexing 352625 ns 379084 ns 0.93
kernel/indexing_checked 510875 ns 563583 ns 0.91
kernel/launch 13458 ns 13375 ns 1.01
kernel/rand 528500 ns 576083.5 ns 0.92
latency/import 1521048500 ns 1500519500 ns 1.01
latency/precompile 32325091520.5 ns 32065662458 ns 1.01
latency/ttfp 1836443584 ns 1812250688 ns 1.01
metal/synchronization/context 717.8137931034482 ns 685.126582278481 ns 1.05
metal/synchronization/stream 480.015306122449 ns 459.81218274111677 ns 1.04

This comment was automatically generated by workflow using github-action-benchmark.

Comment thread src/array.jl Outdated
@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.36%. Comparing base (a637d21) to head (97ae406).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #822      +/-   ##
==========================================
- Coverage   83.42%   83.36%   -0.06%     
==========================================
  Files          67       67              
  Lines        3669     3668       -1     
==========================================
- Hits         3061     3058       -3     
- Misses        608      610       +2     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@christiangnrd christiangnrd force-pushed the shared branch 2 times, most recently from db53574 to fdf7a32 Compare June 16, 2026 22:49
@christiangnrd

Copy link
Copy Markdown
Member Author

Could the github actions failure be related to the compilation failures?

@maleadt

maleadt commented Jun 19, 2026

Copy link
Copy Markdown
Member

I'm not sure how. There's a known issue with ObjC errors leaking outside of their retain/release scope and crashing during error reporting, but that's read-only and shouldn't cause a crash in LLVM.

@maleadt

maleadt commented Jun 19, 2026

Copy link
Copy Markdown
Member

I guess the remaining question is semantics. Do we want to allow scalar iteration on shard memory so that it can be used with CPU code? It'll never be as fast as Array, but if we port the dirty memory flag tracking from CUDA.jl we can get it down to a couple of ns (as opposed to ~1ns for an Array getindex or so). If we want to add back scalar iteration checking we add another couple of ns for the TLS check on every access.

@christiangnrd

Copy link
Copy Markdown
Member Author

i.e. unsafe_wrap(Array, ... wouldn't be needed?

@maleadt

maleadt commented Jun 19, 2026

Copy link
Copy Markdown
Member

Right, that's the current design of CUDA.jl:

Precompiling CUDA finished.
  12 dependencies successfully precompiled in 43 seconds. 90 already precompiled.

julia> CUDA.allowscalar(false)

julia> a = cu([1])
1-element CuArray{Int64, 1, CUDACore.DeviceMemory}:
 1

julia> a[]
ERROR: Scalar indexing is disallowed.

julia> b = cu([1]; unified=true)
1-element CuArray{Int64, 1, CUDACore.UnifiedMemory}:
 1

julia> b[]
1

I'm not entirely convinced this is the best option though. It makes sense, but people often use allowscalar for detecting GPU code. We could tell them they have to use private memory for that, but it still makes allowscalar(false) a lie.

@christiangnrd

Copy link
Copy Markdown
Member Author

This is also how shared storage currently works

@christiangnrd

Copy link
Copy Markdown
Member Author

But without the dirty flag so there might be some latent race conditions with shared MtlArrays

@maleadt

maleadt commented Jun 19, 2026

Copy link
Copy Markdown
Member

This is also how shared storage currently works

Right, but it never was the default. It would break users doing allowscalar(false) to detect GPU functionality execution on the CPU.

And we definitely need the dirty flag to improve performance here. But that can be follow-up work.

@christiangnrd

Copy link
Copy Markdown
Member Author

What of we add a @warn when they call allowscalar the first time if default storage mode is shared?

@maleadt

maleadt commented Jun 19, 2026

Copy link
Copy Markdown
Member

I guess that could work, but the interface is not extensible like that right now.

EDIT: I guess we could implement Metal.allowscalar, have it warn, and then call GPUArrays.allowscalar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants