Skip to content

Default MtlArray storage to SharedStorage#820

Closed
KaanKesginLW wants to merge 1 commit into
JuliaGPU:mainfrom
KaanKesginLW:feature/default-shared-storage
Closed

Default MtlArray storage to SharedStorage#820
KaanKesginLW wants to merge 1 commit into
JuliaGPU:mainfrom
KaanKesginLW:feature/default-shared-storage

Conversation

@KaanKesginLW

Copy link
Copy Markdown
Contributor

Summary

Change the default MtlArray storage mode from PrivateStorage to SharedStorage. On unified-memory (Apple Silicon) GPUs — essentially every supported Mac — SharedStorage is zero-copy between CPU and GPU and matches Apple's guidance. Discrete GPUs keep working (and get a one-time notice steering them to PrivateStorage).

using Metal
a = MtlArray(rand(Float32, 1024, 1024))   # SharedStorage by default now
w = unsafe_wrap(Array, a)                  # zero-copy CPU view — no Array() copy

This supersedes #717 with a clean, minimal diff (+31 / −10 across 4 files, vs #717's +655 / −606).

Motivation

  • Apple's guidance. For unified memory, the Metal Best Practices Guide states "the Shared mode is usually the correct choice." The previous PrivateStorage default came from discrete-GPU guidance.
  • All supported Macs are Apple Silicon (unified memory), so this is the right default for ~every user.
  • Zero-copy CPU access. SharedStorage enables unsafe_wrap(Array, x), avoiding the allocate-and-copy of Array().

Benchmarks

No performance regression (M2 Max) — SharedStoragePrivateStorage:

size op Shared Private
512² broadcast 0.057 ms 0.056 ms
512² matmul 0.286 ms 0.291 ms
1024² broadcast 0.135 ms 0.132 ms
1024² matmul 0.664 ms 0.788 ms

Consistent with the original 67-benchmark sweep: 78% ties, identical for copyto! / fill! / MPS-matmul, and all ties at ≥ 512 MB.

Faster CPU access via zero-copy unsafe_wrap(Array, x) instead of Array()
(M2 Max, best of 5; both paths sum() to force a full read):

size Array() + use unsafe_wrap + use speedup
512 MB 17.1 ms 9.0 ms 1.9×
1 GB 37.5 ms 18.2 ms 2.1×
2 GB 249.7 ms 36.4 ms 6.9×
4 GB 539.2 ms 72.5 ms 7.4×

(The jump at ≥ 2 GB is Array()'s GPU→CPU blit crossing into chunked copies;
unsafe_wrap stays flat because it never copies.)

Implementation

  • The default_storage preference now defaults to "shared". Every constructor (mtl, zeros, fill, similar, …) already resolves through DefaultStorageMode, so they follow automatically.
  • The default stays a compile-time const — type-stable and with no device access during precompilation. (It is not a MTLDevice(…).hasUnifiedMemory call baked into a precompiled constant.)
  • On the rare non-unified-memory (discrete) GPU, __init__ emits a one-time notice pointing the user to set_preferences!(Metal, "default_storage" => "private"). SharedStorage still works there; this just flags the faster option.
  • Also fixes a latent bug in the old error message ($default_storage$str).

Notes

Supersedes #717, closed as "this will probably eventually happen but it'll be easier to do it in a separate PR than to rebase and clean up this one." This is that clean PR: no formatting churn, no test rewrite, just the default plus the discrete-GPU notice, docs, and a focused test.

Related: #717

On unified-memory (Apple Silicon) GPUs SharedStorage is zero-copy and matches
Apple's guidance; it is also valid on discrete GPUs. Change the default_storage
preference default from "private" to "shared".

- src/array.jl: DefaultStorageMode defaults to "shared" (was "private"); fix the
  error message interpolation ($str). All constructors (mtl, zeros, fill, ...)
  already use DefaultStorageMode, so they follow automatically. Docstrings updated.
- src/initialization.jl: on the rare non-unified-memory (discrete) GPU, warn once
  and point the user at the "private" preference.
- docs + test/array.jl: document and test the new default.

Supersedes the closed JuliaGPU#717 with a minimal, clean diff: no formatting churn, no
test rewrite, and the default is decided at the preference level rather than via a
device call inside a precompile-time const.
@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 81.30%. Comparing base (bfb9ba3) to head (12e9d2f).

Files with missing lines Patch % Lines
src/initialization.jl 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #820      +/-   ##
==========================================
- Coverage   81.43%   81.30%   -0.14%     
==========================================
  Files          66       66              
  Lines        3318     3321       +3     
==========================================
- Hits         2702     2700       -2     
- Misses        616      621       +5     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metal Benchmarks

Details
Benchmark suite Current: 12e9d2f Previous: bfb9ba3 Ratio
array/accumulate/Float32/1d 818917 ns 813334 ns 1.01
array/accumulate/Float32/dims=1 1004917 ns 982750 ns 1.02
array/accumulate/Float32/dims=1L 10042458 ns 10003208 ns 1.00
array/accumulate/Float32/dims=2 1307166 ns 1259354 ns 1.04
array/accumulate/Float32/dims=2L 6166208.5 ns 5629958 ns 1.10
array/accumulate/Int64/1d 975125 ns 950125 ns 1.03
array/accumulate/Int64/dims=1 1107645.5 ns 1125625.5 ns 0.98
array/accumulate/Int64/dims=1L 11925167 ns 12141625 ns 0.98
array/accumulate/Int64/dims=2 1478124.5 ns 1476416 ns 1.00
array/accumulate/Int64/dims=2L 9481646 ns 9438083 ns 1.00
array/broadcast 376750 ns 374625 ns 1.01
array/construct 6125 ns 5666 ns 1.08
array/permutedims/2d 641312 ns 630750 ns 1.02
array/permutedims/3d 1129708 ns 1117000 ns 1.01
array/permutedims/4d 1992854 ns 1994209 ns 1.00
array/private/copy 440687.5 ns 412292 ns 1.07
array/private/copyto!/cpu_to_gpu 371417 ns 368583 ns 1.01
array/private/copyto!/gpu_to_cpu 368792 ns 358916 ns 1.03
array/private/copyto!/gpu_to_gpu 342917 ns 342666 ns 1.00
array/private/iteration/findall/bool 1076500 ns 1073250 ns 1.00
array/private/iteration/findall/int 1255209 ns 1252000 ns 1.00
array/private/iteration/findfirst/bool 1468708 ns 1458437 ns 1.01
array/private/iteration/findfirst/int 1508291.5 ns 1487958 ns 1.01
array/private/iteration/findmin/1d 1608750 ns 1592041 ns 1.01
array/private/iteration/findmin/2d 1310792 ns 1315875 ns 1.00
array/private/iteration/logical 1754354 ns 1743542 ns 1.01
array/private/iteration/scalar 2683833 ns 2638375.5 ns 1.02
array/random/rand/Float32 584458 ns 634917 ns 0.92
array/random/rand/Int64 697833 ns 669834 ns 1.04
array/random/rand!/Float32 587667 ns 580958 ns 1.01
array/random/rand!/Int64 508250 ns 509000 ns 1.00
array/random/randn/Float32 597416.5 ns 597958 ns 1.00
array/random/randn!/Float32 535542 ns 531209 ns 1.01
array/reductions/mapreduce/Float32/1d 342500 ns 750833 ns 0.46
array/reductions/mapreduce/Float32/dims=1 514250 ns 499041.5 ns 1.03
array/reductions/mapreduce/Float32/dims=1L 916395.5 ns 780791 ns 1.17
array/reductions/mapreduce/Float32/dims=2 511042 ns 502750 ns 1.02
array/reductions/mapreduce/Float32/dims=2L 1364583 ns 1356041 ns 1.01
array/reductions/mapreduce/Int64/1d 671750 ns 934917 ns 0.72
array/reductions/mapreduce/Int64/dims=1 795979.5 ns 786666 ns 1.01
array/reductions/mapreduce/Int64/dims=1L 1666167 ns 1712500 ns 0.97
array/reductions/mapreduce/Int64/dims=2 972209 ns 966667 ns 1.01
array/reductions/mapreduce/Int64/dims=2L 2271354.5 ns 2260917 ns 1.00
array/reductions/reduce/Float32/1d 340500 ns 743333 ns 0.46
array/reductions/reduce/Float32/dims=1 511542 ns 499625 ns 1.02
array/reductions/reduce/Float32/dims=1L 876250 ns 813625 ns 1.08
array/reductions/reduce/Float32/dims=2 509084 ns 505833 ns 1.01
array/reductions/reduce/Float32/dims=2L 1360042 ns 1346625 ns 1.01
array/reductions/reduce/Int64/1d 668500 ns 930625 ns 0.72
array/reductions/reduce/Int64/dims=1 800250 ns 783875 ns 1.02
array/reductions/reduce/Int64/dims=1L 1629125 ns 1680125 ns 0.97
array/reductions/reduce/Int64/dims=2 971542 ns 980563 ns 0.99
array/reductions/reduce/Int64/dims=2L 2272541.5 ns 2260250 ns 1.01
array/shared/copy 231917 ns 238375 ns 0.97
array/shared/copyto!/cpu_to_gpu 48083 ns 40667 ns 1.18
array/shared/copyto!/gpu_to_cpu 47250 ns 40667 ns 1.16
array/shared/copyto!/gpu_to_gpu 42333 ns 41292 ns 1.03
array/shared/iteration/findall/bool 1080083 ns 1079166 ns 1.00
array/shared/iteration/findall/int 1255333 ns 1250333 ns 1.00
array/shared/iteration/findfirst/bool 1199333 ns 1192416.5 ns 1.01
array/shared/iteration/findfirst/int 1235625 ns 1274104.5 ns 0.97
array/shared/iteration/findmin/1d 1343208 ns 1282291 ns 1.05
array/shared/iteration/findmin/2d 1326584 ns 1266125 ns 1.05
array/shared/iteration/logical 1601666 ns 1594459 ns 1.00
array/shared/iteration/scalar 8666.666666666666 ns 5868.166666666667 ns 1.48
integration/byval/reference 1162750 ns 1157708 ns 1.00
integration/byval/slices=1 1168479.5 ns 1159208 ns 1.01
integration/byval/slices=2 2092334 ns 2086791.5 ns 1.00
integration/byval/slices=3 7914208 ns 7931979 ns 1.00
integration/metaldevrt 489250 ns 468520.5 ns 1.04
kernel/indexing 370833 ns 366500 ns 1.01
kernel/indexing_checked 550208 ns 540834 ns 1.02
kernel/launch 14875 ns 13208 ns 1.13
kernel/rand 563916 ns 558042 ns 1.01
latency/import 1404497583 ns 1399005208 ns 1.00
latency/precompile 31277763959 ns 31215035583.5 ns 1.00
latency/ttfp 1710765542 ns 1710418937.5 ns 1.00
metal/synchronization/context 1045.8 ns 838.258064516129 ns 1.25
metal/synchronization/stream 696.7483443708609 ns 436.9748743718593 ns 1.59

This comment was automatically generated by workflow using github-action-benchmark.

@christiangnrd

Copy link
Copy Markdown
Member

#822

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants