Default MtlArray storage to SharedStorage#820
Closed
KaanKesginLW wants to merge 1 commit into
Closed
Conversation
On unified-memory (Apple Silicon) GPUs SharedStorage is zero-copy and matches Apple's guidance; it is also valid on discrete GPUs. Change the default_storage preference default from "private" to "shared". - src/array.jl: DefaultStorageMode defaults to "shared" (was "private"); fix the error message interpolation ($str). All constructors (mtl, zeros, fill, ...) already use DefaultStorageMode, so they follow automatically. Docstrings updated. - src/initialization.jl: on the rare non-unified-memory (discrete) GPU, warn once and point the user at the "private" preference. - docs + test/array.jl: document and test the new default. Supersedes the closed JuliaGPU#717 with a minimal, clean diff: no formatting churn, no test rewrite, and the default is decided at the preference level rather than via a device call inside a precompile-time const.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #820 +/- ##
==========================================
- Coverage 81.43% 81.30% -0.14%
==========================================
Files 66 66
Lines 3318 3321 +3
==========================================
- Hits 2702 2700 -2
- Misses 616 621 +5 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Contributor
There was a problem hiding this comment.
Metal Benchmarks
Details
| Benchmark suite | Current: 12e9d2f | Previous: bfb9ba3 | Ratio |
|---|---|---|---|
array/accumulate/Float32/1d |
818917 ns |
813334 ns |
1.01 |
array/accumulate/Float32/dims=1 |
1004917 ns |
982750 ns |
1.02 |
array/accumulate/Float32/dims=1L |
10042458 ns |
10003208 ns |
1.00 |
array/accumulate/Float32/dims=2 |
1307166 ns |
1259354 ns |
1.04 |
array/accumulate/Float32/dims=2L |
6166208.5 ns |
5629958 ns |
1.10 |
array/accumulate/Int64/1d |
975125 ns |
950125 ns |
1.03 |
array/accumulate/Int64/dims=1 |
1107645.5 ns |
1125625.5 ns |
0.98 |
array/accumulate/Int64/dims=1L |
11925167 ns |
12141625 ns |
0.98 |
array/accumulate/Int64/dims=2 |
1478124.5 ns |
1476416 ns |
1.00 |
array/accumulate/Int64/dims=2L |
9481646 ns |
9438083 ns |
1.00 |
array/broadcast |
376750 ns |
374625 ns |
1.01 |
array/construct |
6125 ns |
5666 ns |
1.08 |
array/permutedims/2d |
641312 ns |
630750 ns |
1.02 |
array/permutedims/3d |
1129708 ns |
1117000 ns |
1.01 |
array/permutedims/4d |
1992854 ns |
1994209 ns |
1.00 |
array/private/copy |
440687.5 ns |
412292 ns |
1.07 |
array/private/copyto!/cpu_to_gpu |
371417 ns |
368583 ns |
1.01 |
array/private/copyto!/gpu_to_cpu |
368792 ns |
358916 ns |
1.03 |
array/private/copyto!/gpu_to_gpu |
342917 ns |
342666 ns |
1.00 |
array/private/iteration/findall/bool |
1076500 ns |
1073250 ns |
1.00 |
array/private/iteration/findall/int |
1255209 ns |
1252000 ns |
1.00 |
array/private/iteration/findfirst/bool |
1468708 ns |
1458437 ns |
1.01 |
array/private/iteration/findfirst/int |
1508291.5 ns |
1487958 ns |
1.01 |
array/private/iteration/findmin/1d |
1608750 ns |
1592041 ns |
1.01 |
array/private/iteration/findmin/2d |
1310792 ns |
1315875 ns |
1.00 |
array/private/iteration/logical |
1754354 ns |
1743542 ns |
1.01 |
array/private/iteration/scalar |
2683833 ns |
2638375.5 ns |
1.02 |
array/random/rand/Float32 |
584458 ns |
634917 ns |
0.92 |
array/random/rand/Int64 |
697833 ns |
669834 ns |
1.04 |
array/random/rand!/Float32 |
587667 ns |
580958 ns |
1.01 |
array/random/rand!/Int64 |
508250 ns |
509000 ns |
1.00 |
array/random/randn/Float32 |
597416.5 ns |
597958 ns |
1.00 |
array/random/randn!/Float32 |
535542 ns |
531209 ns |
1.01 |
array/reductions/mapreduce/Float32/1d |
342500 ns |
750833 ns |
0.46 |
array/reductions/mapreduce/Float32/dims=1 |
514250 ns |
499041.5 ns |
1.03 |
array/reductions/mapreduce/Float32/dims=1L |
916395.5 ns |
780791 ns |
1.17 |
array/reductions/mapreduce/Float32/dims=2 |
511042 ns |
502750 ns |
1.02 |
array/reductions/mapreduce/Float32/dims=2L |
1364583 ns |
1356041 ns |
1.01 |
array/reductions/mapreduce/Int64/1d |
671750 ns |
934917 ns |
0.72 |
array/reductions/mapreduce/Int64/dims=1 |
795979.5 ns |
786666 ns |
1.01 |
array/reductions/mapreduce/Int64/dims=1L |
1666167 ns |
1712500 ns |
0.97 |
array/reductions/mapreduce/Int64/dims=2 |
972209 ns |
966667 ns |
1.01 |
array/reductions/mapreduce/Int64/dims=2L |
2271354.5 ns |
2260917 ns |
1.00 |
array/reductions/reduce/Float32/1d |
340500 ns |
743333 ns |
0.46 |
array/reductions/reduce/Float32/dims=1 |
511542 ns |
499625 ns |
1.02 |
array/reductions/reduce/Float32/dims=1L |
876250 ns |
813625 ns |
1.08 |
array/reductions/reduce/Float32/dims=2 |
509084 ns |
505833 ns |
1.01 |
array/reductions/reduce/Float32/dims=2L |
1360042 ns |
1346625 ns |
1.01 |
array/reductions/reduce/Int64/1d |
668500 ns |
930625 ns |
0.72 |
array/reductions/reduce/Int64/dims=1 |
800250 ns |
783875 ns |
1.02 |
array/reductions/reduce/Int64/dims=1L |
1629125 ns |
1680125 ns |
0.97 |
array/reductions/reduce/Int64/dims=2 |
971542 ns |
980563 ns |
0.99 |
array/reductions/reduce/Int64/dims=2L |
2272541.5 ns |
2260250 ns |
1.01 |
array/shared/copy |
231917 ns |
238375 ns |
0.97 |
array/shared/copyto!/cpu_to_gpu |
48083 ns |
40667 ns |
1.18 |
array/shared/copyto!/gpu_to_cpu |
47250 ns |
40667 ns |
1.16 |
array/shared/copyto!/gpu_to_gpu |
42333 ns |
41292 ns |
1.03 |
array/shared/iteration/findall/bool |
1080083 ns |
1079166 ns |
1.00 |
array/shared/iteration/findall/int |
1255333 ns |
1250333 ns |
1.00 |
array/shared/iteration/findfirst/bool |
1199333 ns |
1192416.5 ns |
1.01 |
array/shared/iteration/findfirst/int |
1235625 ns |
1274104.5 ns |
0.97 |
array/shared/iteration/findmin/1d |
1343208 ns |
1282291 ns |
1.05 |
array/shared/iteration/findmin/2d |
1326584 ns |
1266125 ns |
1.05 |
array/shared/iteration/logical |
1601666 ns |
1594459 ns |
1.00 |
array/shared/iteration/scalar |
8666.666666666666 ns |
5868.166666666667 ns |
1.48 |
integration/byval/reference |
1162750 ns |
1157708 ns |
1.00 |
integration/byval/slices=1 |
1168479.5 ns |
1159208 ns |
1.01 |
integration/byval/slices=2 |
2092334 ns |
2086791.5 ns |
1.00 |
integration/byval/slices=3 |
7914208 ns |
7931979 ns |
1.00 |
integration/metaldevrt |
489250 ns |
468520.5 ns |
1.04 |
kernel/indexing |
370833 ns |
366500 ns |
1.01 |
kernel/indexing_checked |
550208 ns |
540834 ns |
1.02 |
kernel/launch |
14875 ns |
13208 ns |
1.13 |
kernel/rand |
563916 ns |
558042 ns |
1.01 |
latency/import |
1404497583 ns |
1399005208 ns |
1.00 |
latency/precompile |
31277763959 ns |
31215035583.5 ns |
1.00 |
latency/ttfp |
1710765542 ns |
1710418937.5 ns |
1.00 |
metal/synchronization/context |
1045.8 ns |
838.258064516129 ns |
1.25 |
metal/synchronization/stream |
696.7483443708609 ns |
436.9748743718593 ns |
1.59 |
This comment was automatically generated by workflow using github-action-benchmark.
Member
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Change the default
MtlArraystorage mode fromPrivateStoragetoSharedStorage. On unified-memory (Apple Silicon) GPUs — essentially every supported Mac —SharedStorageis zero-copy between CPU and GPU and matches Apple's guidance. Discrete GPUs keep working (and get a one-time notice steering them toPrivateStorage).This supersedes #717 with a clean, minimal diff (+31 / −10 across 4 files, vs #717's +655 / −606).
Motivation
PrivateStoragedefault came from discrete-GPU guidance.SharedStorageenablesunsafe_wrap(Array, x), avoiding the allocate-and-copy ofArray().Benchmarks
No performance regression (M2 Max) —
SharedStorage≈PrivateStorage:Consistent with the original 67-benchmark sweep: 78% ties, identical for
copyto!/fill!/ MPS-matmul, and all ties at ≥ 512 MB.Faster CPU access via zero-copy
unsafe_wrap(Array, x)instead ofArray()(M2 Max, best of 5; both paths
sum()to force a full read):Array()+ useunsafe_wrap+ use(The jump at ≥ 2 GB is
Array()'s GPU→CPU blit crossing into chunked copies;unsafe_wrapstays flat because it never copies.)Implementation
default_storagepreference now defaults to"shared". Every constructor (mtl,zeros,fill,similar, …) already resolves throughDefaultStorageMode, so they follow automatically.const— type-stable and with no device access during precompilation. (It is not aMTLDevice(…).hasUnifiedMemorycall baked into a precompiled constant.)__init__emits a one-time notice pointing the user toset_preferences!(Metal, "default_storage" => "private").SharedStoragestill works there; this just flags the faster option.$default_storage→$str).Notes
Supersedes #717, closed as "this will probably eventually happen but it'll be easier to do it in a separate PR than to rebase and clean up this one." This is that clean PR: no formatting churn, no test rewrite, just the default plus the discrete-GPU notice, docs, and a focused test.
Related: #717