Port non-blocking synchronization from CUDA.jl#783
Conversation
Wait for command buffers via a completion handler that notifies the Julia scheduler, instead of parking the calling thread in waitUntilCompleted. Fixes task switches from command buffer callbacks (#532). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
This essentially reverts #690, but if synchronization is faster I don't mind doing this until we want to support Metal 4. |
It's not. The only reason is to avoid the calling thread to block, since that can cause deadlocks when it's thread 0 and a callback from Metal also wants to do I/O (which in Julia can only happen on thread 0). Well, it also enables running code during synchronization, but that's not the immediate goal here. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #783 +/- ##
==========================================
+ Coverage 80.84% 80.96% +0.11%
==========================================
Files 63 64 +1
Lines 3017 3057 +40
==========================================
+ Hits 2439 2475 +36
- Misses 578 582 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Metal Benchmarks
Details
| Benchmark suite | Current: 50f8832 | Previous: c8d960b | Ratio |
|---|---|---|---|
array/accumulate/Float32/1d |
752208.5 ns |
1112417 ns |
0.68 |
array/accumulate/Float32/dims=1 |
962833 ns |
1556166 ns |
0.62 |
array/accumulate/Float32/dims=1L |
9793500 ns |
9750875 ns |
1.00 |
array/accumulate/Float32/dims=2 |
1272708 ns |
1853083 ns |
0.69 |
array/accumulate/Float32/dims=2L |
5430479 ns |
7085833 ns |
0.77 |
array/accumulate/Int64/1d |
915666 ns |
1244792 ns |
0.74 |
array/accumulate/Int64/dims=1 |
1093084 ns |
1825708.5 ns |
0.60 |
array/accumulate/Int64/dims=1L |
11746625 ns |
11623917 ns |
1.01 |
array/accumulate/Int64/dims=2 |
1463000 ns |
2166041 ns |
0.68 |
array/accumulate/Int64/dims=2L |
9319958.5 ns |
9786417 ns |
0.95 |
array/broadcast |
371042 ns |
634000 ns |
0.59 |
array/construct |
5791 ns |
6125 ns |
0.95 |
array/permutedims/2d |
616583 ns |
1161792 ns |
0.53 |
array/permutedims/3d |
1122208.5 ns |
1646292 ns |
0.68 |
array/permutedims/4d |
1981312.5 ns |
2585542 ns |
0.77 |
array/private/copy |
425417 ns |
552542 ns |
0.77 |
array/private/copyto!/cpu_to_gpu |
353104.5 ns |
783625 ns |
0.45 |
array/private/copyto!/gpu_to_cpu |
354625 ns |
783000 ns |
0.45 |
array/private/copyto!/gpu_to_gpu |
335541 ns |
612583 ns |
0.55 |
array/private/iteration/findall/bool |
1045042 ns |
1398750 ns |
0.75 |
array/private/iteration/findall/int |
1187166 ns |
1564083.5 ns |
0.76 |
array/private/iteration/findfirst/bool |
1337250 ns |
1967959 ns |
0.68 |
array/private/iteration/findfirst/int |
1384625 ns |
1988500 ns |
0.70 |
array/private/iteration/findmin/1d |
1449458.5 ns |
2287333 ns |
0.63 |
array/private/iteration/findmin/2d |
1207604 ns |
1598145.5 ns |
0.76 |
array/private/iteration/logical |
1594312.5 ns |
2651562 ns |
0.60 |
array/private/iteration/scalar |
2684562.5 ns |
4420479.5 ns |
0.61 |
array/random/rand/Float32 |
630042 ns |
1122042 ns |
0.56 |
array/random/rand/Int64 |
690146 ns |
1266542 ns |
0.54 |
array/random/rand!/Float32 |
542084 ns |
899875 ns |
0.60 |
array/random/rand!/Int64 |
501083 ns |
853437.5 ns |
0.59 |
array/random/randn/Float32 |
596208 ns |
1081042 ns |
0.55 |
array/random/randn!/Float32 |
508916 ns |
845645.5 ns |
0.60 |
array/reductions/mapreduce/Float32/1d |
498500 ns |
1008333 ns |
0.49 |
array/reductions/mapreduce/Float32/dims=1 |
499792 ns |
842875 ns |
0.59 |
array/reductions/mapreduce/Float32/dims=1L |
734333 ns |
1374166.5 ns |
0.53 |
array/reductions/mapreduce/Float32/dims=2 |
497812.5 ns |
837250 ns |
0.59 |
array/reductions/mapreduce/Float32/dims=2L |
1351459 ns |
1791354 ns |
0.75 |
array/reductions/mapreduce/Int64/1d |
931479.5 ns |
1518667 ns |
0.61 |
array/reductions/mapreduce/Int64/dims=1 |
788042 ns |
1144437.5 ns |
0.69 |
array/reductions/mapreduce/Int64/dims=1L |
1649645.5 ns |
2042229 ns |
0.81 |
array/reductions/mapreduce/Int64/dims=2 |
982500 ns |
1325292 ns |
0.74 |
array/reductions/mapreduce/Int64/dims=2L |
2253459 ns |
4595187.5 ns |
0.49 |
array/reductions/reduce/Float32/1d |
721250 ns |
1011458 ns |
0.71 |
array/reductions/reduce/Float32/dims=1 |
498084 ns |
843833 ns |
0.59 |
array/reductions/reduce/Float32/dims=1L |
713083 ns |
1374167 ns |
0.52 |
array/reductions/reduce/Float32/dims=2 |
498042 ns |
843625 ns |
0.59 |
array/reductions/reduce/Float32/dims=2L |
1346833 ns |
1787520.5 ns |
0.75 |
array/reductions/reduce/Int64/1d |
926083.5 ns |
1376062 ns |
0.67 |
array/reductions/reduce/Int64/dims=1 |
797125 ns |
1114583 ns |
0.72 |
array/reductions/reduce/Int64/dims=1L |
1435375 ns |
2030542 ns |
0.71 |
array/reductions/reduce/Int64/dims=2 |
981271 ns |
1308500 ns |
0.75 |
array/reductions/reduce/Int64/dims=2L |
2256458 ns |
4207916 ns |
0.54 |
array/shared/copy |
212000 ns |
244208 ns |
0.87 |
array/shared/copyto!/cpu_to_gpu |
46250 ns |
80292 ns |
0.58 |
array/shared/copyto!/gpu_to_cpu |
40416 ns |
80750 ns |
0.50 |
array/shared/copyto!/gpu_to_gpu |
48000 ns |
80583 ns |
0.60 |
array/shared/iteration/findall/bool |
1052208 ns |
1416084 ns |
0.74 |
array/shared/iteration/findall/int |
1189375 ns |
1535667 ns |
0.77 |
array/shared/iteration/findfirst/bool |
1067875 ns |
1574458.5 ns |
0.68 |
array/shared/iteration/findfirst/int |
1080000 ns |
1588625 ns |
0.68 |
array/shared/iteration/findmin/1d |
1199584 ns |
1906750 ns |
0.63 |
array/shared/iteration/findmin/2d |
1208021 ns |
1601292 ns |
0.75 |
array/shared/iteration/logical |
1428166.5 ns |
2298125 ns |
0.62 |
array/shared/iteration/scalar |
5138.833333333333 ns |
193083 ns |
0.026614633775802806 |
integration/byval/reference |
1155917 ns |
1575250 ns |
0.73 |
integration/byval/slices=1 |
1158083 ns |
1573500 ns |
0.74 |
integration/byval/slices=2 |
2084792 ns |
2633000 ns |
0.79 |
integration/byval/slices=3 |
10213854.5 ns |
7919354.5 ns |
1.29 |
integration/metaldevrt |
458208 ns |
786708 ns |
0.58 |
kernel/indexing |
353917 ns |
646458 ns |
0.55 |
kernel/indexing_checked |
355542 ns |
660812.5 ns |
0.54 |
kernel/launch |
11875 ns |
12666 ns |
0.94 |
kernel/rand |
362375 ns |
589042 ns |
0.62 |
latency/import |
1384144062.5 ns |
1381318041.5 ns |
1.00 |
latency/precompile |
29321346833 ns |
29203062750 ns |
1.00 |
latency/ttfp |
1649868125 ns |
1646438625 ns |
1.00 |
metal/synchronization/context |
770.1475409836065 ns |
19458 ns |
0.0395799949112759 |
metal/synchronization/stream |
363.4354066985646 ns |
18583 ns |
0.019557413049484183 |
This comment was automatically generated by workflow using github-action-benchmark.
|
Now approach. Safer and faster now. Benchmark by Claude: Results
What's actually happening per scenario
Notes
|
Closes #532