Skip to content

Port non-blocking synchronization from CUDA.jl#783

Merged
maleadt merged 4 commits into
mainfrom
tb/nonblocking_sync
May 28, 2026
Merged

Port non-blocking synchronization from CUDA.jl#783
maleadt merged 4 commits into
mainfrom
tb/nonblocking_sync

Conversation

@maleadt

@maleadt maleadt commented May 26, 2026

Copy link
Copy Markdown
Member

Closes #532

maleadt and others added 2 commits May 26, 2026 21:59
Wait for command buffers via a completion handler that notifies the Julia
scheduler, instead of parking the calling thread in waitUntilCompleted.
Fixes task switches from command buffer callbacks (#532).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@christiangnrd

Copy link
Copy Markdown
Member

This essentially reverts #690, but if synchronization is faster I don't mind doing this until we want to support Metal 4.

@maleadt

maleadt commented May 26, 2026

Copy link
Copy Markdown
Member Author

if synchronization is faster I don't mind doing this until we want to support Metal 4.

It's not. The only reason is to avoid the calling thread to block, since that can cause deadlocks when it's thread 0 and a callback from Metal also wants to do I/O (which in Julia can only happen on thread 0).

Well, it also enables running code during synchronization, but that's not the immediate goal here.

@codecov

codecov Bot commented May 26, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 92.06349% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.96%. Comparing base (08dd32c) to head (50f8832).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/synchronization.jl 90.47% 4 Missing ⚠️
lib/mps/ndarray.jl 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #783      +/-   ##
==========================================
+ Coverage   80.84%   80.96%   +0.11%     
==========================================
  Files          63       64       +1     
  Lines        3017     3057      +40     
==========================================
+ Hits         2439     2475      +36     
- Misses        578      582       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metal Benchmarks

Details
Benchmark suite Current: 50f8832 Previous: c8d960b Ratio
array/accumulate/Float32/1d 752208.5 ns 1112417 ns 0.68
array/accumulate/Float32/dims=1 962833 ns 1556166 ns 0.62
array/accumulate/Float32/dims=1L 9793500 ns 9750875 ns 1.00
array/accumulate/Float32/dims=2 1272708 ns 1853083 ns 0.69
array/accumulate/Float32/dims=2L 5430479 ns 7085833 ns 0.77
array/accumulate/Int64/1d 915666 ns 1244792 ns 0.74
array/accumulate/Int64/dims=1 1093084 ns 1825708.5 ns 0.60
array/accumulate/Int64/dims=1L 11746625 ns 11623917 ns 1.01
array/accumulate/Int64/dims=2 1463000 ns 2166041 ns 0.68
array/accumulate/Int64/dims=2L 9319958.5 ns 9786417 ns 0.95
array/broadcast 371042 ns 634000 ns 0.59
array/construct 5791 ns 6125 ns 0.95
array/permutedims/2d 616583 ns 1161792 ns 0.53
array/permutedims/3d 1122208.5 ns 1646292 ns 0.68
array/permutedims/4d 1981312.5 ns 2585542 ns 0.77
array/private/copy 425417 ns 552542 ns 0.77
array/private/copyto!/cpu_to_gpu 353104.5 ns 783625 ns 0.45
array/private/copyto!/gpu_to_cpu 354625 ns 783000 ns 0.45
array/private/copyto!/gpu_to_gpu 335541 ns 612583 ns 0.55
array/private/iteration/findall/bool 1045042 ns 1398750 ns 0.75
array/private/iteration/findall/int 1187166 ns 1564083.5 ns 0.76
array/private/iteration/findfirst/bool 1337250 ns 1967959 ns 0.68
array/private/iteration/findfirst/int 1384625 ns 1988500 ns 0.70
array/private/iteration/findmin/1d 1449458.5 ns 2287333 ns 0.63
array/private/iteration/findmin/2d 1207604 ns 1598145.5 ns 0.76
array/private/iteration/logical 1594312.5 ns 2651562 ns 0.60
array/private/iteration/scalar 2684562.5 ns 4420479.5 ns 0.61
array/random/rand/Float32 630042 ns 1122042 ns 0.56
array/random/rand/Int64 690146 ns 1266542 ns 0.54
array/random/rand!/Float32 542084 ns 899875 ns 0.60
array/random/rand!/Int64 501083 ns 853437.5 ns 0.59
array/random/randn/Float32 596208 ns 1081042 ns 0.55
array/random/randn!/Float32 508916 ns 845645.5 ns 0.60
array/reductions/mapreduce/Float32/1d 498500 ns 1008333 ns 0.49
array/reductions/mapreduce/Float32/dims=1 499792 ns 842875 ns 0.59
array/reductions/mapreduce/Float32/dims=1L 734333 ns 1374166.5 ns 0.53
array/reductions/mapreduce/Float32/dims=2 497812.5 ns 837250 ns 0.59
array/reductions/mapreduce/Float32/dims=2L 1351459 ns 1791354 ns 0.75
array/reductions/mapreduce/Int64/1d 931479.5 ns 1518667 ns 0.61
array/reductions/mapreduce/Int64/dims=1 788042 ns 1144437.5 ns 0.69
array/reductions/mapreduce/Int64/dims=1L 1649645.5 ns 2042229 ns 0.81
array/reductions/mapreduce/Int64/dims=2 982500 ns 1325292 ns 0.74
array/reductions/mapreduce/Int64/dims=2L 2253459 ns 4595187.5 ns 0.49
array/reductions/reduce/Float32/1d 721250 ns 1011458 ns 0.71
array/reductions/reduce/Float32/dims=1 498084 ns 843833 ns 0.59
array/reductions/reduce/Float32/dims=1L 713083 ns 1374167 ns 0.52
array/reductions/reduce/Float32/dims=2 498042 ns 843625 ns 0.59
array/reductions/reduce/Float32/dims=2L 1346833 ns 1787520.5 ns 0.75
array/reductions/reduce/Int64/1d 926083.5 ns 1376062 ns 0.67
array/reductions/reduce/Int64/dims=1 797125 ns 1114583 ns 0.72
array/reductions/reduce/Int64/dims=1L 1435375 ns 2030542 ns 0.71
array/reductions/reduce/Int64/dims=2 981271 ns 1308500 ns 0.75
array/reductions/reduce/Int64/dims=2L 2256458 ns 4207916 ns 0.54
array/shared/copy 212000 ns 244208 ns 0.87
array/shared/copyto!/cpu_to_gpu 46250 ns 80292 ns 0.58
array/shared/copyto!/gpu_to_cpu 40416 ns 80750 ns 0.50
array/shared/copyto!/gpu_to_gpu 48000 ns 80583 ns 0.60
array/shared/iteration/findall/bool 1052208 ns 1416084 ns 0.74
array/shared/iteration/findall/int 1189375 ns 1535667 ns 0.77
array/shared/iteration/findfirst/bool 1067875 ns 1574458.5 ns 0.68
array/shared/iteration/findfirst/int 1080000 ns 1588625 ns 0.68
array/shared/iteration/findmin/1d 1199584 ns 1906750 ns 0.63
array/shared/iteration/findmin/2d 1208021 ns 1601292 ns 0.75
array/shared/iteration/logical 1428166.5 ns 2298125 ns 0.62
array/shared/iteration/scalar 5138.833333333333 ns 193083 ns 0.026614633775802806
integration/byval/reference 1155917 ns 1575250 ns 0.73
integration/byval/slices=1 1158083 ns 1573500 ns 0.74
integration/byval/slices=2 2084792 ns 2633000 ns 0.79
integration/byval/slices=3 10213854.5 ns 7919354.5 ns 1.29
integration/metaldevrt 458208 ns 786708 ns 0.58
kernel/indexing 353917 ns 646458 ns 0.55
kernel/indexing_checked 355542 ns 660812.5 ns 0.54
kernel/launch 11875 ns 12666 ns 0.94
kernel/rand 362375 ns 589042 ns 0.62
latency/import 1384144062.5 ns 1381318041.5 ns 1.00
latency/precompile 29321346833 ns 29203062750 ns 1.00
latency/ttfp 1649868125 ns 1646438625 ns 1.00
metal/synchronization/context 770.1475409836065 ns 19458 ns 0.0395799949112759
metal/synchronization/stream 363.4354066985646 ns 18583 ns 0.019557413049484183

This comment was automatically generated by workflow using github-action-benchmark.

@maleadt

maleadt commented May 28, 2026

Copy link
Copy Markdown
Member Author

Now approach. Safer and faster now. Benchmark by Claude:

Results

Scenario main (run 1) main (run 2) branch (run 1) branch (run 2) speed-up
1. synchronize(q) on a queue that has never had work committed 15.87 µs 15.83 µs 0.185 µs 0.183 µs ~86×
2. synchronize() when the queue is idle (had work, fully drained) 15.55 µs 15.56 µs 0.368 µs 0.363 µs ~42×
3. @metal big + synchronize() — long kernel forces block path 13 874 µs 13 873 µs 13 964 µs 13 912 µs ~1.00× (kernel-dominated)
4. @metal small + synchronize() — realistic tight loop 359.0 µs 365.1 µs 148.6 µs 149.6 µs ~2.4×

What's actually happening per scenario

  • Scenario 1 (empty queue). main always allocates a fresh sentinel
    MTLCommandBuffer, encodes a signal into a per-queue MTLSharedEvent,
    commits, and calls waitUntilSignaledValue. The branch consults the
    per-queue last-committed dict, finds nothing, and returns immediately.
  • Scenario 2 (idle queue). Same main cost as scenario 1 — it can't
    tell the queue is drained. The branch finds the prior cmdbuf in the
    dict, sees cmdbuf.status == Completed, returns immediately.
  • Scenario 3 (long kernel + sync). Both branches spend most of the
    time waiting for the GPU. main waits via waitUntilSignaledValue on
    the per-queue event; the branch falls through the spin and waits via a
    fresh sentinel + addCompletedHandler:. Per-call sync overhead is small
    compared to the ~14 ms of GPU work, so the two are indistinguishable
    here. (Both still allocate one sentinel per call — the new path doesn't
    win because waiting on an in-flight buffer can't avoid blocking.)
  • Scenario 4 (small kernel + sync, tight loop). This is the case most
    user code hits. main blocks the calling thread inside Metal each call
    (~360 µs round trip dominated by GPU completion + sync overhead). The
    branch's spin frequently catches completion before the spin budget runs
    out, so most syncs avoid the libdispatch → scheduler wakeup, and the
    GPU completion latency is shorter end-to-end. ~2.4× faster overall.

Notes

  • The "speed-up" column for scenario 3 is GPU-bound; the small absolute
    difference (~50 µs out of 14 ms) is within run-to-run noise.
  • The branch's fast-path cost (sub-microsecond) is dominated by the dict
    lookup + an MTLCommandBuffer.status ObjC property read. A
    Threads.@spawn-driven 8-thread × 200-iteration stress (`@metal small
    • synchronize`) ran clean with the per-commit retain/release bookkeeping
      active.
  • Spin budget is the same 256-iteration shape as CUDA.jl
    (spinning_synchronization in
    CUDACore/lib/cudadrv/synchronization.jl): first 32 spins do
    jl_cpu_pause + jl_gc_safepoint, the remainder yield().

@maleadt maleadt merged commit d6ff4aa into main May 28, 2026
15 checks passed
@maleadt maleadt deleted the tb/nonblocking_sync branch May 28, 2026 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support task switches from command buffer callbacks

2 participants