Port non-blocking synchronization from CUDA.jl by maleadt · Pull Request #783 · JuliaGPU/Metal.jl

maleadt · 2026-05-26T20:03:24Z

Closes #532

Wait for command buffers via a completion handler that notifies the Julia scheduler, instead of parking the calling thread in waitUntilCompleted. Fixes task switches from command buffer callbacks (#532). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

christiangnrd · 2026-05-26T20:42:29Z

This essentially reverts #690, but if synchronization is faster I don't mind doing this until we want to support Metal 4.

maleadt · 2026-05-26T20:54:47Z

if synchronization is faster I don't mind doing this until we want to support Metal 4.

It's not. The only reason is to avoid the calling thread to block, since that can cause deadlocks when it's thread 0 and a callback from Metal also wants to do I/O (which in Julia can only happen on thread 0).

Well, it also enables running code during synchronization, but that's not the immediate goal here.

codecov · 2026-05-26T21:23:06Z

Codecov Report

❌ Patch coverage is 92.06349% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.96%. Comparing base (08dd32c) to head (50f8832).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
src/synchronization.jl	90.47%	4 Missing ⚠️
lib/mps/ndarray.jl	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #783      +/-   ##
==========================================
+ Coverage   80.84%   80.96%   +0.11%     
==========================================
  Files          63       64       +1     
  Lines        3017     3057      +40     
==========================================
+ Hits         2439     2475      +36     
- Misses        578      582       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions

Metal Benchmarks

Details

Benchmark suite	Current: `50f8832`	Previous: `c8d960b`	Ratio
`array/accumulate/Float32/1d`	`752208.5` ns	`1112417` ns	`0.68`
`array/accumulate/Float32/dims=1`	`962833` ns	`1556166` ns	`0.62`
`array/accumulate/Float32/dims=1L`	`9793500` ns	`9750875` ns	`1.00`
`array/accumulate/Float32/dims=2`	`1272708` ns	`1853083` ns	`0.69`
`array/accumulate/Float32/dims=2L`	`5430479` ns	`7085833` ns	`0.77`
`array/accumulate/Int64/1d`	`915666` ns	`1244792` ns	`0.74`
`array/accumulate/Int64/dims=1`	`1093084` ns	`1825708.5` ns	`0.60`
`array/accumulate/Int64/dims=1L`	`11746625` ns	`11623917` ns	`1.01`
`array/accumulate/Int64/dims=2`	`1463000` ns	`2166041` ns	`0.68`
`array/accumulate/Int64/dims=2L`	`9319958.5` ns	`9786417` ns	`0.95`
`array/broadcast`	`371042` ns	`634000` ns	`0.59`
`array/construct`	`5791` ns	`6125` ns	`0.95`
`array/permutedims/2d`	`616583` ns	`1161792` ns	`0.53`
`array/permutedims/3d`	`1122208.5` ns	`1646292` ns	`0.68`
`array/permutedims/4d`	`1981312.5` ns	`2585542` ns	`0.77`
`array/private/copy`	`425417` ns	`552542` ns	`0.77`
`array/private/copyto!/cpu_to_gpu`	`353104.5` ns	`783625` ns	`0.45`
`array/private/copyto!/gpu_to_cpu`	`354625` ns	`783000` ns	`0.45`
`array/private/copyto!/gpu_to_gpu`	`335541` ns	`612583` ns	`0.55`
`array/private/iteration/findall/bool`	`1045042` ns	`1398750` ns	`0.75`
`array/private/iteration/findall/int`	`1187166` ns	`1564083.5` ns	`0.76`
`array/private/iteration/findfirst/bool`	`1337250` ns	`1967959` ns	`0.68`
`array/private/iteration/findfirst/int`	`1384625` ns	`1988500` ns	`0.70`
`array/private/iteration/findmin/1d`	`1449458.5` ns	`2287333` ns	`0.63`
`array/private/iteration/findmin/2d`	`1207604` ns	`1598145.5` ns	`0.76`
`array/private/iteration/logical`	`1594312.5` ns	`2651562` ns	`0.60`
`array/private/iteration/scalar`	`2684562.5` ns	`4420479.5` ns	`0.61`
`array/random/rand/Float32`	`630042` ns	`1122042` ns	`0.56`
`array/random/rand/Int64`	`690146` ns	`1266542` ns	`0.54`
`array/random/rand!/Float32`	`542084` ns	`899875` ns	`0.60`
`array/random/rand!/Int64`	`501083` ns	`853437.5` ns	`0.59`
`array/random/randn/Float32`	`596208` ns	`1081042` ns	`0.55`
`array/random/randn!/Float32`	`508916` ns	`845645.5` ns	`0.60`
`array/reductions/mapreduce/Float32/1d`	`498500` ns	`1008333` ns	`0.49`
`array/reductions/mapreduce/Float32/dims=1`	`499792` ns	`842875` ns	`0.59`
`array/reductions/mapreduce/Float32/dims=1L`	`734333` ns	`1374166.5` ns	`0.53`
`array/reductions/mapreduce/Float32/dims=2`	`497812.5` ns	`837250` ns	`0.59`
`array/reductions/mapreduce/Float32/dims=2L`	`1351459` ns	`1791354` ns	`0.75`
`array/reductions/mapreduce/Int64/1d`	`931479.5` ns	`1518667` ns	`0.61`
`array/reductions/mapreduce/Int64/dims=1`	`788042` ns	`1144437.5` ns	`0.69`
`array/reductions/mapreduce/Int64/dims=1L`	`1649645.5` ns	`2042229` ns	`0.81`
`array/reductions/mapreduce/Int64/dims=2`	`982500` ns	`1325292` ns	`0.74`
`array/reductions/mapreduce/Int64/dims=2L`	`2253459` ns	`4595187.5` ns	`0.49`
`array/reductions/reduce/Float32/1d`	`721250` ns	`1011458` ns	`0.71`
`array/reductions/reduce/Float32/dims=1`	`498084` ns	`843833` ns	`0.59`
`array/reductions/reduce/Float32/dims=1L`	`713083` ns	`1374167` ns	`0.52`
`array/reductions/reduce/Float32/dims=2`	`498042` ns	`843625` ns	`0.59`
`array/reductions/reduce/Float32/dims=2L`	`1346833` ns	`1787520.5` ns	`0.75`
`array/reductions/reduce/Int64/1d`	`926083.5` ns	`1376062` ns	`0.67`
`array/reductions/reduce/Int64/dims=1`	`797125` ns	`1114583` ns	`0.72`
`array/reductions/reduce/Int64/dims=1L`	`1435375` ns	`2030542` ns	`0.71`
`array/reductions/reduce/Int64/dims=2`	`981271` ns	`1308500` ns	`0.75`
`array/reductions/reduce/Int64/dims=2L`	`2256458` ns	`4207916` ns	`0.54`
`array/shared/copy`	`212000` ns	`244208` ns	`0.87`
`array/shared/copyto!/cpu_to_gpu`	`46250` ns	`80292` ns	`0.58`
`array/shared/copyto!/gpu_to_cpu`	`40416` ns	`80750` ns	`0.50`
`array/shared/copyto!/gpu_to_gpu`	`48000` ns	`80583` ns	`0.60`
`array/shared/iteration/findall/bool`	`1052208` ns	`1416084` ns	`0.74`
`array/shared/iteration/findall/int`	`1189375` ns	`1535667` ns	`0.77`
`array/shared/iteration/findfirst/bool`	`1067875` ns	`1574458.5` ns	`0.68`
`array/shared/iteration/findfirst/int`	`1080000` ns	`1588625` ns	`0.68`
`array/shared/iteration/findmin/1d`	`1199584` ns	`1906750` ns	`0.63`
`array/shared/iteration/findmin/2d`	`1208021` ns	`1601292` ns	`0.75`
`array/shared/iteration/logical`	`1428166.5` ns	`2298125` ns	`0.62`
`array/shared/iteration/scalar`	`5138.833333333333` ns	`193083` ns	`0.026614633775802806`
`integration/byval/reference`	`1155917` ns	`1575250` ns	`0.73`
`integration/byval/slices=1`	`1158083` ns	`1573500` ns	`0.74`
`integration/byval/slices=2`	`2084792` ns	`2633000` ns	`0.79`
`integration/byval/slices=3`	`10213854.5` ns	`7919354.5` ns	`1.29`
`integration/metaldevrt`	`458208` ns	`786708` ns	`0.58`
`kernel/indexing`	`353917` ns	`646458` ns	`0.55`
`kernel/indexing_checked`	`355542` ns	`660812.5` ns	`0.54`
`kernel/launch`	`11875` ns	`12666` ns	`0.94`
`kernel/rand`	`362375` ns	`589042` ns	`0.62`
`latency/import`	`1384144062.5` ns	`1381318041.5` ns	`1.00`
`latency/precompile`	`29321346833` ns	`29203062750` ns	`1.00`
`latency/ttfp`	`1649868125` ns	`1646438625` ns	`1.00`
`metal/synchronization/context`	`770.1475409836065` ns	`19458` ns	`0.0395799949112759`
`metal/synchronization/stream`	`363.4354066985646` ns	`18583` ns	`0.019557413049484183`

This comment was automatically generated by workflow using github-action-benchmark.

maleadt · 2026-05-28T15:37:42Z

Now approach. Safer and faster now. Benchmark by Claude:

Results

Scenario	`main` (run 1)	`main` (run 2)	branch (run 1)	branch (run 2)	speed-up
1. `synchronize(q)` on a queue that has never had work committed	15.87 µs	15.83 µs	0.185 µs	0.183 µs	~86×
2. `synchronize()` when the queue is idle (had work, fully drained)	15.55 µs	15.56 µs	0.368 µs	0.363 µs	~42×
3. `@metal big + synchronize()` — long kernel forces block path	13 874 µs	13 873 µs	13 964 µs	13 912 µs	~1.00× (kernel-dominated)
4. `@metal small + synchronize()` — realistic tight loop	359.0 µs	365.1 µs	148.6 µs	149.6 µs	~2.4×

What's actually happening per scenario

Scenario 1 (empty queue). main always allocates a fresh sentinel
MTLCommandBuffer, encodes a signal into a per-queue MTLSharedEvent,
commits, and calls waitUntilSignaledValue. The branch consults the
per-queue last-committed dict, finds nothing, and returns immediately.
Scenario 2 (idle queue). Same main cost as scenario 1 — it can't
tell the queue is drained. The branch finds the prior cmdbuf in the
dict, sees cmdbuf.status == Completed, returns immediately.
Scenario 3 (long kernel + sync). Both branches spend most of the
time waiting for the GPU. main waits via waitUntilSignaledValue on
the per-queue event; the branch falls through the spin and waits via a
fresh sentinel + addCompletedHandler:. Per-call sync overhead is small
compared to the ~14 ms of GPU work, so the two are indistinguishable
here. (Both still allocate one sentinel per call — the new path doesn't
win because waiting on an in-flight buffer can't avoid blocking.)
Scenario 4 (small kernel + sync, tight loop). This is the case most
user code hits. main blocks the calling thread inside Metal each call
(~360 µs round trip dominated by GPU completion + sync overhead). The
branch's spin frequently catches completion before the spin budget runs
out, so most syncs avoid the libdispatch → scheduler wakeup, and the
GPU completion latency is shorter end-to-end. ~2.4× faster overall.

Notes

The "speed-up" column for scenario 3 is GPU-bound; the small absolute
difference (~50 µs out of 14 ms) is within run-to-run noise.
The branch's fast-path cost (sub-microsecond) is dominated by the dict
lookup + an MTLCommandBuffer.status ObjC property read. A
Threads.@spawn-driven 8-thread × 200-iteration stress (`@metal small
- synchronize`) ran clean with the per-commit retain/release bookkeeping
  active.
Spin budget is the same 256-iteration shape as CUDA.jl
(spinning_synchronization in
CUDACore/lib/cudadrv/synchronization.jl): first 32 spins do
jl_cpu_pause + jl_gc_safepoint, the remainder yield().

maleadt and others added 2 commits May 26, 2026 21:59

Use nonblocking synchronize for command buffer waits.

04a0ddf

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This was referenced May 26, 2026

Promote async block to feature instead of just a workaround. JuliaInterop/ObjectiveC.jl#72

Merged

Simple throwing kernel hangs #433

Closed

github-actions Bot reviewed May 26, 2026

View reviewed changes

maleadt added 2 commits May 28, 2026 15:11

Revert to synchronizing on a sentinel buffer.

272e786

Keep track of last command buffer to simplify fast-path sync.

50f8832

maleadt merged commit d6ff4aa into main May 28, 2026
15 checks passed

maleadt deleted the tb/nonblocking_sync branch May 28, 2026 15:37

christiangnrd mentioned this pull request May 29, 2026

Metal 1.10 blog post JuliaGPU/juliagpu.org#58

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Port non-blocking synchronization from CUDA.jl#783

Port non-blocking synchronization from CUDA.jl#783
maleadt merged 4 commits into
mainfrom
tb/nonblocking_sync

maleadt commented May 26, 2026

Uh oh!

christiangnrd commented May 26, 2026

Uh oh!

maleadt commented May 26, 2026

Uh oh!

codecov Bot commented May 26, 2026 •

edited

Loading

Uh oh!

github-actions Bot left a comment •

edited

Loading

Uh oh!

maleadt commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

maleadt commented May 26, 2026

Uh oh!

christiangnrd commented May 26, 2026

Uh oh!

maleadt commented May 26, 2026

Uh oh!

codecov Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Metal Benchmarks

Uh oh!

maleadt commented May 28, 2026

Results

What's actually happening per scenario

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented May 26, 2026 •

edited

Loading

github-actions Bot left a comment •

edited

Loading