Benchmarking Apple Metal GPU kernels against MLX reference implementations. Modeled on KernelBench, swapping CUDA β Metal and PyTorch β MLX.
While working on inference for Apple Silicon, I found an agent-based loop of kernel writing & testing against a perf/accuracy test helped a lot along the way. This repo contains the harness and code used to benchmark against baseline MLX versions. Kernels don't differ much β mostly in how threadgroups are utilized. One of the main differences was performance across the newer M-series chips. I soon realized this was a we-have-kernel-bench-at-home version, so I polished it and am releasing it as a benchmark + agent harness for Metal kernel authoring (Agent Steel). Much of this repo is organized after / inspired by KernelBench, so props to them!
Feel free to contribute with kernels for other mchip types, I only had my hands on a m4 mini and m2.
agent_steel/ is an LLM-driven closed-loop harness that profiles a kernel,
writes a new candidate, and verifies the change against ./bench β repeating
until it can't improve further. New kernels are bookended by a fourth agent
that authors the MLX/registry/Metal scaffolding and polishes the result for PR.
- Welder - New-kernel mode only. Authors
mlx/kernels/<set>/<name>.py+ appends the registry entry + writes the first.metal. Two-stage accuracy gate (Metal vs MLX, MLX vs an external PyTorch reference). Invoked again after the perf loop to polish for PR. - Profiler - Parses the
.gputrace+ bench output, runs the chip-aware synthesizer (chip_metrics/m{N}.py), emits a 2-3 paragraph diagnosis. - Optimizer - Reads the diagnosis + the kernel's AttemptDB log + current
.metal+ MLX reference, writes the next.metalin-place tometal/kernels/<set>/<kernel>/<chip>.metal, runs an accuracy gate (./benchcorrectness β₯ 99%) β retries up to 4Γ on accuracy fail. - Verifier - Benches the promoted kernel (
--warmup 30 --iters 100), compares mean vssession.json's leaderboard best, logs Β±Ξ% to AttemptDB, reverts on regression.
AttemptDB is one JSONL per (kernel, chip) at .agent-steel/history/. Every
entry carries the .metal source snapshot + the chip-aware metrics dict from
that profile. AttemptDB.top_n_by_time(kernel, chip, n) returns the N fastest
kept attempts.
Run it:
# optimize an existing kernel
python -m agent_steel --kernel-name relu --loop --max-rounds 5
python -m agent_steel --kernel-name relu,softmax --parallel 2 --loop
# author a brand-new kernel (Welder), then chain into the perf loop
python -m agent_steel --welder my_kernel --description '...' --reference ref.py --loopConcurrent agent-steel instances are safe: a process-wide flock serializes
./bench (GPU is one resource); a per-(kernel, chip) flock prevents two
processes from racing on the same kernel's lineage.
Sample run on a slice of the standard set. Each curve is best-so-far perf gain vs the leaderboard baseline as the loop iterates; legend shows the final % beat.
Three sets, increasing in size/complexity:
| set | what |
|---|---|
| Common | basic ops β activations, matmuls, norms, convs, scans |
| Standard | fused 2+ op kernels β attention, SwiGLU, RMSNorm + linear |
| Full | end-to-end model blocks β transformer block, AlexNet, ResNet, LLaMA decoder, DenseNet |
See KERNELS.md for the full registry.
Kernels can be split per M-chip generation when one impl genuinely needs different code than another (e.g. <name>/default.metal + <name>/m4.metal). The harness auto-picks the right variant at runtime based on the chip detected via sysctl. See Per-chip variants below.
The benchmark measures accuracy against the MLX reference and performance across five targets: speedup vs MLX, compute throughput (GFLOPS), memory bandwidth (GB/s), run-to-run stability (0-1), and a balanced composite score. Every kernel that passes correctness gets a row in best_times.md.
Two commands from a fresh clone to your first benchmark:
python3 setup.py # checks toolchains, installs Metal toolchain + Python deps, builds host
./bench sqr_mm # build kernel, run, save, print reportsetup.py checks (and tries to fix) all of:
- macOS + Apple Silicon
- Xcode developer tools (
xcode-select -p) - Metal toolchain β runs
xcodebuild -downloadComponent MetalToolchainif missing (a few hundred MB; the usual blocker on a fresh Mac) - Python dependencies (
mlx,numpy,pydantic) - Host binary builds (
make host) - Chip detection works
If any step fails it tells you exactly what to run.
./bench <name> # default: both MLX + Metal, paired
./bench <name> --mlx # MLX only
./bench <name> --metal # Metal only
./bench <name> --no-session # don't write results/<chip>/<name>.json
./bench <name> -- --target compute --iters 500
./bench <name> -- --cold-start # measure first-launch latency
./bench --all # run every kernel in the registryDefaults are set so a single command gives a stable, publishable number. Bump --iters for tighter measurement.
A human-readable report on the terminal:
sqr_mm target=speed score=1.428
device : Apple M2 (m2) 8 CPU / 8 GPU / 9 GB
occupancy : tg_mem=16KB max_thr/tg=896
correctness : β correct max_err=0.000e+00
speedup : 1.43Γ vs MLX
kernel : 1.266 ms (min 1.194, mean 1.289, n=300)
mlx ref : 1.808 ms
compute : 1695.8 GFLOPS
bandwidth : 9.9 GB/s
arith int. : 170.7 FLOPs/byte
stability : 0.98
target score
---------- ----------
speed 1.428
compute 1695.83
memory 9.94
stable 0.98
balanced 1.23
Plus three artifacts:
results/<chip-bucket>/<name>.jsonβ full result, every run (with--save, default on)session.jsonβ per-chip leaderboard. Auto-updated when a run beats the recorded best for that kernel; stores the entire.metalsource of the winning version so the result is reproducible from the file alone- stderr β
updated session.json [apple-m2/sqr_mm]: new best 1.266 ms (was 1.808 ms, Ξ +0.542)when a run wins
Every run computes all five scores and prints them in a table. The --target flag only changes which one becomes the headline score.
| target | metric | good | bad | what it means |
|---|---|---|---|---|
| speed | speedup vs MLX |
> 1.0Γ = faster than MLX | < 1.0Γ = slower than MLX | How your kernel compares to Apple's reference |
| compute | GFLOPS | higher = better throughput | low = GPU underutilized | Raw compute. Ignore for memory-bound kernels (element-wise ops, copies) |
| memory | GB/s | near M2 peak ~89 GB/s | well below peak | Memory bandwidth utilization. The primary metric for element-wise kernels |
| stable | 0β1 | > 0.95 = solid | < 0.90 = noisy | Run-to-run consistency. Low stability means thermal throttling or OS interference |
| balanced | composite | higher = better overall | β | 0.5Β·speedup + 0.3Β·gflops/1000 + 0.2Β·stability |
Which target matters for your kernel:
- Element-wise ops (relu, sigmoid, add, etc.): look at
memoryβ they're bandwidth-bound, GFLOPS will be low by nature - Matmuls: look at
computeβ they're compute-bound, GB/s will be low by nature - Reductions (layernorm, softmax, dot product): look at
speedβ they're latency-sensitive, mixed compute/bandwidth - Scans (cumsum): look at
speedorbalanced
GOOD (sqr_mm): BAD (naive kernel):
target score target score
---------- ---------- ---------- ----------
speed 1.428 speed 0.120
compute 1695.83 compute 142.50
memory 9.94 memory 0.83
stable 0.98 stable 0.52
balanced 1.23 balanced 0.17
For Agent Steel internals see agent_steel/README.md. Short version for human contributors:
mlx/kernels/<set>/<name>.pyβ the MLX baseline (don't edit; it defines the problem).metal/kernels/<set>/<name>.metalβ your kernel.- Run
./bench <name>untilcorrect=true. - Edit only the
.metalfile. Open a PR.best_times.mdandLINK.mdare auto-generated fromsession.json.
Most kernels stay as a single flat file. When a kernel genuinely needs different code per M-generation, promote it to a directory:
metal/kernels/common/sqr_mm/
default.metal # fallback (used by chips without their own variant)
m4.metal # M4-specific impl
The harness auto-picks <name>__<chip>.metallib β __default β flat <name>.metallib
based on the chip you're running on. Only split when you have a measured perf reason.
If you use MetalBench in published work, please cite it as:
@misc{metalbench2026,
title = {MetalBench: Apple Metal GPU Kernel Benchmarks},
author = {Manakelew, Alazar},
year = {2026},
url = {https://github.com/Lazarus-931/MetalBench},
note = {Live leaderboard: https://lazarus-931.github.io/leaderboard.html}
}

