diff --git a/benchmarks/community_results.json b/benchmarks/community_results.json index e975925..1272db7 100644 --- a/benchmarks/community_results.json +++ b/benchmarks/community_results.json @@ -94,6 +94,26 @@ "peak_tflops_inmem": 12.17, "notes": "inmem_peak only, no training data submitted.", "contributor": "elijah-pelton" + }, + { + "chip": "M5 Max", + "cores": "18-core (6P+12E)", + "ram_gb": 128, + "macos": "26.4.1", + "ane_subtype": "h17", + "ms_per_step": [89.7, 90.2], + "ane_ms": [7.8, 8.1], + "compile_ms": [3335, 3384], + "ane_tflops": 1.03, + "ane_util_pct": 5.4, + "peak_tflops_inmem": 13.80, + "peak_tflops_int8_w8a8": 35.61, + "peak_tflops_fp16_64x64": 19.27, + "ms_per_step_dynamic_stories110m": 73.5, + "ms_per_step_dynamic_qwen3_06b": 320.0, + "compile_ms_dynamic": 421, + "notes": "First H17 ANE on record — distinct subtype from M4/M5 base (both H16). FP16/INT8 peak compute matches M4 within 4%; training gains over M5 base are CPU-driven (12P cores + Accelerate). See benchmarks/m5max_result.md for the full probe report.", + "contributor": "lixiang.ict@gmail.com" } ], "neural_engine_specs": { @@ -108,6 +128,7 @@ "M3_Ultra": {"ne_cores": 32, "rated_tops": 31.6}, "M4": {"ne_cores": 16, "rated_tops": 38, "note": "INT8/mixed-precision spec"}, "M4_Max": {"ne_cores": 16, "rated_tops": 38, "note": "INT8/mixed-precision spec"}, - "M5": {"ne_cores": 16, "rated_tops": null, "estimated_tops": 19} + "M5": {"ne_cores": 16, "rated_tops": null, "estimated_tops": 19, "ane_subtype": "h16"}, + "M5_Max": {"ne_cores": 16, "rated_tops": null, "measured_fp16_tflops": 19.27, "measured_int8_tops": 35.61, "ane_subtype": "h17", "note": "First chip on record reporting H17 ANE subtype; peak math matches M4 H16."} } } diff --git a/benchmarks/m5max_result.md b/benchmarks/m5max_result.md new file mode 100644 index 0000000..5d2fb0e --- /dev/null +++ b/benchmarks/m5max_result.md @@ -0,0 +1,229 @@ +# M5 Max ANE Probe & Training Benchmark + +**Machine**: MacBook Pro · Apple M5 Max (6P + 12E CPU) · 128 GB RAM +**macOS**: 26.4.1 (Darwin 25.4.0), Model Identifier `Mac17,7` +**Date**: 2026-04-23 +**ANE Family**: **`h17`** (new — M4 and base M5 are `h16`) + +All data gathered with the repo's probes and training harness as-is, no source +changes. Compared against: +- [README.md](../README.md) M4 reference figures +- [training/m5result.md](../training/m5result.md) — base M5 (10-core, 16 GB) probe notes +- [benchmarks/community_results.json](./community_results.json) — M1/M3/M4/M5 submissions + +--- + +## Hardware identification + +**Question**: Is the ANE in M5 Max the same silicon block as M4 / base M5? + +**Result**: **No — `_ANEDeviceInfo.aneSubType` returns `h17`**, a version not +seen in any community submission. The base M5 (per `m5result.md`) still reports +`h16`, same as M4. M5 Max is the first `h17` on record. + +``` +=== ANE INT8 W8A8 Benchmark (M4, h17) === ← header label is hardcoded "M4", + the "h17" is read from the device. +``` + +Everything else still works: `program(1.3)` MIL, `_ANEInMemoryModelDescriptor`, +`constexpr_affine_dequantize`, `quantize` / `dequantize`. No API closures on +macOS 26.4.1. + +--- + +## inmem_peak — deep conv stacks (FP16) + +**Question**: Peak FP16 throughput using the same 128-layer conv sweep +([inmem_peak.m](../inmem_peak.m)) the README reports for M4. + +``` +Config W(MB) GFLOP ms/eval TFLOPS +----------------------------------------------------------------- + 32x conv 512ch sp64 16.0 1.07 0.135 ms 7.95 + 48x conv 512ch sp64 24.0 1.61 0.171 ms 9.42 + 64x conv 512ch sp64 32.0 2.15 0.206 ms 10.40 + 96x conv 512ch sp64 48.0 3.22 0.266 ms 12.13 +128x conv 512ch sp64 64.0 4.29 0.311 ms 13.80 ← peak + 64x conv 256ch sp64 8.0 0.54 0.168 ms 3.19 +128x conv 256ch sp64 16.0 1.07 0.132 ms 8.16 +256x conv 256ch sp64 32.0 2.15 0.216 ms 9.94 + 64x conv 384ch sp64 18.0 1.21 0.142 ms 8.52 +128x conv 384ch sp64 36.0 2.42 0.203 ms 11.91 +``` + +**Peak: 13.80 TFLOPS** at `128× conv 512ch sp=64`. + +| Chip | inmem_peak FP16 (TFLOPS) | +|------------|--------------------------| +| M3 Pro | 9.98 | +| M4 Pro | 12.57 | +| M4 Max | 10.93 | +| M5 (16 GB) | 12.17 | +| M5 (32 GB) | 12.44 | +| **M5 Max** | **13.80** | + +--- + +## ane_int8_bench — FP16 vs INT8 W8A8 (larger spatial 64×64) + +**Question**: How close does M5 Max come to the M4 blog's 19 TFLOPS / 35 TOPS +figures when the conv is large enough to saturate the array? + +``` +Config GOP ms/eval TOPS Ratio +----------------------------------------------------------------- +FP16 128x conv 512ch 64x64 274.88 14.263 ms 19.27 +W8A8 128x conv 512ch 64x64 274.88 7.720 ms 35.61 1.85x +FP16 64x conv 512ch 64x64 137.44 7.153 ms 19.21 +W8A8 64x conv 512ch 64x64 137.44 3.824 ms 35.94 1.87x +FP16 256x conv 256ch 64x64 137.44 7.318 ms 18.78 +W8A8 256x conv 256ch 64x64 137.44 4.118 ms 33.37 1.78x +FP16 128x conv 256ch 64x64 68.72 3.696 ms 18.59 +W8A8 128x conv 256ch 64x64 68.72 2.112 ms 32.54 1.75x +FP16 128x conv 384ch 64x64 154.62 8.154 ms 18.96 +W8A8 128x conv 384ch 64x64 154.62 4.389 ms 35.23 1.86x +``` + +| Precision | M5 Max | M4 (README `H16G`) | +|-----------|--------|---------------------| +| FP16 peak | **19.27 TFLOPS** | 18.6 TFLOPS | +| INT8 W8A8 peak | **35.61 TOPS** | 35.1 TOPS | +| INT8/FP16 ratio | 1.85× | 1.88× | + +**Implication**: the `h17` ANE's raw compute is **within 4 % of `h16`** +(run-to-run noise). Apple kept the ~19 TFLOPS FP16 / ~35 TOPS INT8 ceiling +across two chip generations. The "38 TOPS" spec remains the INT8 path. + +--- + +## sram_bench — working-set cliff + +**Question**: Where does the on-chip SRAM spill to DRAM? + +``` +Config W(MB) Act(MB) Tot(MB) ms/eval TFLOPS +--------------------------------------------------------------------- + 256ch x 64sp 0.1 0.03 0.2 0.212 ms 0.04 + 512ch x 64sp 0.5 0.06 0.6 0.085 ms 0.40 +1024ch x 64sp 2.0 0.12 2.2 0.335 ms 0.40 +2048ch x 64sp 8.0 0.25 8.5 0.141 ms 3.80 +3072ch x 64sp 18.0 0.38 18.8 0.204 ms 5.92 +4096ch x 64sp 32.0 0.50 33.0 0.300 ms 7.17 +5120ch x 64sp 50.0 0.62 51.2 0.432 ms 7.76 +6144ch x 64sp 72.0 0.75 73.5 0.565 ms 8.56 +8192ch x 32sp 128.0 0.50 129.0 0.965 ms 4.45 +``` + +**M4 in the blog shows the cliff around ~32 MB**. On M5 Max throughput is still +climbing past **73 MB** and only breaks at **129 MB**. Caveat: the last row +also halves `sp` from 64 to 32 — a pipeline-starvation confound we can't rule +out without an independent probe. What's unambiguous: the effective SRAM +working set is **at least as large as M4's**, plausibly larger. + +--- + +## inmem_bench — single 1×1 conv latency scan + +``` +Config W(MB) ms/eval TFLOPS +-------------------------------------------- + 256ch x64sp 0.1 0.088 ms 0.09 + 512ch x64sp 0.5 0.089 ms 0.38 +1024ch x64sp 2.0 0.313 ms 0.43 +2048ch x64sp 8.0 0.131 ms 4.10 +3072ch x64sp 18.0 0.189 ms 6.38 +4096ch x64sp 32.0 0.302 ms 7.11 +``` + +Dispatch floor ≈ 0.09 ms, matching the M4 blog's ~0.095 ms XPC/IOKit overhead. + +--- + +## Training — dynamic pipeline (`training_dynamic/train.m`) + +**Synthetic token data** (5 M random uint16 in [0, 5000) to mimic a compressed +TinyStories vocab), random init, `--accum 10`. + +| Model | Params | Layers | Kernels compiled once | **M5 Max ms/step** | +|-------|--------|--------|-----------------------|--------------------| +| Stories110M (MHA 12/12) | 109 M | 12 | 421 ms | **73.5 ms** | +| Qwen3-0.6B (GQA 16/8) | 596 M | 28 | 398 ms | **320.0 ms** | + +Qwen3-0.6B per-step timing breakdown (stable from step 10+): + +``` +ane_fwd=54.6 io_fwd=15.2 rms=4.5 ane_bwd=70.5 io_bwd=43.3 +silu=27.0 rms_bwd=12.4 cls=8.7 cblas_wait=0.0 dw_copy=9.9 +``` + +ANE time = 125 ms (39 %) · CPU time = 195 ms (61 %). Bottleneck is unchanged +from the README's diagnosis: ANE is idle most of the step waiting for +RMSNorm / SiLU / classifier / dW / Adam on CPU. + +--- + +## Training — static pipeline (`train_large.m`) + +For apples-to-apples with `community_results.json` (all existing entries use +this path). + +``` +[batch 10: compile=3384ms train=902.5ms (90.2ms/step) compiles=72] + ane=8.0 io=2.8 cls=7.6 elem=11.7 rms=0.1 cblas_wait=0.0 ms/step +[batch 20: compile=3335ms train=897.1ms (89.7ms/step) compiles=72] +[batch 30: compile=3353ms train=900.6ms (90.1ms/step) compiles=72] +Total steps: 30 +Wall time: 13.1 s +Compile time: 10072 ms (76.9 %) +Train time: 2700 ms (20.6 %) +Avg train: 90.0 ms/step +``` + +| Chip | ms/step | ane ms | compile / 10 | +|-------------|---------|--------|--------------| +| M1 Pro | 148–163 | 32–35 | 7.9–8.5 s | +| M1 Max | 143–167 | 35–45 | ~7.1 s | +| M3 Ultra\* | 91 | ~10 | ~3.7 s | +| M4 Pro | 69–73 | 8.9 | ~3.5 s | +| M4 Max | 64 | 10.2 | ~3.5 s | +| M5 (16 GB) | 101–120 | 9.1–9.8| 3.2–3.4 s | +| **M5 Max** | **90.0**| **8.0**| **~3.35 s** | + +\* repo reference platform. + +--- + +## Speedup summary — M5 Max vs baselines + +| Metric | M4 (README) | M5 base | M5 Max | vs M4 | vs M5 base | +|--------|-------------|---------|--------|-------|------------| +| FP16 peak (TFLOPS) | 18.6 | 12.17–12.44 | **19.27** | 1.04× | 1.55× | +| INT8 W8A8 (TOPS) | 35.1 | — | **35.61** | 1.01× | — | +| Stories110M static (ms/step) | 91 | 101–120 | **90.0** | 1.01× | 1.22× | +| Stories110M dynamic (ms/step)| — | — | **73.5** | — | — | +| Qwen3-0.6B dynamic (ms/step) | 412| — | **320.0** | 1.29× | — | + +**Takeaways**: + +1. **Peak ANE compute has not moved between M4 and M5 Max** (≈ 19 TFLOPS FP16, + ≈ 35 TOPS INT8). The `h16 → h17` version bump does not show up in peak math. +2. **Training gains of 1.22–1.29× are CPU-driven**, not ANE-driven. The 12 + performance cores plus Accelerate's `cblas_sgemm` on M5 Max close the gap + that made base M5 (4P + 6E) slower than M4 Pro despite a newer ANE. +3. **M5 Max's effective SRAM working set is ≥ M4's.** The `sram_bench` cliff + sits past 70 MB where M4's was at ~32 MB, though a cleaner probe is needed + (the 128 MB row changes two variables at once). + +--- + +## Strategic implications + +- Anyone optimizing training on this repo for M5 Max should focus on pushing + RMSNorm / SiLU / classifier onto the ANE, not on peak-throughput MIL tricks — + the ANE already has 60 % idle headroom per step. +- `h17` is worth re-probing with the tests under `training/test_*.m` — the + `m5result.md` findings (weight-reload fails, weightsBuffer is inert, + procedureIndex is accepted but ignored, QoS has no effect) were recorded on + `h16` and may or may not hold on `h17`. +- No evidence that Apple has tightened the private-API surface on macOS 26.4.1.