Releases: lablup/mlxcel
v0.3.0
New Features
- Nine new model families: BitNet b1.58 (1.58-bit ternary, #252), IBM Granite dense (#254) and GraniteMoeHybrid (Mamba2 plus attention hybrid, #259), LFM2 and LFM2-MoE (#255), Falcon-H1 (Mamba2 plus attention parallel hybrid, #256), PLaMo 2 (Mamba plus attention hybrid, #257) with PlamoTokenizer (#264), Apertus (xIELU, QK-norm, llama3 RoPE scaling, #260), ByteDance Seed-OSS (#261), and dots.llm1 MoE (#263).
- Configurable allowed-origins for server CORS, replacing the any-origin default when set (#253).
Improvements
mlxcel runwith no model argument now defaults tomlx-community/gemma-4-e2b-it-4bit(wasLlama-3.2-3B-Instruct-4bit): a smaller checkpoint that downloads faster and runs in less memory.- Cleaned up
--helpoutput: multi-line example and value-legend blocks now render one item per line, and the help text reads as a standalone runtime. - Fused decode-MoE Metal kernel is now on by default (
MLXCEL_FUSED_MOE, set to0to disable): faster single-token MoE decode, about 13% on gemma4 (#285). - Two-kernel fused decode-MoE that beats
gather_qmm, extended to 6-bit and mixed-bit experts for dots.llm1 and wired to qwen3-next / Qwen 3.5 / 3.6 and gemma4 (#274, #275, #276, #278, #279, #281); the squared-ReLU kernel stays behind a dedicated flag (#280). - Gate the Mamba2 and nemotron_h per-mixer eval to M5 Max so SSM-hybrid decode is not slowed on other Apple Silicon (#266, #271).
- CCCL header resolution at runtime handles relative invocations and nodes without the build-machine path, and a persistent PTX kernel cache reuses JIT-compiled kernels across runs (#270).
Bug Fixes
- Quantized models now stay bf16, fixing a 33-41% M1 Ultra decode regression on bf16-scale checkpoints (qwen3, nemotron, gpt-oss, solar, and others). The blanket bf16-to-f16 quant-scale promotion added with Apertus had created a bf16-activation by f16-scale mismatch in
quantized_matmul/gather_qmm(#290). - Infer per-tensor quantization bits for embeddings, so mixed-precision exports that store the embedding at a different bit width than the top-level config load instead of aborting in dequant. For example diffusiongemma stores its embedding at 8-bit under a 4-bit default (#292).
CI/CD Improvements
- Linux x86_64 and aarch64 CUDA release build jobs with bundled CCCL headers and GPU smoke tests, producing prebuilt CUDA artifacts (#262).
Technical Details
- Split
mlx_cxx_bridge.cppinto domain-specific translation units (#277). - Refreshed the M1 Ultra and M5 Max benchmark results for the 0.3.0 sweep (#295, #296).
Dependencies
- Bumped the minor-and-patch dependency group (#288).
Breaking Changes
None. The fused decode-MoE kernel changes the default decode path, but greedy output is unchanged; set MLXCEL_FUSED_MOE=0 to restore the previous path.
Known Issues
- bf16 exports of very large MoE models do not fit a 128GB host (for example Hunyuan-A13B at about 160GB). Use the 4-bit variant, which runs at about 44 tok/s on M1 Ultra and 64 tok/s on M5 Max.
What's Changed
- feat(server): add configurable allowed-origins for CORS by @inureyes in #253
- feat(models): add IBM Granite dense (granite) by @inureyes in #254
- feat(models): add LFM2 and LFM2-MoE (lfm2, lfm2_moe) by @inureyes in #255
- feat(models): add Falcon-H1 (Mamba2 + attention parallel hybrid) by @inureyes in #256
- feat(models): add PLaMo 2 (plamo2, Mamba + attention hybrid) by @inureyes in #257
- feat(models): add GraniteMoeHybrid (Granite 4.x Mamba2 + attention hybrid) by @inureyes in #259
- feat(models): add Apertus (xIELU, QK-norm, llama3 RoPE scaling) by @inureyes in #260
- feat(models): add ByteDance Seed-OSS (seed_oss) by @inureyes in #261
- feat(models): add dots.llm1 (dots1) MoE by @inureyes in #263
- feat(tokenizer): support PLaMo PlamoTokenizer (tokenizer.jsonl Unigram) by @inureyes in #264
- perf(models): gate Mamba2 mixer eval boundary to M5 Max by @inureyes in #266
- feat(ci): add Linux x86_64 + aarch64 CUDA release builds with bundled CCCL by @inureyes in #262
- perf(models): gate nemotron_h Mamba2 per-mixer eval to M5 Max by @inureyes in #271
- feat(cuda): robust CCCL resolution, first-run JIT notice, and persistent kernel cache by @inureyes in #270
- docs(benchmarks): MoE decode gap investigation (#268) by @inureyes in #273
- perf(moe): fused decode-MoE kernel foundation (#268) by @inureyes in #274
- perf(moe): fused MoE expert decode kernel, correctness-validated (#268 step 2a) by @inureyes in #275
- perf(moe): two-kernel decode-MoE that beats gather_qmm (#268 step 2b) by @inureyes in #276
- refactor(core): split mlx_cxx_bridge.cpp by domain; bump mlxcel-core 0.2.0 by @inureyes in #277
- perf(moe): 6-bit/mixed-bit fused decode-MoE; wire dots.llm1 (#268 step 2c) by @inureyes in #278
- perf(moe): wire qwen3_next fused decode-MoE (qwen3.5/3.6) (#268 step 3a) by @inureyes in #279
- perf(moe): preserve squared-ReLU fused MoE kernel behind a dedicated flag (#268) by @inureyes in #280
- perf(moe): GeGLU fused decode-MoE; wire gemma4 (+13%) (#268 step 3b) by @inureyes in #281
- docs(moe): document MLXCEL_FUSED_MOE flags and per-model gains (#268) by @inureyes in #283
- perf(moe): default-on MLXCEL_FUSED_MOE, validated on M5 (#282) by @inureyes in #285
- feat(models): add BitNet b1.58 (1.58-bit ternary) support (#252) by @inureyes in #287
- perf(models): keep quantized models bf16 to fix M1 Ultra decode regression (#289) by @inureyes in #290
- docs(benchmarks): refresh M5 Max results for the 0.2.1 full sweep by @inureyes in #293
- deps(deps): bump the minor-and-patch group with 2 updates by @dependabot[bot] in #288
- fix(models): infer per-tensor bits for quantized embeddings (#291) by @inureyes in #292
- docs(benchmarks): refresh M1 Ultra + M5 Max for mlxcel 0.2.1 by @inureyes in #295
- docs(benchmarks): correct M5 Max #294 entries; both failures were environmental by @inureyes in #296
Full Changelog: v0.2.1...v0.3.0
v0.2.1
New Features
- Exact-prefix prompt-cache snapshots now cover model-owned recurrent and mixed-cache families: Mamba, Mamba2, Jamba, Nemotron-H, Qwen 3.5 / 3.6 text, MoE, and VLM wrappers (#241).
- Gemma 4 text, VLM, and Unified wrappers now donate and restore exact-prefix prompt-cache snapshots for model-owned standard and rotating caches (#243).
Improvements
mlxcel serve --helpandmlxcel-server --helpnow describe disaggregated peer roles consistently for--prefill-peers,--decode-peers, and--serving-bind.- Documentation now lists Gemma 4 snapshot-cache support and the
MLXCEL_KV_CACHE_BUDGETandMLXCEL_ENABLE_VLM_PREFIX_CACHEenvironment variables.
Bug Fixes
None
CI/CD Improvements
None
Technical Details
- Gemma 4 snapshot restore preserves rotating-cache metadata, including write index, window size, buffered state, and seed.
- Real checkpoint validation on
models/gemma-4-26b-a4b-it-4bitinserted a 10,568,520-byte snapshot withsnapshot_rejections_oversized=0.
Dependencies
None
Breaking Changes
None
Known Issues
None
What's Changed
- feat: add exact-prefix snapshot prompt cache by @inureyes in #241
- feat: add Gemma 4 snapshot prompt-cache reuse by @inureyes in #243
Full Changelog: v0.2.0...v0.2.1
v0.2.0
mlxcel v0.2.0 lands the unified paged KV cache as a live serving path and adds disaggregated prefill/decode/router serving, the DiffusionGemma block-diffusion model, and a per-hardware speculative-decode default. Changes are measured against v0.1.4.
New Features
- Unified paged KV cache, live in the batching server (epic #116). Prefix reuse and paged block storage now operate together: a concurrent shared prefix is stored once with reference counting and copy-on-write, so a second request that shares a prefix adopts the existing blocks and re-prefills only its divergent suffix. Pool-backed decode is byte-identical to the prior dense path on qwen3 and llama3 across single, batched, and prefix-share cases (#152, #167, #168).
- Disaggregated serving: prefill, decode, and router roles over TCP.
mlxcel-server --node-role {prefill,decode,router}with--serving-bind,--prefill-peers, and--decode-peerssplits the three roles across processes. A model-free router fronts HTTP, prefill hands the sequence to decode, and the router merges the token stream back to the client. A 3-process run is byte-identical to a single hybrid node (#185, #187, #188, #189, #190, #191, #192, #193). - DiffusionGemma block-diffusion model: text generation, image input, and
mlxcel-serverserving. Temperature-0 output is byte-identical across the MLX bump (#217, #218, #219, #220). - Qwen3-Coder XML tool-call parsing, surfaced as OpenAI
tool_calls(#206). --kv-cache-budget <BYTES|auto>knob (envMLXCEL_KV_CACHE_BUDGET) caps the paged KV pool with an admission gate and cold-prefix eviction, with usage exposed atGET /v1/cache/statsand on/metrics. Opt-in, unbounded by default (#174, #175, #176, #178).- Architecture-aware KV-cache memory estimation for
mlxcel inspectand the--estimate-memorypreflight, plus an explicit activation term. Sliding-window, MLA, hybrid, and pure-SSM models no longer use a flat formula that was off by about 100x for Gemma, DeepSeek, and Mamba (#172, #173). - Opt-in VLM prompt-prefix cache sharing for multi-turn same-image conversations, behind
--enable-vlm-prefix-cache, verified byte-identical to a cold prefill on qwen2-vl-2b (#182, #184). - Fused paged-attention decode Metal kernel (split-K flash-decoding), built and numerically correct but gated off behind
MLXCEL_PAGED_ATTENTION_NATIVEbecause it does not beat MLX gather-then-SDPA at long context on Apple Silicon (#181).
Improvements
- Automatic Prefix Caching is now enabled by default; the output is unchanged (#233).
- The prompt-prefix KV cache now serves the Anthropic
/v1/messagesand OpenAI Responses/v1/responsesendpoints, not just/v1/chat/completionsand/v1/completions(#240). - The B=1 MTP speculative-burst default is now chosen per hardware. Batch-capable targets (such as Gemma 4 31B) default on only on M5-class hardware with a neural accelerator, since they regressed 0.75x to 0.96x on M1 Ultra while gaining 1.2x to 1.4x on M5; non-batchable targets stay always-on;
MLXCEL_ENABLE_MTP_B1overrides either way (#216). - Partially matched paged prefixes are adopted instead of declined, and paged adoption clones and pins the shared blocks rather than consuming them (#230, #232).
- Chunked slab storage for the paged pool, presized per prefill span with eager slab eval (#237, #229).
- Stream decode continuation tokens one frame at a time from the disaggregated decode role (#214).
- Hardened the ragged B>1 MTP batching masks and verify tail so variable-length prompts in one burst keep greedy parity (#202).
- Vendored MLX bumped to upstream main (2026-06-11); the steel GEMM overlay was retired now that the fix is upstream (#223).
Bug Fixes
- Per-row position holes broke B>1 batched MTP greedy parity after divergent accepts. The surviving K/V is now compacted to each row's accepted end with per-row RoPE and a precise mask, so a divergent round no longer shifts later rows off their true positions (#211).
- Guard the empty-batch paged-decode fallbacks against a
drain(..1)panic, and use absolute block indexing in append, trim, restore, and serde validation so alogical_start > 0write addresses the correct block (#215). - Support chunked-prefill prompts in the disaggregated serving handoff, driving start and continue-chunked to completion with a 1M-token admission cap and pool release on extract error (#213).
- Apply the chat stream filter to disaggregated router output so reasoning-content splitting and structural-token cleanup match the single-node path (#212).
- Finish a chunked prefill when the first chunk already reaches the prompt end (#179).
- Release paged KV block pins on prompt-cache evict or decline, closing a pre-existing leak that left the origin allocation pinned at reference count 1 (#170).
- Account real paged pool bytes in the prompt-cache ledger and
/v1/cache/statsinstead of a nominal placeholder (#231). - Enforce the pack3 size contracts in release builds so a mis-sized packed buffer fails fast instead of corrupting silently (#236).
- Render assistant
tool_calls.argumentsas a JSON object rather than a string on multi-turn requests (#210). - Render the request's
toolsinto the prompt so templates that inspect the tool list receive the real definitions (#207). - Expand bare model names to the default org in the
downloadsubcommand, matching the other-mconsumers (#177).
CI/CD Improvements
- None.
Technical Details
- Hardened the paged KV handoff deserialization boundary: frame-size cap, block-geometry anchor, per-layer consistency check, and empty-sequence rejection, so a malformed handoff payload from a peer cannot drive an out-of-bounds read or an unbounded allocation. A restore that fails partway now releases the blocks it already took instead of leaking them (#186).
- Extended the paged KV cache scheduler and prefix-share parity suites to llama3 alongside qwen3, all byte-identical (#169).
- Added hybrid-SSM cache carve-out tests and multimodal-digest plumbing so SSM and VLM families stay correctly excluded from or included in block sharing (#182).
- New
docs/CONTINUOUS_BATCHING.mdcovering continuous batching, paged decode, and the disaggregated prefill/decode/router topology, plus an expanded unified-cache section indocs/turbo-kv-cache.md(#194). - Recorded upstream attribution for ported third-party code (#238).
Dependencies
- Bumped the minor-and-patch dependency group with 3 updates (#180).
- Vendored MLX pinned to upstream main, 2026-06-11 (#223).
Breaking Changes
- None.
Known Issues
- None.
What's Changed
- feat: transparent pool-backed KVCache + real-model paged parity test by @inureyes in #152
- feat: back paged scheduler sequences with the shared KV block pool by @inureyes in #167
- feat: unify radix prompt cache with paged block pool (sub-step b of #121) by @inureyes in #168
- test: extend paged KV cache parity to llama3 by @inureyes in #169
- fix: release paged KV block pins on prompt-cache evict/decline by @inureyes in #170
- feat: architecture-aware KV-cache memory estimation by @inureyes in #172
- feat: explicit activation term in memory estimate by @inureyes in #173
- feat: opt-in block budget for the paged KV pool by @inureyes in #174
- feat: paged KV block-budget admission gate and eviction by @inureyes in #175
- feat: wire paged KV block budget to a --kv-cache-budget knob by @inureyes in #176
- feat: surface paged KV block budget in cache stats and metrics by @inureyes in #178
- feat: add fused paged-attention decode Metal kernel (gated off) by @inureyes in #181
- feat: hybrid-SSM cache carve-out tests and multimodal digest plumbing by @inureyes in #182
- feat: opt-in VLM prompt-prefix cache sharing for multi-turn images by @inureyes in #184
- feat: serialize paged KV block contents for node handoff by @inureyes in #185
- feat: harden paged KV handoff deserialization boundary by @inureyes in #186
- feat: in-process paged KV serving-role handoff mechanism by @inureyes in #187
- deps(deps): bump the minor-and-patch group with 3 updates by @dependabot[bot] in #180
- fix: expand bare model names in the download subcommand by @inureyes in #177
- feat: disaggregated serving-mode plumbing and coordinator skeleton by @inureyes in #188
- fix: finish chunked prefill when first chunk reaches prompt end (#179) by @ujwal-setlur in #183
- feat: serving-role KV handoff scheduler entries (B2b) by @inureyes in #189
- feat: disaggregated serving-role loops over a real TCP transport (B3a) by @inureyes in #190
- feat: serving-role worker-flip mechanism + disaggregated peer CLI (B3b1) by @inureyes in #191
- feat: live 2-process disaggregated serving handoff (B3b2a) by @inureyes in #192
- feat: dedicated disaggregated serving router front (B3b2b) by @inureyes in #193
- docs: unified paged KV cache + disaggregated serving (B3c) by @inureyes in #194
*...
v0.1.4
Changes since v0.1.3.
New Features
- Gemma 4 Unified (
gemma4_unified) multimodal architecture (#153, closes #151). - Gemma 4 Unified MTP speculative drafter (
gemma4_unified_assistant) (#157, closes #158). The Gemma 4 Unified decode target routes through the existing MTP speculative burst dispatch, reusing the MTP drafter and round loop unchanged. The drafter's pre/post projections load through the quantization-awareUnifiedLinear, so a 4-bit assistant such asgemma-4-12B-it-assistant-4bitno longer crashes at forward time with a matmul shape mismatch. On thegemma-4-12b-it-4bittarget plus the 4-bit assistant, temperature-0 output is byte-identical to classic decode at about 1.87x decode speedup (39 to 74 tok/s). - Variable-length prompts in B>1 batched MTP bursts (#162, closes #161), behind the new
MLXCEL_ENABLE_MTP_BATCH_RAGGEDopt-in (subordinate toMLXCEL_ENABLE_MTP_BATCH). Rows of different prompt lengths join a single burst via per-row left-padding and a windowed left-padding causal mask; greedy parity holds because every token in a row is shifted by the same constant left-padding offset. Eligibility is limited tomax_prompt_len <= sliding_window, and out-of-regime windows fall back to per-row B=1 service. Off by default (measured 0.94x to 1.13x on the 31B), so the production path is byte-for-byte unchanged. - Unified paged KV cache foundation (epic #116), landed in four additive phases:
- Phase 0: decode-time page-gather microbench and ADR 0001, selecting the
[num_blocks, block_size, n_kv_heads, head_dim]pool layout (about 2.1x faster on gather-then-SDPA than the head-split layout) and the gather-then-SDPA strategy (#145, closes #117). - Phase 1: physical block-pool K/V tensor storage in
PagedBlockPool, lazily allocated per layer withwrite_block/gather_visibleprimitives (#148, closes #118). - Phase 2: pooled paged-decode read path over real, possibly fragmented block tables, bit-identical to the dense fallback over 200 steps (#149, closes #119).
- Phase 3: paged prefill writer with shared-prefix copy-on-write, so a suffix write after a shared prefix allocates only the divergent blocks (#150, closes #120).
- Phase 0: decode-time page-gather microbench and ADR 0001, selecting the
Improvements
- B=1 MTP speculative decoding now runs by default for every MTP target, including batch-capable ones such as Gemma 4 31B (#159, closes #158). Previously batch-capable targets declined singleton MTP unless
MLXCEL_ENABLE_MTP_B1=1was set, a calibration from an earlier "B=1 is slower" measurement. M5 Max measurement shows B=1 MTP is profitable with byte-identical output at temperature 0: about 1.2x to 1.4x on the 31B plus bf16 assistant and about 1.87x on the 12B Unified pair. Opt out withMLXCEL_ENABLE_MTP_B1=0.
Bug Fixes
- Quantized fused MoE experts in the
gemma4_unifiedloader are now split correctly (#156). The fused-expert split insanitize_gemma4_unified_weightsonly matched the bare non-quantized.weight, so a quantized MoE checkpoint's.weight/.scales/.biaseslegs fell through unsplit andswitch_gluconstruction could not find its per-projection quantized parts. The split now matches each quantized component leg and slices it on the output (doubled-FFN) axis at the same half boundary, with a dequantize-equivalence test proving no group straddling.
CI/CD Improvements
None.
Technical Details
- The four unified paged KV cache phases (#145, #148, #149, #150) are additive machinery exercised by tests. The live decode path stays byte-for-byte unchanged until the scheduler wiring lands in a later phase of epic #116.
- Recorded the measured Gemma 4 31B B>1 batched MTP numbers and aligned the related code comments (#160).
Dependencies
- Bumped the
minor-and-patchgroup:uuid1.23.1 to 1.23.2 andhyper1.9.0 to 1.10.1 (#147).
Breaking Changes
None.
Known Issues
None.
What's Changed
- perf: page-gather decode microbench and paged-attention ADR (#117) by @inureyes in #145
- feat: add physical block-pool K/V tensor storage to PagedBlockPool by @inureyes in #148
- feat: pooled paged decode read path over real block tables (#119) by @inureyes in #149
- feat: paged prefill writer with shared-prefix copy-on-write by @inureyes in #150
- feat: add Gemma 4 Unified (gemma4_unified) multimodal architecture by @inureyes in #153
- deps(deps): bump the minor-and-patch group with 2 updates by @dependabot[bot] in #147
- fix: split quantized fused MoE experts in gemma4_unified sanitize by @inureyes in #156
- feat: Gemma 4 Unified MTP speculative drafter (gemma4_unified_assistant) by @inureyes in #157
- perf: run B=1 MTP by default for all targets (incl. batch-capable 31B) by @inureyes in #159
- docs: record measured 31B B>1 batched MTP numbers; align comments by @inureyes in #160
- perf: support variable-length prompts in B>1 batched MTP bursts by @inureyes in #162
Full Changelog: v0.1.3...v0.1.4
v0.1.3
Changes since v0.1.2.
New Features
mlxcel arch(aliasmlxcel supported) prints the supported model-architecture catalog, separating it from the local-model listing (#138).mlxcel listgained scripting and machine-readable modes:--json(a stable{repo_id, size_bytes, path, modified}array),-q/--quiet(repo-ids only, formlxcel list -q | xargs -n1 mlxcel rm),-v/--verbose(restores the absolute PATH column), and--sort name|size|modified(#141).
Improvements
mlxcel listdefault table redesigned to NAME / SIZE / MODIFIED, dropping the absolute PATH from the default view. MODIFIED shows a relative time from the snapshot directory mtime ("just now", "2 days ago", "3 weeks ago"); the header contracts$HOMEto~and dims secondary columns on a TTY, respectingNO_COLOR. Long repo-ids are ellipsized so one outlier does not break column alignment (#141).
Bug Fixes
- Security: bounded chat-template rendering with a minijinja fuel cap of 50M VM instructions, so a pathological model template (for example unbounded
forloops) can no longer cause a denial of service. Exhaustion returns a cleanOutOfFuelerror instead of panicking, and the load-timesupports_toolsprobe degrades to a string heuristic. Audited across 91 templates and 267 scenarios with 0 failures. This is RCE-safe and matters most for multi-tenant deployments (#129, PR #139). - The base-model warning no longer presents
-itas a universal instruction-tuned suffix. It now names the per-family conventions (Gemma-it; Llama and Qwen2.5-Instruct; Qwen3 and Qwen3.5 plain name vs.-Base), so runningQwen3.5-0.8B-Baseadvises dropping-Baseinstead of pointing at a non-existent-itrepo (#137).
CI/CD Improvements
- Added a pull-request check (
scripts/ci/check_cross_repo_refs.py) that flags unqualified bare#NNNissue/PR references, so cross-repository references are written asorg/repo#NNNand caught in review (#144).
Technical Details
StoredModelnow carries an optionalmodifiedtimestamp from the snapshot directory mtime; the list renderers take an injectednowanduse_color, so table output is deterministic and TTY styling is gated outside the render path (#141).- Removed leaked internal issue/PR number references from public source comments and docs (#144).
Dependencies
- minijinja: enabled the
fuelfeature (version unchanged at 2.20) to bound chat-template render cost (#139).
Breaking Changes
mlxcel list(and itslsalias) now lists local downloaded models by default instead of the supported-architecture catalog, and themlxcel list --localflag was removed (clap rejects it as an unknown argument). Migration: usemlxcel archfor the catalog, and drop--localfrom anymlxcel list --localinvocation, since the baremlxcel listnow does the same thing (#138).
Known Issues
- None.
What's Changed
- fix: make base-model warning's instruct advice family-aware by @inureyes in #137
- fix(security): bound chat-template rendering with minijinja fuel by @inureyes in #139
- feat!: list local models by default, move catalog to
mlxcel archby @inureyes in #142 - feat: redesign
mlxcel listoutput (modified, --json/-q/-v, styling) by @inureyes in #143 - chore: purge leaked mlxcel-internal issue/PR numbers from public source and docs by @inureyes in #144
Full Changelog: v0.1.2...v0.1.3
v0.1.2
New Features
None
Improvements
None
Bug Fixes
- Chat fallback for models without a
chat_templateno longer collapses into echo loops on base / non-instruction-tuned models. Whentokenizer_config.jsonships nochat_templatefield and there is nochat_template.jinja,render_promptpreviously calledconcat_plaintext, which is bare content-only concatenation with no role markers, so a base completion model took the most natural continuation of an unstructured prompt and parroted the user's last turn indefinitely (the symptom reported in #133). The implicit "no template found" path now uses a genericUser: ... Assistant: ...pseudo-template viaconcat_userassistant_fallback, with a trailingAssistant:cue (no newline) that nudges the model to produce an assistant turn next instead of completing its own prompt with anotherUser:line. Theprocessor.is_none()warning still fires and still recommends the-itHub counterpart.--no-chat-templateis unchanged and remains the documented escape hatch for completion-style raw concatenation, parallelingmlxcel generate --no-chat-template. Template-render failure inside the chat-template path now falls back to the structured form as well, rather than raw concat, since by then the user is already in chat mode. Unknown roles such astoolare preserved verbatim with the sameRole:pattern instead of silently merging into the prior turn (#133, PR #136).
CI/CD Improvements
None
Technical Details
- New
concat_userassistant_fallbackhelper insrc/commands/chat.rscomplements the existingconcat_plaintext(which is kept verbatim for the--no-chat-templateopt-in). render_promptnow dispatches three paths in priority order:--no-chat-template(raw concat, unchanged), chat template present (template apply, structured-form fallback on render failure), no template + not opted out (new structured fallback).- Test coverage added in
src/commands/chat_tests.rs:user_assistant_fallback_labels_all_turns_and_cues_assistant,user_assistant_fallback_marks_unknown_roles_instead_of_dropping_them,render_prompt_without_template_uses_user_assistant_fallback,render_prompt_no_chat_template_flag_uses_raw_concatenation. The pre-existingconcat_plaintext_joins_turns_with_newlinestest is retained and clarified as the--no-chat-templatepath.
Dependencies
None
Breaking Changes
None
Known Issues
None
Full changelog: v0.1.1...v0.1.2
v0.1.1
New Features
None.
Improvements
mlxcel runwarning for models without a chat template is now actionable: it states that the model is likely a base / non-instruction-tuned variant, that chat replies will be incoherent or repetitive, suggests trying an-itvariant on the Hub (e.g.gemma-4-e4b-it-4bitforgemma-4-e4b-4bit), and documents--no-chat-templatefor silent raw-text mode andmlxcel generate -p <prompt>for one-shot completion. The explicit--no-chat-templatepath remains completely silent (#132, PR #134).
Bug Fixes
chat_template.jinjais now downloaded alongside the rest of the model snapshot. The downloader allow-list insrc/downloader/filters.rs::is_wanted_fileonly accepted exact-namechat_template(no extension) plus*.json,*.safetensors,*.tiktoken,*.model, and constrained*.txt, but the actual HuggingFace convention ischat_template.jinja. The file was filtered out at download time, leaving theChatTemplateProcessor::from_model_pathchat_template.jinjafallback dead and forcing the REPL into the raw-text path for any model that ships its template as a separate Jinja file (e.g.mlx-community/gemma-4-e4b-it-4bit).is_wanted_filenow also accepts*.jinjafiles; theis_safe_relative_pathandis_explicitly_deniedguards still run before the allow-list so no new attack surface is opened (#132, PR #134).
CI/CD Improvements
- macOS release binaries are now notarized. The release workflow submits signed
mlxcelandmlxcel-serverto Apple's notary service viarcodesign notary-submit --wait, so Gatekeeper no longer blocks first launch with "developer cannot be verified". Stapling is skipped because bare Mach-O executables do not support stapling, andspctl --assessruns as a soft warn-only check since the notary ticket may still be propagating. Paired withrcodesign verifyafter signing,set -euo pipefailon the prepare-cert and code-sign steps, surfacedopenssl pkcs12stderr on extraction failure, up-front validation ofAPPLE_CERTIFICATE/APPLE_CERTIFICATE_PASSWORD/AC_API_*secrets,chmod 600on the materialized PEM and API key files, and an always-run cleanup that scrubssigning.pem,original.p12,AuthKey.p8,ac-key.json, and the notarization zip from\$RUNNER_TEMPso self-hosted runners no longer carry an unencrypted Developer ID private key across jobs. - Per-target `workflow_dispatch` filter on the release workflow (`targets`: `all` / `macos` / `linux`). Re-uploading a single platform's artifact to an existing release (for example retrofitting notarized macOS binaries onto a release that was cut before notarization landed) no longer rebuilds and replaces the other platforms' bit-different (timestamp-driven) zips, so any sha256 pinned by a downstream consumer remains valid. Release events still build everything; the filter is dispatch-only. Modeled after the per-family `targets` filter in `all-smi`'s release workflow.
- `actions/checkout` ref pinned to the target release tag in both the macOS and Linux CUDA jobs (release event: `github.event.release.tag_name`; `workflow_dispatch`: `github.event.inputs.release_tag`; otherwise `github.sha`). Without an explicit ref, `actions/checkout` would grab the dispatched ref (which is `main` for `workflow_dispatch`), so re-dispatching a build for an older tag would silently use `main` HEAD's source instead of the tag's source. The workflow YAML itself still runs from the dispatched ref, matching `all-smi`'s self-healing release pattern.
Technical Details
- GB10 (NVIDIA Grace Blackwell) doc refreshed to the 2026-05-28 full sweep on mlxcel 0.1.0 with MLX pin `84961223` and the warm same-process harness (`--cooldown 0`). Adds the recovered `internvl3-1b` and `molmo-7b` text rows and three VLM image-path entries (`qwen2-vl-2b`, `qwen2-vl-2b-4bit`, `qwen3-vl-30b-a3b`). The cross-hardware decode table in `model_tests.md` now reflects the canonical state of each per-hardware doc: GB10 2026-05-28, M1 Ultra 2026-05-28, M5 Max 2026-05-27. The "vs 2026-05-19" delta framing is dropped so the doc reads as a current-state snapshot, and the `Partial (
⚠️ )` status is collapsed into `Pass (✅)` because the partial-token information already lives in the Notes column. GB10 Overall Status counts: 101 text pass / 8 fail, 38 VLM image-path pass / 0 fail (#131).
Dependencies
None.
Breaking Changes
None.
Known Issues
- The qwen3-0.6b decode throughput still trails the mlx-lm baseline; the gap is flagged in the M5 Max benchmark notes (carried over from v0.1.0).
- Gemma 3n VLM checkpoints (`gemma3n-e2b-4bit`, `gemma3n-e4b-4bit`, `gemma3n-e4b-bf16`) still fail on the M5 Max VLM path with a `[broadcast_shapes]` mismatch; the text-only path on the same checkpoints is unaffected (carried over from v0.0.31).
Full Changelog: v0.1.0...v0.1.1
v0.1.0
New Features
mlxcel run <repo-id-or-path>subcommand (#102, epic #92). With no-p, enters the interactive chat REPL; with-p, one-shot output byte-identical to the equivalentmlxcel generate; with no model argument, falls back tomlx-community/Llama-3.2-3B-Instruct-4bit, matching mlx-lm'sDEFAULT_MODEL. The model is a positional argument, somlxcel run <repo-id>reads likeollama run.- Interactive multi-turn chat REPL (#101).
mlxcel generatewithout-p/--promptnow enters a chat loop that streams the assistant reply token-by-token, preserves conversation context across turns, and supports/bye,/clear,/?(alias/help) slash commands plus ollama-style"""multiline input blocks. Reuses the existing model resolver, tokenizer, chat template, sampling, generator, and the server's byte-fallback-safe streaming detokenizer, so the offline REPL stays in lock-step with the server's behavior. - Local model management (#99).
mlxcel list --localenumerates downloaded snapshots with repo-id, on-disk size, and absolute path.mlxcel rm <repo-id>deletes from the mlxcel store with confirmation on a TTY, refuses on a non-TTY without--yes, contains deletion to the store root, and refuses to touch the read-only HuggingFace cache. - Repo-id-aware
-m/--modelacrossgenerate,serve,inspect,mlxcel-server, andrun(#100, #92).-mnow accepts a local path or a HuggingFaceowner/namerepo-id. Resolution precedence: existing on-disk path (byte-identical to the pre-#100 behavior); otherwise legacy./models/<name>then HF cache (read-only reuse) then mlxcel global store then auto-download into the store. Legacy and store branches gate on a presentconfig.json. - Global model store and HuggingFace cache read-reuse (#98). Default download destination moves from per-CWD
./models/<basename>to${MLXCEL_CACHE_DIR:-$HOME/.cache/mlxcel}/models/<owner>/<name>, namespaced to avoid same-name collisions across owners.download_reporeuses an existing snapshot under$HF_HUB_CACHE/$HF_HOME/~/.cache/huggingface/hubwhen no--local-diris pinned; mlxcel never writes into the HF content-addressed layout. MLXCEL_MODELS_DIRenvironment variable and uniform--models-dirflag (#108). Precedence:--models-dir>MLXCEL_MODELS_DIR>${MLXCEL_CACHE_DIR:-$HOME/.cache/mlxcel}/models. Wired throughdownload,generate,serve,inspect,run,list, andrm. Closes #107.- Bare model names default to the
mlx-communityorg (#113). A value with no slash (e.g.Qwen3-4B-4bit) now resolves asmlx-community/<name>instead of erroring. Overridable viaMLXCEL_DEFAULT_ORG; invalid org values are caught up front with no network call.
Improvements
None.
Bug Fixes
mlxcel-serverlegacy startup-m(epic #92 hardening) now uses the same repo-id resolver and--models-dirstore override asmlxcel serve, including the safetensors-only presence check.- Video resource caps (
MLXCEL_VIDEO_MAX_PIXELS/_DURATION_SEC/_PNG_FRAME_BYTES) are now threaded through aVideoLimitsstruct resolved once at the boundary instead of being read fromstd::envdeep in the decode path. Removes a libc setenv/getenv data race that made the cap tests flake under the threadedcargo testrunner (#104). - Pipeline runtime tests bind
127.0.0.1:0directly throughTcpTransportand resolve the real port vialocal_addr(), removing the release-then-rebind window that caused intermittent "stub stage startup channel dropped" failures when another concurrently running test grabbed the freed ephemeral port (#106). require_secure_endpoint_refuses_plaintext_with_tokentest acquiresenv_lock()so it no longer races with sibling tests that setMLXCEL_ALLOW_INSECURE_ENDPOINT(#111).
CI/CD Improvements
None.
Technical Details
- README "Run a model" section now leads with the
mlxcel runone-liner and collapses the core verbs (generate,serve,inspect,--estimate-memory) into a single one-line-comment block; env-var detail moves behinddocs/environment-variables.md. README drops about 58 lines with no loss of documented behavior (#114). - M5 Max detailed table, README headline version, and benchmark report method table refreshed to the 2026-05-27 full-sweep state on mlxcel 0.1.0.
internvl3-1bnow passes in both text (661 tok/s) and VLM (601 tok/s, ahead of mlx-vlm's 529 tok/s) (#115). - M5 Max decode aggregates recomputed against the unchanged
mlx-lm/mlx-vlmbaselines. Text decode 99% median, 62 of 66 at >=90% parity; VLM decode 102% average, 101% median, 22 comparable pairs, 18 of 22 at >=90% parity. README decode tables add Gemma 2 2B, Phi-3.5-mini, Jamba, and InternVL3 1B (#127). - 2026-05-28 M1 Ultra full sweep mirrored into the public benchmark docs. M1 Ultra text 99% median (64 of 74 at >=90%); VLM 98% median (12 of 18 at >=90%);
internvl3now passes on M1 (#130). - v0.0.31 #86 entry in CHANGELOG.md and
debian/changelogreworded to credit the batch-scheduler concurrency race as the root cause, matching the GitHub release note for v0.0.31.
Dependencies
None.
Breaking Changes
- The default download destination of
mlxcel download(and the implicit auto-download under any-mconsumer) moves from per-CWD./models/<basename>to the global model store${MLXCEL_CACHE_DIR:-$HOME/.cache/mlxcel}/models/<owner>/<name>. Existing local paths passed to-mcontinue to be used verbatim. Pass--local-dir <path>, setMLXCEL_MODELS_DIR, or pass--models-dir <path>to override the destination.
Known Issues
- The qwen3-0.6b decode throughput still trails the mlx-lm baseline; the gap is flagged in the M5 Max benchmark notes.
- Gemma 3n VLM checkpoints (
gemma3n-e2b-4bit,gemma3n-e4b-4bit,gemma3n-e4b-bf16) still fail on the M5 Max VLM path with a[broadcast_shapes]mismatch; the text-only path on the same checkpoints is unaffected (carried over from v0.0.31).
What's Changed
- feat(download): global model store + default migration + HF-cache read reuse by @inureyes in #98
- feat(cli): repo-id-aware -m resolver with auto-download (generate/serve/inspect) by @inureyes in #100
- feat(cli): local model management — list downloaded models + remove by @inureyes in #99
- feat(cli): interactive chat REPL (multi-turn, streaming, slash commands, multiline) by @inureyes in #101
- feat(cli): add
mlxcel runverb (one-shot + REPL dispatch + default-model fallback) by @inureyes in #102 - fix(multimodal): inject video resource caps instead of reading env in the decode path by @inureyes in #104
- fix(distributed): bind ephemeral ports directly in pipeline runtime tests by @inureyes in #106
- feat(cli): MLXCEL_MODELS_DIR env var + uniform --models-dir model-store override by @inureyes in #108
- fix: harden epic 92 server model resolver by @inureyes in #109
- fix(downloader): serialize the secure-endpoint refusal test through ENV_LOCK by @inureyes in #111
- feat(cli): default bare model names to the mlx-community org by @inureyes in #113
- docs: slim down README Quick start section by @inureyes in #114
- docs: refresh M5 Max benchmark results to mlxcel 0.1.0 (2026-05-27) by @inureyes in #115
- docs: recompute M5 Max decode aggregates and feature optimized models by @inureyes in #127
- docs: reflect the 2026-05-28 M1 Ultra re-benchmark in public docs by @inureyes in #130
Full Changelog: v0.0.31...v0.1.0
v0.0.31
New Features
- MiniCPM-V 4.6 VLM architecture, including hardened image grid handling for the multi-slice image processor. (#82, #83)
- RT-DETRv2 object detection model, exposed through the new
mlxcel detectsubcommand. (#80) - Anthropic-style
/v1/messagesAPI endpoint on the server, so Messages API clients can talk to mlxcel alongside the existing OpenAI-compatible routes. (#74)
Improvements
- Documented the
MLXCEL_CAPTURE_DECODEenvironment variable and clarified the memory headroom wording. (#72)
Bug Fixes
- Chat message
contentthat is missing or explicitlynull, such as assistant tool-call turns, is now tolerated instead of being rejected with an HTTP 422. This restores multi-turn tool loops for OpenAI-compatible clients that omitcontenton tool-call messages. (#91) - Gemma 3n VLM
per_layer_inputsis now keyed per sequence id, mirroring the Gemma 4 container. It was previously stashed in a single shared cell, so a burst of concurrent VLM requests in the batch scheduler could overwrite it before prefill consumed it and read the wrong sequence's tensor (or panic). (#86) - Qwen3.5 MTP speculative decoding now uses per-position verify attention so the draft and verify passes stay in parity. (#78)
- Batched quantized KV caches now apply the correct mask offset. (#76)
CI/CD Improvements
- Pinned the Rust toolchain to 1.93.1 for reproducible builds. (#87, #90)
- Excluded the root
modelssymlink (#88) and AI assistant temporary directories from.gitignore.
Technical Details
- New model coverage: MiniCPM-V 4.6 on the VLM path and RT-DETRv2 for object detection, the latter wired through a dedicated
mlxcel detectsubcommand. - The server now speaks the Anthropic Messages API (
/v1/messages) in addition to the OpenAI-compatible chat and responses endpoints. - Correctness work this cycle targets multi-turn tool calling (missing or
nullcontent), Gemma 3n VLM per-sequence state isolation under the batch scheduler, Qwen3.5 MTP speculative-decode parity, and batched quantized KV cache masking.
Dependencies
Breaking Changes
None.
Known Issues
- Gemma 3n VLM checkpoints (
gemma3n-e2b-4bit,gemma3n-e4b-4bit,gemma3n-e4b-bf16) still fail on the M5 Max VLM path with a[broadcast_shapes]mismatch. The text-only path on the same checkpoints is unaffected. - The qwen3-0.6b decode throughput still trails the mlx-lm baseline; the gap is flagged in the M5 Max benchmark notes.
What's Changed
- docs: document MLXCEL_CAPTURE_DECODE and clarify headroom wording by @inureyes in #72
- feat(server): add Anthropic-style /v1/messages API endpoint by @inureyes in #74
- fix(cache): correct mask offset for batched quantized KV caches by @inureyes in #76
- fix(speculative): per-position verify attention for Qwen3.5 MTP parity by @inureyes in #78
- feat(models): add RT-DETRv2 object detection model by @inureyes in #80
- feat(models): add MiniCPM-V 4.6 VLM support by @inureyes in #82
- fix: harden MiniCPM-V 4.6 image grid handling by @inureyes in #83
- deps(deps): bump the minor-and-patch group with 2 updates by @dependabot[bot] in #84
- fix(vision): make Gemma 3n VLM per_layer_inputs sequence-aware by @inureyes in #86
- chore: pin Rust toolchain to 1.95.0 by @inureyes in #87
- chore: ignore root models symlink in .gitignore by @inureyes in #88
- chore: pin Rust toolchain to 1.93.1 by @inureyes in #90
- fix(server): tolerate missing/null content in chat messages by @inureyes in #91
Full Changelog: v0.0.30...v0.0.31
v0.0.30
New Features
- Unified pre-load memory estimator.
mlxcel inspectis a new read-only subcommand that prints a byte-level breakdown of model weights, KV cache, and runtime headroom against available unified memory without loading any tensors.--estimate-memoryonmlxcel generateandmlxcel serveruns the same estimator as a preflight and aborts when the model will not fit; pass--force(alias--no-memory-check) to override, and setMLXCEL_MEMORY_LIMIT=NGBto tighten the available figure to a soft cap. (#67) - Exact weight footprint from the safetensors header, parsed without materializing tensors so the estimator works from real per-dtype byte counts. (#64)
- KV cache memory estimator with 256-token rounding that matches the runtime's pre-allocation steps. (#65)
- MLX runtime memory API bindings that expose the active, peak, and limit byte counters through FFI. (#66)
- Molmo v1 (molmo-7b) VLM architecture. (#41)
- InternVL (internvl_chat) VLM architecture. (#37)
Improvements
- Gemma 3n bf16 decode: reduced AltUp/MLP graph overhead (#60) and improved M5 decode bandwidth with pretransposed weights (#62).
- Phi-3.5 SuScaledRoPE decode speedup. (#42)
- Gemma dense GeGLU aligned with the mlx-lm reference for faster decode. (#43)
- Jamba hybrid decode speedup. (#44)
- Server parallel context sizing:
--ctx-sizeis now treated as a total context budget shared across active request slots, matching llama.cpp server semantics.--parallel N --ctx-size Cyields an effective per-slot window offloor(C / N),/slotsreports the per-slot window, and startup rejects per-slot windows below 512 tokens. (#57)
Bug Fixes
- CLI boolean cache flags are now validated, and CLI flags correctly take precedence over their environment-variable equivalents. (#70)
- Prompt cache radix trie is now iterative, preventing a stack overflow on deep prompt prefixes. (#63)
- Gemma 3n gates the bf16 fused decode path off the M5 Neural Accelerator so output stays correct on that hardware. (#61)
- CUDA Hopper builds append the
90aarchitecture suffix for auto-detect and fallback builds. (#51) - VLM server image decoding is hardened to skip invalid entries instead of failing the whole request. (#50)
- Qwen2-VL image placeholder is expanded to the full grid count. (#39)
- Tightened the memory estimator preflight coverage so the abort path is exercised across
generateandserve.
CI/CD Improvements
None.
Technical Details
- The memory estimator composes three layers: static weight bytes from the safetensors header, a KV cache sized to the context window with 256-token rounding, and a runtime headroom factor that defaults to
1.20x. The total is checked against available unified memory and the optionalMLXCEL_MEMORY_LIMITsoft cap using live MLX runtime memory counters. - Docs: refreshed the M1 Ultra and README decode benchmark figures for the Molmo / Phi-3.5 / Gemma / Jamba / InternVL work (#45, #46), corrected the M5 Max baichuan-m1-14b decode comparison (#49), and dropped change-cause notes from the result tables to keep them current-state only (#47, #48).
- Tests: added a qwen2.5-vl-3b-4bit warmup regression guard. (#38)
Dependencies
None. The only Cargo.lock change is the mlxcel package version bump.
Breaking Changes
- Server
--ctx-sizesemantics changed under--parallel. Previously each request slot received the full--ctx-sizewindow; it is now a shared budget, so--parallel N --ctx-size Cgives each slotfloor(C / N)tokens. Deployments that relied on the old per-slot sizing should raise--ctx-sizeaccordingly. Startup now rejects per-slot windows below 512 tokens. (#57)
Known Issues
- The qwen3-0.6b decode throughput still trails the mlx-lm baseline; the gap is flagged in the M5 Max benchmark notes. (#49)
What's Changed
- feat(models): add InternVL (internvl_chat) VLM architecture by @inureyes in #37
- test(vision): add qwen2.5-vl-3b-4bit warmup regression guard by @inureyes in #38
- fix(vision): expand Qwen2-VL image placeholder to grid count by @inureyes in #39
- feat(models): add Molmo v1 (molmo-7b) VLM architecture by @inureyes in #41
- fix(perf): speed up Phi-3.5 SuScaledRoPE decode by @inureyes in #42
- fix(perf): align Gemma dense GeGLU with mlx-lm by @inureyes in #43
- fix(perf): speed up Jamba hybrid decode by @inureyes in #44
- docs(benchmarks): refresh M1 Ultra results for Molmo / Phi-3.5 / Gemma / Jamba / InternVL by @inureyes in #45
- docs(readme): refresh decode benchmark figures after the Gemma / Phi-3.5 / Jamba work by @inureyes in #46
- docs(benchmarks): drop change-cause notes from the result tables by @inureyes in #47
- docs(benchmarks): drop legacy change-cause notes from the M1 Ultra table by @inureyes in #48
- docs(benchmarks): correct the M5 Max baichuan-m1-14b decode comparison and flag qwen3-0.6b by @inureyes in #49
- fix: harden VLM server image decoding by @inureyes in #50
- fix(cuda): append 90a arch suffix for Hopper auto-detect/fallback builds by @inureyes in #51
- perf(gemma3n): reduce bf16 decode AltUp/MLP graph overhead by @inureyes in #60
- fix(gemma3n): gate bf16 fused decode path off M5 Neural Accelerator by @inureyes in #61
- perf(gemma3n): improve M5 decode bandwidth with pretransposed weights by @inureyes in #62
- fix(prompt_cache): make radix trie iterative to prevent stack overflow by @inureyes in #63
- feat(ffi): wrap MLX runtime memory APIs (active/peak/limit) by @inureyes in #66
- feat(weights): add exact weight footprint from safetensors header by @inureyes in #64
- feat(hardware): add KV cache memory estimator with 256-token rounding by @inureyes in #65
- feat: unified memory estimator with mlxcel inspect and generate/serve preflight by @inureyes in #67
- fix: tighten memory estimator preflight coverage by @inureyes in #68
- fix: align parallel context sizing with slots by @inureyes in #69
- fix(cli): validate boolean cache flags, fix CLI-over-env precedence by @inureyes in #70
Full Changelog: v0.0.29...v0.0.30