Skip to content

Releases: lablup/mlxcel

v0.3.0

15 Jun 09:59

Choose a tag to compare

New Features

  • Nine new model families: BitNet b1.58 (1.58-bit ternary, #252), IBM Granite dense (#254) and GraniteMoeHybrid (Mamba2 plus attention hybrid, #259), LFM2 and LFM2-MoE (#255), Falcon-H1 (Mamba2 plus attention parallel hybrid, #256), PLaMo 2 (Mamba plus attention hybrid, #257) with PlamoTokenizer (#264), Apertus (xIELU, QK-norm, llama3 RoPE scaling, #260), ByteDance Seed-OSS (#261), and dots.llm1 MoE (#263).
  • Configurable allowed-origins for server CORS, replacing the any-origin default when set (#253).

Improvements

  • mlxcel run with no model argument now defaults to mlx-community/gemma-4-e2b-it-4bit (was Llama-3.2-3B-Instruct-4bit): a smaller checkpoint that downloads faster and runs in less memory.
  • Cleaned up --help output: multi-line example and value-legend blocks now render one item per line, and the help text reads as a standalone runtime.
  • Fused decode-MoE Metal kernel is now on by default (MLXCEL_FUSED_MOE, set to 0 to disable): faster single-token MoE decode, about 13% on gemma4 (#285).
  • Two-kernel fused decode-MoE that beats gather_qmm, extended to 6-bit and mixed-bit experts for dots.llm1 and wired to qwen3-next / Qwen 3.5 / 3.6 and gemma4 (#274, #275, #276, #278, #279, #281); the squared-ReLU kernel stays behind a dedicated flag (#280).
  • Gate the Mamba2 and nemotron_h per-mixer eval to M5 Max so SSM-hybrid decode is not slowed on other Apple Silicon (#266, #271).
  • CCCL header resolution at runtime handles relative invocations and nodes without the build-machine path, and a persistent PTX kernel cache reuses JIT-compiled kernels across runs (#270).

Bug Fixes

  • Quantized models now stay bf16, fixing a 33-41% M1 Ultra decode regression on bf16-scale checkpoints (qwen3, nemotron, gpt-oss, solar, and others). The blanket bf16-to-f16 quant-scale promotion added with Apertus had created a bf16-activation by f16-scale mismatch in quantized_matmul / gather_qmm (#290).
  • Infer per-tensor quantization bits for embeddings, so mixed-precision exports that store the embedding at a different bit width than the top-level config load instead of aborting in dequant. For example diffusiongemma stores its embedding at 8-bit under a 4-bit default (#292).

CI/CD Improvements

  • Linux x86_64 and aarch64 CUDA release build jobs with bundled CCCL headers and GPU smoke tests, producing prebuilt CUDA artifacts (#262).

Technical Details

  • Split mlx_cxx_bridge.cpp into domain-specific translation units (#277).
  • Refreshed the M1 Ultra and M5 Max benchmark results for the 0.3.0 sweep (#295, #296).

Dependencies

  • Bumped the minor-and-patch dependency group (#288).

Breaking Changes

None. The fused decode-MoE kernel changes the default decode path, but greedy output is unchanged; set MLXCEL_FUSED_MOE=0 to restore the previous path.

Known Issues

  • bf16 exports of very large MoE models do not fit a 128GB host (for example Hunyuan-A13B at about 160GB). Use the 4-bit variant, which runs at about 44 tok/s on M1 Ultra and 64 tok/s on M5 Max.

What's Changed

  • feat(server): add configurable allowed-origins for CORS by @inureyes in #253
  • feat(models): add IBM Granite dense (granite) by @inureyes in #254
  • feat(models): add LFM2 and LFM2-MoE (lfm2, lfm2_moe) by @inureyes in #255
  • feat(models): add Falcon-H1 (Mamba2 + attention parallel hybrid) by @inureyes in #256
  • feat(models): add PLaMo 2 (plamo2, Mamba + attention hybrid) by @inureyes in #257
  • feat(models): add GraniteMoeHybrid (Granite 4.x Mamba2 + attention hybrid) by @inureyes in #259
  • feat(models): add Apertus (xIELU, QK-norm, llama3 RoPE scaling) by @inureyes in #260
  • feat(models): add ByteDance Seed-OSS (seed_oss) by @inureyes in #261
  • feat(models): add dots.llm1 (dots1) MoE by @inureyes in #263
  • feat(tokenizer): support PLaMo PlamoTokenizer (tokenizer.jsonl Unigram) by @inureyes in #264
  • perf(models): gate Mamba2 mixer eval boundary to M5 Max by @inureyes in #266
  • feat(ci): add Linux x86_64 + aarch64 CUDA release builds with bundled CCCL by @inureyes in #262
  • perf(models): gate nemotron_h Mamba2 per-mixer eval to M5 Max by @inureyes in #271
  • feat(cuda): robust CCCL resolution, first-run JIT notice, and persistent kernel cache by @inureyes in #270
  • docs(benchmarks): MoE decode gap investigation (#268) by @inureyes in #273
  • perf(moe): fused decode-MoE kernel foundation (#268) by @inureyes in #274
  • perf(moe): fused MoE expert decode kernel, correctness-validated (#268 step 2a) by @inureyes in #275
  • perf(moe): two-kernel decode-MoE that beats gather_qmm (#268 step 2b) by @inureyes in #276
  • refactor(core): split mlx_cxx_bridge.cpp by domain; bump mlxcel-core 0.2.0 by @inureyes in #277
  • perf(moe): 6-bit/mixed-bit fused decode-MoE; wire dots.llm1 (#268 step 2c) by @inureyes in #278
  • perf(moe): wire qwen3_next fused decode-MoE (qwen3.5/3.6) (#268 step 3a) by @inureyes in #279
  • perf(moe): preserve squared-ReLU fused MoE kernel behind a dedicated flag (#268) by @inureyes in #280
  • perf(moe): GeGLU fused decode-MoE; wire gemma4 (+13%) (#268 step 3b) by @inureyes in #281
  • docs(moe): document MLXCEL_FUSED_MOE flags and per-model gains (#268) by @inureyes in #283
  • perf(moe): default-on MLXCEL_FUSED_MOE, validated on M5 (#282) by @inureyes in #285
  • feat(models): add BitNet b1.58 (1.58-bit ternary) support (#252) by @inureyes in #287
  • perf(models): keep quantized models bf16 to fix M1 Ultra decode regression (#289) by @inureyes in #290
  • docs(benchmarks): refresh M5 Max results for the 0.2.1 full sweep by @inureyes in #293
  • deps(deps): bump the minor-and-patch group with 2 updates by @dependabot[bot] in #288
  • fix(models): infer per-tensor bits for quantized embeddings (#291) by @inureyes in #292
  • docs(benchmarks): refresh M1 Ultra + M5 Max for mlxcel 0.2.1 by @inureyes in #295
  • docs(benchmarks): correct M5 Max #294 entries; both failures were environmental by @inureyes in #296

Full Changelog: v0.2.1...v0.3.0

v0.2.1

13 Jun 01:35

Choose a tag to compare

New Features

  • Exact-prefix prompt-cache snapshots now cover model-owned recurrent and mixed-cache families: Mamba, Mamba2, Jamba, Nemotron-H, Qwen 3.5 / 3.6 text, MoE, and VLM wrappers (#241).
  • Gemma 4 text, VLM, and Unified wrappers now donate and restore exact-prefix prompt-cache snapshots for model-owned standard and rotating caches (#243).

Improvements

  • mlxcel serve --help and mlxcel-server --help now describe disaggregated peer roles consistently for --prefill-peers, --decode-peers, and --serving-bind.
  • Documentation now lists Gemma 4 snapshot-cache support and the MLXCEL_KV_CACHE_BUDGET and MLXCEL_ENABLE_VLM_PREFIX_CACHE environment variables.

Bug Fixes

None

CI/CD Improvements

None

Technical Details

  • Gemma 4 snapshot restore preserves rotating-cache metadata, including write index, window size, buffered state, and seed.
  • Real checkpoint validation on models/gemma-4-26b-a4b-it-4bit inserted a 10,568,520-byte snapshot with snapshot_rejections_oversized=0.

Dependencies

None

Breaking Changes

None

Known Issues

None

What's Changed

  • feat: add exact-prefix snapshot prompt cache by @inureyes in #241
  • feat: add Gemma 4 snapshot prompt-cache reuse by @inureyes in #243

Full Changelog: v0.2.0...v0.2.1

v0.2.0

12 Jun 15:41

Choose a tag to compare

mlxcel v0.2.0 lands the unified paged KV cache as a live serving path and adds disaggregated prefill/decode/router serving, the DiffusionGemma block-diffusion model, and a per-hardware speculative-decode default. Changes are measured against v0.1.4.

New Features

  • Unified paged KV cache, live in the batching server (epic #116). Prefix reuse and paged block storage now operate together: a concurrent shared prefix is stored once with reference counting and copy-on-write, so a second request that shares a prefix adopts the existing blocks and re-prefills only its divergent suffix. Pool-backed decode is byte-identical to the prior dense path on qwen3 and llama3 across single, batched, and prefix-share cases (#152, #167, #168).
  • Disaggregated serving: prefill, decode, and router roles over TCP. mlxcel-server --node-role {prefill,decode,router} with --serving-bind, --prefill-peers, and --decode-peers splits the three roles across processes. A model-free router fronts HTTP, prefill hands the sequence to decode, and the router merges the token stream back to the client. A 3-process run is byte-identical to a single hybrid node (#185, #187, #188, #189, #190, #191, #192, #193).
  • DiffusionGemma block-diffusion model: text generation, image input, and mlxcel-server serving. Temperature-0 output is byte-identical across the MLX bump (#217, #218, #219, #220).
  • Qwen3-Coder XML tool-call parsing, surfaced as OpenAI tool_calls (#206).
  • --kv-cache-budget <BYTES|auto> knob (env MLXCEL_KV_CACHE_BUDGET) caps the paged KV pool with an admission gate and cold-prefix eviction, with usage exposed at GET /v1/cache/stats and on /metrics. Opt-in, unbounded by default (#174, #175, #176, #178).
  • Architecture-aware KV-cache memory estimation for mlxcel inspect and the --estimate-memory preflight, plus an explicit activation term. Sliding-window, MLA, hybrid, and pure-SSM models no longer use a flat formula that was off by about 100x for Gemma, DeepSeek, and Mamba (#172, #173).
  • Opt-in VLM prompt-prefix cache sharing for multi-turn same-image conversations, behind --enable-vlm-prefix-cache, verified byte-identical to a cold prefill on qwen2-vl-2b (#182, #184).
  • Fused paged-attention decode Metal kernel (split-K flash-decoding), built and numerically correct but gated off behind MLXCEL_PAGED_ATTENTION_NATIVE because it does not beat MLX gather-then-SDPA at long context on Apple Silicon (#181).

Improvements

  • Automatic Prefix Caching is now enabled by default; the output is unchanged (#233).
  • The prompt-prefix KV cache now serves the Anthropic /v1/messages and OpenAI Responses /v1/responses endpoints, not just /v1/chat/completions and /v1/completions (#240).
  • The B=1 MTP speculative-burst default is now chosen per hardware. Batch-capable targets (such as Gemma 4 31B) default on only on M5-class hardware with a neural accelerator, since they regressed 0.75x to 0.96x on M1 Ultra while gaining 1.2x to 1.4x on M5; non-batchable targets stay always-on; MLXCEL_ENABLE_MTP_B1 overrides either way (#216).
  • Partially matched paged prefixes are adopted instead of declined, and paged adoption clones and pins the shared blocks rather than consuming them (#230, #232).
  • Chunked slab storage for the paged pool, presized per prefill span with eager slab eval (#237, #229).
  • Stream decode continuation tokens one frame at a time from the disaggregated decode role (#214).
  • Hardened the ragged B>1 MTP batching masks and verify tail so variable-length prompts in one burst keep greedy parity (#202).
  • Vendored MLX bumped to upstream main (2026-06-11); the steel GEMM overlay was retired now that the fix is upstream (#223).

Bug Fixes

  • Per-row position holes broke B>1 batched MTP greedy parity after divergent accepts. The surviving K/V is now compacted to each row's accepted end with per-row RoPE and a precise mask, so a divergent round no longer shifts later rows off their true positions (#211).
  • Guard the empty-batch paged-decode fallbacks against a drain(..1) panic, and use absolute block indexing in append, trim, restore, and serde validation so a logical_start > 0 write addresses the correct block (#215).
  • Support chunked-prefill prompts in the disaggregated serving handoff, driving start and continue-chunked to completion with a 1M-token admission cap and pool release on extract error (#213).
  • Apply the chat stream filter to disaggregated router output so reasoning-content splitting and structural-token cleanup match the single-node path (#212).
  • Finish a chunked prefill when the first chunk already reaches the prompt end (#179).
  • Release paged KV block pins on prompt-cache evict or decline, closing a pre-existing leak that left the origin allocation pinned at reference count 1 (#170).
  • Account real paged pool bytes in the prompt-cache ledger and /v1/cache/stats instead of a nominal placeholder (#231).
  • Enforce the pack3 size contracts in release builds so a mis-sized packed buffer fails fast instead of corrupting silently (#236).
  • Render assistant tool_calls.arguments as a JSON object rather than a string on multi-turn requests (#210).
  • Render the request's tools into the prompt so templates that inspect the tool list receive the real definitions (#207).
  • Expand bare model names to the default org in the download subcommand, matching the other -m consumers (#177).

CI/CD Improvements

  • None.

Technical Details

  • Hardened the paged KV handoff deserialization boundary: frame-size cap, block-geometry anchor, per-layer consistency check, and empty-sequence rejection, so a malformed handoff payload from a peer cannot drive an out-of-bounds read or an unbounded allocation. A restore that fails partway now releases the blocks it already took instead of leaking them (#186).
  • Extended the paged KV cache scheduler and prefix-share parity suites to llama3 alongside qwen3, all byte-identical (#169).
  • Added hybrid-SSM cache carve-out tests and multimodal-digest plumbing so SSM and VLM families stay correctly excluded from or included in block sharing (#182).
  • New docs/CONTINUOUS_BATCHING.md covering continuous batching, paged decode, and the disaggregated prefill/decode/router topology, plus an expanded unified-cache section in docs/turbo-kv-cache.md (#194).
  • Recorded upstream attribution for ported third-party code (#238).

Dependencies

  • Bumped the minor-and-patch dependency group with 3 updates (#180).
  • Vendored MLX pinned to upstream main, 2026-06-11 (#223).

Breaking Changes

  • None.

Known Issues

  • None.

What's Changed

  • feat: transparent pool-backed KVCache + real-model paged parity test by @inureyes in #152
  • feat: back paged scheduler sequences with the shared KV block pool by @inureyes in #167
  • feat: unify radix prompt cache with paged block pool (sub-step b of #121) by @inureyes in #168
  • test: extend paged KV cache parity to llama3 by @inureyes in #169
  • fix: release paged KV block pins on prompt-cache evict/decline by @inureyes in #170
  • feat: architecture-aware KV-cache memory estimation by @inureyes in #172
  • feat: explicit activation term in memory estimate by @inureyes in #173
  • feat: opt-in block budget for the paged KV pool by @inureyes in #174
  • feat: paged KV block-budget admission gate and eviction by @inureyes in #175
  • feat: wire paged KV block budget to a --kv-cache-budget knob by @inureyes in #176
  • feat: surface paged KV block budget in cache stats and metrics by @inureyes in #178
  • feat: add fused paged-attention decode Metal kernel (gated off) by @inureyes in #181
  • feat: hybrid-SSM cache carve-out tests and multimodal digest plumbing by @inureyes in #182
  • feat: opt-in VLM prompt-prefix cache sharing for multi-turn images by @inureyes in #184
  • feat: serialize paged KV block contents for node handoff by @inureyes in #185
  • feat: harden paged KV handoff deserialization boundary by @inureyes in #186
  • feat: in-process paged KV serving-role handoff mechanism by @inureyes in #187
  • deps(deps): bump the minor-and-patch group with 3 updates by @dependabot[bot] in #180
  • fix: expand bare model names in the download subcommand by @inureyes in #177
  • feat: disaggregated serving-mode plumbing and coordinator skeleton by @inureyes in #188
  • fix: finish chunked prefill when first chunk reaches prompt end (#179) by @ujwal-setlur in #183
  • feat: serving-role KV handoff scheduler entries (B2b) by @inureyes in #189
  • feat: disaggregated serving-role loops over a real TCP transport (B3a) by @inureyes in #190
  • feat: serving-role worker-flip mechanism + disaggregated peer CLI (B3b1) by @inureyes in #191
  • feat: live 2-process disaggregated serving handoff (B3b2a) by @inureyes in #192
  • feat: dedicated disaggregated serving router front (B3b2b) by @inureyes in #193
  • docs: unified paged KV cache + disaggregated serving (B3c) by @inureyes in #194
    *...
Read more

v0.1.4

05 Jun 03:40

Choose a tag to compare

Changes since v0.1.3.

New Features

  • Gemma 4 Unified (gemma4_unified) multimodal architecture (#153, closes #151).
  • Gemma 4 Unified MTP speculative drafter (gemma4_unified_assistant) (#157, closes #158). The Gemma 4 Unified decode target routes through the existing MTP speculative burst dispatch, reusing the MTP drafter and round loop unchanged. The drafter's pre/post projections load through the quantization-aware UnifiedLinear, so a 4-bit assistant such as gemma-4-12B-it-assistant-4bit no longer crashes at forward time with a matmul shape mismatch. On the gemma-4-12b-it-4bit target plus the 4-bit assistant, temperature-0 output is byte-identical to classic decode at about 1.87x decode speedup (39 to 74 tok/s).
  • Variable-length prompts in B>1 batched MTP bursts (#162, closes #161), behind the new MLXCEL_ENABLE_MTP_BATCH_RAGGED opt-in (subordinate to MLXCEL_ENABLE_MTP_BATCH). Rows of different prompt lengths join a single burst via per-row left-padding and a windowed left-padding causal mask; greedy parity holds because every token in a row is shifted by the same constant left-padding offset. Eligibility is limited to max_prompt_len <= sliding_window, and out-of-regime windows fall back to per-row B=1 service. Off by default (measured 0.94x to 1.13x on the 31B), so the production path is byte-for-byte unchanged.
  • Unified paged KV cache foundation (epic #116), landed in four additive phases:
    • Phase 0: decode-time page-gather microbench and ADR 0001, selecting the [num_blocks, block_size, n_kv_heads, head_dim] pool layout (about 2.1x faster on gather-then-SDPA than the head-split layout) and the gather-then-SDPA strategy (#145, closes #117).
    • Phase 1: physical block-pool K/V tensor storage in PagedBlockPool, lazily allocated per layer with write_block / gather_visible primitives (#148, closes #118).
    • Phase 2: pooled paged-decode read path over real, possibly fragmented block tables, bit-identical to the dense fallback over 200 steps (#149, closes #119).
    • Phase 3: paged prefill writer with shared-prefix copy-on-write, so a suffix write after a shared prefix allocates only the divergent blocks (#150, closes #120).

Improvements

  • B=1 MTP speculative decoding now runs by default for every MTP target, including batch-capable ones such as Gemma 4 31B (#159, closes #158). Previously batch-capable targets declined singleton MTP unless MLXCEL_ENABLE_MTP_B1=1 was set, a calibration from an earlier "B=1 is slower" measurement. M5 Max measurement shows B=1 MTP is profitable with byte-identical output at temperature 0: about 1.2x to 1.4x on the 31B plus bf16 assistant and about 1.87x on the 12B Unified pair. Opt out with MLXCEL_ENABLE_MTP_B1=0.

Bug Fixes

  • Quantized fused MoE experts in the gemma4_unified loader are now split correctly (#156). The fused-expert split in sanitize_gemma4_unified_weights only matched the bare non-quantized .weight, so a quantized MoE checkpoint's .weight / .scales / .biases legs fell through unsplit and switch_glu construction could not find its per-projection quantized parts. The split now matches each quantized component leg and slices it on the output (doubled-FFN) axis at the same half boundary, with a dequantize-equivalence test proving no group straddling.

CI/CD Improvements

None.

Technical Details

  • The four unified paged KV cache phases (#145, #148, #149, #150) are additive machinery exercised by tests. The live decode path stays byte-for-byte unchanged until the scheduler wiring lands in a later phase of epic #116.
  • Recorded the measured Gemma 4 31B B>1 batched MTP numbers and aligned the related code comments (#160).

Dependencies

  • Bumped the minor-and-patch group: uuid 1.23.1 to 1.23.2 and hyper 1.9.0 to 1.10.1 (#147).

Breaking Changes

None.

Known Issues

None.

What's Changed

  • perf: page-gather decode microbench and paged-attention ADR (#117) by @inureyes in #145
  • feat: add physical block-pool K/V tensor storage to PagedBlockPool by @inureyes in #148
  • feat: pooled paged decode read path over real block tables (#119) by @inureyes in #149
  • feat: paged prefill writer with shared-prefix copy-on-write by @inureyes in #150
  • feat: add Gemma 4 Unified (gemma4_unified) multimodal architecture by @inureyes in #153
  • deps(deps): bump the minor-and-patch group with 2 updates by @dependabot[bot] in #147
  • fix: split quantized fused MoE experts in gemma4_unified sanitize by @inureyes in #156
  • feat: Gemma 4 Unified MTP speculative drafter (gemma4_unified_assistant) by @inureyes in #157
  • perf: run B=1 MTP by default for all targets (incl. batch-capable 31B) by @inureyes in #159
  • docs: record measured 31B B>1 batched MTP numbers; align comments by @inureyes in #160
  • perf: support variable-length prompts in B>1 batched MTP bursts by @inureyes in #162

Full Changelog: v0.1.3...v0.1.4

v0.1.3

30 May 13:08

Choose a tag to compare

Changes since v0.1.2.

New Features

  • mlxcel arch (alias mlxcel supported) prints the supported model-architecture catalog, separating it from the local-model listing (#138).
  • mlxcel list gained scripting and machine-readable modes: --json (a stable {repo_id, size_bytes, path, modified} array), -q / --quiet (repo-ids only, for mlxcel list -q | xargs -n1 mlxcel rm), -v / --verbose (restores the absolute PATH column), and --sort name|size|modified (#141).

Improvements

  • mlxcel list default table redesigned to NAME / SIZE / MODIFIED, dropping the absolute PATH from the default view. MODIFIED shows a relative time from the snapshot directory mtime ("just now", "2 days ago", "3 weeks ago"); the header contracts $HOME to ~ and dims secondary columns on a TTY, respecting NO_COLOR. Long repo-ids are ellipsized so one outlier does not break column alignment (#141).

Bug Fixes

  • Security: bounded chat-template rendering with a minijinja fuel cap of 50M VM instructions, so a pathological model template (for example unbounded for loops) can no longer cause a denial of service. Exhaustion returns a clean OutOfFuel error instead of panicking, and the load-time supports_tools probe degrades to a string heuristic. Audited across 91 templates and 267 scenarios with 0 failures. This is RCE-safe and matters most for multi-tenant deployments (#129, PR #139).
  • The base-model warning no longer presents -it as a universal instruction-tuned suffix. It now names the per-family conventions (Gemma -it; Llama and Qwen2.5 -Instruct; Qwen3 and Qwen3.5 plain name vs. -Base), so running Qwen3.5-0.8B-Base advises dropping -Base instead of pointing at a non-existent -it repo (#137).

CI/CD Improvements

  • Added a pull-request check (scripts/ci/check_cross_repo_refs.py) that flags unqualified bare #NNN issue/PR references, so cross-repository references are written as org/repo#NNN and caught in review (#144).

Technical Details

  • StoredModel now carries an optional modified timestamp from the snapshot directory mtime; the list renderers take an injected now and use_color, so table output is deterministic and TTY styling is gated outside the render path (#141).
  • Removed leaked internal issue/PR number references from public source comments and docs (#144).

Dependencies

  • minijinja: enabled the fuel feature (version unchanged at 2.20) to bound chat-template render cost (#139).

Breaking Changes

  • mlxcel list (and its ls alias) now lists local downloaded models by default instead of the supported-architecture catalog, and the mlxcel list --local flag was removed (clap rejects it as an unknown argument). Migration: use mlxcel arch for the catalog, and drop --local from any mlxcel list --local invocation, since the bare mlxcel list now does the same thing (#138).

Known Issues

  • None.

What's Changed

  • fix: make base-model warning's instruct advice family-aware by @inureyes in #137
  • fix(security): bound chat-template rendering with minijinja fuel by @inureyes in #139
  • feat!: list local models by default, move catalog to mlxcel arch by @inureyes in #142
  • feat: redesign mlxcel list output (modified, --json/-q/-v, styling) by @inureyes in #143
  • chore: purge leaked mlxcel-internal issue/PR numbers from public source and docs by @inureyes in #144

Full Changelog: v0.1.2...v0.1.3

v0.1.2

28 May 15:28

Choose a tag to compare

New Features

None

Improvements

None

Bug Fixes

  • Chat fallback for models without a chat_template no longer collapses into echo loops on base / non-instruction-tuned models. When tokenizer_config.json ships no chat_template field and there is no chat_template.jinja, render_prompt previously called concat_plaintext, which is bare content-only concatenation with no role markers, so a base completion model took the most natural continuation of an unstructured prompt and parroted the user's last turn indefinitely (the symptom reported in #133). The implicit "no template found" path now uses a generic User: ... Assistant: ... pseudo-template via concat_userassistant_fallback, with a trailing Assistant: cue (no newline) that nudges the model to produce an assistant turn next instead of completing its own prompt with another User: line. The processor.is_none() warning still fires and still recommends the -it Hub counterpart. --no-chat-template is unchanged and remains the documented escape hatch for completion-style raw concatenation, paralleling mlxcel generate --no-chat-template. Template-render failure inside the chat-template path now falls back to the structured form as well, rather than raw concat, since by then the user is already in chat mode. Unknown roles such as tool are preserved verbatim with the same Role: pattern instead of silently merging into the prior turn (#133, PR #136).

CI/CD Improvements

None

Technical Details

  • New concat_userassistant_fallback helper in src/commands/chat.rs complements the existing concat_plaintext (which is kept verbatim for the --no-chat-template opt-in).
  • render_prompt now dispatches three paths in priority order: --no-chat-template (raw concat, unchanged), chat template present (template apply, structured-form fallback on render failure), no template + not opted out (new structured fallback).
  • Test coverage added in src/commands/chat_tests.rs: user_assistant_fallback_labels_all_turns_and_cues_assistant, user_assistant_fallback_marks_unknown_roles_instead_of_dropping_them, render_prompt_without_template_uses_user_assistant_fallback, render_prompt_no_chat_template_flag_uses_raw_concatenation. The pre-existing concat_plaintext_joins_turns_with_newlines test is retained and clarified as the --no-chat-template path.

Dependencies

None

Breaking Changes

None

Known Issues

None

Full changelog: v0.1.1...v0.1.2

v0.1.1

28 May 13:33

Choose a tag to compare

New Features

None.

Improvements

  • mlxcel run warning for models without a chat template is now actionable: it states that the model is likely a base / non-instruction-tuned variant, that chat replies will be incoherent or repetitive, suggests trying an -it variant on the Hub (e.g. gemma-4-e4b-it-4bit for gemma-4-e4b-4bit), and documents --no-chat-template for silent raw-text mode and mlxcel generate -p <prompt> for one-shot completion. The explicit --no-chat-template path remains completely silent (#132, PR #134).

Bug Fixes

  • chat_template.jinja is now downloaded alongside the rest of the model snapshot. The downloader allow-list in src/downloader/filters.rs::is_wanted_file only accepted exact-name chat_template (no extension) plus *.json, *.safetensors, *.tiktoken, *.model, and constrained *.txt, but the actual HuggingFace convention is chat_template.jinja. The file was filtered out at download time, leaving the ChatTemplateProcessor::from_model_path chat_template.jinja fallback dead and forcing the REPL into the raw-text path for any model that ships its template as a separate Jinja file (e.g. mlx-community/gemma-4-e4b-it-4bit). is_wanted_file now also accepts *.jinja files; the is_safe_relative_path and is_explicitly_denied guards still run before the allow-list so no new attack surface is opened (#132, PR #134).

CI/CD Improvements

  • macOS release binaries are now notarized. The release workflow submits signed mlxcel and mlxcel-server to Apple's notary service via rcodesign notary-submit --wait, so Gatekeeper no longer blocks first launch with "developer cannot be verified". Stapling is skipped because bare Mach-O executables do not support stapling, and spctl --assess runs as a soft warn-only check since the notary ticket may still be propagating. Paired with rcodesign verify after signing, set -euo pipefail on the prepare-cert and code-sign steps, surfaced openssl pkcs12 stderr on extraction failure, up-front validation of APPLE_CERTIFICATE / APPLE_CERTIFICATE_PASSWORD / AC_API_* secrets, chmod 600 on the materialized PEM and API key files, and an always-run cleanup that scrubs signing.pem, original.p12, AuthKey.p8, ac-key.json, and the notarization zip from \$RUNNER_TEMP so self-hosted runners no longer carry an unencrypted Developer ID private key across jobs.
  • Per-target `workflow_dispatch` filter on the release workflow (`targets`: `all` / `macos` / `linux`). Re-uploading a single platform's artifact to an existing release (for example retrofitting notarized macOS binaries onto a release that was cut before notarization landed) no longer rebuilds and replaces the other platforms' bit-different (timestamp-driven) zips, so any sha256 pinned by a downstream consumer remains valid. Release events still build everything; the filter is dispatch-only. Modeled after the per-family `targets` filter in `all-smi`'s release workflow.
  • `actions/checkout` ref pinned to the target release tag in both the macOS and Linux CUDA jobs (release event: `github.event.release.tag_name`; `workflow_dispatch`: `github.event.inputs.release_tag`; otherwise `github.sha`). Without an explicit ref, `actions/checkout` would grab the dispatched ref (which is `main` for `workflow_dispatch`), so re-dispatching a build for an older tag would silently use `main` HEAD's source instead of the tag's source. The workflow YAML itself still runs from the dispatched ref, matching `all-smi`'s self-healing release pattern.

Technical Details

  • GB10 (NVIDIA Grace Blackwell) doc refreshed to the 2026-05-28 full sweep on mlxcel 0.1.0 with MLX pin `84961223` and the warm same-process harness (`--cooldown 0`). Adds the recovered `internvl3-1b` and `molmo-7b` text rows and three VLM image-path entries (`qwen2-vl-2b`, `qwen2-vl-2b-4bit`, `qwen3-vl-30b-a3b`). The cross-hardware decode table in `model_tests.md` now reflects the canonical state of each per-hardware doc: GB10 2026-05-28, M1 Ultra 2026-05-28, M5 Max 2026-05-27. The "vs 2026-05-19" delta framing is dropped so the doc reads as a current-state snapshot, and the `Partial (⚠️)` status is collapsed into `Pass (✅)` because the partial-token information already lives in the Notes column. GB10 Overall Status counts: 101 text pass / 8 fail, 38 VLM image-path pass / 0 fail (#131).

Dependencies

None.

Breaking Changes

None.

Known Issues

  • The qwen3-0.6b decode throughput still trails the mlx-lm baseline; the gap is flagged in the M5 Max benchmark notes (carried over from v0.1.0).
  • Gemma 3n VLM checkpoints (`gemma3n-e2b-4bit`, `gemma3n-e4b-4bit`, `gemma3n-e4b-bf16`) still fail on the M5 Max VLM path with a `[broadcast_shapes]` mismatch; the text-only path on the same checkpoints is unaffected (carried over from v0.0.31).

Full Changelog: v0.1.0...v0.1.1

v0.1.0

28 May 05:31

Choose a tag to compare

New Features

  • mlxcel run <repo-id-or-path> subcommand (#102, epic #92). With no -p, enters the interactive chat REPL; with -p, one-shot output byte-identical to the equivalent mlxcel generate; with no model argument, falls back to mlx-community/Llama-3.2-3B-Instruct-4bit, matching mlx-lm's DEFAULT_MODEL. The model is a positional argument, so mlxcel run <repo-id> reads like ollama run.
  • Interactive multi-turn chat REPL (#101). mlxcel generate without -p/--prompt now enters a chat loop that streams the assistant reply token-by-token, preserves conversation context across turns, and supports /bye, /clear, /? (alias /help) slash commands plus ollama-style """ multiline input blocks. Reuses the existing model resolver, tokenizer, chat template, sampling, generator, and the server's byte-fallback-safe streaming detokenizer, so the offline REPL stays in lock-step with the server's behavior.
  • Local model management (#99). mlxcel list --local enumerates downloaded snapshots with repo-id, on-disk size, and absolute path. mlxcel rm <repo-id> deletes from the mlxcel store with confirmation on a TTY, refuses on a non-TTY without --yes, contains deletion to the store root, and refuses to touch the read-only HuggingFace cache.
  • Repo-id-aware -m/--model across generate, serve, inspect, mlxcel-server, and run (#100, #92). -m now accepts a local path or a HuggingFace owner/name repo-id. Resolution precedence: existing on-disk path (byte-identical to the pre-#100 behavior); otherwise legacy ./models/<name> then HF cache (read-only reuse) then mlxcel global store then auto-download into the store. Legacy and store branches gate on a present config.json.
  • Global model store and HuggingFace cache read-reuse (#98). Default download destination moves from per-CWD ./models/<basename> to ${MLXCEL_CACHE_DIR:-$HOME/.cache/mlxcel}/models/<owner>/<name>, namespaced to avoid same-name collisions across owners. download_repo reuses an existing snapshot under $HF_HUB_CACHE / $HF_HOME / ~/.cache/huggingface/hub when no --local-dir is pinned; mlxcel never writes into the HF content-addressed layout.
  • MLXCEL_MODELS_DIR environment variable and uniform --models-dir flag (#108). Precedence: --models-dir > MLXCEL_MODELS_DIR > ${MLXCEL_CACHE_DIR:-$HOME/.cache/mlxcel}/models. Wired through download, generate, serve, inspect, run, list, and rm. Closes #107.
  • Bare model names default to the mlx-community org (#113). A value with no slash (e.g. Qwen3-4B-4bit) now resolves as mlx-community/<name> instead of erroring. Overridable via MLXCEL_DEFAULT_ORG; invalid org values are caught up front with no network call.

Improvements

None.

Bug Fixes

  • mlxcel-server legacy startup -m (epic #92 hardening) now uses the same repo-id resolver and --models-dir store override as mlxcel serve, including the safetensors-only presence check.
  • Video resource caps (MLXCEL_VIDEO_MAX_PIXELS / _DURATION_SEC / _PNG_FRAME_BYTES) are now threaded through a VideoLimits struct resolved once at the boundary instead of being read from std::env deep in the decode path. Removes a libc setenv/getenv data race that made the cap tests flake under the threaded cargo test runner (#104).
  • Pipeline runtime tests bind 127.0.0.1:0 directly through TcpTransport and resolve the real port via local_addr(), removing the release-then-rebind window that caused intermittent "stub stage startup channel dropped" failures when another concurrently running test grabbed the freed ephemeral port (#106).
  • require_secure_endpoint_refuses_plaintext_with_token test acquires env_lock() so it no longer races with sibling tests that set MLXCEL_ALLOW_INSECURE_ENDPOINT (#111).

CI/CD Improvements

None.

Technical Details

  • README "Run a model" section now leads with the mlxcel run one-liner and collapses the core verbs (generate, serve, inspect, --estimate-memory) into a single one-line-comment block; env-var detail moves behind docs/environment-variables.md. README drops about 58 lines with no loss of documented behavior (#114).
  • M5 Max detailed table, README headline version, and benchmark report method table refreshed to the 2026-05-27 full-sweep state on mlxcel 0.1.0. internvl3-1b now passes in both text (661 tok/s) and VLM (601 tok/s, ahead of mlx-vlm's 529 tok/s) (#115).
  • M5 Max decode aggregates recomputed against the unchanged mlx-lm / mlx-vlm baselines. Text decode 99% median, 62 of 66 at >=90% parity; VLM decode 102% average, 101% median, 22 comparable pairs, 18 of 22 at >=90% parity. README decode tables add Gemma 2 2B, Phi-3.5-mini, Jamba, and InternVL3 1B (#127).
  • 2026-05-28 M1 Ultra full sweep mirrored into the public benchmark docs. M1 Ultra text 99% median (64 of 74 at >=90%); VLM 98% median (12 of 18 at >=90%); internvl3 now passes on M1 (#130).
  • v0.0.31 #86 entry in CHANGELOG.md and debian/changelog reworded to credit the batch-scheduler concurrency race as the root cause, matching the GitHub release note for v0.0.31.

Dependencies

None.

Breaking Changes

  • The default download destination of mlxcel download (and the implicit auto-download under any -m consumer) moves from per-CWD ./models/<basename> to the global model store ${MLXCEL_CACHE_DIR:-$HOME/.cache/mlxcel}/models/<owner>/<name>. Existing local paths passed to -m continue to be used verbatim. Pass --local-dir <path>, set MLXCEL_MODELS_DIR, or pass --models-dir <path> to override the destination.

Known Issues

  • The qwen3-0.6b decode throughput still trails the mlx-lm baseline; the gap is flagged in the M5 Max benchmark notes.
  • Gemma 3n VLM checkpoints (gemma3n-e2b-4bit, gemma3n-e4b-4bit, gemma3n-e4b-bf16) still fail on the M5 Max VLM path with a [broadcast_shapes] mismatch; the text-only path on the same checkpoints is unaffected (carried over from v0.0.31).

What's Changed

  • feat(download): global model store + default migration + HF-cache read reuse by @inureyes in #98
  • feat(cli): repo-id-aware -m resolver with auto-download (generate/serve/inspect) by @inureyes in #100
  • feat(cli): local model management — list downloaded models + remove by @inureyes in #99
  • feat(cli): interactive chat REPL (multi-turn, streaming, slash commands, multiline) by @inureyes in #101
  • feat(cli): add mlxcel run verb (one-shot + REPL dispatch + default-model fallback) by @inureyes in #102
  • fix(multimodal): inject video resource caps instead of reading env in the decode path by @inureyes in #104
  • fix(distributed): bind ephemeral ports directly in pipeline runtime tests by @inureyes in #106
  • feat(cli): MLXCEL_MODELS_DIR env var + uniform --models-dir model-store override by @inureyes in #108
  • fix: harden epic 92 server model resolver by @inureyes in #109
  • fix(downloader): serialize the secure-endpoint refusal test through ENV_LOCK by @inureyes in #111
  • feat(cli): default bare model names to the mlx-community org by @inureyes in #113
  • docs: slim down README Quick start section by @inureyes in #114
  • docs: refresh M5 Max benchmark results to mlxcel 0.1.0 (2026-05-27) by @inureyes in #115
  • docs: recompute M5 Max decode aggregates and feature optimized models by @inureyes in #127
  • docs: reflect the 2026-05-28 M1 Ultra re-benchmark in public docs by @inureyes in #130

Full Changelog: v0.0.31...v0.1.0

v0.0.31

26 May 16:42

Choose a tag to compare

New Features

  • MiniCPM-V 4.6 VLM architecture, including hardened image grid handling for the multi-slice image processor. (#82, #83)
  • RT-DETRv2 object detection model, exposed through the new mlxcel detect subcommand. (#80)
  • Anthropic-style /v1/messages API endpoint on the server, so Messages API clients can talk to mlxcel alongside the existing OpenAI-compatible routes. (#74)

Improvements

  • Documented the MLXCEL_CAPTURE_DECODE environment variable and clarified the memory headroom wording. (#72)

Bug Fixes

  • Chat message content that is missing or explicitly null, such as assistant tool-call turns, is now tolerated instead of being rejected with an HTTP 422. This restores multi-turn tool loops for OpenAI-compatible clients that omit content on tool-call messages. (#91)
  • Gemma 3n VLM per_layer_inputs is now keyed per sequence id, mirroring the Gemma 4 container. It was previously stashed in a single shared cell, so a burst of concurrent VLM requests in the batch scheduler could overwrite it before prefill consumed it and read the wrong sequence's tensor (or panic). (#86)
  • Qwen3.5 MTP speculative decoding now uses per-position verify attention so the draft and verify passes stay in parity. (#78)
  • Batched quantized KV caches now apply the correct mask offset. (#76)

CI/CD Improvements

  • Pinned the Rust toolchain to 1.93.1 for reproducible builds. (#87, #90)
  • Excluded the root models symlink (#88) and AI assistant temporary directories from .gitignore.

Technical Details

  • New model coverage: MiniCPM-V 4.6 on the VLM path and RT-DETRv2 for object detection, the latter wired through a dedicated mlxcel detect subcommand.
  • The server now speaks the Anthropic Messages API (/v1/messages) in addition to the OpenAI-compatible chat and responses endpoints.
  • Correctness work this cycle targets multi-turn tool calling (missing or null content), Gemma 3n VLM per-sequence state isolation under the batch scheduler, Qwen3.5 MTP speculative-decode parity, and batched quantized KV cache masking.

Dependencies

  • serde_json 1.0.149 to 1.0.150. (#84)
  • minijinja 2.19.0 to 2.20.0. (#84)

Breaking Changes

None.

Known Issues

  • Gemma 3n VLM checkpoints (gemma3n-e2b-4bit, gemma3n-e4b-4bit, gemma3n-e4b-bf16) still fail on the M5 Max VLM path with a [broadcast_shapes] mismatch. The text-only path on the same checkpoints is unaffected.
  • The qwen3-0.6b decode throughput still trails the mlx-lm baseline; the gap is flagged in the M5 Max benchmark notes.

What's Changed

  • docs: document MLXCEL_CAPTURE_DECODE and clarify headroom wording by @inureyes in #72
  • feat(server): add Anthropic-style /v1/messages API endpoint by @inureyes in #74
  • fix(cache): correct mask offset for batched quantized KV caches by @inureyes in #76
  • fix(speculative): per-position verify attention for Qwen3.5 MTP parity by @inureyes in #78
  • feat(models): add RT-DETRv2 object detection model by @inureyes in #80
  • feat(models): add MiniCPM-V 4.6 VLM support by @inureyes in #82
  • fix: harden MiniCPM-V 4.6 image grid handling by @inureyes in #83
  • deps(deps): bump the minor-and-patch group with 2 updates by @dependabot[bot] in #84
  • fix(vision): make Gemma 3n VLM per_layer_inputs sequence-aware by @inureyes in #86
  • chore: pin Rust toolchain to 1.95.0 by @inureyes in #87
  • chore: ignore root models symlink in .gitignore by @inureyes in #88
  • chore: pin Rust toolchain to 1.93.1 by @inureyes in #90
  • fix(server): tolerate missing/null content in chat messages by @inureyes in #91

Full Changelog: v0.0.30...v0.0.31

v0.0.30

23 May 01:38

Choose a tag to compare

New Features

  • Unified pre-load memory estimator. mlxcel inspect is a new read-only subcommand that prints a byte-level breakdown of model weights, KV cache, and runtime headroom against available unified memory without loading any tensors. --estimate-memory on mlxcel generate and mlxcel serve runs the same estimator as a preflight and aborts when the model will not fit; pass --force (alias --no-memory-check) to override, and set MLXCEL_MEMORY_LIMIT=NGB to tighten the available figure to a soft cap. (#67)
  • Exact weight footprint from the safetensors header, parsed without materializing tensors so the estimator works from real per-dtype byte counts. (#64)
  • KV cache memory estimator with 256-token rounding that matches the runtime's pre-allocation steps. (#65)
  • MLX runtime memory API bindings that expose the active, peak, and limit byte counters through FFI. (#66)
  • Molmo v1 (molmo-7b) VLM architecture. (#41)
  • InternVL (internvl_chat) VLM architecture. (#37)

Improvements

  • Gemma 3n bf16 decode: reduced AltUp/MLP graph overhead (#60) and improved M5 decode bandwidth with pretransposed weights (#62).
  • Phi-3.5 SuScaledRoPE decode speedup. (#42)
  • Gemma dense GeGLU aligned with the mlx-lm reference for faster decode. (#43)
  • Jamba hybrid decode speedup. (#44)
  • Server parallel context sizing: --ctx-size is now treated as a total context budget shared across active request slots, matching llama.cpp server semantics. --parallel N --ctx-size C yields an effective per-slot window of floor(C / N), /slots reports the per-slot window, and startup rejects per-slot windows below 512 tokens. (#57)

Bug Fixes

  • CLI boolean cache flags are now validated, and CLI flags correctly take precedence over their environment-variable equivalents. (#70)
  • Prompt cache radix trie is now iterative, preventing a stack overflow on deep prompt prefixes. (#63)
  • Gemma 3n gates the bf16 fused decode path off the M5 Neural Accelerator so output stays correct on that hardware. (#61)
  • CUDA Hopper builds append the 90a architecture suffix for auto-detect and fallback builds. (#51)
  • VLM server image decoding is hardened to skip invalid entries instead of failing the whole request. (#50)
  • Qwen2-VL image placeholder is expanded to the full grid count. (#39)
  • Tightened the memory estimator preflight coverage so the abort path is exercised across generate and serve.

CI/CD Improvements

None.

Technical Details

  • The memory estimator composes three layers: static weight bytes from the safetensors header, a KV cache sized to the context window with 256-token rounding, and a runtime headroom factor that defaults to 1.20x. The total is checked against available unified memory and the optional MLXCEL_MEMORY_LIMIT soft cap using live MLX runtime memory counters.
  • Docs: refreshed the M1 Ultra and README decode benchmark figures for the Molmo / Phi-3.5 / Gemma / Jamba / InternVL work (#45, #46), corrected the M5 Max baichuan-m1-14b decode comparison (#49), and dropped change-cause notes from the result tables to keep them current-state only (#47, #48).
  • Tests: added a qwen2.5-vl-3b-4bit warmup regression guard. (#38)

Dependencies

None. The only Cargo.lock change is the mlxcel package version bump.

Breaking Changes

  • Server --ctx-size semantics changed under --parallel. Previously each request slot received the full --ctx-size window; it is now a shared budget, so --parallel N --ctx-size C gives each slot floor(C / N) tokens. Deployments that relied on the old per-slot sizing should raise --ctx-size accordingly. Startup now rejects per-slot windows below 512 tokens. (#57)

Known Issues

  • The qwen3-0.6b decode throughput still trails the mlx-lm baseline; the gap is flagged in the M5 Max benchmark notes. (#49)

What's Changed

  • feat(models): add InternVL (internvl_chat) VLM architecture by @inureyes in #37
  • test(vision): add qwen2.5-vl-3b-4bit warmup regression guard by @inureyes in #38
  • fix(vision): expand Qwen2-VL image placeholder to grid count by @inureyes in #39
  • feat(models): add Molmo v1 (molmo-7b) VLM architecture by @inureyes in #41
  • fix(perf): speed up Phi-3.5 SuScaledRoPE decode by @inureyes in #42
  • fix(perf): align Gemma dense GeGLU with mlx-lm by @inureyes in #43
  • fix(perf): speed up Jamba hybrid decode by @inureyes in #44
  • docs(benchmarks): refresh M1 Ultra results for Molmo / Phi-3.5 / Gemma / Jamba / InternVL by @inureyes in #45
  • docs(readme): refresh decode benchmark figures after the Gemma / Phi-3.5 / Jamba work by @inureyes in #46
  • docs(benchmarks): drop change-cause notes from the result tables by @inureyes in #47
  • docs(benchmarks): drop legacy change-cause notes from the M1 Ultra table by @inureyes in #48
  • docs(benchmarks): correct the M5 Max baichuan-m1-14b decode comparison and flag qwen3-0.6b by @inureyes in #49
  • fix: harden VLM server image decoding by @inureyes in #50
  • fix(cuda): append 90a arch suffix for Hopper auto-detect/fallback builds by @inureyes in #51
  • perf(gemma3n): reduce bf16 decode AltUp/MLP graph overhead by @inureyes in #60
  • fix(gemma3n): gate bf16 fused decode path off M5 Neural Accelerator by @inureyes in #61
  • perf(gemma3n): improve M5 decode bandwidth with pretransposed weights by @inureyes in #62
  • fix(prompt_cache): make radix trie iterative to prevent stack overflow by @inureyes in #63
  • feat(ffi): wrap MLX runtime memory APIs (active/peak/limit) by @inureyes in #66
  • feat(weights): add exact weight footprint from safetensors header by @inureyes in #64
  • feat(hardware): add KV cache memory estimator with 256-token rounding by @inureyes in #65
  • feat: unified memory estimator with mlxcel inspect and generate/serve preflight by @inureyes in #67
  • fix: tighten memory estimator preflight coverage by @inureyes in #68
  • fix: align parallel context sizing with slots by @inureyes in #69
  • fix(cli): validate boolean cache flags, fix CLI-over-env precedence by @inureyes in #70

Full Changelog: v0.0.29...v0.0.30