Skip to content

Phase 3: Paged prefill into the block pool #120

@inureyes

Description

@inureyes

Context

Prefill currently writes into dense buffers and the paged path is decode-only (is_paged_decode requires q_len == 1). For an end-to-end paged sequence, prefill must allocate and write pool blocks, and it must be able to start after a shared prefix.

Phase 0 outcome (ADR 0001, #117): prefill writes into the same layout A pool established in #118. Block writes must reassign the pool tensor (pool = slice_update(pool, ...)) so MLX donates the buffer and writes are O(block) in place rather than O(pool), the same append discipline #118 sets. See docs/adr/0001-paged-attention-gather-vs-fused-kernel.md.

Tasks

  • Allocate blocks during prefill and write prefill K/V into the pool. (PagedBlockPool::write_prefill)
  • Support starting prefill after a shared prefix: reference the matched prefix's shared blocks and write only the divergent suffix into fresh blocks. (writes at the absolute tail; real copy-on-write of a shared partial tail block via copy_on_write_block)
  • Keep prefill attention numerically identical to the dense prefill path (no change to logits). (byte-identical pool storage; pooled vs dense decode parity max RMS = 0 over the prefill→decode test)
  • Decide whether to keep the dense prefill fast path for single-stream / non-batched runs, or unify on the pool. (deferred to Phase 4: Radix-trie and block-pool unification (scheduler wiring) #121 — the live forward/scheduler wiring decision)

Acceptance criteria

  • A cold request and a request that shares a prefix both produce logits identical to the dense path. (proven at the pool layer: cold round-trip byte-identical to dense; shared-prefix suffix gathers correctly; end-to-end prefill→decode parity RMS = 0)
  • The shared-prefix request allocates blocks only for its suffix (verified via PagedCacheStats).

Scope note (#120 is pool-layer only). The model forward(&self, caches: &mut [KVCache], ...) only sees dense KVCache; the pool is not reachable from the live forward until #121 wires a paged-aware cache mode + scheduler. So write_prefill is the additive pool-layer prefill-write capability + tests, not live-wired into the model forward — the same additive/deferred pattern as #118/#119. Removing dense placeholders and the dense-vs-pool fast-path decision are #121; the fused C++ kernel is #123.

Dependencies

Blocked by Phase 1 and Phase 2.

Part of #116

Metadata

Metadata

Assignees

Labels

area:coremlxcel-core: MLX FFI, primitives, KV cache, layersarea:inferenceGeneration, sampling, decoding (incl. speculative, DRY)status:doneCompletedtype:enhancementNew features, capabilities, or significant additions

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions