Phase 3: Paged prefill into the block pool

## Context

Prefill currently writes into dense buffers and the paged path is decode-only (`is_paged_decode` requires `q_len == 1`). For an end-to-end paged sequence, prefill must allocate and write pool blocks, and it must be able to start after a shared prefix.

> **Phase 0 outcome (ADR 0001, #117):** prefill writes into the same **layout A** pool established in #118. Block writes must reassign the pool tensor (`pool = slice_update(pool, ...)`) so MLX donates the buffer and writes are O(block) in place rather than O(pool), the same append discipline #118 sets. See `docs/adr/0001-paged-attention-gather-vs-fused-kernel.md`.

## Tasks

- [x] Allocate blocks during prefill and write prefill K/V into the pool. (`PagedBlockPool::write_prefill`)
- [x] Support starting prefill after a shared prefix: reference the matched prefix's shared blocks and write only the divergent suffix into fresh blocks. (writes at the absolute tail; real copy-on-write of a shared partial tail block via `copy_on_write_block`)
- [x] Keep prefill attention numerically identical to the dense prefill path (no change to logits). (byte-identical pool storage; pooled vs dense decode parity max RMS = 0 over the prefill→decode test)
- [ ] Decide whether to keep the dense prefill fast path for single-stream / non-batched runs, or unify on the pool. (deferred to #121 — the live forward/scheduler wiring decision)

## Acceptance criteria

- [x] A cold request and a request that shares a prefix both produce logits identical to the dense path. (proven at the pool layer: cold round-trip byte-identical to dense; shared-prefix suffix gathers correctly; end-to-end prefill→decode parity RMS = 0)
- [x] The shared-prefix request allocates blocks only for its suffix (verified via `PagedCacheStats`).

> **Scope note (#120 is pool-layer only).** The model `forward(&self, caches: &mut [KVCache], ...)` only sees dense `KVCache`; the pool is not reachable from the live forward until **#121** wires a paged-aware cache mode + scheduler. So `write_prefill` is the additive pool-layer prefill-write capability + tests, not live-wired into the model forward — the same additive/deferred pattern as #118/#119. Removing dense placeholders and the dense-vs-pool fast-path decision are #121; the fused C++ kernel is #123.

## Dependencies

Blocked by Phase 1 and Phase 2.

Part of #116


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 3: Paged prefill into the block pool #120

Context

Tasks

Acceptance criteria

Dependencies

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Phase 3: Paged prefill into the block pool #120

Description

Context

Tasks

Acceptance criteria

Dependencies

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions