Context
Final integration checkpoint that proves the unified store is correct, measures its payoff on Apple Silicon, and brings the docs in line.
Phase 0 note (ADR 0001, #117): the #117 microbench (examples/page_gather_microbench.rs) measures the attention sub-step only, so its overhead figures (~15-67% single-sequence, higher under batching) dilute at the model level where attention is a fraction of each decode token. End-to-end benchmarks here must measure model-level decode throughput (examples/profile_paged_decode_kernel.rs gives the paged-vs-dense model-level comparison) and must include concurrent / batched-decode scenarios, since Phase 0 found batch amplifies the gather cost far more than context length. See docs/adr/0001-paged-attention-gather-vs-fused-kernel.md.
Tasks
Acceptance criteria
Dependencies
Blocked by Phases 4, 5, 6, and the coverage/distributed sub-issues.
Part of #116
Context
Final integration checkpoint that proves the unified store is correct, measures its payoff on Apple Silicon, and brings the docs in line.
Tasks
docs/model_tests*.md.docs/turbo-kv-cache.mdanddocs/en/prompt_cache.md; updatedocs/CONTINUOUS_BATCHING.md; remove the "Paged backend adopt/donate disabled" note from the prompt-cache Limitations section.Acceptance criteria
Dependencies
Blocked by Phases 4, 5, 6, and the coverage/distributed sub-issues.
Part of #116