-
Notifications
You must be signed in to change notification settings - Fork 15
Phase 6: Fused Metal paged-attention kernel #123
Copy link
Copy link
Open
Labels
area:coremlxcel-core: MLX FFI, primitives, KV cache, layersmlxcel-core: MLX FFI, primitives, KV cache, layersplatform:macosmacOS (Apple Silicon) specificmacOS (Apple Silicon) specificstatus:backlogIn the backlog, not yet readyIn the backlog, not yet readytype:performancePerformance improvementsPerformance improvements
Milestone
Metadata
Metadata
Assignees
Labels
area:coremlxcel-core: MLX FFI, primitives, KV cache, layersmlxcel-core: MLX FFI, primitives, KV cache, layersplatform:macosmacOS (Apple Silicon) specificmacOS (Apple Silicon) specificstatus:backlogIn the backlog, not yet readyIn the backlog, not yet readytype:performancePerformance improvementsPerformance improvements
Type
Fields
Give feedbackNo fields configured for issues without a type.
Context
Gather-then-SDPA (Phase 2) re-materializes a contiguous K/V each decode step, so its cost grows with context length. Phase 0 (ADR 0001) confirmed that cost is material at long or batched context, so this issue replaces it with a fused Metal kernel that reads scattered blocks directly.
Tasks
src/lib/mlx-cpp/turbo/, modeled onsparse_v_sdpa.metal, consuming Q + block table + the pool tensors.use_native_paged_kernelplus a feature/env flag; keep gather-then-SDPA as the fallback path.CLAUDE.md: compile on Apple Silicon and confirm RMS < 5e-3 vs the graph reference over 200 steps.Acceptance criteria
Dependencies
Blocked by Phase 2; gated by the Phase 0 kernel-strategy decision.
Part of #116