Skip to content

Phase 2: Paged decode attention over real block tables #119

@inureyes

Description

@inureyes

Context

The decode kernel currently slices per-sequence dense buffers using an identity block table built by PagedDecodeMetadata::from_visible_lengths ([0, 1, 2, ...]). With real pool storage in place, attention must gather the sequence's actual, possibly non-contiguous, physical blocks.

Phase 0 outcome (ADR 0001, #117): option A is gather-then-SDPAtake(pool, block_ids, axis=0) (layout A), then reshape, then transpose [0,2,1,3], then fused scaled_dot_product_attention, reusing existing FFI with no new kernel. MLX fuses the gather into the SDPA read, so gather+SDPA stays within ~15% of contiguous SDPA at <=4k single-sequence context (it climbs with context and batch; the fused-kernel alternative is deferred to #123). Build the gather on the layout-A pool from #118. See docs/adr/0001-paged-attention-gather-vs-fused-kernel.md.

Tasks

  • Replace the identity tables with the sequence's real PagedSequenceState block tables in the decode dispatch (src/models/model_owned.rs and the per-model paths in qwen3.rs / llama3.rs / gemma3.rs / llama4.rs).
  • Update paged_decode_attention_dense_compat (C++ in cpp/mlx_cxx_bridge.cpp) and the Rust fallback paged_decode_attention_dense_fallback (layers.rs) to gather by physical block id from the pool (Phase 0 option A).
  • Apply the same change to the rotating / sliding-window variant paged_decode_attention_rotating_compat.

Acceptance criteria

  • Decode output with a fragmented (non-contiguous) block table matches the dense-backend reference within RMS < 5e-3 over 200 steps.
  • Supported models (llama3/llama4/qwen3/qwen3.5/gemma3 and their VLM wrappers) are functionally unaffected.

Dependencies

Blocked by Phase 1 (global block-pool tensor storage).

Part of #116

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:coremlxcel-core: MLX FFI, primitives, KV cache, layersarea:inferenceGeneration, sampling, decoding (incl. speculative, DRY)platform:macosmacOS (Apple Silicon) specificstatus:backlogIn the backlog, not yet readytype:enhancementNew features, capabilities, or significant additions

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions