Skip to content

Phase 5: Block-budget admission, eviction, and preemption #122

@inureyes

Description

@inureyes

Context

With physical blocks shared across requests, scheduling and eviction must be tied to the global block budget rather than per-sequence buffers. Freed prefix blocks must return to the pool when no sequence references them.

Tasks

  • Admit or queue sequences based on free block count; integrate with the existing PreemptionPolicy.
  • LRU/TTL eviction over radix-tree leaves that releases physical blocks when their refcount reaches 0; merge this with the current prompt-cache capacity/TTL accounting (prompt_cache/policy.rs, store.rs).
  • Implement recompute-on-preempt (drop blocks, re-prefill on resume). Host swap-out is out of scope because unified memory makes it low value.
  • Surface pool block usage in GET /v1/cache/stats and the Prometheus gauges (extend prompt_cache/metrics.rs).

Acceptance criteria

  • Under block pressure the scheduler evicts cold prefixes and admits new work without OOM.
  • Metrics report block-level usage (allocated / live / free blocks and bytes).

Dependencies

Blocked by Phase 4 (radix-trie / block-pool unification).

Part of #116

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:coremlxcel-core: MLX FFI, primitives, KV cache, layersarea:inferenceGeneration, sampling, decoding (incl. speculative, DRY)status:backlogIn the backlog, not yet readytype:enhancementNew features, capabilities, or significant additions

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions