Skip to content

More aggressive fusion of small neighboring kernels (cumsum etc.) for inference #18

@icavan

Description

@icavan

Description

Fuse small neighboring kernels (e.g., cumsum, element-wise ops) with main linear attention kernels to reduce kernel launch overhead and memory traffic in inference scenarios.

Context

In inference workloads, small auxiliary kernels like cumsum can become a significant fraction of total execution time due to kernel launch overhead and extra memory round-trips. Fusing these into the main kernel can yield meaningful speedups, especially for latency-sensitive serving.

Tasks

  • Identify small kernels adjacent to main linear attention kernels in the inference pipeline
  • Evaluate fusion opportunities (cumsum, gating, normalization, etc.)
  • Implement fused kernels
  • Benchmark latency improvements
  • Validate correctness

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions