More aggressive fusion of small neighboring kernels (cumsum etc.) for inference


### Description

Fuse small neighboring kernels (e.g., `cumsum`, element-wise ops) with main linear attention kernels to reduce kernel launch overhead and memory traffic in inference scenarios.

### Context

In inference workloads, small auxiliary kernels like `cumsum` can become a significant fraction of total execution time due to kernel launch overhead and extra memory round-trips. Fusing these into the main kernel can yield meaningful speedups, especially for latency-sensitive serving.

### Tasks

- [ ] Identify small kernels adjacent to main linear attention kernels in the inference pipeline
- [ ] Evaluate fusion opportunities (cumsum, gating, normalization, etc.)
- [ ] Implement fused kernels
- [ ] Benchmark latency improvements
- [ ] Validate correctness


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More aggressive fusion of small neighboring kernels (cumsum etc.) for inference #18

Description

Context

Tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

More aggressive fusion of small neighboring kernels (cumsum etc.) for inference #18

Description

Description

Context

Tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions