Description
Fuse small neighboring kernels (e.g., cumsum, element-wise ops) with main linear attention kernels to reduce kernel launch overhead and memory traffic in inference scenarios.
Context
In inference workloads, small auxiliary kernels like cumsum can become a significant fraction of total execution time due to kernel launch overhead and extra memory round-trips. Fusing these into the main kernel can yield meaningful speedups, especially for latency-sensitive serving.
Tasks
Description
Fuse small neighboring kernels (e.g.,
cumsum, element-wise ops) with main linear attention kernels to reduce kernel launch overhead and memory traffic in inference scenarios.Context
In inference workloads, small auxiliary kernels like
cumsumcan become a significant fraction of total execution time due to kernel launch overhead and extra memory round-trips. Fusing these into the main kernel can yield meaningful speedups, especially for latency-sensitive serving.Tasks