Description
Implement an optimized KDA prefill kernel for SM10X (Blackwell) GPUs.
Context
A fused KDA prefill kernel is already available for SM90 (Hopper). Porting and optimizing this for SM10X (Blackwell) would leverage Blackwell's architectural improvements for better inference performance.
Tasks
Description
Implement an optimized KDA prefill kernel for SM10X (Blackwell) GPUs.
Context
A fused KDA prefill kernel is already available for SM90 (Hopper). Porting and optimizing this for SM10X (Blackwell) would leverage Blackwell's architectural improvements for better inference performance.
Tasks