High-performance kernel operations for KTransformers, featuring CPU-optimized MoE inference with AMX, AVX, KML and blis (amd library) support.
- KT-Kernel
Current Support Status:
- ✅ Intel CPUs with AMX: Fully supported (using weights converted to INT4/INT8 format)
- ✅ Universal CPU (llamafile backend): Supported (using GGUF-format weights)
- ✅ AMD CPUs with BLIS: Supported (for int8 prefill & decode)
- ✅ Kimi-K2 Native INT4 (RAWINT4): Supported on AVX512 CPUs (CPU-GPU shared INT4 weights) - Guide
- CPU-Optimized MoE Kernels: High-throughput MoE expert kernels optimized for instruction sets.
- AMX INT4/INT8 Backend: INT4 / INT8 quantized expert inference backend for AMX-capable servers.
- Llamafile CPU Backend: AVX2/AVX512-based MoE backend built on Llamafile for universal CPU deployment.
- NUMA-Aware Execution: Thread pool and memory layout designed for multi-socket / multi-NUMA machines.
Coming soon...
Choose the version matching your CUDA installation:
# For CUDA 11.8
pip install kt-kernel==0.4.2.cu118
# For CUDA 12.1
pip install kt-kernel==0.4.2.cu121
# For CUDA 12.4
pip install kt-kernel==0.4.2.cu124
# For CUDA 12.6
pip install kt-kernel==0.4.2.cu126Note: Replace
0.4.2with the latest version if available.
Features:
- ✅ Automatic CPU detection: Detects your CPU and loads the optimal kernel variant
- ✅ Multi-variant wheel: Includes AMX, AVX512, and AVX2 variants in a single package
- ✅ No compilation needed: Pre-built wheels for Python 3.10, 3.11, 3.12
- ✅ Multiple CUDA versions: Choose the version matching your environment
Requirements:
- CUDA 11.8+ or 12.x runtime (must match the package version you install)
- PyTorch 2.0+ (install separately, must match CUDA version)
- Linux x86-64
CPU Variants Included:
| Variant | CPU Support | Use Case |
|---|---|---|
| AMX | Intel Sapphire Rapids+ | Best performance on latest Intel CPUs |
| AVX512 | Intel Skylake-X/Ice Lake/Cascade Lake | AVX512-capable CPUs without AMX |
| AVX2 | Intel Haswell+, AMD Zen+ | Maximum compatibility |
Check which variant is loaded:
import kt_kernel
print(f"CPU variant: {kt_kernel.__cpu_variant__}") # 'amx', 'avx512', or 'avx2'
print(f"Version: {kt_kernel.__version__}")Environment Variables:
# Override automatic CPU detection
export KT_KERNEL_CPU_VARIANT=avx2 # or 'avx512', 'amx'
# Enable debug output
export KT_KERNEL_DEBUG=1
python -c "import kt_kernel"Build from source for local installation or when you need AMD (BLIS), ARM (KML), or custom CUDA versions.
First, initialize git submodules and create a conda environment:
git submodule update --init --recursive
conda create -n kt-kernel python=3.11 -y
conda activate kt-kernelSimply run the install script - it will auto-detect your CPU and optimize for best performance:
./install.shWhat happens automatically:
- Auto-detects CPU capabilities (AMX, AVX512_VNNI, AVX512_BF16)
- Installs system dependencies (
cmake,libhwloc-dev,pkg-config) - Builds optimized binary for your CPU only (using
-march=native) - Software fallbacks: Automatically enabled for CPUs without VNNI/BF16
Optional: Two-step installation
./install.sh deps # Install dependencies only
./install.sh build # Build and install kt-kernelCPU Requirements by Backend:
| Backend | Minimum CPU Requirement | Example CPUs | Notes |
|---|---|---|---|
| LLAMAFILE | AVX2 | Intel Haswell (2013+), AMD Zen+ | Universal compatibility |
| RAWINT4 | AVX512F + AVX512BW | Intel Skylake-X (2017+), Ice Lake, Cascade Lake | Software fallbacks for VNNI/BF16 |
| AMXINT4/INT8 | AMX | Intel Sapphire Rapids (2023+) | Best performance, requires AMX hardware |
Software Fallback Support (AVX512 backends):
- ✅ VNNI fallback: Uses AVX512BW instructions
- ✅ BF16 fallback: Uses AVX512F instructions
- ✅ Older AVX512 CPUs (Skylake-X, Cascade Lake) can run RAWINT4 with fallbacks
python -c "from kt_kernel import KTMoEWrapper; print('✓ kt-kernel installed successfully')"KT-Kernel can be used standalone via Direct Python API or integrated with SGLang for production deployment. This section describes SGLang integration to enable CPU-GPU heterogeneous inference, where "hot" experts run on GPU and "cold" experts run on CPU for optimal resource utilization.
git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"You need both GPU weights and CPU-side expert weights for heterogeneous inference. The exact format depends on the backend:
GPU Weights (for all backends):
Use the model weights required by SGLang for GPU inference (for example, the original or already-quantized model directory from Hugging Face).
CPU Weights (AMX backend: AMXINT4 / AMXINT8):
Quantize weights to AMX-optimized INT4/INT8 format using the provided script:
python scripts/convert_cpu_weights.py \
--input-path /path/to/model \
--input-type bf16 \
--output /path/to/cpu-weights \
--quant-method int8 # or int4 or moe_int8 (for amd now) --input-path: Path to GPU-side original weights--input-type: Depends on your GPU weights type (fp8,fp16, orbf16)
In SGLang integration, --kt-weight-path should point to this converted CPU weights directory.
Supported input formats: FP8, FP16, BF16 → INT4/INT8.
CPU Weights (LLAMAFILE backend: LLAMAFILE):
LLAMAFILE uses pre-quantized GGUF weights on the CPU side directly, without running convert_cpu_weights.py. You need to:
- Download a GGUF model directly from the web (e.g., GGUF repos on Hugging Face / Modelscope);
- In SGLang integration, use that GGUF directory as
--kt-weight-path. KT-Kernel supports multiple GGUF quantization formats such asQ4_KM,Q4_K,Q5_K, etc. Choose based on your latency and accuracy requirements.
Start the SGLang server with your normal SGLang parameters, and add the following KT-Kernel specific parameters to enable CPU-GPU heterogeneous inference:
KT-Kernel Parameters to Add:
--kt-method: Backend method (AMXINT4, AMXINT8, or LLAMAFILE)--kt-weight-path: Path to the converted CPU weights--kt-cpuinfer: Number of CPU inference threads (set to physical cores)--kt-threadpool-count: Number of thread pools (set to NUMA node count)--kt-num-gpu-experts: Number of experts to keep on GPU--kt-max-deferred-experts-per-token: Deferred experts for pipelined execution
Example:
python -m sglang.launch_server \
[your normal SGLang parameters...] \
--kt-method AMXINT8 \
--kt-weight-path /path/to/cpu-weights \
--kt-cpuinfer 64 \
--kt-threadpool-count 2 \
--kt-num-gpu-experts 32 \
--kt-max-deferred-experts-per-token 2See KT-Kernel Parameters section below for detailed parameter tuning guidelines.
This example demonstrates the full workflow from downloading weights to launching the server, showing both AMX backend and LLAMAFILE backend options.
Hardware Configuration:
- GPU: NVIDIA RTX 4090 24GB
- CPU: 2x Intel Xeon Gold 6454S (64 physical cores total, 128 threads, 2 NUMA nodes)
- Model: Qwen3-30B-A3B
How to verify your system configuration:
# Check CPU configuration
lscpu | grep -E "^CPU\(s\)|Thread\(s\) per core|Socket\(s\)|NUMA node\(s\)"
# Expected output example:
CPU(s): 128
Thread(s) per core: 2
Socket(s): 2
NUMA node(s): 2
# → Physical cores = CPU(s) / Thread(s) per core = 128 / 2 = 64Parameter Rationale:
--kt-cpuinfer 64: Set to physical cores (64), not hyperthreads (128)--kt-threadpool-count 2: 2 NUMA nodes detected (dual-socket system)--kt-num-gpu-experts 32: With 24GB GPU memory, we can fit ~32 experts on GPU for this model (varies by model architecture and actual memory usage)--kt-max-deferred-experts-per-token 2: Enable pipelined execution; allows CPU to process next batch while GPU completes current batch
For Intel CPUs with AMX instruction set support.
Step 1: Download model weights
# Install huggingface-cli if not already installed
pip install huggingface-hub
# Download model from Hugging Face
huggingface-cli download Qwen/Qwen3-30B-A3B --local-dir /mnt/data/models/Qwen3-30B-A3BStep 2: Convert to CPU weights (AMXINT8)
python scripts/convert_cpu_weights.py \
--input-path /mnt/data/models/Qwen3-30B-A3B \
--input-type bf16 \
--output /mnt/data/models/Qwen3-30B-A3B-INT8 \
--quant-method int8Step 3: Launch SGLang server
python -m sglang.launch_server \
--host 0.0.0.0 \
--port 8000 \
--model /mnt/data/models/Qwen3-30B-A3B \
--trust-remote-code \
--mem-fraction-static 0.92 \
--chunked-prefill-size 4096 \
--served-model-name Qwen3-30B-A3B \
--enable-mixed-chunk \
--kt-method AMXINT8 \
--kt-weight-path /mnt/data/models/Qwen3-30B-A3B-INT8 \
--kt-cpuinfer 64 \
--kt-threadpool-count 2 \
--kt-num-gpu-experts 32 \
--kt-max-deferred-experts-per-token 2For universal CPUs (no AMX required), using pre-quantized GGUF weights directly.
Step 1: Download GPU weights (original model)
pip install huggingface-hub
huggingface-cli download Qwen/Qwen3-30B-A3B --local-dir /mnt/data/models/Qwen3-30B-A3BStep 2: Download CPU weights (GGUF format)
huggingface-cli download Qwen/Qwen3-30B-A3B-GGUF Qwen3-30B-A3B-Q4_K_M.gguf \
--local-dir /mnt/data/models/Qwen3-30B-A3B-Q4_K_MStep 3: Launch SGLang server
python -m sglang.launch_server \
--host 0.0.0.0 \
--port 8000 \
--model /mnt/data/models/Qwen3-30B-A3B \
--trust-remote-code \
--mem-fraction-static 0.92 \
--chunked-prefill-size 4096 \
--served-model-name Qwen3-30B-A3B \
--enable-mixed-chunk \
--kt-method LLAMAFILE \
--kt-weight-path /mnt/data/models/Qwen3-30B-A3B-Q4_K_M \
--kt-cpuinfer 64 \
--kt-threadpool-count 2 \
--kt-num-gpu-experts 32 \
--kt-max-deferred-experts-per-token 2| Parameter | Description | Example Value |
|---|---|---|
--kt-method |
CPU inference backend method | AMXINT4, AMXINT8, RAWINT4, or LLAMAFILE |
--kt-weight-path |
Path to quantized CPU weights | /path/to/cpu-weights |
--kt-cpuinfer |
Number of CPU inference threads | 64 (adjust based on CPU cores) |
--kt-threadpool-count |
Number of thread pools for parallel execution | 2 (typically 1-4) |
--kt-num-gpu-experts |
Number of experts to keep on GPU | 32 (remaining experts go to CPU) |
--kt-max-deferred-experts-per-token |
Number of experts per token to defer for pipelined execution | 2 (0 to disable, 1-4 recommended) |
--kt-gpu-prefill-token-threshold |
Token count threshold for prefill strategy (RAWINT4 only) | ~400 |
Parameter Guidelines:
-
kt-method: Choose based on your CPU and weight format:AMXINT4: Best performance on AMX CPUs with INT4 quantized weights (May cause huge accuracy drop for some models, e.g., Qwen3-30B-A3B)AMXINT8: Higher accuracy with INT8 quantized weights on AMX CPUsRAWINT4: Native INT4 weights shared by CPU and GPU (AMX backend only, currently supports Kimi-K2-Thinking model). See Kimi-K2-Thinking Native Tutorial for details.LLAMAFILE: GGUF-based backend
-
kt-cpuinfer: Set to the number of physical CPU cores (not hyperthreads).- Check physical cores:
lscpu | grep -E "^CPU\(s\)|Thread\(s\) per core" - Physical cores = CPU(s) / Thread(s) per core
- Example: If CPU(s)=128 and Thread(s) per core=2, then physical cores = 64
- Important: Do NOT set to hyperthread count - this will degrade performance
- Check physical cores:
-
kt-threadpool-count: Set to the number of NUMA nodes.- Check NUMA count:
lscpu | grep "NUMA node(s)" - Or use:
numactl --hardware | grep "available" - Note: NUMA node count is NOT necessarily the number of physical CPUs
- It represents memory domains, which may be divided within a single CPU or across multiple CPUs
- Use the NUMA node count from
lscpu, regardless of physical CPU count
- Typical values: 1-2 for single-socket, 2-4 for dual-socket systems
- This enables better memory bandwidth utilization across NUMA domains
- Check NUMA count:
-
kt-num-gpu-experts: Determine based on GPU memory and profiling:- More GPU experts = lower latency but higher GPU memory usage (May cause OOM)
-
kt-max-deferred-experts-per-token: Enables pipelined execution:0: Synchronous execution (simpler, higher latency)1-4: Deferred execution (recommended range; good latency/quality balance, requires tuning)5-7: Highest latency reduction but may introduce noticeable accuracy loss; use with care
-
kt-gpu-prefill-token-threshold(RAWINT4 only): Controls prefill strategy for native INT4 inference:- ≤ threshold: Uses hybrid CPU+GPU prefill. No extra VRAM needed, but performance degrades slowly as token count increases.
- > threshold: Uses layerwise GPU prefill. Performance scales better with longer sequences, but requires ~9GB+ extra VRAM.
- Only applicable when
--kt-method RAWINT4is used. Currently supports Kimi-K2-Thinking model only.
For standalone usage without SGLang, you can use KT-Kernel directly via Python API:
from kt_kernel import KTMoEWrapper
# Initialize the MoE wrapper
wrapper = KTMoEWrapper(
layer_idx=0,
num_experts=8,
num_experts_per_tok=2,
hidden_size=4096,
moe_intermediate_size=14336,
num_gpu_experts=2,
cpuinfer_threads=32,
threadpool_count=2,
weight_path="/path/to/weights",
chunked_prefill_size=512,
method="AMXINT4" # Options: "AMXINT4", "AMXINT8", "LLAMAFILE"
)
# Load weights (from disk - pre-quantized)
wrapper.load_weights(physical_to_logical_map)
# Or load weights from tensors (online quantization)
wrapper.load_weights_from_tensors(gate_proj, up_proj, down_proj, physical_to_logical_map)
# Run inference
output = wrapper.forward(hidden_states, topk_ids, topk_weights, cuda_stream)
# Or use async API for better performance
wrapper.submit_forward(hidden_states, topk_ids, topk_weights, cuda_stream)
# ... do other work ...
output = wrapper.sync_forward(hidden_states, cuda_stream)# Initialize with additional options
wrapper = KTMoEWrapper(
layer_idx=0,
num_experts=8,
num_experts_per_tok=2,
hidden_size=4096,
moe_intermediate_size=14336,
num_gpu_experts=2,
cpuinfer_threads=32,
threadpool_count=2,
weight_path="/path/to/weights",
chunked_prefill_size=512,
method="AMXINT4",
cpu_save=False, # Keep weights in CPU memory after loading
max_deferred_experts_per_token=0 # Number of experts to defer (for pipelined execution)
)
# Pre-allocate buffers for specific batch sizes (improves performance)
KTMoEWrapper.set_capture_batch_sizes([1, 2, 4, 8, 16])
# Query captured batch sizes
batch_sizes = KTMoEWrapper.get_capture_batch_sizes()
# Clear buffer cache to free memory
KTMoEWrapper.clear_buffer_cache()For portable builds, binary distribution, or cross-machine deployment, you need to manually specify target instruction sets:
# General distribution (works on any AVX512 CPU from 2017+)
export CPUINFER_CPU_INSTRUCT=AVX512
export CPUINFER_ENABLE_AMX=OFF
./install.sh build --manual
# Maximum compatibility (works on any CPU from 2013+)
export CPUINFER_CPU_INSTRUCT=AVX2
export CPUINFER_ENABLE_AMX=OFF
./install.sh build --manual
# Modern CPUs only (Ice Lake+, Zen 4+)
export CPUINFER_CPU_INSTRUCT=FANCY
export CPUINFER_ENABLE_AMX=OFF
./install.sh build --manualOptional: Override VNNI/BF16 detection
# Force enable/disable VNNI and BF16 (for testing fallbacks)
export CPUINFER_ENABLE_AVX512_VNNI=OFF
export CPUINFER_ENABLE_AVX512_BF16=OFF
./install.shSee ./install.sh --help for all available options.
If you prefer manual installation without the install.sh script:
Prerequisites:
cmake(recommended:conda install -y cmake)libhwloc-devandpkg-config
Core Options:
| Variable | Options | Description |
|---|---|---|
CPUINFER_CPU_INSTRUCT |
NATIVE, AVX512, AVX2, FANCY |
CPU instruction set to use |
CPUINFER_ENABLE_AMX |
ON, OFF |
Enable Intel AMX support |
CPUINFER_BUILD_TYPE |
Release, Debug, RelWithDebInfo |
Build type (default: Release) |
CPUINFER_PARALLEL |
Number | Parallel build jobs (default: auto-detect) |
CPUINFER_VERBOSE |
0, 1 |
Verbose build output (default: 0) |
Instruction Set Details:
| Option | Target CPUs | Use Case |
|---|---|---|
NATIVE |
Your specific CPU only | Local builds (best performance, default) |
AVX512 |
Skylake-X, Ice Lake, Cascade Lake, Zen 4+ | General distribution |
AVX2 |
Haswell (2013) and newer | Maximum compatibility |
FANCY |
Ice Lake+, Zen 4+ | Modern CPUs with full AVX512 extensions |
Example Configurations:
# Local use - maximum performance (default behavior)
export CPUINFER_CPU_INSTRUCT=NATIVE
export CPUINFER_ENABLE_AMX=ON # or OFF
# Distribution build - works on any AVX512 CPU
export CPUINFER_CPU_INSTRUCT=AVX512
export CPUINFER_ENABLE_AMX=OFF
# Maximum compatibility - works on CPUs since 2013
export CPUINFER_CPU_INSTRUCT=AVX2
export CPUINFER_ENABLE_AMX=OFF
# Debug build
export CPUINFER_BUILD_TYPE=Debug
export CPUINFER_VERBOSE=1# Editable installation (for development)
pip install -e .
# Standard installation
pip install . -- Looking for a CUDA compiler - NOTFOUND
CMake Error at CMakeLists.txt:389 (message):
KTRANSFORMERS_USE_CUDA=ON but CUDA compiler not found
Make sure you have the CUDA toolkit installed and nvcc is in your system PATH.
Try export CMAKE_ARGS="-D CMAKE_CUDA_COMPILER=$(which nvcc)" and reinstall again.
Run sudo apt install libhwloc-dev if on a Debian-based system or build from source: https://www.open-mpi.org/projects/hwloc/.
wget https://download.open-mpi.org/release/hwloc/v2.12/hwloc-2.12.2.tar.gz
tar -xzf hwloc-2.12.2.tar.gz
cd hwloc-2.12.2
./configure
make
sudo make install
For AMX backends (AMXINT4 / AMXINT8), CPU-side experts must be converted to AMX-friendly INT4/INT8 format using the provided script:
python scripts/convert_cpu_weights.py \
--input-path /path/to/model \
--input-type bf16 \
--output /path/to/output \
--quant-method int4Supported formats: FP8, FP16, BF16 → INT4/INT8
For LLAMAFILE backend (LLAMAFILE), CPU-side experts are loaded directly from GGUF weights. You do not need to run the AMX conversion script; instead, download a GGUF model from the web (e.g., a GGUF repo on Hugging Face) and point weight_path / SGLang --kt-weight-path (or --model when appropriate) to that GGUF directory. KT-Kernel supports multiple GGUF quantization types such as Q4_KM, Q4_K, Q5_K, etc.
For detailed documentation, advanced options, and low-memory mode, see scripts/README.md.
Commit messages should follow the Conventional Commits specification: https://www.conventionalcommits.org/
Please format your code before committing:
cmake -B build
cd build
make formatYou may need a newer clang-format (at least version 18). In a conda environment:
conda install -c conda-forge clang-format=18
rm -rf buildIt's also recommended to install black for Python code formatting:
conda install black