Skip to content

inclusionAI/humming

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Humming

Humming is a high-performance, lightweight, and highly flexible JIT (Just-In-Time) compiled GEMM kernel library specifically designed for quantized inference.

Key Features

  • High Flexibility
    • Supports inference for any weight type under 8-bit across FP16 / BF16 / FP8 / FP4 / INT8 / INT4 activations (provided the activation's dynamic range covers the weight type).
    • Supports various quantization strategies.
    • Supports various scale types (BF16, FP16, E4M3, E5M2, and UE8M0).
    • Supports both Dense GEMM and MoE GEMM.
  • High Compatibility: supports all NVIDIA GPUs from SM75+ (Turing architecture) and beyond.
  • High Performance
    • Delivers State-of-the-Art (SOTA) throughput and efficiency across a wide range of computational scenarios.
  • Ultra-Lightweight
    • Minimal dependencies: Requires only PyTorch and NVCC.
    • Compact footprint: The package size is only 100+KB.

Support Matrix

Activation Type Supported Devices Supported Weight Types
FP16 (e5m10) SM75+ • Symmetric INT1-8
• INT1-8 with dynamic zero point
• Arbitrary signed FP (kBits ≤ 8, kExp ≤ 5)
BF16 (e8m7) SM80+ • Symmetric INT1-8
• INT1-8 with dynamic zero point
• Arbitrary signed FP (kBits ≤ 8)
FP8 (e4m3) SM89+ • Symmetric INT1-5
• INT1-4 with dynamic zero point
• Arbitrary signed FP (kExp ≤ 4, kMan ≤ 3)
FP8 (e5m2) SM89+ • Symmetric INT1-4
• INT1-3 with dynamic zero point
• Arbitrary signed FP (kExp ≤ 5, kMan ≤ 2)
FP4 (e2m1) SM120+ • Symmetric INT1-3
• INT1-2 with dynamic zero point
• Arbitrary signed FP (kExp ≤ 2, kMan ≤ 1)
INT8 SM75+ • Symmetric INT1-8
• INT1-7 with dynamic zero point
INT4 SM80+ • Symmetric INT1-4
• INT1-3 with dynamic zero point

Getting Started

Installation

pip install git+https://github.com/inclusionAI/humming.git

Usage Example

from humming.layer import HummingLayer
import torch


layer = HummingLayer(
    shape_n=8192,
    shape_k=8192,
    weight_config={"dtype": "int6"},
    torch_dtype=torch.float16,
).cuda()

torch.cuda.manual_seed(0)
inputs = torch.randn((8192, 8192), dtype=torch.float16, device="cuda:0")
weight = torch.randn((8192, 8192), dtype=torch.float16, device="cuda:0")
# load unquantized weight and quantize to layer quantization format
layer.load_from_unquantized(weight)
# transform weight to humming format
layer.transform()
# you can add a kernel for a shape_m range (min_shape_m, max_shape_m]
layer.add_kernel_config(
    # min_shape_m (not included)
    min_shape_m=64,
    # max_shape_m (included)
    max_shape_m=128,
    # block_shape and warp_shape are required
    block_shape=(64, 128, 128),
    warp_shape=(64, 64, 32),
    # other args are optional
    num_stages=3,
    use_stream_k=False,
    use_f16_accum=True,
)
# or use default kernels
layer.prepare_default_kernel_configs()


print("\n\nQuantized GEMM Output:\n")
print(layer(inputs))
print("\n\nUnquantized GEMM Output:\n")
print(inputs.matmul(weight.T))

Acknowledgement

This project is highly inspired by

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors