Skip to content

InfiniTensor/InfiniTrain

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

371 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

InfiniTrain

CI Issues PR License

A from-scratch C++ training framework for large-scale models with multi-dimensional distributed parallelism.

๐Ÿš€ Quick Start

System Requirements

Hardware Requirements

  • Recommended: NVIDIA Ampere-class GPUs (A100/A800) or newer

Software Requirements

  • CUDA / NCCL: Latest stable versions
  • gcc / g++: Version 13+
  • CMake: Version 3.13+

Installation

mkdir build
cd build
cmake .. -DUSE_CUDA=ON -DUSE_NCCL=ON
make -j

Build Options:

  • USE_CUDA=ON

    Enable CUDA backend support.

  • USE_NCCL=ON

    Enable NCCL-based distributed communication.

Both options are optional and can be disabled for CPU-only builds.

โœจ InfiniTrain Overview

โœ” Support Matrix

Category Feature Description Status
Model Support GPT-2 Decoder-only Transformer language model โœ” Supported
LLaMA 3 Modern LLaMA-family Transformer architecture โœ” Supported
Qwen3-8B Qwen3 8B language model ๐Ÿ—“ Planned
DeepSeek-V3 Large-scale MoE-based language model ๐Ÿ—“ Planned
Precision Multiple Data Type FP32, BF16 โœ” Supported
Mixed Precision Autocast-based BF16 compute with FP32 accumulation โœ” Supported
Distributed Training Data Parallel (DP) Parameter-server-style data parallelism โœ” Supported
Distributed Data Parallel (DDP) Collective-based data parallelism โœ” Supported
Tensor Parallelism (TP) Intra-layer tensor sharding โœ” Supported
Sequence Parallelism (SP) Sequence dimension sharding โœ” Supported
Pipeline Parallelism (PP) GPipe, 1F1B scheduling, Virtual Pipeline (vPP) โœ” Supported
Hybrid Parallelism Arbitrary combination of DDP + TP + SP + PP โœ” Supported
Core Components Multi-backend CPU and CUDA execution backends โœ” Supported
Multi-node Distributed Training Distributed execution across multiple nodes โœ” Supported
Transformer Abstraction Generic Transformer structure abstraction โœ” Supported
Backend Registries Device / CCL / dtype abstraction and registration โœ” Supported
Kernel Dispatcher Kernel registration and dynamic dispatch mechanism โœ” Supported
Autograd Automatic differentiation engine โœ” Supported
Autocast Automatic mixed precision runtime โœ” Supported
Checkpointing Training checkpoint save and restore ๐Ÿ—“ Planned
Fine-tuning LoRA Memory-efficient fine-tuning with merge / unmerge โœ” Supported
Memory Optimizations ZeRO Stage-1 Sharded optimizer states for DDP โœ” Supported
ZeRO Stage-2 Sharded gradients across DDP ranks โœ” Supported
Activation Recomputation Recompute activations to reduce memory usage ๐Ÿ—“ Planned
Performance Optimizations Computeโ€“Comm Overlap Explicit scheduling to hide communication latency โœ” Supported
DDP Gradient Bucketing Deferred and bucketed gradient synchronization โœ” Supported
Execution Mode Training Mode Full forwardโ€“backward training with autograd โœ” Supported
no_grad Inference Forward-only execution without gradient tracking โœ” Supported
Debugging & Tooling Built-in Profiler Kernel-level performance profiling โœ” Supported
Precision Alignment Checker Function / Module precision checks and E2E loss diff โœ” Supported
CTest + GTest Infrastructure Automated unit tests with CTest integration โœ” Supported
Automated Benchmarking One-click execution, log analysis and Feishu export โœ” Supported

๐Ÿ‹๏ธ Training

Each model in the example/ directory is compiled into an independent executable.
For example, the llama3 example produces a binary named llama3.

To view available runtime options:

./llama3 --help

Getting Started

The following examples demonstrate LLaMA 3 supervised fine-tuning (SFT) using InfiniTrain.

Single-node Training Example

./llama3 \
  --device cuda \
  --input_bin [training_data_path] \
  --llmc_filepath [model_path] \
  --num_iteration 10

Multi-nodes Training Example (3D parallel)

./infini_run \
  --nnodes=2 \
  --nproc_per_node=1 \
  --node_rank=[rank_id] \
  -- ./llama3 \
     --device cuda \
     --input_bin [training_data_path] \
     --llmc_filepath [model_path] \
     --num_iteration 10 \
     --nthread_per_process 8 \
     --batch_size 40 \
     --total_batch_size 10240 \
     --tensor_parallel 2 \
     --pipeline_parallel 2 \
     --sequence_parallel

Parallelism Strategies

Distributed Data Parallelism (DDP)

--nthread_per_process 8 	# ddp_size = nthread_per_process / (tensor_parallel ร— pipeline_parallel)

Tensor Parallelism (TP)

--tensor_parallel 4        # 4-way tensor parallelism
--sequence_parallel        # Enable sequence parallelism (requires TP > 1)

Pipeline Parallelism (PP)

--pipeline_parallel 8     		# 8 pipeline stages
--virtual_pipeline_parallel 4  	# Virtual pipeline for better load balancing

Combining Parallelism Strategies

Multiple parallelism strategies (DDP, TP, SP, PP) can be freely combined to scale training across devices and nodes.

๐Ÿ—บ Roadmap

  • 2025/03/10 โ€” InfiniTrain v0.1.0

    Initial framework prototype with MNIST CPU training.

  • 2025/04/30 โ€” InfiniTrain v0.3.0

    Added Autograd support and GPT-2 training on CPU/CUDA.

  • 2025/07/09 โ€” InfiniTrain v0.4.0

    Introduced kernel registration, LLaMA training on CPU/CUDA, BF16 precision, and Data Parallelism.

  • 2025/12/31 โ€” InfiniTrain v0.5.0

    Added Autocast, multi-dimensional distributed parallelism (DDP, TP, SP, PP with GPipe / 1F1B / vPP), multi-node training, no_grad mode, and communicationโ€“computation overlap with bucketed gradient synchronization.

  • 2026/06/08 โ€” InfiniTrain v0.6.0

    Added loss alignment tooling for Function / Module level precision checks and end-to-end loss comparison, with a unified hook mechanism.

    Added memory optimizations for DDP training and Autograd execution. ZeRO Stage-1 shards optimizer states across DDP ranks, while ZeRO Stage-2 further shards gradients. Autograd Tensor release timing was also optimized to reduce peak memory usage.

    Introduced LoRA fine-tuning with merge / unmerge support for efficient training and inference-time weight merging.

    Refactored core backend abstractions around device, communication, and low-precision dtype registration. The framework layer now uses DeviceGuard, CclGroupGuard, and backend-registered FP16 / BF16 native types to avoid hardware-specialized framework code.

    Introduced a generic Transformer structure abstraction backed by TransformerConfig, providing a common foundation for GPT-2 and LLaMA 3 style model construction.

    Improved BF16 training performance through autocast and elementwise kernel optimizations.

    Integrated a CTest + GTest based testing infrastructure to strengthen the framework's automated test workflow.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors