[26.06] AutoModel Roadmap
This post tracks the planned work items for NeMo AutoModel in the 26.06 cycle. Plans may shift as we validate performance, hardware coverage, and community priorities. If you have feature requests, use cases, or model families you want us to prioritize, please comment below.
Developer and User Experience
-
Installation and Container UX
Make CUDA installs and containers more predictable, including deterministic CUDA stack options for pip, clearer optional dependency groups, validation on common base containers, and first-class launcher and model-specific dependencies in the NVIDIA container. This includes AArch64 container fixes such as qwen_vl_utils, tokenizer dependencies such as sentencepiece and tiktoken, and validation on non-AutoModel base containers.
-
Configuration Quality
Improve recipe authoring with typed configuration work, annotated canonical configs, and YAML linting so examples are easier to read, validate, and maintain.
-
Include NeMo Run in Containers
Include NeMo Run in the released container so users can use the same CLI for local and remote training.
Model Support and Registry
-
Continuous Day-0 Model Support
Continue rapid support for new Hugging Face model releases across LLM, VLM, omnimodal, diffusion, and retrieval workflows.
-
Model Capability Registry
Expand the registry that maps model families to supported capabilities, recipes, tests, and docs so users can quickly see supported parallelism, precision, functionality (e.g., packed sequences, etc). This includes introducing a standardized test suite to verify each individual model's capabilities, and publishing model-specific CI results. Large effort that may be split into multiple cycles.
-
DeepSeek V4 Support, MTP, and Stabilization
Add and validate DeepSeek V4 support, including MTP support and follow-up correctness work for HyperConnection, sparse attention, FSDP2 dtype handling, checkpointing, and parity-sensitive paths.
-
Gemma 4 Coverage
Expand Gemma 4 support across MTP, context parallelism, Transformer Engine-backed MoE, TP/PP recipe quality, DGX Spark validation for 26B and 31B MoE recipes, and convergence or NaN-loss investigations.
-
Qwen Family Coverage
Add Qwen Image fine-tuning and optimization, continue Qwen3.5 MoE and MTP support, and close Qwen3.5 VLM recipe gaps such as neat packing, checkpointing, and container dependency coverage.
-
BAGEL and Community-Requested Models
Improve support for BAGEL pretraining and fine-tuning, and other community-requested model families as they land.
Core infrastructure
- Provide an RL Library Interface
Create a clear interface for RL libraries to build on AutoModel’s model support and parallelism features, making it easy to integrate AutoModel-backed models into RL training workflows.
Performance and Parallelism: MoE, PEFT, and more
-
MoE Backend Upgrades
Upgrade and validate the MoE communication and expert-dispatch stack, including DeepEP v2, Blackwell paths, MXFP8, and NVFP4 coverage where feasible.
-
MoE Compile and CUDA Graph Paths
Continue reducing MoE overhead with torch.compile and CUDA Graph work, with a focus on end-to-end throughput and stability for sparse models.
-
Context Parallelism and Packing
Expand CP coverage for TE-backed Gemma 4, sequence packing plus CP, and larger hybrid or long-context recipes.
-
Backend and Optimizer Compatibility
Keep the Transformer Engine stack current and close compatibility gaps such as DTensor support for fused optimizer paths and tensor-parallel synchronization for custom module weights.
-
Low-Memory PEFT, Expert LoRA, and Kernel Support
Add low-memory PEFT and kernel coverage, including Transformer v5 expert LoRA behavior, fused adapter outputs for MoE layers, and memory-efficient configs for large single-GPU or constrained fine-tuning jobs.
-
AutoShard
Explore compiler-assisted automatic model sharding to reduce manual parallelism work for new architectures.
Multimodal, VLM, and Retrieval
-
Gemma 4 and Omnimodal Coverage
Broaden Gemma 4 recipe coverage, MTP support, CP support, DGX Spark validation, and related VLM/omnimodal reliability work.
-
VLM Knowledge Distillation
Add VLM knowledge distillation coverage so multimodal teacher-student workflows are represented in the 26.06 roadmap.
-
Retriever Training Coverage
Add retrieval CI coverage and extend dataset and vision-language retrieval support for bi-encoder and cross-encoder workflows.
Diffusion
-
BAGEL Pretraining, Fine-Tuning, and Scale Validation
Add and validate BAGEL support as part of the multimodal training stack, including model onboarding, fine-tuning, scale validation, and performance characterization.
-
Flux 2 Fine-Tuning
Add support for Flux 2 fine-tuning and keep the diffusion recipes aligned with current Diffusers model APIs.
-
Add WAN 2.2 Support
Add first-class support for WAN 2.2 in AutoModel, including model loading, training recipes, parallelism compatibility, and validation against reference behavior.
-
Video Diffusion Scale Validation
Use HunyuanVideo and related large video models as scale milestones for validating multi-node diffusion training performance.
-
Diffusion Performance Improvements
Continue improving throughput and memory efficiency for image and video diffusion fine-tuning and pretraining.
Checkpointing, State, and Robustness
-
Checkpointing Robustness and Speed
Improve save/resume reliability, including async checkpoint coverage, DCP planner regressions, Qwen3.5 neat-packing checkpoint failures, and checkpoint robustness thresholds.
-
S3 and Remote Checkpoint Storage
Add DCP-compatible S3 checkpoint support for cloud and shared-storage training workflows.
-
Lower-Memory State Dict and PEFT Adapters
Reduce peak VRAM and CPU memory during state-dict adaptation, checkpoint export, and PEFT adapter conversion, including cases where adapter generation currently requires roughly 2x VRAM.
Release Quality, CI, and Validation
-
CI Stability
Harden release coverage across AArch64 containers, tokenizer dependencies, trust_remote_code cache behavior, checkpoint reloads, and known model-specific CI failures.
-
Convergence Robustness
Continue convergence investigations and publishable loss-curve validation for representative dense, MoE, VLM, retrieval, and diffusion recipes. This includes targeted investigations such as Llama 3.2 pretraining instability and Gemma 4 TP/PP NaN-loss regressions.
-
Recipe Consolidation and Benchmarks
Reduce duplicated benchmark and fine-tuning recipe trees, align configs across AutoModel training and benchmark jobs, and publish clearer benchmark numbers.
Tracker-Backed 26.06 Items
The following tracker-backed items were used to align the public roadmap with the 26.06 Linear plan.
| Area |
Planned item |
Tracker reference |
| Model support |
DeepSeek V4 support |
#2034, #2143, #2088, #2086, #2154, #2170 |
| Model support |
DeepSeek V4 MTP support |
#2191 |
| Model support |
BAGEL pretraining and fine-tuning |
#1015, #2275 |
| Model support |
Qwen Image fine-tuning and optimization |
#1700 |
| Diffusion |
Flux 2 fine-tuning |
#418, #2145 |
| Training/performance |
Low-memory PEFT and kernel support |
#2166 |
| Training/performance |
DeepEP v2 upgrade |
#2021 |
| Training/performance |
AutoShard |
#909 |
| Training/performance |
MoE and torch.compile |
#1438 |
| Training/performance |
MoE on Blackwell |
#1187, #2115 |
| Multimodal/retrieval |
VLM knowledge distillation |
#2195 |
| Multimodal/retrieval |
Retriever VL and dataset support |
#1342, #1407 |
| Multimodal/retrieval |
Retriever CI tests |
#1449 |
| Infrastructure/docs |
Separate model zoo from infrastructure |
#2163 |
| Infrastructure/docs |
Component refactor |
#2163, #2266 |
| Infrastructure/docs |
Model capability registry |
#1438 |
| Infrastructure/docs |
Fern upgrade |
#2196, #2291 |
We Want Your Input
Have a feature request or use case that is not covered above? Please comment and include:
- What you would like to see.
- Why it matters for your workflow.
- Any context that helps us prioritize, such as model family, scale, hardware, precision, or deployment target.
We will prioritize based on community feedback, engineering feasibility, and release validation risk.
[26.06] AutoModel Roadmap
Developer and User Experience
Installation and Container UX
Make CUDA installs and containers more predictable, including deterministic CUDA stack options for
pip, clearer optional dependency groups, validation on common base containers, and first-class launcher and model-specific dependencies in the NVIDIA container. This includes AArch64 container fixes such asqwen_vl_utils, tokenizer dependencies such assentencepieceandtiktoken, and validation on non-AutoModel base containers.Configuration Quality
Improve recipe authoring with typed configuration work, annotated canonical configs, and YAML linting so examples are easier to read, validate, and maintain.
Include NeMo Run in Containers
Include NeMo Run in the released container so users can use the same CLI for local and remote training.
Model Support and Registry
Continuous Day-0 Model Support
Continue rapid support for new Hugging Face model releases across LLM, VLM, omnimodal, diffusion, and retrieval workflows.
Model Capability Registry
Expand the registry that maps model families to supported capabilities, recipes, tests, and docs so users can quickly see supported parallelism, precision, functionality (e.g., packed sequences, etc). This includes introducing a standardized test suite to verify each individual model's capabilities, and publishing model-specific CI results. Large effort that may be split into multiple cycles.
DeepSeek V4 Support, MTP, and Stabilization
Add and validate DeepSeek V4 support, including MTP support and follow-up correctness work for HyperConnection, sparse attention, FSDP2 dtype handling, checkpointing, and parity-sensitive paths.
Gemma 4 Coverage
Expand Gemma 4 support across MTP, context parallelism, Transformer Engine-backed MoE, TP/PP recipe quality, DGX Spark validation for 26B and 31B MoE recipes, and convergence or NaN-loss investigations.
Qwen Family Coverage
Add Qwen Image fine-tuning and optimization, continue Qwen3.5 MoE and MTP support, and close Qwen3.5 VLM recipe gaps such as neat packing, checkpointing, and container dependency coverage.
BAGEL and Community-Requested Models
Improve support for BAGEL pretraining and fine-tuning, and other community-requested model families as they land.
Core infrastructure
Create a clear interface for RL libraries to build on AutoModel’s model support and parallelism features, making it easy to integrate AutoModel-backed models into RL training workflows.
Performance and Parallelism: MoE, PEFT, and more
MoE Backend Upgrades
Upgrade and validate the MoE communication and expert-dispatch stack, including DeepEP v2, Blackwell paths, MXFP8, and NVFP4 coverage where feasible.
MoE Compile and CUDA Graph Paths
Continue reducing MoE overhead with
torch.compileand CUDA Graph work, with a focus on end-to-end throughput and stability for sparse models.Context Parallelism and Packing
Expand CP coverage for TE-backed Gemma 4, sequence packing plus CP, and larger hybrid or long-context recipes.
Backend and Optimizer Compatibility
Keep the Transformer Engine stack current and close compatibility gaps such as DTensor support for fused optimizer paths and tensor-parallel synchronization for custom module weights.
Low-Memory PEFT, Expert LoRA, and Kernel Support
Add low-memory PEFT and kernel coverage, including Transformer v5 expert LoRA behavior, fused adapter outputs for MoE layers, and memory-efficient configs for large single-GPU or constrained fine-tuning jobs.
AutoShard
Explore compiler-assisted automatic model sharding to reduce manual parallelism work for new architectures.
Multimodal, VLM, and Retrieval
Gemma 4 and Omnimodal Coverage
Broaden Gemma 4 recipe coverage, MTP support, CP support, DGX Spark validation, and related VLM/omnimodal reliability work.
VLM Knowledge Distillation
Add VLM knowledge distillation coverage so multimodal teacher-student workflows are represented in the 26.06 roadmap.
Retriever Training Coverage
Add retrieval CI coverage and extend dataset and vision-language retrieval support for bi-encoder and cross-encoder workflows.
Diffusion
BAGEL Pretraining, Fine-Tuning, and Scale Validation
Add and validate BAGEL support as part of the multimodal training stack, including model onboarding, fine-tuning, scale validation, and performance characterization.
Flux 2 Fine-Tuning
Add support for Flux 2 fine-tuning and keep the diffusion recipes aligned with current Diffusers model APIs.
Add WAN 2.2 Support
Add first-class support for WAN 2.2 in AutoModel, including model loading, training recipes, parallelism compatibility, and validation against reference behavior.
Video Diffusion Scale Validation
Use HunyuanVideo and related large video models as scale milestones for validating multi-node diffusion training performance.
Diffusion Performance Improvements
Continue improving throughput and memory efficiency for image and video diffusion fine-tuning and pretraining.
Checkpointing, State, and Robustness
Checkpointing Robustness and Speed
Improve save/resume reliability, including async checkpoint coverage, DCP planner regressions, Qwen3.5 neat-packing checkpoint failures, and checkpoint robustness thresholds.
S3 and Remote Checkpoint Storage
Add DCP-compatible S3 checkpoint support for cloud and shared-storage training workflows.
Lower-Memory State Dict and PEFT Adapters
Reduce peak VRAM and CPU memory during state-dict adaptation, checkpoint export, and PEFT adapter conversion, including cases where adapter generation currently requires roughly 2x VRAM.
Release Quality, CI, and Validation
CI Stability
Harden release coverage across AArch64 containers, tokenizer dependencies,
trust_remote_codecache behavior, checkpoint reloads, and known model-specific CI failures.Convergence Robustness
Continue convergence investigations and publishable loss-curve validation for representative dense, MoE, VLM, retrieval, and diffusion recipes. This includes targeted investigations such as Llama 3.2 pretraining instability and Gemma 4 TP/PP NaN-loss regressions.
Recipe Consolidation and Benchmarks
Reduce duplicated benchmark and fine-tuning recipe trees, align configs across AutoModel training and benchmark jobs, and publish clearer benchmark numbers.
Tracker-Backed 26.06 Items
The following tracker-backed items were used to align the public roadmap with the 26.06 Linear plan.
torch.compileWe Want Your Input
Have a feature request or use case that is not covered above? Please comment and include:
We will prioritize based on community feedback, engineering feasibility, and release validation risk.