[26.06] AutoModel Roadmap

# [26.06] AutoModel Roadmap

> This post tracks the planned work items for NeMo AutoModel in the 26.06 cycle. Plans may shift as we validate performance, hardware coverage, and community priorities. If you have feature requests, use cases, or model families you want us to prioritize, please comment below.

---

## Developer and User Experience

* **Installation and Container UX**
Make CUDA installs and containers more predictable, including deterministic CUDA stack options for `pip`, clearer optional dependency groups, validation on common base containers, and first-class launcher and model-specific dependencies in the NVIDIA container. This includes AArch64 container fixes such as `qwen_vl_utils`, tokenizer dependencies such as `sentencepiece` and `tiktoken`, and validation on non-AutoModel base containers.

* **Configuration Quality**
Improve recipe authoring with typed configuration work, annotated canonical configs, and YAML linting so examples are easier to read, validate, and maintain.

* **Include NeMo Run in Containers**
Include NeMo Run in the released container so users can use the same CLI for local and remote training.

---

## Model Support and Registry

* **Continuous Day-0 Model Support**
Continue rapid support for new Hugging Face model releases across LLM, VLM, omnimodal, diffusion, and retrieval workflows.

* **Model Capability Registry**
Expand the registry that maps model families to supported capabilities, recipes, tests, and docs so users can quickly see supported parallelism, precision, functionality (e.g., packed sequences, etc). This includes introducing a standardized test suite to verify each individual model's capabilities, and publishing model-specific CI results. Large effort that may be split into multiple cycles.

* **DeepSeek V4 Support, MTP, and Stabilization**
Add and validate DeepSeek V4 support, including MTP support and follow-up correctness work for HyperConnection, sparse attention, FSDP2 dtype handling, checkpointing, and parity-sensitive paths.

* **Gemma 4 Coverage**
Expand Gemma 4 support across MTP, context parallelism, Transformer Engine-backed MoE, TP/PP recipe quality, DGX Spark validation for 26B and 31B MoE recipes, and convergence or NaN-loss investigations.

* **Qwen Family Coverage**
Add Qwen Image fine-tuning and optimization, continue Qwen3.5 MoE and MTP support, and close Qwen3.5 VLM recipe gaps such as neat packing, checkpointing, and container dependency coverage.

* **BAGEL and Community-Requested Models**
Improve support for BAGEL pretraining and fine-tuning, and other community-requested model families as they land.

---

## Core infrastructure

* **Provide an RL Library Interface**
Create a clear interface for RL libraries to build on AutoModel’s model support and parallelism features, making it easy to integrate AutoModel-backed models into RL training workflows.


## Performance and Parallelism: MoE, PEFT, and more

* **MoE Backend Upgrades**
Upgrade and validate the MoE communication and expert-dispatch stack, including DeepEP v2, Blackwell paths, MXFP8, and NVFP4 coverage where feasible.

* **MoE Compile and CUDA Graph Paths**
Continue reducing MoE overhead with `torch.compile` and CUDA Graph work, with a focus on end-to-end throughput and stability for sparse models.

* **Context Parallelism and Packing**
Expand CP coverage for TE-backed Gemma 4, sequence packing plus CP, and larger hybrid or long-context recipes.

* **Backend and Optimizer Compatibility**
Keep the Transformer Engine stack current and close compatibility gaps such as DTensor support for fused optimizer paths and tensor-parallel synchronization for custom module weights.

* **Low-Memory PEFT, Expert LoRA, and Kernel Support**
Add low-memory PEFT and kernel coverage, including Transformer v5 expert LoRA behavior, fused adapter outputs for MoE layers, and memory-efficient configs for large single-GPU or constrained fine-tuning jobs.

* **AutoShard**
Explore compiler-assisted automatic model sharding to reduce manual parallelism work for new architectures.

---

## Multimodal, VLM, and Retrieval

* **Gemma 4 and Omnimodal Coverage**
Broaden Gemma 4 recipe coverage, MTP support, CP support, DGX Spark validation, and related VLM/omnimodal reliability work.

* **VLM Knowledge Distillation**
Add VLM knowledge distillation coverage so multimodal teacher-student workflows are represented in the 26.06 roadmap.

* **Retriever Training Coverage**
Add retrieval CI coverage and extend dataset and vision-language retrieval support for bi-encoder and cross-encoder workflows.

---

## Diffusion

* **BAGEL Pretraining, Fine-Tuning, and Scale Validation**
Add and validate BAGEL support as part of the multimodal training stack, including model onboarding, fine-tuning, scale validation, and performance characterization.

* **Flux 2 Fine-Tuning**
Add support for Flux 2 fine-tuning and keep the diffusion recipes aligned with current Diffusers model APIs.

* **Add WAN 2.2 Support**
Add first-class support for WAN 2.2 in AutoModel, including model loading, training recipes, parallelism compatibility, and validation against reference behavior.

* **Video Diffusion Scale Validation**
Use HunyuanVideo and related large video models as scale milestones for validating multi-node diffusion training performance.

* **Diffusion Performance Improvements**
Continue improving throughput and memory efficiency for image and video diffusion fine-tuning and pretraining.

---

## Checkpointing, State, and Robustness

* **Checkpointing Robustness and Speed**
Improve save/resume reliability, including async checkpoint coverage, DCP planner regressions, Qwen3.5 neat-packing checkpoint failures, and checkpoint robustness thresholds.

* **S3 and Remote Checkpoint Storage**
Add DCP-compatible S3 checkpoint support for cloud and shared-storage training workflows.

* **Lower-Memory State Dict and PEFT Adapters**
Reduce peak VRAM and CPU memory during state-dict adaptation, checkpoint export, and PEFT adapter conversion, including cases where adapter generation currently requires roughly 2x VRAM.

---

## Release Quality, CI, and Validation

* **CI Stability**
Harden release coverage across AArch64 containers, tokenizer dependencies, `trust_remote_code` cache behavior, checkpoint reloads, and known model-specific CI failures.

* **Convergence Robustness**
Continue convergence investigations and publishable loss-curve validation for representative dense, MoE, VLM, retrieval, and diffusion recipes. This includes targeted investigations such as Llama 3.2 pretraining instability and Gemma 4 TP/PP NaN-loss regressions.

* **Recipe Consolidation and Benchmarks**
Reduce duplicated benchmark and fine-tuning recipe trees, align configs across AutoModel training and benchmark jobs, and publish clearer benchmark numbers.

---

## Tracker-Backed 26.06 Items

The following tracker-backed items were used to align the public roadmap with the 26.06 Linear plan.

| Area | Planned item | Tracker reference |
| --- | --- | --- |
| Model support | DeepSeek V4 support | [#2034](https://github.com/NVIDIA-NeMo/Automodel/issues/2034), [#2143](https://github.com/NVIDIA-NeMo/Automodel/issues/2143), [#2088](https://github.com/NVIDIA-NeMo/Automodel/issues/2088), [#2086](https://github.com/NVIDIA-NeMo/Automodel/issues/2086), [#2154](https://github.com/NVIDIA-NeMo/Automodel/issues/2154), [#2170](https://github.com/NVIDIA-NeMo/Automodel/issues/2170) |
| Model support | DeepSeek V4 MTP support | [#2191](https://github.com/NVIDIA-NeMo/Automodel/pull/2191) |
| Model support | BAGEL pretraining and fine-tuning | [#1015](https://github.com/NVIDIA-NeMo/Automodel/issues/1015), [#2275](https://github.com/NVIDIA-NeMo/Automodel/pull/2275) |
| Model support | Qwen Image fine-tuning and optimization | [#1700](https://github.com/NVIDIA-NeMo/Automodel/issues/1700) |
| Diffusion | Flux 2 fine-tuning | [#418](https://github.com/NVIDIA-NeMo/Automodel/issues/418), [#2145](https://github.com/NVIDIA-NeMo/Automodel/pull/2145) |
| Training/performance | Low-memory PEFT and kernel support | [#2166](https://github.com/NVIDIA-NeMo/Automodel/issues/2166) |
| Training/performance | DeepEP v2 upgrade | [#2021](https://github.com/NVIDIA-NeMo/Automodel/issues/2021) |
| Training/performance | AutoShard | [#909](https://github.com/NVIDIA-NeMo/Automodel/issues/909) |
| Training/performance | MoE and `torch.compile` | [#1438](https://github.com/NVIDIA-NeMo/Automodel/issues/1438) |
| Training/performance | MoE on Blackwell | [#1187](https://github.com/NVIDIA-NeMo/Automodel/issues/1187), [#2115](https://github.com/NVIDIA-NeMo/Automodel/issues/2115) |
| Multimodal/retrieval | VLM knowledge distillation | [#2195](https://github.com/NVIDIA-NeMo/Automodel/issues/2195) |
| Multimodal/retrieval | Retriever VL and dataset support | [#1342](https://github.com/NVIDIA-NeMo/Automodel/pull/1342), [#1407](https://github.com/NVIDIA-NeMo/Automodel/pull/1407) |
| Multimodal/retrieval | Retriever CI tests | [#1449](https://github.com/NVIDIA-NeMo/Automodel/pull/1449) |
| Infrastructure/docs | Separate model zoo from infrastructure | [#2163](https://github.com/NVIDIA-NeMo/Automodel/issues/2163) |
| Infrastructure/docs | Component refactor | [#2163](https://github.com/NVIDIA-NeMo/Automodel/issues/2163), [#2266](https://github.com/NVIDIA-NeMo/Automodel/pull/2266) |
| Infrastructure/docs | Model capability registry | [#1438](https://github.com/NVIDIA-NeMo/Automodel/issues/1438) |
| Infrastructure/docs | Fern upgrade | [#2196](https://github.com/NVIDIA-NeMo/Automodel/pull/2196), [#2291](https://github.com/NVIDIA-NeMo/Automodel/pull/2291) |

---

## We Want Your Input

Have a feature request or use case that is not covered above? Please comment and include:

1. What you would like to see.
2. Why it matters for your workflow.
3. Any context that helps us prioritize, such as model family, scale, hardware, precision, or deployment target.

We will prioritize based on community feedback, engineering feasibility, and release validation risk.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[26.06] AutoModel Roadmap #2340

[26.06] AutoModel Roadmap

Developer and User Experience

Model Support and Registry

Core infrastructure

Performance and Parallelism: MoE, PEFT, and more

Multimodal, VLM, and Retrieval

Diffusion

Checkpointing, State, and Robustness

Release Quality, CI, and Validation

Tracker-Backed 26.06 Items

We Want Your Input

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Area	Planned item	Tracker reference
Model support	DeepSeek V4 support	#2034, #2143, #2088, #2086, #2154, #2170
Model support	DeepSeek V4 MTP support	#2191
Model support	BAGEL pretraining and fine-tuning	#1015, #2275
Model support	Qwen Image fine-tuning and optimization	#1700
Diffusion	Flux 2 fine-tuning	#418, #2145
Training/performance	Low-memory PEFT and kernel support	#2166
Training/performance	DeepEP v2 upgrade	#2021
Training/performance	AutoShard	#909
Training/performance	MoE and `torch.compile`	#1438
Training/performance	MoE on Blackwell	#1187, #2115
Multimodal/retrieval	VLM knowledge distillation	#2195
Multimodal/retrieval	Retriever VL and dataset support	#1342, #1407
Multimodal/retrieval	Retriever CI tests	#1449
Infrastructure/docs	Separate model zoo from infrastructure	#2163
Infrastructure/docs	Component refactor	#2163, #2266
Infrastructure/docs	Model capability registry	#1438
Infrastructure/docs	Fern upgrade	#2196, #2291

[26.06] AutoModel Roadmap #2340

Description

[26.06] AutoModel Roadmap

Developer and User Experience

Model Support and Registry

Core infrastructure

Performance and Parallelism: MoE, PEFT, and more

Multimodal, VLM, and Retrieval

Diffusion

Checkpointing, State, and Robustness

Release Quality, CI, and Validation

Tracker-Backed 26.06 Items

We Want Your Input

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions