Skip to content

[26.06] AutoModel Roadmap #2340

@akoumpa

Description

@akoumpa

[26.06] AutoModel Roadmap

This post tracks the planned work items for NeMo AutoModel in the 26.06 cycle. Plans may shift as we validate performance, hardware coverage, and community priorities. If you have feature requests, use cases, or model families you want us to prioritize, please comment below.


Developer and User Experience

  • Installation and Container UX
    Make CUDA installs and containers more predictable, including deterministic CUDA stack options for pip, clearer optional dependency groups, validation on common base containers, and first-class launcher and model-specific dependencies in the NVIDIA container. This includes AArch64 container fixes such as qwen_vl_utils, tokenizer dependencies such as sentencepiece and tiktoken, and validation on non-AutoModel base containers.

  • Configuration Quality
    Improve recipe authoring with typed configuration work, annotated canonical configs, and YAML linting so examples are easier to read, validate, and maintain.

  • Include NeMo Run in Containers
    Include NeMo Run in the released container so users can use the same CLI for local and remote training.


Model Support and Registry

  • Continuous Day-0 Model Support
    Continue rapid support for new Hugging Face model releases across LLM, VLM, omnimodal, diffusion, and retrieval workflows.

  • Model Capability Registry
    Expand the registry that maps model families to supported capabilities, recipes, tests, and docs so users can quickly see supported parallelism, precision, functionality (e.g., packed sequences, etc). This includes introducing a standardized test suite to verify each individual model's capabilities, and publishing model-specific CI results. Large effort that may be split into multiple cycles.

  • DeepSeek V4 Support, MTP, and Stabilization
    Add and validate DeepSeek V4 support, including MTP support and follow-up correctness work for HyperConnection, sparse attention, FSDP2 dtype handling, checkpointing, and parity-sensitive paths.

  • Gemma 4 Coverage
    Expand Gemma 4 support across MTP, context parallelism, Transformer Engine-backed MoE, TP/PP recipe quality, DGX Spark validation for 26B and 31B MoE recipes, and convergence or NaN-loss investigations.

  • Qwen Family Coverage
    Add Qwen Image fine-tuning and optimization, continue Qwen3.5 MoE and MTP support, and close Qwen3.5 VLM recipe gaps such as neat packing, checkpointing, and container dependency coverage.

  • BAGEL and Community-Requested Models
    Improve support for BAGEL pretraining and fine-tuning, and other community-requested model families as they land.


Core infrastructure

  • Provide an RL Library Interface
    Create a clear interface for RL libraries to build on AutoModel’s model support and parallelism features, making it easy to integrate AutoModel-backed models into RL training workflows.

Performance and Parallelism: MoE, PEFT, and more

  • MoE Backend Upgrades
    Upgrade and validate the MoE communication and expert-dispatch stack, including DeepEP v2, Blackwell paths, MXFP8, and NVFP4 coverage where feasible.

  • MoE Compile and CUDA Graph Paths
    Continue reducing MoE overhead with torch.compile and CUDA Graph work, with a focus on end-to-end throughput and stability for sparse models.

  • Context Parallelism and Packing
    Expand CP coverage for TE-backed Gemma 4, sequence packing plus CP, and larger hybrid or long-context recipes.

  • Backend and Optimizer Compatibility
    Keep the Transformer Engine stack current and close compatibility gaps such as DTensor support for fused optimizer paths and tensor-parallel synchronization for custom module weights.

  • Low-Memory PEFT, Expert LoRA, and Kernel Support
    Add low-memory PEFT and kernel coverage, including Transformer v5 expert LoRA behavior, fused adapter outputs for MoE layers, and memory-efficient configs for large single-GPU or constrained fine-tuning jobs.

  • AutoShard
    Explore compiler-assisted automatic model sharding to reduce manual parallelism work for new architectures.


Multimodal, VLM, and Retrieval

  • Gemma 4 and Omnimodal Coverage
    Broaden Gemma 4 recipe coverage, MTP support, CP support, DGX Spark validation, and related VLM/omnimodal reliability work.

  • VLM Knowledge Distillation
    Add VLM knowledge distillation coverage so multimodal teacher-student workflows are represented in the 26.06 roadmap.

  • Retriever Training Coverage
    Add retrieval CI coverage and extend dataset and vision-language retrieval support for bi-encoder and cross-encoder workflows.


Diffusion

  • BAGEL Pretraining, Fine-Tuning, and Scale Validation
    Add and validate BAGEL support as part of the multimodal training stack, including model onboarding, fine-tuning, scale validation, and performance characterization.

  • Flux 2 Fine-Tuning
    Add support for Flux 2 fine-tuning and keep the diffusion recipes aligned with current Diffusers model APIs.

  • Add WAN 2.2 Support
    Add first-class support for WAN 2.2 in AutoModel, including model loading, training recipes, parallelism compatibility, and validation against reference behavior.

  • Video Diffusion Scale Validation
    Use HunyuanVideo and related large video models as scale milestones for validating multi-node diffusion training performance.

  • Diffusion Performance Improvements
    Continue improving throughput and memory efficiency for image and video diffusion fine-tuning and pretraining.


Checkpointing, State, and Robustness

  • Checkpointing Robustness and Speed
    Improve save/resume reliability, including async checkpoint coverage, DCP planner regressions, Qwen3.5 neat-packing checkpoint failures, and checkpoint robustness thresholds.

  • S3 and Remote Checkpoint Storage
    Add DCP-compatible S3 checkpoint support for cloud and shared-storage training workflows.

  • Lower-Memory State Dict and PEFT Adapters
    Reduce peak VRAM and CPU memory during state-dict adaptation, checkpoint export, and PEFT adapter conversion, including cases where adapter generation currently requires roughly 2x VRAM.


Release Quality, CI, and Validation

  • CI Stability
    Harden release coverage across AArch64 containers, tokenizer dependencies, trust_remote_code cache behavior, checkpoint reloads, and known model-specific CI failures.

  • Convergence Robustness
    Continue convergence investigations and publishable loss-curve validation for representative dense, MoE, VLM, retrieval, and diffusion recipes. This includes targeted investigations such as Llama 3.2 pretraining instability and Gemma 4 TP/PP NaN-loss regressions.

  • Recipe Consolidation and Benchmarks
    Reduce duplicated benchmark and fine-tuning recipe trees, align configs across AutoModel training and benchmark jobs, and publish clearer benchmark numbers.


Tracker-Backed 26.06 Items

The following tracker-backed items were used to align the public roadmap with the 26.06 Linear plan.

Area Planned item Tracker reference
Model support DeepSeek V4 support #2034, #2143, #2088, #2086, #2154, #2170
Model support DeepSeek V4 MTP support #2191
Model support BAGEL pretraining and fine-tuning #1015, #2275
Model support Qwen Image fine-tuning and optimization #1700
Diffusion Flux 2 fine-tuning #418, #2145
Training/performance Low-memory PEFT and kernel support #2166
Training/performance DeepEP v2 upgrade #2021
Training/performance AutoShard #909
Training/performance MoE and torch.compile #1438
Training/performance MoE on Blackwell #1187, #2115
Multimodal/retrieval VLM knowledge distillation #2195
Multimodal/retrieval Retriever VL and dataset support #1342, #1407
Multimodal/retrieval Retriever CI tests #1449
Infrastructure/docs Separate model zoo from infrastructure #2163
Infrastructure/docs Component refactor #2163, #2266
Infrastructure/docs Model capability registry #1438
Infrastructure/docs Fern upgrade #2196, #2291

We Want Your Input

Have a feature request or use case that is not covered above? Please comment and include:

  1. What you would like to see.
  2. Why it matters for your workflow.
  3. Any context that helps us prioritize, such as model family, scale, hardware, precision, or deployment target.

We will prioritize based on community feedback, engineering feasibility, and release validation risk.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions