feat(speculative): add Qwen3-MoE target support to EAGLE-1/2/3 by khazic · Pull Request #2317 · NVIDIA-NeMo/Automodel

khazic · 2026-05-26T06:54:18Z

What does this PR do?

Add Qwen3MoeForCausalLM (Qwen3-MoE family) as a supported target for
EAGLE-1 / EAGLE-2 / EAGLE-3 training, alongside the existing dense
Llama / Phi-3 paths. Also wires the EAGLE recipes to AutoModel's existing
FSDP2 sharding so 30B-class MoE targets actually fit on standard 80GB
multi-GPU nodes -- the previous code replicated the full target per rank
and OOMed on anything bigger than ~13B.

Changelog

Qwen3-MoE target support:

components/speculative/eagle/registry.py: append
Qwen3MoeForCausalLM to _DENSE_ARCHITECTURES. The dense draft
works as-is for MoE backbones because the EAGLE draft only consumes
post-block hidden states emitted by register_forward_hook on each
decoder layer -- per-expert routing internals are not part of the
draft's input.
recipes/llm/train_eagle{1,3}.py: docstrings extended from
"Llama, Phi-3" to mention Qwen3-MoE.
Example YAMLs: examples/speculative/eagle{1,2,3}/qwen3_moe_*_perfectblend.yaml.
tests/unit_tests/speculative/test_eagle_registry.py: add four
tests covering EAGLE-1 / EAGLE-3 registry containment and dispatch
resolution for Qwen3MoeForCausalLM -> LlamaEagle*DraftModel.

FSDP2 target sharding for the EAGLE recipes:

recipes/llm/train_eagle{1,3}.py: when a distributed: section
is present in the YAML, call setup_distributed and thread
device_mesh / moe_mesh / distributed_config / moe_config /
activation_checkpointing into NeMoAutoModelForCausalLM.from_pretrained.
The infrastructure layer then shards the target with FSDP2 (and EP
when ep_size > 1); the brute-force .to(self.device) is skipped
in the sharded path because the infra has already placed the
parameters.
The draft model keeps its existing DistributedDataParallel wrap.
FSDP2 on the target and DDP on the draft coexist on the same world,
each using its own communication scheme.
When distributed: is absent the recipes fall back to the original
single-GPU-per-rank behavior, so existing Llama / Phi-3 / Qwen3-dense
recipes are unaffected.
New Qwen3-MoE example YAMLs ship with strategy: fsdp2 and
activation_checkpointing: true to leave headroom for the
TTT-unrolled draft forwards. ep_size can be raised in YAML for
larger MoEs; the recipe forwards moe_config already.

DTensor copy fix:

components/speculative/eagle/draft_llama.py and
draft_llama_v12.py: in copy_embeddings_from_target, gather a
DTensor target weight via .full_tensor() before copying into the
unsharded draft parameter. Without this, FSDP2-wrapped targets raise
RuntimeError: aten.copy_.default got mixed torch.Tensor and DTensor.
The hasattr guard makes the existing unsharded path a no-op.

Verification

End-to-end smoke test on 8 x A800 80GB, Qwen3-30B-A3B-Instruct-2507
(non-thinking variant) as target, EAGLE-3 recipe, FSDP2 strategy
(dp_size=8, activation_checkpointing=true):

Target weights sharded across ranks (~16 GB per rank instead of full
60 GB), no OOM.
Draft training enters the loop and converges over a 250-step sanity
run: training loss decreases monotonically, draft accuracy rises from
near-zero to ~0.31, LR follows the cosine schedule.
Each 10-step window completes in ~6 s end-to-end.

Training-loop log excerpt (step 10 -> 250, every 10 steps):

epoch=0 step=10  train_loss=9.065207 train_acc=0.006130 lr=3.548e-05
epoch=0 step=20  train_loss=7.670224 train_acc=0.062447 lr=6.774e-05
epoch=0 step=30  train_loss=6.768910 train_acc=0.068292 lr=1.000e-04
epoch=0 step=40  train_loss=6.126582 train_acc=0.109515 lr=9.995e-05
epoch=0 step=50  train_loss=5.978342 train_acc=0.113623 lr=9.977e-05
epoch=0 step=60  train_loss=5.815319 train_acc=0.135431 lr=9.947e-05
epoch=0 step=70  train_loss=5.840969 train_acc=0.145151 lr=9.905e-05
epoch=0 step=80  train_loss=5.370955 train_acc=0.180499 lr=9.850e-05
epoch=0 step=90  train_loss=5.316124 train_acc=0.202069 lr=9.783e-05
epoch=0 step=100 train_loss=5.357407 train_acc=0.205463 lr=9.704e-05
epoch=0 step=110 train_loss=5.378894 train_acc=0.200543 lr=9.613e-05
epoch=0 step=120 train_loss=4.823973 train_acc=0.251581 lr=9.511e-05
epoch=0 step=130 train_loss=4.989300 train_acc=0.237234 lr=9.397e-05
epoch=0 step=140 train_loss=4.670279 train_acc=0.273315 lr=9.273e-05
epoch=0 step=150 train_loss=4.925109 train_acc=0.246173 lr=9.138e-05
epoch=0 step=160 train_loss=4.756421 train_acc=0.264386 lr=8.993e-05
epoch=0 step=170 train_loss=4.565536 train_acc=0.270532 lr=8.838e-05
epoch=0 step=180 train_loss=4.455242 train_acc=0.286340 lr=8.674e-05
epoch=0 step=190 train_loss=4.445095 train_acc=0.273486 lr=8.500e-05
epoch=0 step=200 train_loss=4.605785 train_acc=0.260287 lr=8.319e-05
epoch=0 step=210 train_loss=4.307686 train_acc=0.307715 lr=8.130e-05
epoch=0 step=220 train_loss=4.183267 train_acc=0.310303 lr=7.933e-05
epoch=0 step=230 train_loss=4.380021 train_acc=0.307971 lr=7.729e-05
epoch=0 step=240 train_loss=4.345554 train_acc=0.283203 lr=7.520e-05
epoch=0 step=250 train_loss=4.174419 train_acc=0.306598 lr=7.304e-05

The verification path exercises every change in this PR: the registry
entry (otherwise dispatch raises), the FSDP2 sharding (otherwise OOM at
load), the DTensor copy fix (otherwise crash at
copy_embeddings_from_target before the first step).

Before your PR is "Ready for review"

Pre checks:

Contributor guidelines followed
Unit tests added (test_eagle_registry.py -- 4 new tests)
Example configs added under examples/speculative/eagle{1,2,3}/
DCO sign-off on every commit

HF Qwen3MoeForCausalLM exposes the same post-block hidden states as a dense causal LM via register_forward_hook -- only the per-expert routing internals differ, and the dense LlamaEagle* drafts never look at those. Register the architecture in _DENSE_ARCHITECTURES so resolve_eagle{1,3}_draft_spec returns the existing dense draft for it; no changes to HFEagle3TargetModel or to any draft module are required. Adds: - "Qwen3MoeForCausalLM" entry in components/speculative/eagle/registry.py with an updated comment confirming MoE backbones are validated, not just anticipated. - example/speculative/eagle{1,2,3}/qwen3_moe_*_perfectblend.yaml mirroring the dense Qwen3 recipes but pointing at Qwen/Qwen3-30B-A3B. - Four registry unit tests covering EAGLE-1 and EAGLE-3 dispatch for Qwen3MoeForCausalLM. Recipe docstrings extended from "Llama, Phi-3" to mention Qwen3-MoE. Signed-off-by: khazic <khazzz1c@gmail.com>

The EAGLE recipes loaded the target with bare ``from_pretrained(...).to(self.device)``: every rank replicated the full target. Fine for 8B-class dense (~16GB), OOM for Qwen3-30B-A3B (~60GB bf16) on 8x80GB. Make the ``distributed:`` YAML section opt-in for the target. When present, ``setup_distributed`` builds a ``MeshContext`` and the recipe forwards ``device_mesh`` / ``moe_mesh`` / ``distributed_config`` / ``moe_config`` / ``activation_checkpointing`` into ``NeMoAutoModelForCausalLM.from_pretrained``. The infrastructure layer then shards the target with FSDP2 (and EP when ``ep_size > 1``) and places it on the correct devices. The brute-force ``.to(self.device)`` is skipped in the sharded path because the infra has already done it. The draft model keeps its existing ``DistributedDataParallel`` wrap; FSDP2 on the target and DDP on the draft coexist on the same world. Recipes touched: ``train_eagle3.py`` (TTT path) and ``train_eagle1.py`` (EAGLE-1/2 alias path). When ``distributed:`` is absent the recipes fall back to the original single-GPU-per-rank behavior, so existing Llama / Phi-3 / Qwen3-dense recipes are unaffected. Qwen3-MoE example YAMLs (EAGLE-1/2/3) ship with the minimal sharding config: ``strategy: fsdp2``, default ``dp_size = world_size``, and ``activation_checkpointing: true`` to leave headroom for the TTT- unrolled draft forwards. EP can be enabled later by setting ``ep_size``; the recipe forwards ``moe_config`` already. Signed-off-by: khazic <khazzz1c@gmail.com>

After enabling FSDP2 sharding on the EAGLE target, the target's ``embed_tokens.weight`` is a ``DTensor`` sharded across ranks. The draft model is not FSDP-wrapped, so its ``embed_tokens.weight`` is a plain ``nn.Parameter``. Calling ``draft_weight.copy_(target_weight)`` then trips: RuntimeError: aten.copy_.default got mixed torch.Tensor and DTensor Gather the target weight to a full local tensor via ``full_tensor()`` before the copy. ``hasattr`` guard keeps the unsharded path (no ``distributed:`` section in YAML) untouched. Fixes the regression introduced by the FSDP2 target-sharding patch in the EAGLE-1/2/3 recipes; both ``LlamaEagle3DraftModel`` and ``LlamaEagleDraftModel`` (EAGLE-1/2) needed the same fix. Signed-off-by: khazic <khazzz1c@gmail.com>

copy-pr-bot · 2026-05-26T06:54:22Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

HuiyingLi · 2026-05-26T07:00:27Z

/ok to test f95b06a

HuiyingLi · 2026-05-26T09:16:37Z

+        target_kwargs = dict(
            trust_remote_code=recipe_cfg.get("trust_remote_code", False),
            torch_dtype=self.compute_dtype,
            force_hf=True,


If force_hf=True, the custom model impl path won't be triggerd.

HuiyingLi · 2026-05-26T20:51:45Z

Thank you @khazic . EP8 is validated at commit d68a275. Would you mind resolving the conflicts and change the recipes to ep8? Thank you so much!

Resolved conflicts in the EAGLE registry and recipe docstrings between the Qwen3 dense target landing on upstream (PR NVIDIA-NeMo#2313) and the Qwen3-MoE target on this branch. Both architectures are now registered, and the recipe / class docstrings list Llama, Phi-3, Qwen3 and Qwen3-MoE. Conflicting files: - nemo_automodel/components/speculative/eagle/registry.py Kept both Qwen3ForCausalLM (upstream) and Qwen3MoeForCausalLM (ours) in _DENSE_ARCHITECTURES. - nemo_automodel/recipes/llm/train_eagle1.py - nemo_automodel/recipes/llm/train_eagle3.py Merged module docstring and class docstring to cover both dense Qwen3 and Qwen3-MoE targets. Signed-off-by: khazic <khazzz1c@gmail.com>

HuiyingLi · 2026-05-27T02:02:57Z

/ok to test 7fbb312

Bump ``distributed.ep_size`` from 1 to 8 in the three Qwen3-MoE EAGLE perfectblend example configs (eagle1 / eagle2 / eagle3). EP8 is the prevailing standard for Qwen3-30B-A3B across the rest of examples/ (qwen3_moe_30b_*.yaml under examples/llm_finetune/qwen) -- it splits the 128 experts evenly across an 8xH100 / 8xA100 node (16 experts per rank) and leaves the remaining non-expert weights to be FSDP-sharded across the ``world_size // ep_size`` DP ranks. Also updates the comments above the ``distributed`` block to describe the EP8 layout instead of the old ``dp_size = world_size`` claim, which no longer holds once expert parallelism is enabled. Reviewer confirmed the EP8 configuration is validated end-to-end at d68a275. Signed-off-by: khazic <khazzz1c@gmail.com>

HuiyingLi · 2026-05-27T02:12:59Z

/ok to test 1170d55

HuiyingLi

Thank you!

khazic added 3 commits May 26, 2026 13:32

khazic requested review from HuiyingLi, ZhiyuLi-Nvidia, adil-a, akoumpa, athitten, hemildesai, pthombre and zyzhou5 as code owners May 26, 2026 06:54

github-actions Bot added the community-request label May 26, 2026

copy-pr-bot Bot temporarily deployed to test May 26, 2026 07:00 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 26, 2026 07:00 Inactive

copy-pr-bot Bot temporarily deployed to public May 26, 2026 07:01 Inactive

copy-pr-bot Bot temporarily deployed to public May 26, 2026 07:03 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 26, 2026 07:06 Inactive

copy-pr-bot Bot temporarily deployed to public May 26, 2026 07:10 Inactive

HuiyingLi reviewed May 26, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to nemo-ci May 26, 2026 14:01 Inactive

copy-pr-bot Bot temporarily deployed to public May 26, 2026 14:02 Inactive

copy-pr-bot Bot temporarily deployed to public May 26, 2026 14:05 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 26, 2026 14:06 Inactive

copy-pr-bot Bot temporarily deployed to public May 26, 2026 14:12 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 27, 2026 02:03 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 27, 2026 02:03 Error

copy-pr-bot Bot temporarily deployed to nemo-ci May 27, 2026 02:03 Inactive

copy-pr-bot Bot temporarily deployed to test May 27, 2026 02:03 Inactive

copy-pr-bot Bot temporarily deployed to public May 27, 2026 02:03 Inactive

copy-pr-bot Bot temporarily deployed to public May 27, 2026 02:05 Inactive

copy-pr-bot Bot temporarily deployed to public May 27, 2026 02:13 Inactive

copy-pr-bot Bot temporarily deployed to test May 27, 2026 02:13 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 27, 2026 02:14 Inactive

HuiyingLi approved these changes May 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(speculative): add Qwen3-MoE target support to EAGLE-1/2/3#2317

feat(speculative): add Qwen3-MoE target support to EAGLE-1/2/3#2317
HuiyingLi merged 7 commits into
NVIDIA-NeMo:mainfrom
khazic:khazic/feat/eagle-qwen3-moe-support

khazic commented May 26, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 26, 2026

Uh oh!

HuiyingLi commented May 26, 2026

Uh oh!

HuiyingLi May 26, 2026

Uh oh!

HuiyingLi commented May 26, 2026

Uh oh!

HuiyingLi commented May 27, 2026

Uh oh!

HuiyingLi commented May 27, 2026

Uh oh!

HuiyingLi left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

khazic commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Changelog

Verification

Before your PR is "Ready for review"

Uh oh!

copy-pr-bot Bot commented May 26, 2026

Uh oh!

HuiyingLi commented May 26, 2026

Uh oh!

HuiyingLi May 26, 2026

Choose a reason for hiding this comment

Uh oh!

HuiyingLi commented May 26, 2026

Uh oh!

HuiyingLi commented May 27, 2026

Uh oh!

HuiyingLi commented May 27, 2026

Uh oh!

HuiyingLi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

khazic commented May 26, 2026 •

edited

Loading