Skip to content

feat(speculative): add Qwen3-MoE target support to EAGLE-1/2/3#2317

Merged
HuiyingLi merged 7 commits into
NVIDIA-NeMo:mainfrom
khazic:khazic/feat/eagle-qwen3-moe-support
May 27, 2026
Merged

feat(speculative): add Qwen3-MoE target support to EAGLE-1/2/3#2317
HuiyingLi merged 7 commits into
NVIDIA-NeMo:mainfrom
khazic:khazic/feat/eagle-qwen3-moe-support

Conversation

@khazic
Copy link
Copy Markdown
Contributor

@khazic khazic commented May 26, 2026

What does this PR do?

Add Qwen3MoeForCausalLM (Qwen3-MoE family) as a supported target for
EAGLE-1 / EAGLE-2 / EAGLE-3 training, alongside the existing dense
Llama / Phi-3 paths. Also wires the EAGLE recipes to AutoModel's existing
FSDP2 sharding so 30B-class MoE targets actually fit on standard 80GB
multi-GPU nodes -- the previous code replicated the full target per rank
and OOMed on anything bigger than ~13B.

Changelog

Qwen3-MoE target support:

  • components/speculative/eagle/registry.py: append
    Qwen3MoeForCausalLM to _DENSE_ARCHITECTURES. The dense draft
    works as-is for MoE backbones because the EAGLE draft only consumes
    post-block hidden states emitted by register_forward_hook on each
    decoder layer -- per-expert routing internals are not part of the
    draft's input.
  • recipes/llm/train_eagle{1,3}.py: docstrings extended from
    "Llama, Phi-3" to mention Qwen3-MoE.
  • Example YAMLs: examples/speculative/eagle{1,2,3}/qwen3_moe_*_perfectblend.yaml.
  • tests/unit_tests/speculative/test_eagle_registry.py: add four
    tests covering EAGLE-1 / EAGLE-3 registry containment and dispatch
    resolution for Qwen3MoeForCausalLM -> LlamaEagle*DraftModel.

FSDP2 target sharding for the EAGLE recipes:

  • recipes/llm/train_eagle{1,3}.py: when a distributed: section
    is present in the YAML, call setup_distributed and thread
    device_mesh / moe_mesh / distributed_config / moe_config /
    activation_checkpointing into NeMoAutoModelForCausalLM.from_pretrained.
    The infrastructure layer then shards the target with FSDP2 (and EP
    when ep_size > 1); the brute-force .to(self.device) is skipped
    in the sharded path because the infra has already placed the
    parameters.
  • The draft model keeps its existing DistributedDataParallel wrap.
    FSDP2 on the target and DDP on the draft coexist on the same world,
    each using its own communication scheme.
  • When distributed: is absent the recipes fall back to the original
    single-GPU-per-rank behavior, so existing Llama / Phi-3 / Qwen3-dense
    recipes are unaffected.
  • New Qwen3-MoE example YAMLs ship with strategy: fsdp2 and
    activation_checkpointing: true to leave headroom for the
    TTT-unrolled draft forwards. ep_size can be raised in YAML for
    larger MoEs; the recipe forwards moe_config already.

DTensor copy fix:

  • components/speculative/eagle/draft_llama.py and
    draft_llama_v12.py: in copy_embeddings_from_target, gather a
    DTensor target weight via .full_tensor() before copying into the
    unsharded draft parameter. Without this, FSDP2-wrapped targets raise
    RuntimeError: aten.copy_.default got mixed torch.Tensor and DTensor.
    The hasattr guard makes the existing unsharded path a no-op.

Verification

End-to-end smoke test on 8 x A800 80GB, Qwen3-30B-A3B-Instruct-2507
(non-thinking variant) as target, EAGLE-3 recipe, FSDP2 strategy
(dp_size=8, activation_checkpointing=true):

  • Target weights sharded across ranks (~16 GB per rank instead of full
    60 GB), no OOM.
  • Draft training enters the loop and converges over a 250-step sanity
    run: training loss decreases monotonically, draft accuracy rises from
    near-zero to ~0.31, LR follows the cosine schedule.
  • Each 10-step window completes in ~6 s end-to-end.

Training-loop log excerpt (step 10 -> 250, every 10 steps):

epoch=0 step=10  train_loss=9.065207 train_acc=0.006130 lr=3.548e-05
epoch=0 step=20  train_loss=7.670224 train_acc=0.062447 lr=6.774e-05
epoch=0 step=30  train_loss=6.768910 train_acc=0.068292 lr=1.000e-04
epoch=0 step=40  train_loss=6.126582 train_acc=0.109515 lr=9.995e-05
epoch=0 step=50  train_loss=5.978342 train_acc=0.113623 lr=9.977e-05
epoch=0 step=60  train_loss=5.815319 train_acc=0.135431 lr=9.947e-05
epoch=0 step=70  train_loss=5.840969 train_acc=0.145151 lr=9.905e-05
epoch=0 step=80  train_loss=5.370955 train_acc=0.180499 lr=9.850e-05
epoch=0 step=90  train_loss=5.316124 train_acc=0.202069 lr=9.783e-05
epoch=0 step=100 train_loss=5.357407 train_acc=0.205463 lr=9.704e-05
epoch=0 step=110 train_loss=5.378894 train_acc=0.200543 lr=9.613e-05
epoch=0 step=120 train_loss=4.823973 train_acc=0.251581 lr=9.511e-05
epoch=0 step=130 train_loss=4.989300 train_acc=0.237234 lr=9.397e-05
epoch=0 step=140 train_loss=4.670279 train_acc=0.273315 lr=9.273e-05
epoch=0 step=150 train_loss=4.925109 train_acc=0.246173 lr=9.138e-05
epoch=0 step=160 train_loss=4.756421 train_acc=0.264386 lr=8.993e-05
epoch=0 step=170 train_loss=4.565536 train_acc=0.270532 lr=8.838e-05
epoch=0 step=180 train_loss=4.455242 train_acc=0.286340 lr=8.674e-05
epoch=0 step=190 train_loss=4.445095 train_acc=0.273486 lr=8.500e-05
epoch=0 step=200 train_loss=4.605785 train_acc=0.260287 lr=8.319e-05
epoch=0 step=210 train_loss=4.307686 train_acc=0.307715 lr=8.130e-05
epoch=0 step=220 train_loss=4.183267 train_acc=0.310303 lr=7.933e-05
epoch=0 step=230 train_loss=4.380021 train_acc=0.307971 lr=7.729e-05
epoch=0 step=240 train_loss=4.345554 train_acc=0.283203 lr=7.520e-05
epoch=0 step=250 train_loss=4.174419 train_acc=0.306598 lr=7.304e-05

The verification path exercises every change in this PR: the registry
entry (otherwise dispatch raises), the FSDP2 sharding (otherwise OOM at
load), the DTensor copy fix (otherwise crash at
copy_embeddings_from_target before the first step).

Before your PR is "Ready for review"

Pre checks:

  • Contributor guidelines followed
  • Unit tests added (test_eagle_registry.py -- 4 new tests)
  • Example configs added under examples/speculative/eagle{1,2,3}/
  • DCO sign-off on every commit

khazic added 3 commits May 26, 2026 13:32
HF Qwen3MoeForCausalLM exposes the same post-block hidden states as a dense
causal LM via register_forward_hook -- only the per-expert routing internals
differ, and the dense LlamaEagle* drafts never look at those. Register the
architecture in _DENSE_ARCHITECTURES so resolve_eagle{1,3}_draft_spec returns
the existing dense draft for it; no changes to HFEagle3TargetModel or to any
draft module are required.

Adds:
- "Qwen3MoeForCausalLM" entry in components/speculative/eagle/registry.py
  with an updated comment confirming MoE backbones are validated, not just
  anticipated.
- example/speculative/eagle{1,2,3}/qwen3_moe_*_perfectblend.yaml mirroring
  the dense Qwen3 recipes but pointing at Qwen/Qwen3-30B-A3B.
- Four registry unit tests covering EAGLE-1 and EAGLE-3 dispatch for
  Qwen3MoeForCausalLM.

Recipe docstrings extended from "Llama, Phi-3" to mention Qwen3-MoE.

Signed-off-by: khazic <khazzz1c@gmail.com>
The EAGLE recipes loaded the target with bare
``from_pretrained(...).to(self.device)``: every rank replicated the full
target. Fine for 8B-class dense (~16GB), OOM for Qwen3-30B-A3B (~60GB
bf16) on 8x80GB.

Make the ``distributed:`` YAML section opt-in for the target. When
present, ``setup_distributed`` builds a ``MeshContext`` and the recipe
forwards ``device_mesh`` / ``moe_mesh`` / ``distributed_config`` /
``moe_config`` / ``activation_checkpointing`` into
``NeMoAutoModelForCausalLM.from_pretrained``. The infrastructure layer
then shards the target with FSDP2 (and EP when ``ep_size > 1``) and
places it on the correct devices. The brute-force ``.to(self.device)``
is skipped in the sharded path because the infra has already done it.

The draft model keeps its existing ``DistributedDataParallel`` wrap;
FSDP2 on the target and DDP on the draft coexist on the same world.

Recipes touched: ``train_eagle3.py`` (TTT path) and ``train_eagle1.py``
(EAGLE-1/2 alias path). When ``distributed:`` is absent the recipes
fall back to the original single-GPU-per-rank behavior, so existing
Llama / Phi-3 / Qwen3-dense recipes are unaffected.

Qwen3-MoE example YAMLs (EAGLE-1/2/3) ship with the minimal sharding
config: ``strategy: fsdp2``, default ``dp_size = world_size``, and
``activation_checkpointing: true`` to leave headroom for the TTT-
unrolled draft forwards. EP can be enabled later by setting ``ep_size``;
the recipe forwards ``moe_config`` already.

Signed-off-by: khazic <khazzz1c@gmail.com>
After enabling FSDP2 sharding on the EAGLE target, the target's
``embed_tokens.weight`` is a ``DTensor`` sharded across ranks. The
draft model is not FSDP-wrapped, so its ``embed_tokens.weight`` is a
plain ``nn.Parameter``. Calling ``draft_weight.copy_(target_weight)``
then trips:

    RuntimeError: aten.copy_.default got mixed torch.Tensor and DTensor

Gather the target weight to a full local tensor via ``full_tensor()``
before the copy. ``hasattr`` guard keeps the unsharded path
(no ``distributed:`` section in YAML) untouched.

Fixes the regression introduced by the FSDP2 target-sharding patch in
the EAGLE-1/2/3 recipes; both ``LlamaEagle3DraftModel`` and
``LlamaEagleDraftModel`` (EAGLE-1/2) needed the same fix.

Signed-off-by: khazic <khazzz1c@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 26, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@HuiyingLi
Copy link
Copy Markdown
Contributor

/ok to test f95b06a

target_kwargs = dict(
trust_remote_code=recipe_cfg.get("trust_remote_code", False),
torch_dtype=self.compute_dtype,
force_hf=True,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If force_hf=True, the custom model impl path won't be triggerd.

@HuiyingLi
Copy link
Copy Markdown
Contributor

Thank you @khazic . EP8 is validated at commit d68a275. Would you mind resolving the conflicts and change the recipes to ep8? Thank you so much!

Resolved conflicts in the EAGLE registry and recipe docstrings between
the Qwen3 dense target landing on upstream (PR NVIDIA-NeMo#2313) and the Qwen3-MoE
target on this branch. Both architectures are now registered, and the
recipe / class docstrings list Llama, Phi-3, Qwen3 and Qwen3-MoE.

Conflicting files:
- nemo_automodel/components/speculative/eagle/registry.py
  Kept both Qwen3ForCausalLM (upstream) and Qwen3MoeForCausalLM (ours)
  in _DENSE_ARCHITECTURES.
- nemo_automodel/recipes/llm/train_eagle1.py
- nemo_automodel/recipes/llm/train_eagle3.py
  Merged module docstring and class docstring to cover both dense Qwen3
  and Qwen3-MoE targets.

Signed-off-by: khazic <khazzz1c@gmail.com>
@HuiyingLi
Copy link
Copy Markdown
Contributor

/ok to test 7fbb312

Bump ``distributed.ep_size`` from 1 to 8 in the three Qwen3-MoE EAGLE
perfectblend example configs (eagle1 / eagle2 / eagle3). EP8 is the
prevailing standard for Qwen3-30B-A3B across the rest of examples/
(qwen3_moe_30b_*.yaml under examples/llm_finetune/qwen) -- it splits the
128 experts evenly across an 8xH100 / 8xA100 node (16 experts per rank)
and leaves the remaining non-expert weights to be FSDP-sharded across
the ``world_size // ep_size`` DP ranks.

Also updates the comments above the ``distributed`` block to describe
the EP8 layout instead of the old ``dp_size = world_size`` claim, which
no longer holds once expert parallelism is enabled. Reviewer confirmed
the EP8 configuration is validated end-to-end at d68a275.

Signed-off-by: khazic <khazzz1c@gmail.com>
@HuiyingLi
Copy link
Copy Markdown
Contributor

/ok to test 1170d55

Copy link
Copy Markdown
Contributor

@HuiyingLi HuiyingLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants