feat(speculative): add Qwen3-MoE target support to EAGLE-1/2/3#2317
Merged
HuiyingLi merged 7 commits intoMay 27, 2026
Merged
Conversation
HF Qwen3MoeForCausalLM exposes the same post-block hidden states as a dense
causal LM via register_forward_hook -- only the per-expert routing internals
differ, and the dense LlamaEagle* drafts never look at those. Register the
architecture in _DENSE_ARCHITECTURES so resolve_eagle{1,3}_draft_spec returns
the existing dense draft for it; no changes to HFEagle3TargetModel or to any
draft module are required.
Adds:
- "Qwen3MoeForCausalLM" entry in components/speculative/eagle/registry.py
with an updated comment confirming MoE backbones are validated, not just
anticipated.
- example/speculative/eagle{1,2,3}/qwen3_moe_*_perfectblend.yaml mirroring
the dense Qwen3 recipes but pointing at Qwen/Qwen3-30B-A3B.
- Four registry unit tests covering EAGLE-1 and EAGLE-3 dispatch for
Qwen3MoeForCausalLM.
Recipe docstrings extended from "Llama, Phi-3" to mention Qwen3-MoE.
Signed-off-by: khazic <khazzz1c@gmail.com>
The EAGLE recipes loaded the target with bare ``from_pretrained(...).to(self.device)``: every rank replicated the full target. Fine for 8B-class dense (~16GB), OOM for Qwen3-30B-A3B (~60GB bf16) on 8x80GB. Make the ``distributed:`` YAML section opt-in for the target. When present, ``setup_distributed`` builds a ``MeshContext`` and the recipe forwards ``device_mesh`` / ``moe_mesh`` / ``distributed_config`` / ``moe_config`` / ``activation_checkpointing`` into ``NeMoAutoModelForCausalLM.from_pretrained``. The infrastructure layer then shards the target with FSDP2 (and EP when ``ep_size > 1``) and places it on the correct devices. The brute-force ``.to(self.device)`` is skipped in the sharded path because the infra has already done it. The draft model keeps its existing ``DistributedDataParallel`` wrap; FSDP2 on the target and DDP on the draft coexist on the same world. Recipes touched: ``train_eagle3.py`` (TTT path) and ``train_eagle1.py`` (EAGLE-1/2 alias path). When ``distributed:`` is absent the recipes fall back to the original single-GPU-per-rank behavior, so existing Llama / Phi-3 / Qwen3-dense recipes are unaffected. Qwen3-MoE example YAMLs (EAGLE-1/2/3) ship with the minimal sharding config: ``strategy: fsdp2``, default ``dp_size = world_size``, and ``activation_checkpointing: true`` to leave headroom for the TTT- unrolled draft forwards. EP can be enabled later by setting ``ep_size``; the recipe forwards ``moe_config`` already. Signed-off-by: khazic <khazzz1c@gmail.com>
After enabling FSDP2 sharding on the EAGLE target, the target's
``embed_tokens.weight`` is a ``DTensor`` sharded across ranks. The
draft model is not FSDP-wrapped, so its ``embed_tokens.weight`` is a
plain ``nn.Parameter``. Calling ``draft_weight.copy_(target_weight)``
then trips:
RuntimeError: aten.copy_.default got mixed torch.Tensor and DTensor
Gather the target weight to a full local tensor via ``full_tensor()``
before the copy. ``hasattr`` guard keeps the unsharded path
(no ``distributed:`` section in YAML) untouched.
Fixes the regression introduced by the FSDP2 target-sharding patch in
the EAGLE-1/2/3 recipes; both ``LlamaEagle3DraftModel`` and
``LlamaEagleDraftModel`` (EAGLE-1/2) needed the same fix.
Signed-off-by: khazic <khazzz1c@gmail.com>
Contributor
|
/ok to test f95b06a |
HuiyingLi
reviewed
May 26, 2026
| target_kwargs = dict( | ||
| trust_remote_code=recipe_cfg.get("trust_remote_code", False), | ||
| torch_dtype=self.compute_dtype, | ||
| force_hf=True, |
Contributor
There was a problem hiding this comment.
If force_hf=True, the custom model impl path won't be triggerd.
Contributor
Resolved conflicts in the EAGLE registry and recipe docstrings between the Qwen3 dense target landing on upstream (PR NVIDIA-NeMo#2313) and the Qwen3-MoE target on this branch. Both architectures are now registered, and the recipe / class docstrings list Llama, Phi-3, Qwen3 and Qwen3-MoE. Conflicting files: - nemo_automodel/components/speculative/eagle/registry.py Kept both Qwen3ForCausalLM (upstream) and Qwen3MoeForCausalLM (ours) in _DENSE_ARCHITECTURES. - nemo_automodel/recipes/llm/train_eagle1.py - nemo_automodel/recipes/llm/train_eagle3.py Merged module docstring and class docstring to cover both dense Qwen3 and Qwen3-MoE targets. Signed-off-by: khazic <khazzz1c@gmail.com>
Contributor
|
/ok to test 7fbb312 |
Bump ``distributed.ep_size`` from 1 to 8 in the three Qwen3-MoE EAGLE perfectblend example configs (eagle1 / eagle2 / eagle3). EP8 is the prevailing standard for Qwen3-30B-A3B across the rest of examples/ (qwen3_moe_30b_*.yaml under examples/llm_finetune/qwen) -- it splits the 128 experts evenly across an 8xH100 / 8xA100 node (16 experts per rank) and leaves the remaining non-expert weights to be FSDP-sharded across the ``world_size // ep_size`` DP ranks. Also updates the comments above the ``distributed`` block to describe the EP8 layout instead of the old ``dp_size = world_size`` claim, which no longer holds once expert parallelism is enabled. Reviewer confirmed the EP8 configuration is validated end-to-end at d68a275. Signed-off-by: khazic <khazzz1c@gmail.com>
Contributor
|
/ok to test 1170d55 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Add
Qwen3MoeForCausalLM(Qwen3-MoE family) as a supported target forEAGLE-1 / EAGLE-2 / EAGLE-3 training, alongside the existing dense
Llama / Phi-3 paths. Also wires the EAGLE recipes to AutoModel's existing
FSDP2 sharding so 30B-class MoE targets actually fit on standard 80GB
multi-GPU nodes -- the previous code replicated the full target per rank
and OOMed on anything bigger than ~13B.
Changelog
Qwen3-MoE target support:
components/speculative/eagle/registry.py: appendQwen3MoeForCausalLMto_DENSE_ARCHITECTURES. The dense draftworks as-is for MoE backbones because the EAGLE draft only consumes
post-block hidden states emitted by
register_forward_hookon eachdecoder layer -- per-expert routing internals are not part of the
draft's input.
recipes/llm/train_eagle{1,3}.py: docstrings extended from"Llama, Phi-3" to mention Qwen3-MoE.
examples/speculative/eagle{1,2,3}/qwen3_moe_*_perfectblend.yaml.tests/unit_tests/speculative/test_eagle_registry.py: add fourtests covering EAGLE-1 / EAGLE-3 registry containment and dispatch
resolution for
Qwen3MoeForCausalLM->LlamaEagle*DraftModel.FSDP2 target sharding for the EAGLE recipes:
recipes/llm/train_eagle{1,3}.py: when adistributed:sectionis present in the YAML, call
setup_distributedand threaddevice_mesh/moe_mesh/distributed_config/moe_config/activation_checkpointingintoNeMoAutoModelForCausalLM.from_pretrained.The infrastructure layer then shards the target with FSDP2 (and EP
when
ep_size > 1); the brute-force.to(self.device)is skippedin the sharded path because the infra has already placed the
parameters.
DistributedDataParallelwrap.FSDP2 on the target and DDP on the draft coexist on the same world,
each using its own communication scheme.
distributed:is absent the recipes fall back to the originalsingle-GPU-per-rank behavior, so existing Llama / Phi-3 / Qwen3-dense
recipes are unaffected.
strategy: fsdp2andactivation_checkpointing: trueto leave headroom for theTTT-unrolled draft forwards.
ep_sizecan be raised in YAML forlarger MoEs; the recipe forwards
moe_configalready.DTensor copy fix:
components/speculative/eagle/draft_llama.pyanddraft_llama_v12.py: incopy_embeddings_from_target, gather aDTensor target weight via
.full_tensor()before copying into theunsharded draft parameter. Without this, FSDP2-wrapped targets raise
RuntimeError: aten.copy_.default got mixed torch.Tensor and DTensor.The
hasattrguard makes the existing unsharded path a no-op.Verification
End-to-end smoke test on 8 x A800 80GB,
Qwen3-30B-A3B-Instruct-2507(non-thinking variant) as target, EAGLE-3 recipe, FSDP2 strategy
(
dp_size=8,activation_checkpointing=true):60 GB), no OOM.
run: training loss decreases monotonically, draft accuracy rises from
near-zero to ~0.31, LR follows the cosine schedule.
Training-loop log excerpt (step 10 -> 250, every 10 steps):
The verification path exercises every change in this PR: the registry
entry (otherwise dispatch raises), the FSDP2 sharding (otherwise OOM at
load), the DTensor copy fix (otherwise crash at
copy_embeddings_from_targetbefore the first step).Before your PR is "Ready for review"
Pre checks:
test_eagle_registry.py-- 4 new tests)examples/speculative/eagle{1,2,3}/