feat(models): add Hy-MT2-30B-A3B SFT support#2320
Merged
HuiyingLi merged 21 commits intoMay 28, 2026
Merged
Conversation
Add a dedicated ``HyMT2ForCausalLM`` module under
``components/models/hy_mt2`` for tencent/Hy-MT2-30B-A3B (translation MoE,
30B total / 3B activated). The on-disk checkpoint shares
``architectures: ["HYV3ForCausalLM"]`` and ``model_type: "hy_v3"`` with
Tencent's older Hy3-preview, but the two models differ substantially in
sizing (48 layers vs 80, 128 experts vs 192, GQA 32/4 vs 64/8,
hidden=2048 vs 4096, rms_norm_eps=1e-5 vs 1e-6) and in three flags that
the existing ``hy_v3`` module either hard-codes or does not handle:
``moe_router_use_sigmoid`` (made configurable here),
``enable_lm_head_fp32`` (in-model fp32 upcast fallback when the YAML
does not set ``lm_head_precision``), and ``expert_hidden_dim`` (synonym
of ``moe_intermediate_size`` preferred when both are present).
The new module is intentionally independent of ``components/models/hy_v3``
so the Hy3-preview recipes are unaffected. The example YAML at
``examples/llm_finetune/hy_mt2/hy_mt2_30b_a3b_sft.yaml`` instantiates the
new class via a fully-qualified ``_target_`` instead of going through the
NeMoAutoModel registry, avoiding the architecture-string collision.
Files:
nemo_automodel/components/models/hy_mt2/{__init__,config,layers,model,state_dict_adapter}.py
examples/llm_finetune/hy_mt2/hy_mt2_30b_a3b_sft.yaml
tests/unit_tests/models/hy_mt2/test_hy_mt2_{config,layers,model,state_dict_adapter}.py
EP / TP / DP / FSDP2 wire up through the standard MoE stack
(``MoEFSDPSyncMixin`` + ``components/moe``). EP must divide ``num_experts``
= 128; the example uses ``ep_size: 8`` (16 experts per rank) on an
8xH100 node.
Signed-off-by: khazic <khazzz1c@gmail.com>
Two refinements on top of a7d91c3: 1. **Config-shape dispatcher** in ``_transformers/model_init.py``: when ``architectures: ["HYV3ForCausalLM"]`` is paired with the Hy-MT2-30B-A3B config fingerprint (hidden=2048, 48 layers, 128 experts, expert_hidden_dim=768, ``enable_lm_head_fp32`` present), resolve to ``HyMT2ForCausalLM`` instead of the default ``HYV3ForCausalLM``. Hy3-preview (hidden=4096, 80 layers, 192 experts) still resolves to ``HYV3ForCausalLM``. Two tests in ``test_model_init.py`` lock this dispatch in. 2. **lm_head fp32 dtype fix** in ``HyMT2ForCausalLM.forward``: when ``enable_lm_head_fp32`` is on, the upcast path was calling ``self.lm_head(hidden.float())`` which would fail because ``lm_head.weight`` stays in bf16 after ``cast_model_to_dtype``. Replace with an explicit ``F.linear(hidden.float(), self.lm_head.weight.float(), bias.float() | None)`` so both operands are fp32; the result is cast back to the original dtype. The example YAML now uses the fully-qualified ``HyMT2ForCausalLM`` target; combined with (1) it can also be loaded via ``NeMoAutoModelForCausalLM``, which gives users both an explicit and an auto-dispatch path. Signed-off-by: khazic <khazzz1c@gmail.com>
The header comment showed ``automodel finetune llm -c <yaml> ...`` which is
not the real CLI signature -- ``nemo_automodel/cli/app.py:76-81`` takes the
YAML path as the first positional argument, so the previous form silently
treated ``finetune`` as the config path and failed with FileNotFoundError
on ``./finetune``. Update the comment to match the actual usage:
automodel <config.yaml> --nproc-per-node 8
Signed-off-by: khazic <khazzz1c@gmail.com>
The in-model ``enable_lm_head_fp32`` path called ``F.linear`` directly with
``self.lm_head.weight.float()``. Under FSDP2 the lm_head weight is a
DTensor, and ``F.linear`` does not handle DTensor redistribution -- the
hidden state is a plain torch.Tensor, so the matmul crashes with::
RuntimeError: aten.mm.default got mixed torch.Tensor and DTensor,
need to convert all torch.Tensor to DTensor before calling
distributed operators!
Drop the explicit ``F.linear`` and rely on ``self.lm_head(...)`` instead;
``nn.Linear.forward`` is DTensor-aware and will redistribute the input as
needed. To avoid the original dtype-mismatch motivation for the manual
upcast (fp32 input vs. bf16 weight), only upcast when ``lm_head.weight``
has already been promoted to fp32 -- which is exactly what the YAML's
``distributed.moe.lm_head_precision: float32`` path does via the MoE
parallelizer's ``MixedPrecisionPolicy``. If the weight is still in the
model dtype, fall through to the standard ``self.lm_head(hidden)`` path.
Also drop the now-unused ``torch.nn.functional`` import and update the
unit tests to validate the new condition (weight promoted -> upcast
runs; weight not promoted -> fall through).
Signed-off-by: khazic <khazzz1c@gmail.com>
The fully-qualified ``_target_: HyMT2ForCausalLM.from_pretrained`` path bypasses ``_transformers/model_init.py``, which is where the HF safetensors loader actually runs. Our class method only invokes ``AutoConfig.from_pretrained`` and ``cls.from_config(...)`` -- the returned model has the right architecture but random weights, so SFT starts at ``loss ~= ln(vocab) = 11.7`` instead of the loaded pre-trained weights. Switch the YAML back to ``NeMoAutoModelForCausalLM.from_pretrained``. The config-shape dispatcher added in a21e014 will still route this to ``HyMT2ForCausalLM`` (hidden=2048 + 48 layers + 128 experts + ``enable_lm_head_fp32``), and the standard NeMoAutoModel loader pipeline will then stream the safetensors through ``HyMT2StateDictAdapter`` into the FSDP2 / EP-sharded parameters. Signed-off-by: khazic <khazzz1c@gmail.com>
Contributor
|
/ok to test c9945eb |
Moves the ``_is_hy_mt2_config`` fingerprint predicate from ``_transformers/model_init.py`` into a new ``components/models/hy_mt2/dispatch.py`` module, and migrates the matching dispatcher tests from ``tests/.../_transformers/test_model_init.py`` to ``tests/.../models/hy_mt2/test_dispatch.py``. The auto-resolver in model_init.py now keeps only a 4-line shim that imports ``is_hy_mt2_config`` from the model package when the architecture name matches HYV3ForCausalLM, so Hy-MT2-specific knowledge (which hidden_size, layer count, expert count, etc. identify the checkpoint) lives entirely inside ``components/models/hy_mt2/`` rather than leaking into shared code. No behavior change: same fingerprint fields, same dispatch outcome, the existing Hy3-preview path is untouched. Signed-off-by: khazic <khazzz1c@gmail.com>
Contributor
|
/ok to test 4e6e6e8 |
HuiyingLi
previously approved these changes
May 27, 2026
jgerh
reviewed
May 27, 2026
Contributor
There was a problem hiding this comment.
Completed a tech pubs review of .md and .mdx files and added comments to maintain alignment with our style and writing guidelines. Also, you might want to consider adding Hy-MT2-30B-A3B (PR #2320) to the AutoModel README news section. For example, [MM/DD/2026] Hy-MT2-30B-A3B — We now support finetuning [model-id/Hy-MT2-30B-A3B]. Check out our recipe.
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Huiying <willwin.lee@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Huiying <willwin.lee@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Huiying <willwin.lee@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Huiying <willwin.lee@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Huiying <willwin.lee@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Huiying <willwin.lee@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Huiying <willwin.lee@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Huiying <willwin.lee@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Huiying <willwin.lee@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Huiying <willwin.lee@gmail.com>
Contributor
|
/ok to test d8627c6 |
Contributor
|
/ok to test 460dab9 |
Contributor
|
/ok to test 2a3f6f9 |
HuiyingLi
previously approved these changes
May 27, 2026
`test_enable_lm_head_fp32_default_false_without_config_flag` constructs `HyMT2ForCausalLM(_Cfg(), ...)` with a bare mock class to verify that the flag defaults to ``False`` when the config does not declare it. The bare mock skips ``PretrainedConfig.__init__``, which is what normally synthesizes ``rope_parameters`` from ``rope_theta``. As a result, ``get_rope_config`` (called during model construction) raised ``AttributeError: '_Cfg' object has no attribute 'rope_parameters'`` on GPU CI. Add the field to the mock with the same shape ``PretrainedConfig`` would produce. The CPU test suite cannot trigger this (the whole ``TestHyMT2ForCausalLM`` class is CUDA-gated), so the regression was only visible on the L0_Unit_Tests_GPU job. Signed-off-by: khazic <khazzz1c@gmail.com>
Contributor
|
/ok to test 7b9e32c |
HuiyingLi
approved these changes
May 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Add full-finetuning support for tencent/Hy-MT2-30B-A3B — Tencent's translation MoE (30B total / 3B activated). The on-disk checkpoint shares
architectures: [\"HYV3ForCausalLM\"]andmodel_type: \"hy_v3\"with the older Hy3-preview (295B), but the two models differ substantially:enable_lm_head_fp32moe_router_use_sigmoidexpert_hidden_dimA new dedicated
HyMT2ForCausalLMmodule undercomponents/models/hy_mt2is added rather than overloadingHYV3ForCausalLM, so the existing Hy3-preview support is untouched. Dispatch between the two custom impls is done by a config-shape detector in_transformers/model_init.py.Changelog
nemo_automodel/components/models/hy_mt2/— new module (config, layers, model, state_dict_adapter, init)nemo_automodel/_transformers/model_init.py— add_is_hy_mt2_configdispatcher that routes Hy-MT2 toHyMT2ForCausalLMwhile keeping Hy3-preview onHYV3ForCausalLMexamples/llm_finetune/hy_mt2/hy_mt2_30b_a3b_sft.yaml— SFT example, FSDP2 + EP8 + activation checkpointing +lm_head_precision: float32tests/unit_tests/models/hy_mt2/— unit tests (config, layers, model, state_dict_adapter) — 41 pass / 30 CUDA-gatedtests/unit_tests/_transformers/test_model_init.py— 2 dispatcher tests (Hy-MT2 fingerprint → HyMT2; Hy3-preview fingerprint → HYV3)Key implementation details vs.
components/models/hy_v3:moe_router_use_sigmoidflag drivesscore_funcinstead of being hard-codedenable_lm_head_fp32in-forward fp32 upcast, gated bylm_head.weight.dtype == fp32so it cooperates with the MoE parallelizer'sMixedPrecisionPolicy(set via YAMLdistributed.moe.lm_head_precision: float32). Usesself.lm_head(hidden.float())rather thanF.linearto stay DTensor-aware under FSDP2.expert_hidden_dimovermoe_intermediate_sizefor the expert MLP hidden dim (matches HF reference)qk_normconfigurable viaconfig.qk_normEP/TP/DP/FSDP2 ride the standard MoE stack (
MoEFSDPSyncMixin+components/moe). EP must dividenum_experts= 128; the example usesep_size: 8(16 experts per rank on an 8x80GB node).End-to-end verification
Ran the example YAML on 8xH100 with the real checkpoint. The three runs below isolate (a) weight loading, (b) task fit, (c) translation SFT — and rule out random-init drift.
Run 1 — initial sanity check on HellaSwag (5 steps)
2.7>> theln(120832) ≈ 11.7random-init baseline → safetensors loaded correctly throughHyMT2StateDictAdapter. HellaSwag plateau around 1.85 simply reflects the model's domain (translation) vs. the task (English commonsense).Run 2 — translation chat data (the actual training distribution), 33 steps
Step-0 loss of ~1.0 (vs. 2.7 on HellaSwag) is exactly what a properly-loaded translation-specialized model should produce on its own training distribution. Loss halves within 30 steps — confirms end-to-end gradients, EP all-reduce, optimizer, and lm_head fp32 path all work.
Pre-checks
ruff format+ruff checkclean on all new filestencent/Hy-MT2-30B-A3Bcheckpoint