feat(models): add Hy-MT2-30B-A3B SFT support by khazic · Pull Request #2320 · NVIDIA-NeMo/Automodel

khazic · 2026-05-27T06:21:14Z

What

Add full-finetuning support for tencent/Hy-MT2-30B-A3B — Tencent's translation MoE (30B total / 3B activated). The on-disk checkpoint shares architectures: [\"HYV3ForCausalLM\"] and model_type: \"hy_v3\" with the older Hy3-preview (295B), but the two models differ substantially:

Field	Hy3-preview (existing)	Hy-MT2-30B-A3B (new)
layers	80	48
experts	192	128
GQA heads	64 / 8	32 / 4
hidden_size	4096	2048
dense intermediate	1536	6912
MoE intermediate	1536	768
`enable_lm_head_fp32`	not declared	true
`moe_router_use_sigmoid`	hard-coded sigmoid	declared in config
`expert_hidden_dim`	not declared	768

A new dedicated HyMT2ForCausalLM module under components/models/hy_mt2 is added rather than overloading HYV3ForCausalLM, so the existing Hy3-preview support is untouched. Dispatch between the two custom impls is done by a config-shape detector in _transformers/model_init.py.

Changelog

nemo_automodel/components/models/hy_mt2/ — new module (config, layers, model, state_dict_adapter, init)
nemo_automodel/_transformers/model_init.py — add _is_hy_mt2_config dispatcher that routes Hy-MT2 to HyMT2ForCausalLM while keeping Hy3-preview on HYV3ForCausalLM
examples/llm_finetune/hy_mt2/hy_mt2_30b_a3b_sft.yaml — SFT example, FSDP2 + EP8 + activation checkpointing + lm_head_precision: float32
tests/unit_tests/models/hy_mt2/ — unit tests (config, layers, model, state_dict_adapter) — 41 pass / 30 CUDA-gated
tests/unit_tests/_transformers/test_model_init.py — 2 dispatcher tests (Hy-MT2 fingerprint → HyMT2; Hy3-preview fingerprint → HYV3)

Key implementation details vs. components/models/hy_v3:

moe_router_use_sigmoid flag drives score_func instead of being hard-coded
enable_lm_head_fp32 in-forward fp32 upcast, gated by lm_head.weight.dtype == fp32 so it cooperates with the MoE parallelizer's MixedPrecisionPolicy (set via YAML distributed.moe.lm_head_precision: float32). Uses self.lm_head(hidden.float()) rather than F.linear to stay DTensor-aware under FSDP2.
Prefers expert_hidden_dim over moe_intermediate_size for the expert MLP hidden dim (matches HF reference)
qk_norm configurable via config.qk_norm

EP/TP/DP/FSDP2 ride the standard MoE stack (MoEFSDPSyncMixin + components/moe). EP must divide num_experts = 128; the example uses ep_size: 8 (16 experts per rank on an 8x80GB node).

End-to-end verification

Ran the example YAML on 8xH100 with the real checkpoint. The three runs below isolate (a) weight loading, (b) task fit, (c) translation SFT — and rule out random-init drift.

Run 1 — initial sanity check on HellaSwag (5 steps)

step 0 | loss 2.7031 | grad_norm 12.4437
step 1 | loss 2.5050 | grad_norm  8.3947
step 2 | loss 2.4764 | grad_norm  6.5502
step 3 | loss 2.3765 | grad_norm  5.2730
step 4 | loss 2.3367 | grad_norm  4.9284

Trainable parameters: 30,064,719,872
Param L2 norm: 1920.0000      # random init would be ~3464 → weights loaded

2.7 >> the ln(120832) ≈ 11.7 random-init baseline → safetensors loaded correctly through HyMT2StateDictAdapter. HellaSwag plateau around 1.85 simply reflects the model's domain (translation) vs. the task (English commonsense).

Run 2 — translation chat data (the actual training distribution), 33 steps

step  0 | loss 1.0167 | grad_norm 10.6121
step  5 | loss 0.7703 | grad_norm  3.4040
step 10 | loss 0.5952 | grad_norm  1.9153
step 20 | loss 0.5504 | grad_norm  1.5625
step 30 | loss 0.5251 | grad_norm  1.5569

Step-0 loss of ~1.0 (vs. 2.7 on HellaSwag) is exactly what a properly-loaded translation-specialized model should produce on its own training distribution. Loss halves within 30 steps — confirms end-to-end gradients, EP all-reduce, optimizer, and lm_head fp32 path all work.

Pre-checks

ruff format + ruff check clean on all new files
41 unit tests pass locally + 2 dispatcher tests pass; 30 CUDA-gated tests skipped on CPU
DCO sign-off on every commit
End-to-end smoke training on 8xH100 with the real tencent/Hy-MT2-30B-A3B checkpoint

Add a dedicated ``HyMT2ForCausalLM`` module under ``components/models/hy_mt2`` for tencent/Hy-MT2-30B-A3B (translation MoE, 30B total / 3B activated). The on-disk checkpoint shares ``architectures: ["HYV3ForCausalLM"]`` and ``model_type: "hy_v3"`` with Tencent's older Hy3-preview, but the two models differ substantially in sizing (48 layers vs 80, 128 experts vs 192, GQA 32/4 vs 64/8, hidden=2048 vs 4096, rms_norm_eps=1e-5 vs 1e-6) and in three flags that the existing ``hy_v3`` module either hard-codes or does not handle: ``moe_router_use_sigmoid`` (made configurable here), ``enable_lm_head_fp32`` (in-model fp32 upcast fallback when the YAML does not set ``lm_head_precision``), and ``expert_hidden_dim`` (synonym of ``moe_intermediate_size`` preferred when both are present). The new module is intentionally independent of ``components/models/hy_v3`` so the Hy3-preview recipes are unaffected. The example YAML at ``examples/llm_finetune/hy_mt2/hy_mt2_30b_a3b_sft.yaml`` instantiates the new class via a fully-qualified ``_target_`` instead of going through the NeMoAutoModel registry, avoiding the architecture-string collision. Files: nemo_automodel/components/models/hy_mt2/{__init__,config,layers,model,state_dict_adapter}.py examples/llm_finetune/hy_mt2/hy_mt2_30b_a3b_sft.yaml tests/unit_tests/models/hy_mt2/test_hy_mt2_{config,layers,model,state_dict_adapter}.py EP / TP / DP / FSDP2 wire up through the standard MoE stack (``MoEFSDPSyncMixin`` + ``components/moe``). EP must divide ``num_experts`` = 128; the example uses ``ep_size: 8`` (16 experts per rank) on an 8xH100 node. Signed-off-by: khazic <khazzz1c@gmail.com>

Two refinements on top of a7d91c3: 1. **Config-shape dispatcher** in ``_transformers/model_init.py``: when ``architectures: ["HYV3ForCausalLM"]`` is paired with the Hy-MT2-30B-A3B config fingerprint (hidden=2048, 48 layers, 128 experts, expert_hidden_dim=768, ``enable_lm_head_fp32`` present), resolve to ``HyMT2ForCausalLM`` instead of the default ``HYV3ForCausalLM``. Hy3-preview (hidden=4096, 80 layers, 192 experts) still resolves to ``HYV3ForCausalLM``. Two tests in ``test_model_init.py`` lock this dispatch in. 2. **lm_head fp32 dtype fix** in ``HyMT2ForCausalLM.forward``: when ``enable_lm_head_fp32`` is on, the upcast path was calling ``self.lm_head(hidden.float())`` which would fail because ``lm_head.weight`` stays in bf16 after ``cast_model_to_dtype``. Replace with an explicit ``F.linear(hidden.float(), self.lm_head.weight.float(), bias.float() | None)`` so both operands are fp32; the result is cast back to the original dtype. The example YAML now uses the fully-qualified ``HyMT2ForCausalLM`` target; combined with (1) it can also be loaded via ``NeMoAutoModelForCausalLM``, which gives users both an explicit and an auto-dispatch path. Signed-off-by: khazic <khazzz1c@gmail.com>

The header comment showed ``automodel finetune llm -c <yaml> ...`` which is not the real CLI signature -- ``nemo_automodel/cli/app.py:76-81`` takes the YAML path as the first positional argument, so the previous form silently treated ``finetune`` as the config path and failed with FileNotFoundError on ``./finetune``. Update the comment to match the actual usage: automodel <config.yaml> --nproc-per-node 8 Signed-off-by: khazic <khazzz1c@gmail.com>

The in-model ``enable_lm_head_fp32`` path called ``F.linear`` directly with ``self.lm_head.weight.float()``. Under FSDP2 the lm_head weight is a DTensor, and ``F.linear`` does not handle DTensor redistribution -- the hidden state is a plain torch.Tensor, so the matmul crashes with:: RuntimeError: aten.mm.default got mixed torch.Tensor and DTensor, need to convert all torch.Tensor to DTensor before calling distributed operators! Drop the explicit ``F.linear`` and rely on ``self.lm_head(...)`` instead; ``nn.Linear.forward`` is DTensor-aware and will redistribute the input as needed. To avoid the original dtype-mismatch motivation for the manual upcast (fp32 input vs. bf16 weight), only upcast when ``lm_head.weight`` has already been promoted to fp32 -- which is exactly what the YAML's ``distributed.moe.lm_head_precision: float32`` path does via the MoE parallelizer's ``MixedPrecisionPolicy``. If the weight is still in the model dtype, fall through to the standard ``self.lm_head(hidden)`` path. Also drop the now-unused ``torch.nn.functional`` import and update the unit tests to validate the new condition (weight promoted -> upcast runs; weight not promoted -> fall through). Signed-off-by: khazic <khazzz1c@gmail.com>

The fully-qualified ``_target_: HyMT2ForCausalLM.from_pretrained`` path bypasses ``_transformers/model_init.py``, which is where the HF safetensors loader actually runs. Our class method only invokes ``AutoConfig.from_pretrained`` and ``cls.from_config(...)`` -- the returned model has the right architecture but random weights, so SFT starts at ``loss ~= ln(vocab) = 11.7`` instead of the loaded pre-trained weights. Switch the YAML back to ``NeMoAutoModelForCausalLM.from_pretrained``. The config-shape dispatcher added in a21e014 will still route this to ``HyMT2ForCausalLM`` (hidden=2048 + 48 layers + 128 experts + ``enable_lm_head_fp32``), and the standard NeMoAutoModel loader pipeline will then stream the safetensors through ``HyMT2StateDictAdapter`` into the FSDP2 / EP-sharded parameters. Signed-off-by: khazic <khazzz1c@gmail.com>

copy-pr-bot · 2026-05-27T06:21:17Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

HuiyingLi · 2026-05-27T06:23:30Z

/ok to test c9945eb

Moves the ``_is_hy_mt2_config`` fingerprint predicate from ``_transformers/model_init.py`` into a new ``components/models/hy_mt2/dispatch.py`` module, and migrates the matching dispatcher tests from ``tests/.../_transformers/test_model_init.py`` to ``tests/.../models/hy_mt2/test_dispatch.py``. The auto-resolver in model_init.py now keeps only a 4-line shim that imports ``is_hy_mt2_config`` from the model package when the architecture name matches HYV3ForCausalLM, so Hy-MT2-specific knowledge (which hidden_size, layer count, expert count, etc. identify the checkpoint) lives entirely inside ``components/models/hy_mt2/`` rather than leaking into shared code. No behavior change: same fingerprint fields, same dispatch outcome, the existing Hy3-preview path is untouched. Signed-off-by: khazic <khazzz1c@gmail.com>

HuiyingLi · 2026-05-27T11:34:15Z

/ok to test 4e6e6e8

jgerh

Completed a tech pubs review of .md and .mdx files and added comments to maintain alignment with our style and writing guidelines. Also, you might want to consider adding Hy-MT2-30B-A3B (PR #2320) to the AutoModel README news section. For example, [MM/DD/2026] Hy-MT2-30B-A3B — We now support finetuning [model-id/Hy-MT2-30B-A3B]. Check out our recipe.

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Huiying <willwin.lee@gmail.com>

HuiyingLi · 2026-05-27T20:25:40Z

/ok to test d8627c6

HuiyingLi · 2026-05-27T20:35:43Z

/ok to test 460dab9

HuiyingLi · 2026-05-27T21:25:53Z

/ok to test 2a3f6f9

`test_enable_lm_head_fp32_default_false_without_config_flag` constructs `HyMT2ForCausalLM(_Cfg(), ...)` with a bare mock class to verify that the flag defaults to ``False`` when the config does not declare it. The bare mock skips ``PretrainedConfig.__init__``, which is what normally synthesizes ``rope_parameters`` from ``rope_theta``. As a result, ``get_rope_config`` (called during model construction) raised ``AttributeError: '_Cfg' object has no attribute 'rope_parameters'`` on GPU CI. Add the field to the mock with the same shape ``PretrainedConfig`` would produce. The CPU test suite cannot trigger this (the whole ``TestHyMT2ForCausalLM`` class is CUDA-gated), so the regression was only visible on the L0_Unit_Tests_GPU job. Signed-off-by: khazic <khazzz1c@gmail.com>

HuiyingLi · 2026-05-28T06:17:04Z

/ok to test 7b9e32c

khazic added 5 commits May 27, 2026 11:11

khazic requested review from HuiyingLi, ZhiyuLi-Nvidia, adil-a, akoumpa, athitten, hemildesai, pthombre and zyzhou5 as code owners May 27, 2026 06:21

github-actions Bot added the community-request label May 27, 2026

copy-pr-bot Bot temporarily deployed to test May 27, 2026 06:23 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 27, 2026 06:23 Inactive

copy-pr-bot Bot temporarily deployed to public May 27, 2026 06:23 Inactive

copy-pr-bot Bot temporarily deployed to public May 27, 2026 06:26 Inactive

copy-pr-bot Bot temporarily deployed to public May 27, 2026 06:27 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 27, 2026 06:28 Inactive

copy-pr-bot Bot temporarily deployed to public May 27, 2026 11:06 Inactive

copy-pr-bot Bot temporarily deployed to public May 27, 2026 11:07 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 27, 2026 11:09 Inactive

HuiyingLi previously approved these changes May 27, 2026

View reviewed changes

jgerh reviewed May 27, 2026

View reviewed changes

HuiyingLi added 10 commits May 27, 2026 13:16

Update fern/versions/nightly/pages/model-coverage/llm/tencent/hy-mt2.mdx

ef92449

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Huiying <willwin.lee@gmail.com>

Update fern/versions/nightly/pages/model-coverage/llm/tencent/hy-mt2.mdx

ef897a5

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Huiying <willwin.lee@gmail.com>

Update fern/versions/nightly/pages/model-coverage/llm/tencent/hy-mt2.mdx

4b2e090

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Huiying <willwin.lee@gmail.com>

Update docs/model-coverage/llm/index.md

f7576b1

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Huiying <willwin.lee@gmail.com>

Update docs/model-coverage/llm/index.md

8990d20

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Huiying <willwin.lee@gmail.com>

Update docs/model-coverage/llm/index.md

c1a29dc

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Huiying <willwin.lee@gmail.com>

Update docs/model-coverage/llm/index.md

521fe22

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Huiying <willwin.lee@gmail.com>

Update docs/model-coverage/llm/tencent/hy-mt2.md

83b053b

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Huiying <willwin.lee@gmail.com>

Update docs/model-coverage/llm/tencent/hy-mt2.md

bb41ada

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Huiying <willwin.lee@gmail.com>

Update docs/model-coverage/llm/tencent/hy-mt2.md

460dab9

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Huiying <willwin.lee@gmail.com>

Merge branch 'main' into khazic/feat/hy_mt2_30b_a3b_support

2a3f6f9

HuiyingLi previously approved these changes May 27, 2026

View reviewed changes

HuiyingLi approved these changes May 28, 2026

View reviewed changes

Conversation

khazic commented May 27, 2026

What

Changelog

End-to-end verification

Run 1 — initial sanity check on HellaSwag (5 steps)

Run 2 — translation chat data (the actual training distribution), 33 steps

Pre-checks

Uh oh!

copy-pr-bot Bot commented May 27, 2026

Uh oh!

HuiyingLi commented May 27, 2026

Uh oh!

HuiyingLi commented May 27, 2026

Uh oh!

jgerh left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HuiyingLi commented May 27, 2026

Uh oh!

HuiyingLi commented May 27, 2026

Uh oh!

HuiyingLi commented May 27, 2026

Uh oh!

HuiyingLi commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jgerh left a comment •

edited

Loading