Skip to content

feat(models): add Hy-MT2-30B-A3B SFT support#2320

Merged
HuiyingLi merged 21 commits into
NVIDIA-NeMo:mainfrom
khazic:khazic/feat/hy_mt2_30b_a3b_support
May 28, 2026
Merged

feat(models): add Hy-MT2-30B-A3B SFT support#2320
HuiyingLi merged 21 commits into
NVIDIA-NeMo:mainfrom
khazic:khazic/feat/hy_mt2_30b_a3b_support

Conversation

@khazic
Copy link
Copy Markdown
Contributor

@khazic khazic commented May 27, 2026

What

Add full-finetuning support for tencent/Hy-MT2-30B-A3B — Tencent's translation MoE (30B total / 3B activated). The on-disk checkpoint shares architectures: [\"HYV3ForCausalLM\"] and model_type: \"hy_v3\" with the older Hy3-preview (295B), but the two models differ substantially:

Field Hy3-preview (existing) Hy-MT2-30B-A3B (new)
layers 80 48
experts 192 128
GQA heads 64 / 8 32 / 4
hidden_size 4096 2048
dense intermediate 1536 6912
MoE intermediate 1536 768
enable_lm_head_fp32 not declared true
moe_router_use_sigmoid hard-coded sigmoid declared in config
expert_hidden_dim not declared 768

A new dedicated HyMT2ForCausalLM module under components/models/hy_mt2 is added rather than overloading HYV3ForCausalLM, so the existing Hy3-preview support is untouched. Dispatch between the two custom impls is done by a config-shape detector in _transformers/model_init.py.

Changelog

  • nemo_automodel/components/models/hy_mt2/ — new module (config, layers, model, state_dict_adapter, init)
  • nemo_automodel/_transformers/model_init.py — add _is_hy_mt2_config dispatcher that routes Hy-MT2 to HyMT2ForCausalLM while keeping Hy3-preview on HYV3ForCausalLM
  • examples/llm_finetune/hy_mt2/hy_mt2_30b_a3b_sft.yaml — SFT example, FSDP2 + EP8 + activation checkpointing + lm_head_precision: float32
  • tests/unit_tests/models/hy_mt2/ — unit tests (config, layers, model, state_dict_adapter) — 41 pass / 30 CUDA-gated
  • tests/unit_tests/_transformers/test_model_init.py — 2 dispatcher tests (Hy-MT2 fingerprint → HyMT2; Hy3-preview fingerprint → HYV3)

Key implementation details vs. components/models/hy_v3:

  • moe_router_use_sigmoid flag drives score_func instead of being hard-coded
  • enable_lm_head_fp32 in-forward fp32 upcast, gated by lm_head.weight.dtype == fp32 so it cooperates with the MoE parallelizer's MixedPrecisionPolicy (set via YAML distributed.moe.lm_head_precision: float32). Uses self.lm_head(hidden.float()) rather than F.linear to stay DTensor-aware under FSDP2.
  • Prefers expert_hidden_dim over moe_intermediate_size for the expert MLP hidden dim (matches HF reference)
  • qk_norm configurable via config.qk_norm

EP/TP/DP/FSDP2 ride the standard MoE stack (MoEFSDPSyncMixin + components/moe). EP must divide num_experts = 128; the example uses ep_size: 8 (16 experts per rank on an 8x80GB node).

End-to-end verification

Ran the example YAML on 8xH100 with the real checkpoint. The three runs below isolate (a) weight loading, (b) task fit, (c) translation SFT — and rule out random-init drift.

Run 1 — initial sanity check on HellaSwag (5 steps)

step 0 | loss 2.7031 | grad_norm 12.4437
step 1 | loss 2.5050 | grad_norm  8.3947
step 2 | loss 2.4764 | grad_norm  6.5502
step 3 | loss 2.3765 | grad_norm  5.2730
step 4 | loss 2.3367 | grad_norm  4.9284

Trainable parameters: 30,064,719,872
Param L2 norm: 1920.0000      # random init would be ~3464 → weights loaded

2.7 >> the ln(120832) ≈ 11.7 random-init baseline → safetensors loaded correctly through HyMT2StateDictAdapter. HellaSwag plateau around 1.85 simply reflects the model's domain (translation) vs. the task (English commonsense).

Run 2 — translation chat data (the actual training distribution), 33 steps

step  0 | loss 1.0167 | grad_norm 10.6121
step  5 | loss 0.7703 | grad_norm  3.4040
step 10 | loss 0.5952 | grad_norm  1.9153
step 20 | loss 0.5504 | grad_norm  1.5625
step 30 | loss 0.5251 | grad_norm  1.5569

Step-0 loss of ~1.0 (vs. 2.7 on HellaSwag) is exactly what a properly-loaded translation-specialized model should produce on its own training distribution. Loss halves within 30 steps — confirms end-to-end gradients, EP all-reduce, optimizer, and lm_head fp32 path all work.

Pre-checks

  • ruff format + ruff check clean on all new files
  • 41 unit tests pass locally + 2 dispatcher tests pass; 30 CUDA-gated tests skipped on CPU
  • DCO sign-off on every commit
  • End-to-end smoke training on 8xH100 with the real tencent/Hy-MT2-30B-A3B checkpoint

khazic added 5 commits May 27, 2026 11:11
Add a dedicated ``HyMT2ForCausalLM`` module under
``components/models/hy_mt2`` for tencent/Hy-MT2-30B-A3B (translation MoE,
30B total / 3B activated). The on-disk checkpoint shares
``architectures: ["HYV3ForCausalLM"]`` and ``model_type: "hy_v3"`` with
Tencent's older Hy3-preview, but the two models differ substantially in
sizing (48 layers vs 80, 128 experts vs 192, GQA 32/4 vs 64/8,
hidden=2048 vs 4096, rms_norm_eps=1e-5 vs 1e-6) and in three flags that
the existing ``hy_v3`` module either hard-codes or does not handle:
``moe_router_use_sigmoid`` (made configurable here),
``enable_lm_head_fp32`` (in-model fp32 upcast fallback when the YAML
does not set ``lm_head_precision``), and ``expert_hidden_dim`` (synonym
of ``moe_intermediate_size`` preferred when both are present).

The new module is intentionally independent of ``components/models/hy_v3``
so the Hy3-preview recipes are unaffected. The example YAML at
``examples/llm_finetune/hy_mt2/hy_mt2_30b_a3b_sft.yaml`` instantiates the
new class via a fully-qualified ``_target_`` instead of going through the
NeMoAutoModel registry, avoiding the architecture-string collision.

Files:
  nemo_automodel/components/models/hy_mt2/{__init__,config,layers,model,state_dict_adapter}.py
  examples/llm_finetune/hy_mt2/hy_mt2_30b_a3b_sft.yaml
  tests/unit_tests/models/hy_mt2/test_hy_mt2_{config,layers,model,state_dict_adapter}.py

EP / TP / DP / FSDP2 wire up through the standard MoE stack
(``MoEFSDPSyncMixin`` + ``components/moe``). EP must divide ``num_experts``
= 128; the example uses ``ep_size: 8`` (16 experts per rank) on an
8xH100 node.

Signed-off-by: khazic <khazzz1c@gmail.com>
Two refinements on top of a7d91c3:

1. **Config-shape dispatcher** in ``_transformers/model_init.py``: when
   ``architectures: ["HYV3ForCausalLM"]`` is paired with the Hy-MT2-30B-A3B
   config fingerprint (hidden=2048, 48 layers, 128 experts,
   expert_hidden_dim=768, ``enable_lm_head_fp32`` present), resolve to
   ``HyMT2ForCausalLM`` instead of the default ``HYV3ForCausalLM``. Hy3-preview
   (hidden=4096, 80 layers, 192 experts) still resolves to ``HYV3ForCausalLM``.
   Two tests in ``test_model_init.py`` lock this dispatch in.

2. **lm_head fp32 dtype fix** in ``HyMT2ForCausalLM.forward``: when
   ``enable_lm_head_fp32`` is on, the upcast path was calling
   ``self.lm_head(hidden.float())`` which would fail because ``lm_head.weight``
   stays in bf16 after ``cast_model_to_dtype``. Replace with an explicit
   ``F.linear(hidden.float(), self.lm_head.weight.float(), bias.float() | None)``
   so both operands are fp32; the result is cast back to the original dtype.

The example YAML now uses the fully-qualified ``HyMT2ForCausalLM`` target;
combined with (1) it can also be loaded via ``NeMoAutoModelForCausalLM``,
which gives users both an explicit and an auto-dispatch path.

Signed-off-by: khazic <khazzz1c@gmail.com>
The header comment showed ``automodel finetune llm -c <yaml> ...`` which is
not the real CLI signature -- ``nemo_automodel/cli/app.py:76-81`` takes the
YAML path as the first positional argument, so the previous form silently
treated ``finetune`` as the config path and failed with FileNotFoundError
on ``./finetune``. Update the comment to match the actual usage:

    automodel <config.yaml> --nproc-per-node 8

Signed-off-by: khazic <khazzz1c@gmail.com>
The in-model ``enable_lm_head_fp32`` path called ``F.linear`` directly with
``self.lm_head.weight.float()``. Under FSDP2 the lm_head weight is a
DTensor, and ``F.linear`` does not handle DTensor redistribution -- the
hidden state is a plain torch.Tensor, so the matmul crashes with::

    RuntimeError: aten.mm.default got mixed torch.Tensor and DTensor,
    need to convert all torch.Tensor to DTensor before calling
    distributed operators!

Drop the explicit ``F.linear`` and rely on ``self.lm_head(...)`` instead;
``nn.Linear.forward`` is DTensor-aware and will redistribute the input as
needed. To avoid the original dtype-mismatch motivation for the manual
upcast (fp32 input vs. bf16 weight), only upcast when ``lm_head.weight``
has already been promoted to fp32 -- which is exactly what the YAML's
``distributed.moe.lm_head_precision: float32`` path does via the MoE
parallelizer's ``MixedPrecisionPolicy``. If the weight is still in the
model dtype, fall through to the standard ``self.lm_head(hidden)`` path.

Also drop the now-unused ``torch.nn.functional`` import and update the
unit tests to validate the new condition (weight promoted -> upcast
runs; weight not promoted -> fall through).

Signed-off-by: khazic <khazzz1c@gmail.com>
The fully-qualified ``_target_: HyMT2ForCausalLM.from_pretrained`` path
bypasses ``_transformers/model_init.py``, which is where the HF
safetensors loader actually runs. Our class method only invokes
``AutoConfig.from_pretrained`` and ``cls.from_config(...)`` -- the
returned model has the right architecture but random weights, so SFT
starts at ``loss ~= ln(vocab) = 11.7`` instead of the loaded pre-trained
weights.

Switch the YAML back to ``NeMoAutoModelForCausalLM.from_pretrained``.
The config-shape dispatcher added in a21e014 will still route this to
``HyMT2ForCausalLM`` (hidden=2048 + 48 layers + 128 experts +
``enable_lm_head_fp32``), and the standard NeMoAutoModel loader pipeline
will then stream the safetensors through ``HyMT2StateDictAdapter`` into
the FSDP2 / EP-sharded parameters.

Signed-off-by: khazic <khazzz1c@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@HuiyingLi
Copy link
Copy Markdown
Contributor

/ok to test c9945eb

Moves the ``_is_hy_mt2_config`` fingerprint predicate from
``_transformers/model_init.py`` into a new
``components/models/hy_mt2/dispatch.py`` module, and migrates the
matching dispatcher tests from ``tests/.../_transformers/test_model_init.py``
to ``tests/.../models/hy_mt2/test_dispatch.py``.

The auto-resolver in model_init.py now keeps only a 4-line shim that
imports ``is_hy_mt2_config`` from the model package when the architecture
name matches HYV3ForCausalLM, so Hy-MT2-specific knowledge (which
hidden_size, layer count, expert count, etc. identify the checkpoint)
lives entirely inside ``components/models/hy_mt2/`` rather than leaking
into shared code.

No behavior change: same fingerprint fields, same dispatch outcome, the
existing Hy3-preview path is untouched.

Signed-off-by: khazic <khazzz1c@gmail.com>
@HuiyingLi
Copy link
Copy Markdown
Contributor

/ok to test 4e6e6e8

HuiyingLi
HuiyingLi previously approved these changes May 27, 2026
Copy link
Copy Markdown
Contributor

@jgerh jgerh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed a tech pubs review of .md and .mdx files and added comments to maintain alignment with our style and writing guidelines. Also, you might want to consider adding Hy-MT2-30B-A3B (PR #2320) to the AutoModel README news section. For example, [MM/DD/2026] Hy-MT2-30B-A3B — We now support finetuning [model-id/Hy-MT2-30B-A3B]. Check out our recipe.

Comment thread fern/versions/nightly/pages/model-coverage/llm/tencent/hy-mt2.mdx Outdated
Comment thread fern/versions/nightly/pages/model-coverage/llm/tencent/hy-mt2.mdx Outdated
Comment thread fern/versions/nightly/pages/model-coverage/llm/tencent/hy-mt2.mdx Outdated
Comment thread docs/model-coverage/llm/index.md Outdated
Comment thread docs/model-coverage/llm/index.md Outdated
Comment thread docs/model-coverage/llm/index.md Outdated
Comment thread docs/model-coverage/llm/index.md Outdated
Comment thread docs/model-coverage/llm/tencent/hy-mt2.md Outdated
Comment thread docs/model-coverage/llm/tencent/hy-mt2.md Outdated
Comment thread docs/model-coverage/llm/tencent/hy-mt2.md Outdated
HuiyingLi added 10 commits May 27, 2026 13:16
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>

Signed-off-by: Huiying <willwin.lee@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>

Signed-off-by: Huiying <willwin.lee@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>

Signed-off-by: Huiying <willwin.lee@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>

Signed-off-by: Huiying <willwin.lee@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>

Signed-off-by: Huiying <willwin.lee@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>

Signed-off-by: Huiying <willwin.lee@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>

Signed-off-by: Huiying <willwin.lee@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>

Signed-off-by: Huiying <willwin.lee@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>

Signed-off-by: Huiying <willwin.lee@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>

Signed-off-by: Huiying <willwin.lee@gmail.com>
@HuiyingLi
Copy link
Copy Markdown
Contributor

/ok to test d8627c6

@HuiyingLi
Copy link
Copy Markdown
Contributor

/ok to test 460dab9

@HuiyingLi
Copy link
Copy Markdown
Contributor

/ok to test 2a3f6f9

HuiyingLi
HuiyingLi previously approved these changes May 27, 2026
`test_enable_lm_head_fp32_default_false_without_config_flag` constructs
`HyMT2ForCausalLM(_Cfg(), ...)` with a bare mock class to verify that
the flag defaults to ``False`` when the config does not declare it. The
bare mock skips ``PretrainedConfig.__init__``, which is what normally
synthesizes ``rope_parameters`` from ``rope_theta``. As a result,
``get_rope_config`` (called during model construction) raised
``AttributeError: '_Cfg' object has no attribute 'rope_parameters'``
on GPU CI.

Add the field to the mock with the same shape ``PretrainedConfig``
would produce. The CPU test suite cannot trigger this (the whole
``TestHyMT2ForCausalLM`` class is CUDA-gated), so the regression was
only visible on the L0_Unit_Tests_GPU job.

Signed-off-by: khazic <khazzz1c@gmail.com>
@HuiyingLi
Copy link
Copy Markdown
Contributor

/ok to test 7b9e32c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants