Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
a7d91c3
feat(models): add Hy-MT2-30B-A3B SFT support
khazic May 27, 2026
a21e014
feat(models): add Hy-MT2 config-shape dispatcher and fp32 lm_head fix
khazic May 27, 2026
97492ff
docs(examples): fix Hy-MT2 launch command in YAML comment
khazic May 27, 2026
3e8c3c0
fix(models): use nn.Linear (DTensor-aware) on Hy-MT2 lm_head fp32 path
khazic May 27, 2026
c9945eb
fix(examples): route Hy-MT2 example via NeMoAutoModel for weight loading
khazic May 27, 2026
4965a99
fix(transformers): register HyMT2ForCausalLM in MODEL_ARCH_MAPPING
khazic May 27, 2026
1234d1b
docs(model-coverage): add Hy-MT2 model card
khazic May 27, 2026
5772311
docs(model-coverage): wire Hy-MT2 into the LLM toctree
khazic May 27, 2026
4e6e6e8
refactor(models): concentrate Hy-MT2 dispatch logic inside hy_mt2 module
khazic May 27, 2026
ef92449
Update fern/versions/nightly/pages/model-coverage/llm/tencent/hy-mt2.mdx
HuiyingLi May 27, 2026
ef897a5
Update fern/versions/nightly/pages/model-coverage/llm/tencent/hy-mt2.mdx
HuiyingLi May 27, 2026
4b2e090
Update fern/versions/nightly/pages/model-coverage/llm/tencent/hy-mt2.mdx
HuiyingLi May 27, 2026
f7576b1
Update docs/model-coverage/llm/index.md
HuiyingLi May 27, 2026
8990d20
Update docs/model-coverage/llm/index.md
HuiyingLi May 27, 2026
c1a29dc
Update docs/model-coverage/llm/index.md
HuiyingLi May 27, 2026
521fe22
Update docs/model-coverage/llm/index.md
HuiyingLi May 27, 2026
83b053b
Update docs/model-coverage/llm/tencent/hy-mt2.md
HuiyingLi May 27, 2026
bb41ada
Update docs/model-coverage/llm/tencent/hy-mt2.md
HuiyingLi May 27, 2026
460dab9
Update docs/model-coverage/llm/tencent/hy-mt2.md
HuiyingLi May 27, 2026
2a3f6f9
Merge branch 'main' into khazic/feat/hy_mt2_30b_a3b_support
HuiyingLi May 27, 2026
7b9e32c
fix(tests): declare rope_parameters on minimal _Cfg mock in hy_mt2 test
khazic May 28, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 6 additions & 4 deletions docs/model-coverage/llm/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ To run LLMs with NeMo AutoModel, make sure you're using NeMo container version [
pip3 install --upgrade git+git@github.com:NVIDIA-NeMo/AutoModel.git
```

For other installation options (e.g., uv), see the [NeMo AutoModel Installation Guide](../../guides/installation.md).
For other installation options (for example, uv), refer to the [NeMo AutoModel Installation Guide](../../guides/installation.md).

## Supported Models

Expand Down Expand Up @@ -73,20 +73,21 @@ NeMo AutoModel supports the [AutoModelForCausalLM](https://huggingface.co/transf
| Stepfun AI | [Step-3.5](stepfun-ai/step-3-5.md) | `Step3p5ForCausalLM` |
| Parasail AI | [GritLM](parasail-ai/gritlm.md) | `GritLM` |
| Tencent | [Hy3-preview](tencent/hy3.md) | `HYV3ForCausalLM` |
| Tencent | [Hy-MT2](tencent/hy-mt2.md) | `HyMT2ForCausalLM` |
| Xiaomi MiMo | [MiMo-V2-Flash](xiaomimimo/mimo-v2-flash.md) | `MiMoV2FlashForCausalLM` |
| inclusionAI | [Ling 2.0](inclusionai/ling-2.md) | `BailingMoeV2ForCausalLM` |

## Fine-Tuning LLMs with NeMo AutoModel
## Fine-Tune LLMs with NeMo AutoModel

The models listed above can be fine-tuned using NeMo AutoModel. We support two primary fine-tuning approaches:

1. **Parameter-Efficient Fine-Tuning (PEFT)**: Updates only a small subset of parameters (typically <1%) using techniques like Low-Rank Adaptation (LoRA).
2. **Supervised Fine-Tuning (SFT)**: Updates all or most model parameters for deeper adaptation.

See the [Fine-Tuning Guide](../../guides/llm/finetune.md) to learn how to apply both methods to your data.
Refer to the [Fine-Tuning Guide](../../guides/llm/finetune.md) to learn how to apply both methods to your data.

:::{tip}
In these guides, we use the `SQuAD v1.1` dataset for demonstration purposes, but you can use your own data. Update the recipe YAML `dataset` / `validation_dataset` sections accordingly. See [LLM datasets](../../guides/llm/dataset.md) and [dataset overview](../../guides/dataset-overview.md).
In these guides, the examples use the `SQuAD v1.1` dataset for demonstration purposes, but you can use your own data. Update the recipe YAML `dataset` / `validation_dataset` sections accordingly. Refer to [LLM datasets](../../guides/llm/dataset.md) and [dataset overview](../../guides/dataset-overview.md).
:::

```{toctree}
Expand Down Expand Up @@ -146,6 +147,7 @@ stabilityai/stablelm
stepfun-ai/step-3-5
parasail-ai/gritlm
tencent/hy3
tencent/hy-mt2
xiaomimimo/mimo-v2-flash
inclusionai/ling-2
```
63 changes: 63 additions & 0 deletions docs/model-coverage/llm/tencent/hy-mt2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Hy-MT2 (Hunyuan-MT2)

[Hy-MT2-30B-A3B](https://huggingface.co/tencent/Hy-MT2-30B-A3B) is Tencent's translation Mixture-of-Experts language model with 30B total parameters and 3B activated per token. It features 48 transformer layers (layer 0 dense, layers 1–47 MoE), 128 routed experts plus 1 shared expert with top-8 sigmoid routing, Grouped Query Attention (32 Q / 4 KV heads), per-head QK RMSNorm, RoPE, and an in-forward fp32 upcast on the language-model head (`enable_lm_head_fp32`). It supports a 256K context window.

:::{card}
| | |
|---|---|
| **Task** | Text Generation (MoE, translation) |
| **Architecture** | `HyMT2ForCausalLM` |
| **Parameters** | 30B total / 3B activated |
| **HF Org** | [tencent](https://huggingface.co/tencent) |
:::

## Available Models

- **Hy-MT2-30B-A3B**: 30B total, top-8 routed experts (out of 128) activated per token, plus 1 shared expert

## Architectures

- `HyMT2ForCausalLM`

## Example HF Models

| Model | HF ID |
|---|---|
| Hy-MT2-30B-A3B | [`tencent/Hy-MT2-30B-A3B`](https://huggingface.co/tencent/Hy-MT2-30B-A3B) |

## Example Recipes

| Recipe | Description |
|---|---|
| {download}`hy_mt2_30b_a3b_sft.yaml <../../../../examples/llm_finetune/hy_mt2/hy_mt2_30b_a3b_sft.yaml>` | SFT — Hy-MT2-30B-A3B with FSDP2 + EP8 + fp32 LM head |

## Try with NeMo AutoModel

**1. Install** ([NeMo AutoModel](../../../guides/installation.md)):

```bash
pip install nemo-automodel
```

**2. Clone the repo** to get the example recipes:

```bash
git clone https://github.com/NVIDIA-NeMo/Automodel.git
cd Automodel
```

**3. Run the recipe** from inside the repo:

```bash
automodel --nproc-per-node=8 examples/llm_finetune/hy_mt2/hy_mt2_30b_a3b_sft.yaml
```

Refer to the [NeMo AutoModel Installation Guide](../../../guides/installation.md) and [LLM Fine-Tuning Guide](../../../guides/llm/finetune.md).

## Fine-Tune the Model

Refer to the [LLM Fine-Tuning Guide](../../../guides/llm/finetune.md) and the [Large MoE Fine-Tuning Guide](../../../guides/llm/large-moe-finetune.md).

## Hugging Face Model Cards

- [tencent/Hy-MT2-30B-A3B](https://huggingface.co/tencent/Hy-MT2-30B-A3B)
152 changes: 152 additions & 0 deletions examples/llm_finetune/hy_mt2/hy_mt2_30b_a3b_sft.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# SFT recipe for tencent/Hy-MT2-30B-A3B (translation MoE, 30B total / 3B activated).
#
# Architecture (from config.json): 48 layers (layer 0 dense), 128 experts top-8
# + 1 shared expert, sigmoid routing with bias, GQA 32/4, hidden=2048,
# moe_intermediate=expert_hidden=768, dense intermediate=6912,
# vocab=120832, 256K context, rope_theta=11158840, qk_norm.
#
# Hardware target: 8 GPUs (80 GB+ each) for full SFT with EP8 + FSDP2.
# automodel examples/llm_finetune/hy_mt2/hy_mt2_30b_a3b_sft.yaml --nproc-per-node 8
#
# EP size must divide num_experts (128). ep_size=8 -> 16 experts per rank.
# Other valid EP sizes: 1, 2, 4, 16, 32, 64, 128.
#
# Note: the on-disk checkpoint declares ``architectures: ["HYV3ForCausalLM"]``
# and ``model_type: "hy_v3"``. NeMoAutoModel's model resolver
# (``_transformers/model_init.py``) detects the Hy-MT2-30B-A3B config
# fingerprint (hidden=2048, 48 layers, 128 experts, ``enable_lm_head_fp32``)
# and dispatches to ``HyMT2ForCausalLM`` instead of the default
# ``HYV3ForCausalLM``. Going through ``NeMoAutoModelForCausalLM`` here is
# important: it runs the full HF safetensors loader, while a fully-qualified
# ``_target_: HyMT2ForCausalLM.from_pretrained`` would only construct the
# architecture with random weights.

recipe: TrainFinetuneRecipeForNextTokenPrediction

step_scheduler:
global_batch_size: 64
local_batch_size: 1
ckpt_every_steps: 500
val_every_steps: 500
num_epochs: 1
max_steps: 100

dist_env:
backend: nccl
timeout_minutes: 30

rng:
_target_: nemo_automodel.components.training.rng.StatefulRNG
seed: 1111
ranked: true

model:
_target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
pretrained_model_name_or_path: tencent/Hy-MT2-30B-A3B
torch_dtype: bfloat16
backend:
_target_: nemo_automodel.components.models.common.BackendConfig
attn: te
linear: torch
rms_norm: torch_fp32
experts: torch_mm
dispatcher: torch
fake_balanced_gate: false
gate_precision: float32
enable_hf_state_dict_adapter: true
enable_fsdp_optimizations: true

checkpoint:
enabled: true
checkpoint_dir: /tmp/checkpoints/hy_mt2_30b_a3b/
model_save_format: safetensors
save_consolidated: true

distributed:
strategy: fsdp2
tp_size: 1
cp_size: 1
pp_size: 1
# Expert parallelism: 128 experts / 8 ranks = 16 experts per rank.
# dp_size is derived as ``world_size // (tp_size * cp_size * pp_size * ep_size)``
# i.e. 1 with ep_size=8 on an 8-GPU node -- experts shard across the
# full node and the remaining (non-expert) weights replicate.
ep_size: 8

sequence_parallel: false
activation_checkpointing: true

moe:
reshard_after_forward: false
wrap_outer_model: false
# HF reference upcasts the lm_head to fp32 (``enable_lm_head_fp32: true``).
# The MoE parallelizer handles this via MixedPrecisionPolicy when set
# here; HyMT2ForCausalLM also has an in-model fp32 fallback if this is
# left unset.
lm_head_precision: float32

loss_fn:
_target_: nemo_automodel.components.loss.masked_ce.MaskedCrossEntropy

dataset:
_target_: nemo_automodel.components.datasets.llm.hellaswag.HellaSwag
path_or_dataset: rowan/hellaswag
split: train
tokenizer:
_target_: transformers.AutoTokenizer.from_pretrained
pretrained_model_name_or_path: tencent/Hy-MT2-30B-A3B
trust_remote_code: true

packed_sequence:
packed_sequence_size: 0

dataloader:
_target_: torchdata.stateful_dataloader.StatefulDataLoader
collate_fn:
_target_: nemo_automodel.components.datasets.utils.default_collater
pad_seq_len_divisible: 64
shuffle: true

validation_dataset:
_target_: nemo_automodel.components.datasets.llm.hellaswag.HellaSwag
path_or_dataset: rowan/hellaswag
split: validation
num_samples_limit: 64
tokenizer:
_target_: transformers.AutoTokenizer.from_pretrained
pretrained_model_name_or_path: tencent/Hy-MT2-30B-A3B
trust_remote_code: true

validation_dataloader:
_target_: torchdata.stateful_dataloader.StatefulDataLoader
collate_fn:
_target_: nemo_automodel.components.datasets.utils.default_collater
pad_seq_len_divisible: 64
shuffle: false
drop_last: true

optimizer:
_target_: torch.optim.AdamW
betas: [0.9, 0.95]
eps: 1e-8
lr: 1e-5
weight_decay: 0.0

# Uncomment for W&B logging
# wandb:
# project: hy_mt2-30b-a3b-sft
# name: hy_mt2_30b_a3b_sft
2 changes: 2 additions & 0 deletions fern/versions/nightly.yml
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,8 @@ navigation:
path: ./nightly/pages/model-coverage/llm/parasail-ai/gritlm.mdx
- page: "Hy3-preview"
path: ./nightly/pages/model-coverage/llm/tencent/hy3.mdx
- page: "Hy-MT2"
path: ./nightly/pages/model-coverage/llm/tencent/hy-mt2.mdx
- page: "MiMo-V2-Flash"
path: ./nightly/pages/model-coverage/llm/xiaomimimo/mimo-v2-flash.mdx
- page: "Ling 2.0"
Expand Down
67 changes: 67 additions & 0 deletions fern/versions/nightly/pages/model-coverage/llm/tencent/hy-mt2.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
---
title: "Hy-MT2 (Hunyuan-MT2)"
description: ""
---
[Hy-MT2-30B-A3B](https://huggingface.co/tencent/Hy-MT2-30B-A3B) is Tencent's translation Mixture-of-Experts language model with 30B total parameters and 3B activated per token. It features 48 transformer layers (layer 0 dense, layers 1–47 MoE), 128 routed experts plus 1 shared expert with top-8 sigmoid routing, Grouped Query Attention (32 Q / 4 KV heads), per-head QK RMSNorm, RoPE, and an in-forward fp32 upcast on the language-model head (`enable_lm_head_fp32`). It supports a 256K context window.

<Info>

| | |
|---|---|
| **Task** | Text Generation (MoE, translation) |
| **Architecture** | `HyMT2ForCausalLM` |
| **Parameters** | 30B total / 3B activated |
| **HF Org** | [tencent](https://huggingface.co/tencent) |

</Info>

## Available Models

- **Hy-MT2-30B-A3B**: 30B total, top-8 routed experts (out of 128) activated per token, plus 1 shared expert

## Architectures

- `HyMT2ForCausalLM`

## Example HF Models

| Model | HF ID |
|---|---|
| Hy-MT2-30B-A3B | [`tencent/Hy-MT2-30B-A3B`](https://huggingface.co/tencent/Hy-MT2-30B-A3B) |

## Example Recipes

| Recipe | Description |
|---|---|
| [hy_mt2_30b_a3b_sft.yaml](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/hy_mt2/hy_mt2_30b_a3b_sft.yaml) | SFT — Hy-MT2-30B-A3B with FSDP2 + EP8 + fp32 LM head |

## Try with NeMo AutoModel

**1. Install** ([NeMo AutoModel](/get-started/installation)):

```bash
pip install nemo-automodel
```

**2. Clone the repo** to get the example recipes:

```bash
git clone https://github.com/NVIDIA-NeMo/Automodel.git
cd Automodel
```

**3. Run the recipe** from inside the repo:

```bash
automodel --nproc-per-node=8 examples/llm_finetune/hy_mt2/hy_mt2_30b_a3b_sft.yaml
```

Refer to the [NeMo AutoModel Installation Guide](/get-started/installation) and [LLM Fine-Tuning Guide](/recipes-e2e-examples/sft-peft).

## Fine-Tune the Model

Refer to the [LLM Fine-Tuning Guide](/recipes-e2e-examples/sft-peft) and the [Large MoE Fine-Tuning Guide](/recipes-e2e-examples/large-moe-fine-tuning).

## Hugging Face Model Cards

- [tencent/Hy-MT2-30B-A3B](https://huggingface.co/tencent/Hy-MT2-30B-A3B)
8 changes: 8 additions & 0 deletions nemo_automodel/_transformers/model_init.py
Original file line number Diff line number Diff line change
Expand Up @@ -221,6 +221,14 @@ def _resolve_custom_model_cls_for_config(config):
return None

arch_name = architectures[0]
if arch_name == "HYV3ForCausalLM":
from nemo_automodel.components.models.hy_mt2.dispatch import is_hy_mt2_config

if is_hy_mt2_config(config):
from nemo_automodel.components.models.hy_mt2.model import HyMT2ForCausalLM

return HyMT2ForCausalLM

if not ModelRegistry.has_custom_model(arch_name):
return None

Expand Down
4 changes: 4 additions & 0 deletions nemo_automodel/_transformers/registry.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,10 @@
"HYV3ForCausalLM",
("nemo_automodel.components.models.hy_v3.model", "HYV3ForCausalLM"),
),
(
"HyMT2ForCausalLM",
("nemo_automodel.components.models.hy_mt2.model", "HyMT2ForCausalLM"),
),
(
"Qwen2ForCausalLM",
("nemo_automodel.components.models.qwen2.model", "Qwen2ForCausalLM"),
Expand Down
17 changes: 17 additions & 0 deletions nemo_automodel/components/models/hy_mt2/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from nemo_automodel.components.models.hy_mt2.model import HyMT2ForCausalLM

__all__ = ["HyMT2ForCausalLM"]
Loading
Loading