NVIDIA-NeMo · HuiyingLi · May 28, 2026 · May 27, 2026 · May 27, 2026 · May 27, 2026
@@ -11,7 +11,7 @@ To run LLMs with NeMo AutoModel, make sure you're using NeMo container version [
 pip3 install --upgrade git+git@github.com:NVIDIA-NeMo/AutoModel.git
 ```
 
-For other installation options (e.g., uv), see the [NeMo AutoModel Installation Guide](../../guides/installation.md).
+For other installation options (for example, uv), refer to the [NeMo AutoModel Installation Guide](../../guides/installation.md).
 
 ## Supported Models
 
@@ -73,20 +73,21 @@ NeMo AutoModel supports the [AutoModelForCausalLM](https://huggingface.co/transf
 | Stepfun AI | [Step-3.5](stepfun-ai/step-3-5.md) | `Step3p5ForCausalLM` |
 | Parasail AI | [GritLM](parasail-ai/gritlm.md) | `GritLM` |
 | Tencent | [Hy3-preview](tencent/hy3.md) | `HYV3ForCausalLM` |
+| Tencent | [Hy-MT2](tencent/hy-mt2.md) | `HyMT2ForCausalLM` |
 | Xiaomi MiMo | [MiMo-V2-Flash](xiaomimimo/mimo-v2-flash.md) | `MiMoV2FlashForCausalLM` |
 | inclusionAI | [Ling 2.0](inclusionai/ling-2.md) | `BailingMoeV2ForCausalLM` |
 
-## Fine-Tuning LLMs with NeMo AutoModel
+## Fine-Tune LLMs with NeMo AutoModel
 
 The models listed above can be fine-tuned using NeMo AutoModel. We support two primary fine-tuning approaches:
 
 1. **Parameter-Efficient Fine-Tuning (PEFT)**: Updates only a small subset of parameters (typically <1%) using techniques like Low-Rank Adaptation (LoRA).
 2. **Supervised Fine-Tuning (SFT)**: Updates all or most model parameters for deeper adaptation.
 
-See the [Fine-Tuning Guide](../../guides/llm/finetune.md) to learn how to apply both methods to your data.
+Refer to the [Fine-Tuning Guide](../../guides/llm/finetune.md) to learn how to apply both methods to your data.
 
 :::{tip}
-In these guides, we use the `SQuAD v1.1` dataset for demonstration purposes, but you can use your own data. Update the recipe YAML `dataset` / `validation_dataset` sections accordingly. See [LLM datasets](../../guides/llm/dataset.md) and [dataset overview](../../guides/dataset-overview.md).
+In these guides, the examples use the `SQuAD v1.1` dataset for demonstration purposes, but you can use your own data. Update the recipe YAML `dataset` / `validation_dataset` sections accordingly. Refer to [LLM datasets](../../guides/llm/dataset.md) and [dataset overview](../../guides/dataset-overview.md).
 :::
 
 ```{toctree}
@@ -146,6 +147,7 @@ stabilityai/stablelm
 stepfun-ai/step-3-5
 parasail-ai/gritlm
 tencent/hy3
+tencent/hy-mt2
 xiaomimimo/mimo-v2-flash
 inclusionai/ling-2
 ```
@@ -0,0 +1,63 @@
+# Hy-MT2 (Hunyuan-MT2)
+
+[Hy-MT2-30B-A3B](https://huggingface.co/tencent/Hy-MT2-30B-A3B) is Tencent's translation Mixture-of-Experts language model with 30B total parameters and 3B activated per token. It features 48 transformer layers (layer 0 dense, layers 1–47 MoE), 128 routed experts plus 1 shared expert with top-8 sigmoid routing, Grouped Query Attention (32 Q / 4 KV heads), per-head QK RMSNorm, RoPE, and an in-forward fp32 upcast on the language-model head (`enable_lm_head_fp32`). It supports a 256K context window.
+
+:::{card}
+| | |
+|---|---|
+| **Task** | Text Generation (MoE, translation) |
+| **Architecture** | `HyMT2ForCausalLM` |
+| **Parameters** | 30B total / 3B activated |
+| **HF Org** | [tencent](https://huggingface.co/tencent) |
+:::
+
+## Available Models
+
+- **Hy-MT2-30B-A3B**: 30B total, top-8 routed experts (out of 128) activated per token, plus 1 shared expert
+
+## Architectures
+
+- `HyMT2ForCausalLM`
+
+## Example HF Models
+
+| Model | HF ID |
+|---|---|
+| Hy-MT2-30B-A3B | [`tencent/Hy-MT2-30B-A3B`](https://huggingface.co/tencent/Hy-MT2-30B-A3B) |
+
+## Example Recipes
+
+| Recipe | Description |
+|---|---|
+| {download}`hy_mt2_30b_a3b_sft.yaml <../../../../examples/llm_finetune/hy_mt2/hy_mt2_30b_a3b_sft.yaml>` | SFT — Hy-MT2-30B-A3B with FSDP2 + EP8 + fp32 LM head |
+
+## Try with NeMo AutoModel
+
+**1. Install** ([NeMo AutoModel](../../../guides/installation.md)):
+
+```bash
+pip install nemo-automodel
+```
+
+**2. Clone the repo** to get the example recipes:
+
+```bash
+git clone https://github.com/NVIDIA-NeMo/Automodel.git
+cd Automodel
+```
+
+**3. Run the recipe** from inside the repo:
+
+```bash
+automodel --nproc-per-node=8 examples/llm_finetune/hy_mt2/hy_mt2_30b_a3b_sft.yaml
+```
+
+Refer to the [NeMo AutoModel Installation Guide](../../../guides/installation.md) and [LLM Fine-Tuning Guide](../../../guides/llm/finetune.md).
+
+## Fine-Tune the Model
+
+Refer to the [LLM Fine-Tuning Guide](../../../guides/llm/finetune.md) and the [Large MoE Fine-Tuning Guide](../../../guides/llm/large-moe-finetune.md).
+
+## Hugging Face Model Cards
+
+- [tencent/Hy-MT2-30B-A3B](https://huggingface.co/tencent/Hy-MT2-30B-A3B)
@@ -0,0 +1,152 @@
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# SFT recipe for tencent/Hy-MT2-30B-A3B (translation MoE, 30B total / 3B activated).
+#
+# Architecture (from config.json): 48 layers (layer 0 dense), 128 experts top-8
+# + 1 shared expert, sigmoid routing with bias, GQA 32/4, hidden=2048,
+# moe_intermediate=expert_hidden=768, dense intermediate=6912,
+# vocab=120832, 256K context, rope_theta=11158840, qk_norm.
+#
+# Hardware target: 8 GPUs (80 GB+ each) for full SFT with EP8 + FSDP2.
+#   automodel examples/llm_finetune/hy_mt2/hy_mt2_30b_a3b_sft.yaml --nproc-per-node 8
+#
+# EP size must divide num_experts (128). ep_size=8 -> 16 experts per rank.
+# Other valid EP sizes: 1, 2, 4, 16, 32, 64, 128.
+#
+# Note: the on-disk checkpoint declares ``architectures: ["HYV3ForCausalLM"]``
+# and ``model_type: "hy_v3"``. NeMoAutoModel's model resolver
+# (``_transformers/model_init.py``) detects the Hy-MT2-30B-A3B config
+# fingerprint (hidden=2048, 48 layers, 128 experts, ``enable_lm_head_fp32``)
+# and dispatches to ``HyMT2ForCausalLM`` instead of the default
+# ``HYV3ForCausalLM``. Going through ``NeMoAutoModelForCausalLM`` here is
+# important: it runs the full HF safetensors loader, while a fully-qualified
+# ``_target_: HyMT2ForCausalLM.from_pretrained`` would only construct the
+# architecture with random weights.
+
+recipe: TrainFinetuneRecipeForNextTokenPrediction
+
+step_scheduler:
+  global_batch_size: 64
+  local_batch_size: 1
+  ckpt_every_steps: 500
+  val_every_steps: 500
+  num_epochs: 1
+  max_steps: 100
+
+dist_env:
+  backend: nccl
+  timeout_minutes: 30
+
+rng:
+  _target_: nemo_automodel.components.training.rng.StatefulRNG
+  seed: 1111
+  ranked: true
+
+model:
+  _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
+  pretrained_model_name_or_path: tencent/Hy-MT2-30B-A3B
+  torch_dtype: bfloat16
+  backend:
+    _target_: nemo_automodel.components.models.common.BackendConfig
+    attn: te
+    linear: torch
+    rms_norm: torch_fp32
+    experts: torch_mm
+    dispatcher: torch
+    fake_balanced_gate: false
+    gate_precision: float32
+    enable_hf_state_dict_adapter: true
+    enable_fsdp_optimizations: true
+
+checkpoint:
+  enabled: true
+  checkpoint_dir: /tmp/checkpoints/hy_mt2_30b_a3b/
+  model_save_format: safetensors
+  save_consolidated: true
+
+distributed:
+  strategy: fsdp2
+  tp_size: 1
+  cp_size: 1
+  pp_size: 1
+  # Expert parallelism: 128 experts / 8 ranks = 16 experts per rank.
+  # dp_size is derived as ``world_size // (tp_size * cp_size * pp_size * ep_size)``
+  # i.e. 1 with ep_size=8 on an 8-GPU node -- experts shard across the
+  # full node and the remaining (non-expert) weights replicate.
+  ep_size: 8
+
+  sequence_parallel: false
+  activation_checkpointing: true
+
+  moe:
+    reshard_after_forward: false
+    wrap_outer_model: false
+    # HF reference upcasts the lm_head to fp32 (``enable_lm_head_fp32: true``).
+    # The MoE parallelizer handles this via MixedPrecisionPolicy when set
+    # here; HyMT2ForCausalLM also has an in-model fp32 fallback if this is
+    # left unset.
+    lm_head_precision: float32
+
+loss_fn:
+  _target_: nemo_automodel.components.loss.masked_ce.MaskedCrossEntropy
+
+dataset:
+  _target_: nemo_automodel.components.datasets.llm.hellaswag.HellaSwag
+  path_or_dataset: rowan/hellaswag
+  split: train
+  tokenizer:
+    _target_: transformers.AutoTokenizer.from_pretrained
+    pretrained_model_name_or_path: tencent/Hy-MT2-30B-A3B
+    trust_remote_code: true
+
+packed_sequence:
+  packed_sequence_size: 0
+
+dataloader:
+  _target_: torchdata.stateful_dataloader.StatefulDataLoader
+  collate_fn:
+    _target_: nemo_automodel.components.datasets.utils.default_collater
+    pad_seq_len_divisible: 64
+  shuffle: true
+
+validation_dataset:
+  _target_: nemo_automodel.components.datasets.llm.hellaswag.HellaSwag
+  path_or_dataset: rowan/hellaswag
+  split: validation
+  num_samples_limit: 64
+  tokenizer:
+    _target_: transformers.AutoTokenizer.from_pretrained
+    pretrained_model_name_or_path: tencent/Hy-MT2-30B-A3B
+    trust_remote_code: true
+
+validation_dataloader:
+  _target_: torchdata.stateful_dataloader.StatefulDataLoader
+  collate_fn:
+    _target_: nemo_automodel.components.datasets.utils.default_collater
+    pad_seq_len_divisible: 64
+  shuffle: false
+  drop_last: true
+
+optimizer:
+  _target_: torch.optim.AdamW
+  betas: [0.9, 0.95]
+  eps: 1e-8
+  lr: 1e-5
+  weight_decay: 0.0
+
+# Uncomment for W&B logging
+# wandb:
+#   project: hy_mt2-30b-a3b-sft
+#   name: hy_mt2_30b_a3b_sft
diff --git a/fern/versions/nightly.yml b/fern/versions/nightly.yml
@@ -154,6 +154,8 @@ navigation:
             path: ./nightly/pages/model-coverage/llm/parasail-ai/gritlm.mdx
           - page: "Hy3-preview"
             path: ./nightly/pages/model-coverage/llm/tencent/hy3.mdx
+          - page: "Hy-MT2"
+            path: ./nightly/pages/model-coverage/llm/tencent/hy-mt2.mdx
           - page: "MiMo-V2-Flash"
             path: ./nightly/pages/model-coverage/llm/xiaomimimo/mimo-v2-flash.mdx
           - page: "Ling 2.0"

diff --git a/fern/versions/nightly/pages/model-coverage/llm/tencent/hy-mt2.mdx b/fern/versions/nightly/pages/model-coverage/llm/tencent/hy-mt2.mdx
@@ -0,0 +1,67 @@
+---
+title: "Hy-MT2 (Hunyuan-MT2)"
+description: ""
+---
+[Hy-MT2-30B-A3B](https://huggingface.co/tencent/Hy-MT2-30B-A3B) is Tencent's translation Mixture-of-Experts language model with 30B total parameters and 3B activated per token. It features 48 transformer layers (layer 0 dense, layers 1–47 MoE), 128 routed experts plus 1 shared expert with top-8 sigmoid routing, Grouped Query Attention (32 Q / 4 KV heads), per-head QK RMSNorm, RoPE, and an in-forward fp32 upcast on the language-model head (`enable_lm_head_fp32`). It supports a 256K context window.
+
+<Info>
+
+| | |
+|---|---|
+| **Task** | Text Generation (MoE, translation) |
+| **Architecture** | `HyMT2ForCausalLM` |
+| **Parameters** | 30B total / 3B activated |
+| **HF Org** | [tencent](https://huggingface.co/tencent) |
+
+</Info>
+
+## Available Models
+
+- **Hy-MT2-30B-A3B**: 30B total, top-8 routed experts (out of 128) activated per token, plus 1 shared expert
+
+## Architectures
+
+- `HyMT2ForCausalLM`
+
+## Example HF Models
+
+| Model | HF ID |
+|---|---|
+| Hy-MT2-30B-A3B | [`tencent/Hy-MT2-30B-A3B`](https://huggingface.co/tencent/Hy-MT2-30B-A3B) |
+
+## Example Recipes
+
+| Recipe | Description |
+|---|---|
+| [hy_mt2_30b_a3b_sft.yaml](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/llm_finetune/hy_mt2/hy_mt2_30b_a3b_sft.yaml) | SFT — Hy-MT2-30B-A3B with FSDP2 + EP8 + fp32 LM head |
+
+## Try with NeMo AutoModel
+
+**1. Install** ([NeMo AutoModel](/get-started/installation)):
+
+```bash
+pip install nemo-automodel
+```
+
+**2. Clone the repo** to get the example recipes:
+
+```bash
+git clone https://github.com/NVIDIA-NeMo/Automodel.git
+cd Automodel
+```
+
+**3. Run the recipe** from inside the repo:
+
+```bash
+automodel --nproc-per-node=8 examples/llm_finetune/hy_mt2/hy_mt2_30b_a3b_sft.yaml
+```
+
+Refer to the [NeMo AutoModel Installation Guide](/get-started/installation) and [LLM Fine-Tuning Guide](/recipes-e2e-examples/sft-peft).
+
+## Fine-Tune the Model
+
+Refer to the [LLM Fine-Tuning Guide](/recipes-e2e-examples/sft-peft) and the [Large MoE Fine-Tuning Guide](/recipes-e2e-examples/large-moe-fine-tuning).
+
+## Hugging Face Model Cards
+
+- [tencent/Hy-MT2-30B-A3B](https://huggingface.co/tencent/Hy-MT2-30B-A3B)
@@ -221,6 +221,14 @@ def _resolve_custom_model_cls_for_config(config):
         return None
 
     arch_name = architectures[0]
+    if arch_name == "HYV3ForCausalLM":
+        from nemo_automodel.components.models.hy_mt2.dispatch import is_hy_mt2_config
+
+        if is_hy_mt2_config(config):
+            from nemo_automodel.components.models.hy_mt2.model import HyMT2ForCausalLM
+
+            return HyMT2ForCausalLM
+
     if not ModelRegistry.has_custom_model(arch_name):
         return None
 

@@ -164,6 +164,10 @@
             "HYV3ForCausalLM",
             ("nemo_automodel.components.models.hy_v3.model", "HYV3ForCausalLM"),
         ),
+        (
+            "HyMT2ForCausalLM",
+            ("nemo_automodel.components.models.hy_mt2.model", "HyMT2ForCausalLM"),
+        ),
         (
             "Qwen2ForCausalLM",
             ("nemo_automodel.components.models.qwen2.model", "Qwen2ForCausalLM"),

@@ -0,0 +1,17 @@
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from nemo_automodel.components.models.hy_mt2.model import HyMT2ForCausalLM
+
+__all__ = ["HyMT2ForCausalLM"]