Skip to content
This repository was archived by the owner on Nov 19, 2025. It is now read-only.

Commit 9c1727f

Browse files
arenduterrykong
authored andcommitted
feat: nemotron-5 features
wip Signed-off-by: arendu <adithya.r@gmail.com> docs: 0.5.0 documentation updates (#346) Signed-off-by: ashors1 <ashors@nvidia.com> ci: Sign-off cherry pick (#366) Signed-off-by: Oliver Koenig <okoenig@nvidia.com> docs: main readme and sft docs (#367) Signed-off-by: Oleksii Kuchaiev <okuchaiev@nvidia.com> Co-authored-by: Gerald Shen <119401249+gshennvm@users.noreply.github.com> docs: fix code block rendering (#369) Signed-off-by: ashors1 <ashors@nvidia.com> dpo and sft Signed-off-by: arendu <adithya.r@gmail.com> dpo support Signed-off-by: root <root@cw-dfw-h100-001-129-026.cm.cluster> mamba padding Signed-off-by: arendu <adithya.r@gmail.com> convenience script to remove old format of DPO data Signed-off-by: adithyare <adithyare@nvidia.com> pad to mult 256 Signed-off-by: arendu <adithya.r@gmail.com> copy dpo style cfg overrides Signed-off-by: arendu <adithya.r@gmail.com> remove _modify_config Signed-off-by: arendu <adithya.r@gmail.com> fix config issue Signed-off-by: Jiaqi Zeng <jiaqiz@nvidia.com> fix mamba config issue Signed-off-by: Jiaqi Zeng <jiaqiz@nvidia.com> is mamba default false Signed-off-by: arendu <adithya.r@gmail.com>
1 parent bd590d6 commit 9c1727f

21 files changed

Lines changed: 483 additions & 369 deletions

File tree

.github/workflows/cherry-pick-release-commit.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ jobs:
6060
(
6161
git fetch origin $RELEASE_BRANCH:$RELEASE_BRANCH
6262
git switch --force-create cherry-pick-$PR_ID-$RELEASE_BRANCH $RELEASE_BRANCH
63-
git cherry-pick $SHA
63+
git cherry-pick --signoff $SHA
6464
git push -u origin --force cherry-pick-$PR_ID-$RELEASE_BRANCH
6565
git checkout ${CI_DEFAULT_BRANCH:-main}
6666
)

README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,13 @@ For the latest stable release, please see the [releases page](https://github.com
4242
### Requirements
4343
NeMo-Aligner has the same requirements as the [NeMo Toolkit Requirements](https://github.com/NVIDIA/NeMo#requirements) with the addition of [PyTriton](https://github.com/triton-inference-server/pytriton).
4444

45+
### Quick start inside NeMo container
46+
NeMo Aligner comes included with NeMo containers. On a machine with NVIDIA GPUs and drivers installed run NeMo container:
47+
```bash
48+
docker run --gpus all -it --rm --shm-size=8g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/nemo:24.07
49+
```
50+
Once you are inside the container, NeMo-Aligner is already installed and together with NeMo and other tools can be found under ```/opt/``` folder.
51+
4552
### Install NeMo-Aligner
4653
Please follow the same steps as outlined in the [NeMo Toolkit Installation Guide](https://github.com/NVIDIA/NeMo#installation). After installing NeMo, execute the following additional command:
4754
```bash

docs/user-guide/dpo.rst

Lines changed: 18 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -5,25 +5,28 @@
55
Model Alignment by DPO, RPO, and IPO
66
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
77

8+
.. note::
9+
Before starting this tutorial, be sure to review the :ref:`introduction <model-aligner-intro>` for tips on setting up your NeMo-Aligner environment.
10+
811
The NeMo Framework supports efficient model alignment via the NeMo-Aligner codebase.
912

10-
All algorithms in NeMo-Aligner will work with any GPT-based model that is from Megatron Core (in the config it has ``mcore_gpt=True``). For the purposes of this tutorial, we will go through the entire Direct Preference Optimization (DPO) pipeline using the newly released `2B GPT model with 4096 sequence length <https://huggingface.co/nvidia/GPT-2B-001>`__. The same tutorial also works for GPT models (such as LLaMa2) of any size.
13+
All algorithms in NeMo-Aligner will work with any GPT-based model that is from Megatron Core (in the config it has ``mcore_gpt=True``). For the purposes of this tutorial, we will go through the entire Direct Preference Optimization (DPO) pipeline using the newly released `2B GPT model with 4096 sequence length <https://huggingface.co/nvidia/GPT-2B-001>`__. The same tutorial also works for GPT models (such as LLaMa3) of any size.
1114

1215
DPO with LoRA
1316
#############
1417

1518
We support both full-parameter DPO training and LoRA DPO training.
16-
For full-parameter DPO, there exists an actor and a reference model. The actor is initialized with the reference model and is fully trainable. The reference model is frozen and used to calculate logprobs for KL-penalty loss (see `DPO paper <https://arxiv.org/pdf/2305.18290.pdf>`__).
19+
In full-parameter DPO, there exists an actor and a reference model. The actor is initialized with the reference model and is fully trainable. The reference model is frozen and used to calculate logprobs for KL-penalty loss (see the `DPO paper <https://arxiv.org/pdf/2305.18290.pdf>`__).
1720
For LoRA-based DPO, the actor is initialized by the reference model plus LoRA weights, where only the LoRA weights are trainable. Therefore, it allows us to switch between the actor/reference models by simply enabling or disabling LoRA. In addition, there is no need to store two sets of LLM weights.
1821

1922
RPO and IPO Variations
2023
#######################
2124

22-
Besides the vanilla DPO algorithm, we support other variants of DPO algorithms, including Identity preference optimization (IPO) and Reward-aware preference optimization (RPO).
25+
Besides the vanilla DPO algorithm, we support other variants of DPO algorithms, including Identity Preference Optimization (IPO) and Reward-aware Preference Optimization (RPO).
2326

2427
The algorithm is identified with the ``dpo.preference_loss`` config variable. We support three sorts of RPO algorithms based on the distance metric: ``rpo_sq`` for squared distance, ``rpo_bwd_kl`` for Bernoulli backward KL divergence, and ``rpo_fwd_kl`` for Bernoulli forward KL divergence.
2528

26-
To use the RPO algorithm, each dataset example should have chosen_reward and rejected_reward, which might come from human labelers or reward models. If chosen_reward and rejected_reward are not existent in the data, dpo.default_chosen_reward and dpo.default_rejected_reward are used.
29+
To use the RPO algorithm, each dataset example should have ``chosen_reward`` and ``rejected_reward``, which might come from human labelers or reward models. If ``chosen_reward`` and ``rejected_reward`` are not existent in the data, ``dpo.default_chosen_reward`` and ``dpo.default_rejected_reward`` are used.
2730

2831
Obtain a Pretrained Model
2932
############################
@@ -36,18 +39,18 @@ To start, we must first get a pretrained model to align. There are two models we
3639

3740
#. Get the 2B checkpoint via ``wget https://huggingface.co/nvidia/GPT-2B-001/resolve/main/GPT-2B-001_bf16_tp1.nemo``.
3841
#. Extract the NeMo File to a folder with ``mkdir model_checkpoint && tar -xvf GPT-2B-001_bf16_tp1.nemo -C model_checkpoint``.
39-
#. Run the script to convert from the old NeMo checkpoint to the Megatron Core checkpoint. The script is located `here <https://github.com/NVIDIA/NeMo/blob/86b198ff93438d454f9c7f3550bcfb7d4e59feab/scripts/nlp_language_modeling/convert_nemo_gpt_to_mcore.py>`__.
42+
#. Run the script to convert from the old NeMo checkpoint to the Megatron Core checkpoint. The script is located `here <https://github.com/NVIDIA/NeMo/blob/0ec7e9090d3261b8ce81818b0555a204e50d814d/scripts/checkpoint_converters/convert_gpt_nemo_to_mcore.py>`__.
4043
.. code-block:: bash
4144
4245
python convert_nemo_gpt_to_mcore.py \
4346
--in-folder ./model_checkpoint \
4447
--out-file ./mcore_gpt.nemo
4548
46-
.. tab-item:: LLaMa2 7B
49+
.. tab-item:: LLaMa3 7B
4750
:sync: key2
4851

49-
#. Download the `Llama 2 7B LLM model and tokenizer <https://huggingface.co/meta-llama/Llama-2-7b>`__ into the models folder.
50-
#. Convert the LLaMa2 LLM into ``.nemo`` format.
52+
#. Download the `Llama 3 8B LLM model and tokenizer <https://huggingface.co/meta-llama/Meta-Llama-3-8B>`__ into the models folder.
53+
#. Convert the LLaMa3 LLM into ``.nemo`` format.
5154
.. code-block:: bash
5255
5356
python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py \
@@ -78,7 +81,7 @@ For best DPO training performance, it is recommended that you start with a SFT m
7881
DPO Model Training
7982
#####################
8083

81-
Before running the core DPO training, you must prepare your training and validation data to the format required for DPO training. DPO expects .jsonl files where each line is a JSON dict corresponding to a single, complete sample, as shown below::
84+
Before running the core DPO training, you must prepare your training and validation data to the format required for DPO training. DPO expects ``.jsonl`` files where each line is a JSON dict corresponding to a single, complete sample, as shown below::
8285

8386
{"prompt": "Which year was the Magna Carta signed?", "chosen_response": "1215", "rejected_response": "I refuse to answer this question."}
8487
{"prompt": "Please give me the name of a famous medieval painter.", "chosen_response": "Hieronymus Bosch", "rejected_response": "David Hockney"}
@@ -88,12 +91,12 @@ However, please be aware that most Megatron GPT models adhere to a strict format
8891
{"prompt": "<extra_id_0>System\n\n<extra_id_1>User\nWhich year was the Magna Carta signed?\n<extra_id_1>Assistant\n", "chosen_response": "1215\n<extra_id_1>", "rejected_response": "I refuse to answer this question.\n<extra_id_1>"}
8992
{"prompt": "<extra_id_0>System\n\n<extra_id_1>User\nPlease give me the name of a famous medieval painter.\n<extra_id_1>Assistant\n", "chosen_response": "Hieronymus Bosch\n<extra_id_1>", "rejected_response": "David Hockney\n<extra_id_1>"}
9093

91-
Always follow the prompt-response template format used during your SFT training for DPO, as failure to do so will produce a model which outputs garbage text. You should create one jsonl file in the format above for your training data and one jsonl for your validation data.
94+
Always follow the prompt-response template format used during your SFT training for DPO, as failure to do so will produce a model which outputs garbage text. You should create one ``.jsonl`` file in the format above for your training data and one ``.jsonl`` for your validation data.
9295

9396
Your JSONL file must contain at least as many samples as the Global Batch Size (GBS) you plan to use during training. For example, if GBS = 64, ensure that both your training and validation files include at least 64 samples. Using a file with fewer samples than the GBS will result in a crash.
9497

9598
Once your data is processed into the correct format, you are ready to begin DPO training. You must start with a pretrained or SFT trained model. For this section, we will use the SFT model trained in the previous step to train the DPO model.
96-
For the purposes of the following sections, we assume that your training jsonl file is located in ``/path/to/train_dpo_format.jsonl`` and your validation jsonl file is located in ``/path/to/valid_dpo_format.jsonl``.
99+
For the purposes of the following sections, we assume that your training ``.jsonl`` file is located in ``/path/to/train_dpo_format.jsonl`` and your validation ``.jsonl`` file is located in ``/path/to/valid_dpo_format.jsonl``.
97100

98101
For the following parameters, the ``model.dpo.ref_policy_kl_penalty`` corresponds to the beta parameter in the DPO paper.
99102

@@ -196,7 +199,7 @@ All metrics will be grouped by either ``train/`` or ``val/`` in WandB, represent
196199
When it comes to ideal hyperparameters for DPO training, much will depend on the characteristics of your SFT or base/foundation model. Consequently, there are no one-size-fits-all parameters that will universally work in all cases.
197200
However, the following list is a brief overview of which hyperparameters we have perturbed for various model sizes and their effects:
198201
199-
* global_batch_size: generally, we have found that, all other parameters held equal, lower GBS performs worse. GBS of 256 or 512 seems to be the sweet spot for most models we trained.
200-
* epochs: highly sensitive to training data size. We recommend you start with 1 epoch and then add on from there. We did not see any improvements beyond 3 epochs.
201-
* learning rate: we tested cosine annealing with a warmup of 10 steps, followed by a slow decay to a constant rate. That constant rate should be fairly low. We saw the best performance with 9e-7 and 5-e7.
202-
* ref_policy_kl_penalty: we generally saw better performance with lower values of 0.1, 0.2, 0.5, and 1.0. Occasionally, values as high as 5.0 worked too.
202+
* global_batch_size: Generally, we have found that, all other parameters held equal, lower GBS performs worse. GBS of 256 or 512 seems to be the sweet spot for most models we trained.
203+
* epochs: Highly sensitive to training data size. We recommend you start with 1 epoch and then add on from there. We did not see any improvements beyond 3 epochs.
204+
* learning rate: We tested cosine annealing with a warmup of 10 steps, followed by a slow decay to a constant rate. That constant rate should be fairly low. We saw the best performance with 9e-7 and 5-e7.
205+
* ref_policy_kl_penalty: We generally saw better performance with lower values of 0.1, 0.2, 0.5, and 1.0. Occasionally, values as high as 5.0 worked too.

docs/user-guide/draftp.rst

Lines changed: 15 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -2,18 +2,20 @@
22

33
.. _model-aligner-draftp:
44

5-
Fine-tuning Stable Diffusion with DRaFT+
5+
Fine-Tuning Stable Diffusion with DRaFT+
66
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
77

8-
In this tutorial, we will go through the step-by-step guide for fine-tuning Stable Diffusion model using DRaFT+ algorithm by NVIDIA.
9-
DRaFT+ is an improvement over the `DRaFT <https://arxiv.org/pdf/2309.17400.pdf>`__ algorithm by alleviating the mode collapse and improving diversity through regularization.
10-
For more technical details on the DRaFT+ algorithm, check out our technical blog.
8+
.. note::
9+
Before starting this tutorial, be sure to review the :ref:`introduction <model-aligner-intro>` for tips on setting up your NeMo-Aligner environment.
1110

11+
In this tutorial, we will go through the step-by-step guide for fine-tuning a Stable Diffusion model using DRaFT+ algorithm by NVIDIA.
12+
DRaFT+ enhances the DRaFT `DRaFT <https://arxiv.org/pdf/2309.17400.pdf>`__ algorithm by mitigating mode collapse and improving diversity through regularization.
13+
For more technical details on the DRaFT+ algorithm, check out our technical blog.
1214

13-
Data Input for running DRaFT+
15+
Data Input for Running DRaFT+
1416
#############################
1517

16-
The data for running DRaFT+ should be a ``.tar`` file consisting of a plain prompt. You can generate a tarfile from a ``.txt``
18+
The data for running DRaFT+ should be a ``.tar`` file consisting of a plain prompt. You can generate a tar file from a ``.txt``
1719
file containing the prompts separated by new lines, such as following format::
1820

1921
prompt1
@@ -35,7 +37,7 @@ Use the following script to download and save the prompts from the `Pick a pic <
3537
for caption in captions:
3638
file.write(caption + '\n')
3739
38-
You can then run the following snipet to convert it to a ``.tar`` file:
40+
You can then run the following snippet to convert it to a ``.tar`` file:
3941
4042
.. code-block:: bash
4143
@@ -64,8 +66,8 @@ you can use the `conversion script <https://github.com/NVIDIA/NeMo/blob/main/exa
6466
DRaFT+ Training
6567
###############
6668
67-
To launch reward model training, you must have checkpoints for `UNet <https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main/unet>`__ and
68-
`VAE <https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main/vae>`__ of a trained Stable Diffusion model and a checkpoint for the Reward Model.
69+
To start reward model training, you need checkpoints for both the `UNet <https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main/unet>`__ and
70+
`VAE <https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main/vae>`__ components of a trained Stable Diffusion model, as well as a checkpoint for the Reward Model.
6971
7072
.. tab-set::
7173
@@ -167,7 +169,7 @@ To launch reward model training, you must have checkpoints for `UNet <https://hu
167169
168170
169171
.. note::
170-
For more info on DRaFT+ hyperparameters please see the model config files (for SD and SDXL respectively):
172+
For more information on DRaFT+ hyperparameters, please see the model config files (for SD and SDXL respectively):
171173
172174
``NeMo-Aligner/examples/mm/stable_diffusion/conf/draftp_sd.yaml``
173175
``NeMo-Aligner/examples/mm/stable_diffusion/conf/draftp_sdxl.yaml``
@@ -179,10 +181,10 @@ Once you have completed fine-tuning Stable Diffusion with DRaFT+, you can run in
179181
and `sd_lora_infer.py <https://github.com/NVIDIA/NeMo/blob/main/examples/multimodal/text_to_image/stable_diffusion/sd_lora_infer.py>`__ scripts from the NeMo codebase. The generated images with the fine-tuned model should have
180182
better prompt alignment and aesthetic quality.
181183
182-
User controllable finetuning with Annealed Importance Guidance (AIG)
183-
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
184+
User-controllable Fine-Tuning with Annealed Importance Guidance (AIG)
185+
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
184186
185-
AIG provides the inference-time flexibility to interpolate between the base Stable Diffusion model (with low rewards and high diversity) and DRaFT-finetuned model (with high rewards and low diversity) to obtain images with high rewards and high diversity. AIG inference is easily done by specifying comma-separated `weight_type` strategies to interpolate between the base and finetuned model.
187+
AIG provides the inference-time flexibility to interpolate between the base Stable Diffusion model (with low rewards and high diversity) and a DRaFT+ fine-tuned model (with high rewards and low diversity) to obtain images with high rewards and high diversity. AIG inference is easily done by specifying comma-separated ``weight_type`` strategies to interpolate between the base and fine-tuned model.
186188
187189
.. tab-set::
188190
.. tab-item:: AIG on Stable Diffusion XL
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,34 @@
1+
2+
.. _model-aligner-intro:
3+
14
Model Alignment
25
!!!!!!!!!!!!!!!
6+
7+
Introduction
8+
############
9+
10+
NeMo-Aligner is a scalable toolkit for efficient model alignment. The toolkit has support for state-of-the-art model alignment algorithms such as SteerLM, Direct Preference Optimization (DPO), and Reinforcement Learning from Human Feedback (RLHF). These algorithms enable users to align language models to be more safe, harmless, and helpful. Users can perform end-to-end model alignment on a wide range of model sizes and take advantage of all the parallelism techniques to ensure their model alignment is done in a performant and resource-efficient manner. For more technical details, please refer to our `paper <https://arxiv.org/abs/2405.01481>`__.
11+
12+
The NeMo-Aligner toolkit is built using the `NeMo Toolkit <https://github.com/NVIDIA/NeMo>`__ which allows for scaling training up to 1000s of GPUs using tensor, data and pipeline parallelism for all components of alignment. All of our checkpoints are cross-compatible with the NeMo ecosystem, allowing for inference deployment and further customization.
13+
14+
The toolkit is currently in its early stages. We are committed to improving the toolkit to make it easier for developers to pick and choose different alignment algorithms to build safe, helpful, and reliable models.
15+
16+
Get Started
17+
###########
18+
19+
NeMo-Aligner comes preinstalled in NVIDIA NeMo containers. NeMo containers are launched concurrently with NeMo version updates.
20+
21+
To get access to the container, log in to the NVIDIA GPU Cloud (NGC) platform or create a free NGC account here: `NVIDIA NGC <https://ngc.nvidia.com/signin>`__. Once you have logged in, you can get the container here: `NVIDIA NGC NeMo Framework <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo>`__.
22+
23+
To use a pre-built container, run the following code:
24+
25+
.. code-block:: bash
26+
27+
docker run -it --gpus=all --shm-size=8g --workdir /opt/NeMo-Aligner nvcr.io/nvidia/nemo:24.09
28+
29+
Please use the latest tag in the form yy.mm.(patch).
30+
31+
.. note::
32+
Some of the subsequent tutorials require accessing gated Hugging Face models. For details on how to access these models, refer to ``this document <https://docs.nvidia.com/nemo-framework/user-guide//latest/generaltips.html#working-with-hugging-face-models>``__.
33+
34+

0 commit comments

Comments
 (0)