You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Nov 19, 2025. It is now read-only.
Copy file name to clipboardExpand all lines: README.md
+7Lines changed: 7 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -42,6 +42,13 @@ For the latest stable release, please see the [releases page](https://github.com
42
42
### Requirements
43
43
NeMo-Aligner has the same requirements as the [NeMo Toolkit Requirements](https://github.com/NVIDIA/NeMo#requirements) with the addition of [PyTriton](https://github.com/triton-inference-server/pytriton).
44
44
45
+
### Quick start inside NeMo container
46
+
NeMo Aligner comes included with NeMo containers. On a machine with NVIDIA GPUs and drivers installed run NeMo container:
47
+
```bash
48
+
docker run --gpus all -it --rm --shm-size=8g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/nemo:24.07
49
+
```
50
+
Once you are inside the container, NeMo-Aligner is already installed and together with NeMo and other tools can be found under ```/opt/``` folder.
51
+
45
52
### Install NeMo-Aligner
46
53
Please follow the same steps as outlined in the [NeMo Toolkit Installation Guide](https://github.com/NVIDIA/NeMo#installation). After installing NeMo, execute the following additional command:
Copy file name to clipboardExpand all lines: docs/user-guide/dpo.rst
+18-15Lines changed: 18 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,25 +5,28 @@
5
5
Model Alignment by DPO, RPO, and IPO
6
6
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
7
7
8
+
.. note::
9
+
Before starting this tutorial, be sure to review the :ref:`introduction <model-aligner-intro>` for tips on setting up your NeMo-Aligner environment.
10
+
8
11
The NeMo Framework supports efficient model alignment via the NeMo-Aligner codebase.
9
12
10
-
All algorithms in NeMo-Aligner will work with any GPT-based model that is from Megatron Core (in the config it has ``mcore_gpt=True``). For the purposes of this tutorial, we will go through the entire Direct Preference Optimization (DPO) pipeline using the newly released `2B GPT model with 4096 sequence length <https://huggingface.co/nvidia/GPT-2B-001>`__. The same tutorial also works for GPT models (such as LLaMa2) of any size.
13
+
All algorithms in NeMo-Aligner will work with any GPT-based model that is from Megatron Core (in the config it has ``mcore_gpt=True``). For the purposes of this tutorial, we will go through the entire Direct Preference Optimization (DPO) pipeline using the newly released `2B GPT model with 4096 sequence length <https://huggingface.co/nvidia/GPT-2B-001>`__. The same tutorial also works for GPT models (such as LLaMa3) of any size.
11
14
12
15
DPO with LoRA
13
16
#############
14
17
15
18
We support both full-parameter DPO training and LoRA DPO training.
16
-
For full-parameter DPO, there exists an actor and a reference model. The actor is initialized with the reference model and is fully trainable. The reference model is frozen and used to calculate logprobs for KL-penalty loss (see `DPO paper <https://arxiv.org/pdf/2305.18290.pdf>`__).
19
+
In full-parameter DPO, there exists an actor and a reference model. The actor is initialized with the reference model and is fully trainable. The reference model is frozen and used to calculate logprobs for KL-penalty loss (see the `DPO paper <https://arxiv.org/pdf/2305.18290.pdf>`__).
17
20
For LoRA-based DPO, the actor is initialized by the reference model plus LoRA weights, where only the LoRA weights are trainable. Therefore, it allows us to switch between the actor/reference models by simply enabling or disabling LoRA. In addition, there is no need to store two sets of LLM weights.
18
21
19
22
RPO and IPO Variations
20
23
#######################
21
24
22
-
Besides the vanilla DPO algorithm, we support other variants of DPO algorithms, including Identity preference optimization (IPO) and Reward-aware preference optimization (RPO).
25
+
Besides the vanilla DPO algorithm, we support other variants of DPO algorithms, including Identity Preference Optimization (IPO) and Reward-aware Preference Optimization (RPO).
23
26
24
27
The algorithm is identified with the ``dpo.preference_loss`` config variable. We support three sorts of RPO algorithms based on the distance metric: ``rpo_sq`` for squared distance, ``rpo_bwd_kl`` for Bernoulli backward KL divergence, and ``rpo_fwd_kl`` for Bernoulli forward KL divergence.
25
28
26
-
To use the RPO algorithm, each dataset example should have chosen_reward and rejected_reward, which might come from human labelers or reward models. If chosen_reward and rejected_reward are not existent in the data, dpo.default_chosen_reward and dpo.default_rejected_reward are used.
29
+
To use the RPO algorithm, each dataset example should have ``chosen_reward`` and ``rejected_reward``, which might come from human labelers or reward models. If ``chosen_reward`` and ``rejected_reward`` are not existent in the data, ``dpo.default_chosen_reward`` and ``dpo.default_rejected_reward`` are used.
27
30
28
31
Obtain a Pretrained Model
29
32
############################
@@ -36,18 +39,18 @@ To start, we must first get a pretrained model to align. There are two models we
36
39
37
40
#. Get the 2B checkpoint via ``wget https://huggingface.co/nvidia/GPT-2B-001/resolve/main/GPT-2B-001_bf16_tp1.nemo``.
38
41
#. Extract the NeMo File to a folder with ``mkdir model_checkpoint && tar -xvf GPT-2B-001_bf16_tp1.nemo -C model_checkpoint``.
39
-
#. Run the script to convert from the old NeMo checkpoint to the Megatron Core checkpoint. The script is located `here <https://github.com/NVIDIA/NeMo/blob/86b198ff93438d454f9c7f3550bcfb7d4e59feab/scripts/nlp_language_modeling/convert_nemo_gpt_to_mcore.py>`__.
42
+
#. Run the script to convert from the old NeMo checkpoint to the Megatron Core checkpoint. The script is located `here <https://github.com/NVIDIA/NeMo/blob/0ec7e9090d3261b8ce81818b0555a204e50d814d/scripts/checkpoint_converters/convert_gpt_nemo_to_mcore.py>`__.
40
43
.. code-block:: bash
41
44
42
45
python convert_nemo_gpt_to_mcore.py \
43
46
--in-folder ./model_checkpoint \
44
47
--out-file ./mcore_gpt.nemo
45
48
46
-
.. tab-item:: LLaMa2 7B
49
+
.. tab-item:: LLaMa3 7B
47
50
:sync: key2
48
51
49
-
#. Download the `Llama 2 7B LLM model and tokenizer <https://huggingface.co/meta-llama/Llama-2-7b>`__ into the models folder.
50
-
#. Convert the LLaMa2 LLM into ``.nemo`` format.
52
+
#. Download the `Llama 3 8B LLM model and tokenizer <https://huggingface.co/meta-llama/Meta-Llama-3-8B>`__ into the models folder.
@@ -78,7 +81,7 @@ For best DPO training performance, it is recommended that you start with a SFT m
78
81
DPO Model Training
79
82
#####################
80
83
81
-
Before running the core DPO training, you must prepare your training and validation data to the format required for DPO training. DPO expects .jsonl files where each line is a JSON dict corresponding to a single, complete sample, as shown below::
84
+
Before running the core DPO training, you must prepare your training and validation data to the format required for DPO training. DPO expects ``.jsonl`` files where each line is a JSON dict corresponding to a single, complete sample, as shown below::
82
85
83
86
{"prompt": "Which year was the Magna Carta signed?", "chosen_response": "1215", "rejected_response": "I refuse to answer this question."}
84
87
{"prompt": "Please give me the name of a famous medieval painter.", "chosen_response": "Hieronymus Bosch", "rejected_response": "David Hockney"}
@@ -88,12 +91,12 @@ However, please be aware that most Megatron GPT models adhere to a strict format
88
91
{"prompt": "<extra_id_0>System\n\n<extra_id_1>User\nWhich year was the Magna Carta signed?\n<extra_id_1>Assistant\n", "chosen_response": "1215\n<extra_id_1>", "rejected_response": "I refuse to answer this question.\n<extra_id_1>"}
89
92
{"prompt": "<extra_id_0>System\n\n<extra_id_1>User\nPlease give me the name of a famous medieval painter.\n<extra_id_1>Assistant\n", "chosen_response": "Hieronymus Bosch\n<extra_id_1>", "rejected_response": "David Hockney\n<extra_id_1>"}
90
93
91
-
Always follow the prompt-response template format used during your SFT training for DPO, as failure to do so will produce a model which outputs garbage text. You should create one jsonl file in the format above for your training data and one jsonl for your validation data.
94
+
Always follow the prompt-response template format used during your SFT training for DPO, as failure to do so will produce a model which outputs garbage text. You should create one ``.jsonl`` file in the format above for your training data and one ``.jsonl`` for your validation data.
92
95
93
96
Your JSONL file must contain at least as many samples as the Global Batch Size (GBS) you plan to use during training. For example, if GBS = 64, ensure that both your training and validation files include at least 64 samples. Using a file with fewer samples than the GBS will result in a crash.
94
97
95
98
Once your data is processed into the correct format, you are ready to begin DPO training. You must start with a pretrained or SFT trained model. For this section, we will use the SFT model trained in the previous step to train the DPO model.
96
-
For the purposes of the following sections, we assume that your training jsonl file is located in ``/path/to/train_dpo_format.jsonl`` and your validation jsonl file is located in ``/path/to/valid_dpo_format.jsonl``.
99
+
For the purposes of the following sections, we assume that your training ``.jsonl`` file is located in ``/path/to/train_dpo_format.jsonl`` and your validation ``.jsonl`` file is located in ``/path/to/valid_dpo_format.jsonl``.
97
100
98
101
For the following parameters, the ``model.dpo.ref_policy_kl_penalty`` corresponds to the beta parameter in the DPO paper.
99
102
@@ -196,7 +199,7 @@ All metrics will be grouped by either ``train/`` or ``val/`` in WandB, represent
196
199
When it comes to ideal hyperparameters for DPO training, much will depend on the characteristics of your SFT or base/foundation model. Consequently, there are no one-size-fits-all parameters that will universally work in all cases.
197
200
However, the following list is a brief overview of which hyperparameters we have perturbed for various model sizes and their effects:
198
201
199
-
* global_batch_size: generally, we have found that, all other parameters held equal, lower GBS performs worse. GBS of 256 or 512 seems to be the sweet spot for most models we trained.
200
-
* epochs: highly sensitive to training data size. We recommend you start with 1 epoch and then add on from there. We did not see any improvements beyond 3 epochs.
201
-
* learning rate: we tested cosine annealing with a warmup of 10 steps, followed by a slow decay to a constant rate. That constant rate should be fairly low. We saw the best performance with 9e-7 and 5-e7.
202
-
* ref_policy_kl_penalty: we generally saw better performance with lower values of 0.1, 0.2, 0.5, and 1.0. Occasionally, values as high as 5.0 worked too.
202
+
* global_batch_size: Generally, we have found that, all other parameters held equal, lower GBS performs worse. GBS of 256 or 512 seems to be the sweet spot for most models we trained.
203
+
* epochs: Highly sensitive to training data size. We recommend you start with 1 epoch and then add on from there. We did not see any improvements beyond 3 epochs.
204
+
* learning rate: We tested cosine annealing with a warmup of 10 steps, followed by a slow decay to a constant rate. That constant rate should be fairly low. We saw the best performance with 9e-7 and 5-e7.
205
+
* ref_policy_kl_penalty: We generally saw better performance with lower values of 0.1, 0.2, 0.5, and 1.0. Occasionally, values as high as 5.0 worked too.
Copy file name to clipboardExpand all lines: docs/user-guide/draftp.rst
+15-13Lines changed: 15 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,18 +2,20 @@
2
2
3
3
.. _model-aligner-draftp:
4
4
5
-
Fine-tuning Stable Diffusion with DRaFT+
5
+
Fine-Tuning Stable Diffusion with DRaFT+
6
6
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
7
7
8
-
In this tutorial, we will go through the step-by-step guide for fine-tuning Stable Diffusion model using DRaFT+ algorithm by NVIDIA.
9
-
DRaFT+ is an improvement over the `DRaFT <https://arxiv.org/pdf/2309.17400.pdf>`__ algorithm by alleviating the mode collapse and improving diversity through regularization.
10
-
For more technical details on the DRaFT+ algorithm, check out our technical blog.
8
+
.. note::
9
+
Before starting this tutorial, be sure to review the :ref:`introduction <model-aligner-intro>` for tips on setting up your NeMo-Aligner environment.
11
10
11
+
In this tutorial, we will go through the step-by-step guide for fine-tuning a Stable Diffusion model using DRaFT+ algorithm by NVIDIA.
12
+
DRaFT+ enhances the DRaFT `DRaFT <https://arxiv.org/pdf/2309.17400.pdf>`__ algorithm by mitigating mode collapse and improving diversity through regularization.
13
+
For more technical details on the DRaFT+ algorithm, check out our technical blog.
12
14
13
-
Data Input for running DRaFT+
15
+
Data Input for Running DRaFT+
14
16
#############################
15
17
16
-
The data for running DRaFT+ should be a ``.tar`` file consisting of a plain prompt. You can generate a tarfile from a ``.txt``
18
+
The data for running DRaFT+ should be a ``.tar`` file consisting of a plain prompt. You can generate a tar file from a ``.txt``
17
19
file containing the prompts separated by new lines, such as following format::
18
20
19
21
prompt1
@@ -35,7 +37,7 @@ Use the following script to download and save the prompts from the `Pick a pic <
35
37
forcaptionin captions:
36
38
file.write(caption + '\n')
37
39
38
-
You can then run the following snipet to convert it to a ``.tar`` file:
40
+
You can then run the following snippet to convert it to a ``.tar`` file:
39
41
40
42
.. code-block:: bash
41
43
@@ -64,8 +66,8 @@ you can use the `conversion script <https://github.com/NVIDIA/NeMo/blob/main/exa
64
66
DRaFT+ Training
65
67
###############
66
68
67
-
To launch reward model training, you must have checkpoints for`UNet <https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main/unet>`__ and
68
-
`VAE <https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main/vae>`__ of a trained Stable Diffusion model and a checkpoint for the Reward Model.
69
+
To start reward model training, you need checkpoints forboth the `UNet <https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main/unet>`__ and
70
+
`VAE <https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main/vae>`__ components of a trained Stable Diffusion model, as well as a checkpoint for the Reward Model.
69
71
70
72
.. tab-set::
71
73
@@ -167,7 +169,7 @@ To launch reward model training, you must have checkpoints for `UNet <https://hu
167
169
168
170
169
171
.. note::
170
-
For more info on DRaFT+ hyperparameters please see the model config files (for SD and SDXL respectively):
172
+
For more information on DRaFT+ hyperparameters, please see the model config files (for SD and SDXL respectively):
@@ -179,10 +181,10 @@ Once you have completed fine-tuning Stable Diffusion with DRaFT+, you can run in
179
181
and `sd_lora_infer.py <https://github.com/NVIDIA/NeMo/blob/main/examples/multimodal/text_to_image/stable_diffusion/sd_lora_infer.py>`__ scripts from the NeMo codebase. The generated images with the fine-tuned model should have
180
182
better prompt alignment and aesthetic quality.
181
183
182
-
Usercontrollable finetuning with Annealed Importance Guidance (AIG)
AIG provides the inference-time flexibility to interpolate between the base Stable Diffusion model (with low rewards and high diversity) and DRaFT-finetuned model (with high rewards and low diversity) to obtain images with high rewards and high diversity. AIG inference is easily done by specifying comma-separated `weight_type` strategies to interpolate between the base and finetuned model.
187
+
AIG provides the inference-time flexibility to interpolate between the base Stable Diffusion model (with low rewards and high diversity) and a DRaFT+ fine-tuned model (with high rewards and low diversity) to obtain images with high rewards and high diversity. AIG inference is easily done by specifying comma-separated ``weight_type`` strategies to interpolate between the base and fine-tuned model.
NeMo-Aligner is a scalable toolkit for efficient model alignment. The toolkit has support for state-of-the-art model alignment algorithms such as SteerLM, Direct Preference Optimization (DPO), and Reinforcement Learning from Human Feedback (RLHF). These algorithms enable users to align language models to be more safe, harmless, and helpful. Users can perform end-to-end model alignment on a wide range of model sizes and take advantage of all the parallelism techniques to ensure their model alignment is done in a performant and resource-efficient manner. For more technical details, please refer to our `paper <https://arxiv.org/abs/2405.01481>`__.
11
+
12
+
The NeMo-Aligner toolkit is built using the `NeMo Toolkit <https://github.com/NVIDIA/NeMo>`__ which allows for scaling training up to 1000s of GPUs using tensor, data and pipeline parallelism for all components of alignment. All of our checkpoints are cross-compatible with the NeMo ecosystem, allowing for inference deployment and further customization.
13
+
14
+
The toolkit is currently in its early stages. We are committed to improving the toolkit to make it easier for developers to pick and choose different alignment algorithms to build safe, helpful, and reliable models.
15
+
16
+
Get Started
17
+
###########
18
+
19
+
NeMo-Aligner comes preinstalled in NVIDIA NeMo containers. NeMo containers are launched concurrently with NeMo version updates.
20
+
21
+
To get access to the container, log in to the NVIDIA GPU Cloud (NGC) platform or create a free NGC account here: `NVIDIA NGC <https://ngc.nvidia.com/signin>`__. Once you have logged in, you can get the container here: `NVIDIA NGC NeMo Framework <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo>`__.
22
+
23
+
To use a pre-built container, run the following code:
24
+
25
+
.. code-block:: bash
26
+
27
+
docker run -it --gpus=all --shm-size=8g --workdir /opt/NeMo-Aligner nvcr.io/nvidia/nemo:24.09
28
+
29
+
Please use the latest tag in the form yy.mm.(patch).
30
+
31
+
.. note::
32
+
Some of the subsequent tutorials require accessing gated Hugging Face models. For details on how to access these models, refer to ``this document <https://docs.nvidia.com/nemo-framework/user-guide//latest/generaltips.html#working-with-hugging-face-models>``__.
0 commit comments