Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
53b077a
add support for data_dir_list in [num_samples, path] form
rnyak May 13, 2026
05c7253
fix checkpoint_dir error msg
rnyak May 18, 2026
78030f6
revert back the checkpoint_dir error msg
rnyak May 18, 2026
4672f87
use dict sampled-source form
rnyak May 18, 2026
853b404
docs(retrieval): add runnable fine-tuning guide
oliverholworthy May 22, 2026
b9c5245
docs(retrieval): expand fine-tuning workflow
oliverholworthy May 22, 2026
d8da530
docs(retrieval): polish retrieval guide references
oliverholworthy May 22, 2026
99cbce9
docs(retrieval): clarify dataset schemas
oliverholworthy May 22, 2026
5a9f27d
docs(retrieval): document runtime caveats
oliverholworthy May 22, 2026
7ee29ca
docs(retrieval): address review followups
oliverholworthy May 22, 2026
c42783e
docs(retrieval): tighten mining guidance
oliverholworthy May 22, 2026
441d64b
docs(retrieval): resolve final review gaps
oliverholworthy May 22, 2026
b6b43b3
docs(retrieval): polish final review feedback
oliverholworthy May 22, 2026
f4cdc9d
docs(retrieval): align final mining examples
oliverholworthy May 22, 2026
1901c16
docs(retrieval): address persona review gaps
oliverholworthy May 22, 2026
d38ffb3
docs(retrieval): close final review nits
oliverholworthy May 22, 2026
7fbec0b
fix(retrieval): load saved encoder exports
oliverholworthy May 22, 2026
076515c
fix(retrieval): preserve encoder metadata
oliverholworthy May 22, 2026
62a9ed3
fix(retrieval): harden mining workflow
oliverholworthy May 22, 2026
95eb041
fix(retrieval): validate mining cache reuse
oliverholworthy May 22, 2026
1066da6
fix(retrieval): harden cache identity and mining handoff
oliverholworthy May 22, 2026
c298554
fix(retrieval): recover from stale mining caches
oliverholworthy May 22, 2026
46aeba1
fix(retrieval): validate partial mining caches
oliverholworthy May 22, 2026
f983d55
fix(retrieval): close mining review gaps
oliverholworthy May 22, 2026
5a41c64
fix(retrieval): harden mining output handoff
oliverholworthy May 22, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,10 +51,10 @@ hand.
## Architecture Overview

```
automodel <command> <domain> -c <config.yaml>
automodel <config.yaml> [--nproc-per-node N] [--key.subkey value]
|
v
_cli/app.py -- routes command + domain to recipe scripts
cli/app.py -- loads YAML config and instantiates the configured recipe
|
v
recipes/ -- main training / eval entry points
Expand Down Expand Up @@ -86,9 +86,9 @@ _diffusers/ -- diffusion pipeline wrapper

### Entry Point

`_cli/app.py` parses `automodel <command> <domain>` and dispatches to the
matching recipe script. The `-c` flag points to a YAML config that drives all
component construction.
`cli/app.py` parses `automodel <config.yaml>`, loads the YAML config, and
instantiates the configured recipe. CLI overrides such as `--model.name value`
are applied to the config before construction.

### Recipes

Expand Down
67 changes: 29 additions & 38 deletions docs/guides/dataset-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This page summarizes the datasets supported in NeMo AutoModel for LLM, VLM, and retrieval training and shows how to plug in your own datasets using Python functions or the YAML `_target_` mechanism.

- See also: [LLM datasets](llm/dataset.md), [VLM datasets](vlm/dataset.md), and [Retrieval dataset](llm/retrieval-dataset.md) for deeper, task-specific guides.
- See also: [LLM datasets](llm/dataset.md), [VLM datasets](vlm/dataset.md), [Retrieval dataset](llm/retrieval-dataset.md), and [Retrieval fine-tuning](llm/retrieval-finetuning.md) for deeper, task-specific guides.

- If a dataset you need is missing, please open a [GitHub issue](https://github.com/NVIDIA-NeMo/Automodel/issues) with a short description and example schema so we can prioritize support.
---
Expand Down Expand Up @@ -62,27 +62,6 @@ dataset:
```
See the detailed guide, [Column-Mapped Text Instruction Dataset](llm/column-mapped-text-instruction-dataset.md), for more information.

- **ChatDataset (multi-turn conversations and tool calling)**
- Class: `nemo_automodel.components.datasets.llm.ChatDataset`
- Use case: multi-turn conversations and tool calling in OpenAI chat format
- Sources: local JSON/JSONL or Hugging Face Hub dataset ID
- Key args:
- `path_or_dataset_id`: path to local file(s) or HuggingFace dataset ID
- `tokenizer`: tokenizer instance (required. Must have chat template support)
- `split`: dataset split (e.g., "train", "validation")
- `name`: dataset configuration/subset name
- `seq_length`: maximum sequence length for padding/truncation
- `padding`: padding strategy ("do_not_pad", "max_length", etc.)
- `truncation`: truncation strategy ("do_not_truncate", "longest_first", etc.)
- `start_of_turn_token`: token marking assistant response start (for answer-only loss)
- `chat_template`: optional override for tokenizer's chat template
- `skip_invalid_samples`: if ``true``, skip malformed JSONL lines when reading local files (warnings log skip counts); default ``false`` fails fast on a bad line
- Notes:
- Requires a tokenizer with chat template support
- Supports both single-turn and multi-turn tool calling
- Tool definitions are provided in a `tools` field at the conversation level
- Tool calls appear in assistant messages via `tool_calls` field
- Tool responses use the `tool` role
### ChatDataset (Multi-Turn Conversations and Tool Calling)
- Class: `nemo_automodel.components.datasets.llm.ChatDataset`
- Use case: multi-turn conversations and tool calling in OpenAI chat format
Expand Down Expand Up @@ -237,26 +216,38 @@ dataset:
See the [Function Calling guide](llm/toolcalling.md) for an end-to-end example with FunctionGemma.
For a small reasoning-style chat SFT starting point, see [qwen2_5_0p5b_instruct_fineproofs_chat.yaml](../../examples/llm_finetune/qwen/qwen2_5_0p5b_instruct_fineproofs_chat.yaml).

### Retrieval (Embedding Fine-Tuning)
- Factory: `nemo_automodel.components.datasets.llm.make_retrieval_dataset`
- Collator: `nemo_automodel.components.datasets.llm.BiEncoderCollator`
- Use case: embedding model fine-tuning with (query, positive doc, negative docs) contrastive learning
- Supported schemas:
### Retrieval Fine-Tuning
- Factory for corpus ID JSON and `hf://` AutoModel-schema sources:
`nemo_automodel.components.datasets.llm.make_retrieval_dataset`
- Factory for inline JSONL:
`nemo_automodel.components.datasets.llm.retrieval_dataset_inline.make_retrieval_dataset`
- Collators: `nemo_automodel.components.datasets.llm.BiEncoderCollator` or
`nemo_automodel.components.datasets.llm.CrossEncoderCollator`
- Use case: bi-encoder embedding fine-tuning with contrastive query/passage groups, or cross-encoder reranking over
query/passage pairs
- Supported retrieval sources:
- Corpus-ID JSON (Merlin/NeMo-retriever style)
- `hf://` sources that already follow the AutoModel retrieval schema
- Inline-text JSONL (e.g., `{"query": "...", "pos_doc": "...", "neg_doc": ["...", "..."]}`)
- Example YAML:
- Inline JSONL example:
```yaml
dataset:
_target_: nemo_automodel.components.datasets.llm.make_retrieval_dataset
data_dir_list: /abs/path/to/train.jsonl
data_type: train
n_passages: 5
collate_fn:
_target_: nemo_automodel.components.datasets.llm.BiEncoderCollator
q_max_len: 512
p_max_len: 512
dataloader:
_target_: torchdata.stateful_dataloader.StatefulDataLoader
dataset:
_target_: nemo_automodel.components.datasets.llm.retrieval_dataset_inline.make_retrieval_dataset
model_type: bi_encoder
data_dir_list:
- /abs/path/to/train.jsonl
data_type: train
n_passages: 5
collate_fn:
_target_: nemo_automodel.components.datasets.llm.BiEncoderCollator
q_max_len: 512
p_max_len: 512
batch_size: 2
shuffle: true
```
See the detailed guide, [Retrieval dataset](llm/retrieval-dataset.md), for more information.
See [Retrieval dataset](llm/retrieval-dataset.md) for schema details and [Retrieval fine-tuning](llm/retrieval-finetuning.md) for the end-to-end workflow.

### NanoGPT Binary Shards (Pretraining)
- Class: `nemo_automodel.components.datasets.llm.nanogpt_dataset.NanogptDataset`
Expand Down
165 changes: 152 additions & 13 deletions docs/guides/llm/retrieval-dataset.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,64 @@
# Retrieval Dataset (Embedding Fine-tuning)
# Retrieval Dataset for Bi-Encoders and Cross-Encoders

NeMo Automodel supports **retrieval model fine-tuning** using a retrieval-style dataset: each training example is a **query** paired with **one positive** document and **one or more negative** documents.
NeMo AutoModel supports **retrieval model fine-tuning** using a retrieval-style dataset: each training example is a
**query** paired with **one positive** document and **one or more negative** documents.

This dataset is used by the retrieval recipes (see `examples/retrieval/bi_encoder/` and `examples/retrieval/cross_encoder/`) together with the `BiEncoderCollator`.
This dataset is used by the retrieval recipes (see `examples/retrieval/bi_encoder/` and
`examples/retrieval/cross_encoder/`) together with the retrieval collators. For an end-to-end training workflow, see
[Retrieval Fine-Tuning](retrieval-finetuning.md).

## What the Bi-Encoder Consumes
## Raw Records and Runtime Schemas

The dataset factory `nemo_automodel.components.datasets.llm.make_retrieval_dataset` returns a Hugging Face `datasets.Dataset`. At runtime it transforms each raw record into the training-time schema:
The dataset factory `nemo_automodel.components.datasets.llm.make_retrieval_dataset` returns a Hugging Face
`datasets.Dataset`. At runtime it transforms each raw record into the training-time schema:

- `question`: query string
- `doc_text`: list of document texts in the order `[positive, negative_1, negative_2, ...]`
- `doc_image`: list of images (or empty strings), aligned with `doc_text`
- `query_instruction` / `passage_instruction`: optional, used when `use_dataset_instruction: true` and the corpus provides instructions via metadata
- `doc_id`: list of document IDs aligned with `doc_text` for corpus-backed and `hf://` sources. Pure inline JSONL
does not provide real document IDs for duplicate-document masking unless you add them in a custom preprocessing path.
- `query_instruction` / `passage_instruction`: optional, used when `use_dataset_instruction: true` and the corpus
provides instructions via metadata

The cross-encoder recipe consumes the same raw retrieval records, but sets `model_type: cross_encoder`. Its dataset
transform flattens each query with its positive and negative passages, and `CrossEncoderCollator` serializes each
query-passage pair for reranking.

Training uses exactly one positive passage per example: the first item in `pos_doc`. For datasets with multiple
relevant passages, either choose a canonical positive, expand the record into one example per positive, or add a
multi-positive loss/masking strategy before training. If expanded rows for the same query can share a batch, keep
distributed in-batch negatives disabled unless you also prevent sibling positives from becoming negatives through
qrels-aware sampling or masking. Keep the full set of known positives in your qrels or corpus metadata for evaluation
and false-negative filtering, even when each training row uses one positive.

## Supported Input Formats

NeMo Automodel supports **two** input schemas:
NeMo AutoModel supports **two** input schemas across three source types. They use different dataset factories:

- Use `nemo_automodel.components.datasets.llm.make_retrieval_dataset` for corpus ID-based JSON and `hf://` sources.
- Use `nemo_automodel.components.datasets.llm.retrieval_dataset_inline.make_retrieval_dataset` for inline JSONL where
document text is stored directly in each record.

| Source | Query field | Required document fields | Best for |
|--------|-------------|--------------------------|----------|
| Corpus ID JSON | `question` | `question_id`, `corpus_id`, `pos_doc`, and `neg_doc` IDs resolved through a local corpus | Production data, hard-negative mining, same-document masking |
| `hf://` AutoModel schema | `question` | `pos_doc`, a companion HF corpus split, and `neg_doc` before training with `n_passages > 1` | Tutorial runs and shared AutoModel retrieval datasets |
| Inline JSONL | `query` or `question` | Inline text in `pos_doc` and `neg_doc` | Small custom runs when you do not need mining or document-ID masking |

Separate qrels files are not consumed directly by the training dataset factory. Convert qrels-style data into retrieval
records before training:

1. Put every passage in a corpus split with stable `id` and `text` values.
2. For each query, write one or more training records with `question_id`, `question`, `corpus_id`, `pos_doc`, and
`neg_doc`. Use unique `question_id` values within each mining file; hard-negative mining writes results back by ID.
3. For training, use the first relevant document in each record as `pos_doc[0]`; expand multi-positive queries into
multiple records if you want every positive to become a supervised positive.
4. For hard-negative mining, include all known positive document IDs for that query in the row's `pos_doc`. The miner
excludes only IDs present in the input row, not an external qrels file.
5. If you expand one query into multiple positive rows, keep those sibling-positive rows out of the same in-batch-negative
training batch or use qrels-aware masking.
6. Preserve the complete qrels separately for full-corpus evaluation and audit mined negatives against them before
reusing the output for training.

### Corpus ID-Based JSON (Merlin/NeMo-Retriever Style)

Expand All @@ -42,19 +85,56 @@ This is the format used by NeMo retriever pipelines where documents live in a se

**Corpus requirements**

Each corpus directory must contain a `merlin_metadata.json` file.
Each corpus directory must contain a `merlin_metadata.json` file and a Hugging Face-loadable `train` split with at least
`id` and `text` columns. For `class: TextQADataset`, AutoModel calls `datasets.load_dataset(<corpus path>)["train"]`,
then resolves `pos_doc` and `neg_doc` IDs against that split.

Minimal example:

```json
{ "class": "TextQADataset", "corpus_id": "wiki_corpus" }
```

Minimal local layout:

```text
retrieval-data/
train.json
wiki_corpus/
merlin_metadata.json
train.parquet # or another load_dataset-compatible train split with id,text columns
```

The `corpus_id` in `merlin_metadata.json` must match the `corpus_id` in each training record. Relative corpus paths in
`train.json` are resolved relative to the JSON file.

:::{note}
- `pos_doc` and `neg_doc` can be lists of `{"id": ...}` dicts or raw IDs (they are normalized internally).
- If you set `use_dataset_instruction: true`, optional fields like `query_instruction` and `passage_instruction` in `merlin_metadata.json` are surfaced to the collator.
- Training uses `pos_doc[0]` as the positive. Additional positives are ignored unless you expand the data before
training.
- To train with corpus instructions, set `use_dataset_instruction: true` on both the dataset and the bi-encoder
collator. The dataset surfaces `query_instruction` and `passage_instruction` from `merlin_metadata.json`; the collator
prepends them before tokenization.
:::

### Hugging Face `hf://` Sources

Direct `hf://` loading expects the AutoModel retrieval schema, not arbitrary Hugging Face retrieval datasets. The URI
format is:

```text
hf://<org>/<repo>/<subset>
```

Each subset must provide:

- `<subset>/dataset_metadata.json` with `corpus_id` metadata and `ids_only: false`
- a `<subset>_corpus` train split with `id` and `text` columns
- a `<subset>` train split with `question` and `pos_doc`; `neg_doc` may be absent but must be available before training
with `n_passages > 1`

Datasets with BEIR, DPR, MS MARCO, MIRACL, or other layouts need a preprocessing step before direct `hf://` loading.

### Inline-Text JSONL (No Corpus Required)

This is convenient for custom fine-tuning pipelines where the documents are included **inline**.
Expand All @@ -71,22 +151,25 @@ This is convenient for custom fine-tuning pipelines where the documents are incl
- `pos_doc` and `neg_doc` can be either:
- strings (interpreted as document text), or
- lists of strings, or
- dicts with at least `text` (optionally `image`, `nr_ocr`) for multimodal use cases.
- dicts with at least `text`.
- The current LLM retrieval collators tokenize text only. Do not rely on inline `image` or OCR fields unless you add a
custom preprocessing and collator path.
- If `corpus_id` is not provided, it defaults to `__inline__`.
- `use_dataset_instruction: true` has no effect for pure inline records (instructions come from corpus metadata).
:::

## YAML Usage (Dataset + Collator)

Use the dataset factory plus the bi-encoder collator:
Use the corpus/HF dataset factory plus the bi-encoder collator for corpus ID-based JSON or `hf://` sources:

```yaml
dataloader:
_target_: torchdata.stateful_dataloader.StatefulDataLoader
dataset:
_target_: nemo_automodel.components.datasets.llm.make_retrieval_dataset
model_type: bi_encoder
data_dir_list:
- /abs/path/to/train.jsonl # or train.json (corpus-id format)
- /abs/path/to/train.json # or hf://nvidia/embed-nemotron-dataset-v1/FEVER
data_type: train
n_passages: 5 # 1 positive + 4 negatives
do_shuffle: true
Expand All @@ -97,10 +180,66 @@ dataloader:
p_max_len: 512
query_prefix: "query:"
passage_prefix: "passage:"
use_dataset_instruction: false
pad_to_multiple_of: 8
```

For corpus ID JSON and `hf://` sources, `do_shuffle: true` shuffles rows only when `max_train_samples` is set before
subsampling. Training order is controlled by the dataloader or distributed sampler. For inline JSONL, `do_shuffle: true`
shuffles the loaded rows directly.

Use the inline dataset factory for inline JSONL:

```yaml
dataloader:
_target_: torchdata.stateful_dataloader.StatefulDataLoader
dataset:
_target_: nemo_automodel.components.datasets.llm.retrieval_dataset_inline.make_retrieval_dataset
model_type: bi_encoder
data_dir_list:
- /abs/path/to/train.jsonl
data_type: train
n_passages: 5 # 1 positive + 4 negatives
do_shuffle: true
collate_fn:
_target_: nemo_automodel.components.datasets.llm.BiEncoderCollator
q_max_len: 512
p_max_len: 512
query_prefix: "query:"
passage_prefix: "passage:"
pad_to_multiple_of: 8
```

For cross-encoder training, keep the same dataset factory, set `model_type: cross_encoder`, and use
`CrossEncoderCollator` arguments:

```yaml
dataloader:
_target_: torchdata.stateful_dataloader.StatefulDataLoader
dataset:
_target_: nemo_automodel.components.datasets.llm.retrieval_dataset_inline.make_retrieval_dataset
model_type: cross_encoder
data_dir_list:
- /abs/path/to/train.jsonl
data_type: train
n_passages: 5
do_shuffle: true
collate_fn:
_target_: nemo_automodel.components.datasets.llm.CrossEncoderCollator
rerank_max_length: 512
prompt_template: "question:{query} \n \n passage:{passage}"
pad_to_multiple_of: 8
```

## Requirements

- `pos_doc` must be **non-empty**.
- If training requests negatives (e.g., `n_passages > 1`), `neg_doc` must contain **at least one** document.
- `neg_doc` must be present in local JSON and JSONL training records. It may be empty only when `n_passages: 1`.
- `hf://` sources may omit `neg_doc` in the source dataset, but add negatives before training with `n_passages > 1`.
- If training requests negatives (e.g., `n_passages > 1`), `neg_doc` must contain **at least one** document.

:::{warning}
`n_passages: 1` is a schema escape hatch, not a good default training setup. The standard bi-encoder and cross-encoder
recipes need at least one negative candidate for meaningful contrastive or reranking supervision, unless you add a
custom negative strategy such as qrels-aware in-batch negatives.
:::
Loading
Loading