feat: split validation statistics by task name by yuki-97 · Pull Request #2019 · NVIDIA-NeMo/RL

yuki-97 · 2026-02-24T09:39:21Z

Support split validation statistics by task name in SFT/GRPO/Distillation by using multiple val dataloaders like RM/DPO.
Support set custom task name, it's useful for saving checkpoint according to a specific dataset's val statistic (e.g. accuracy, val_loss).

Summary by CodeRabbit

New Features
- Multi-task validation support with per-dataset accuracy and loss tracking for checkpointing
- Per-dataset task naming for validation metrics (e.g., val:accuracy_<TaskName>)
Documentation
- Added guidance for configuring dataset-specific validation metrics in training configs
- Included YAML examples for multi-dataset validation setups
Tests
- Updated validation metric checks to include per-task accuracy metrics

coderabbitai · 2026-02-24T15:45:04Z

📝 Walkthrough

Walkthrough

This PR extends validation support from single merged datasets to per-task validation dictionaries across training algorithms. It updates data setup, algorithm training loops, dataset initialization patterns, configuration examples, and checkpointing logic to handle per-dataset validation metrics and enable dataset-specific checkpoint triggers.

Changes

Cohort / File(s)	Summary
Documentation & Guidance `docs/guides/grpo.md`, `docs/guides/sft.md`	Added guidance on using dataset-specific validation metrics for checkpointing via `task_name` config and `metric_name` set to `val:accuracy_<TaskName>` or `val:val_loss_<TaskName>`.
Configuration Examples `examples/configs/distillation_math.yaml`, `grpo_math_1B.yaml`, `sft.yaml`	Updated config examples with comments explaining per-dataset metric configuration and added new checkpointing guidance blocks for multi-dataset setups.
Example Training Script `examples/run_sft.py`	Refactored validation dataset collection from list-based to dict-based (keyed by `task_name`); changed `setup_data` return type from single `val_dataset` to `dict[str, AllTaskProcessedDataset]`.
Algorithm: Data Setup & Validation `nemo_rl/algorithms/sft.py`, `grpo.py`, `distillation.py`	Updated `setup` signatures to accept `val_dataset: dict[str, AllTaskProcessedDataset]` and return `val_dataloader: dict[str, StatefulDataLoader]`; refactored validation loops to iterate per-task, accumulating per-task metrics and computing totals.
Algorithm: Checkpoint & Metric Handling `nemo_rl/algorithms/dpo.py`, `rm.py`	Removed strict runtime assertion on metric_name format (train:/val: prefix); adjusted checkpointing logic to split metric_name without prior validation, with fallback handling for missing metrics.
Data Utilities & Dataset Base `nemo_rl/data/utils.py`, `nemo_rl/data/datasets/raw_dataset.py`	Updated `setup_response_data` return type to `dict[str, AllTaskProcessedDataset]` for validation; introduced `common_init` method on `RawDataset` for centralized task initialization, removing legacy `set_processor` and `set_task_spec` methods.
Dataset Registry & Loading `nemo_rl/data/datasets/preference_datasets/__init__.py`, `response_datasets/__init__.py`	Added `skip_set_processor=True/False` parameter when instantiating datasets; removed explicit `set_task_spec` and `set_processor` calls, delegating to `common_init`.
Response & Preference Dataset Classes `nemo_rl/data/datasets/response_datasets/.py` (13 files), `nemo_rl/data/datasets/preference_datasets/.py` (5 files)	Replaced direct `self.task_name` assignment with `self.common_init(default_task_name=..., **kwargs)` calls; some datasets now add `task_name` column post-load.
Checkpoint Validation `nemo_rl/utils/checkpoint.py`	Added runtime validation in `CheckpointManager.__init__` to enforce metric_name format (must start with "train:" or "val:" if provided).
Test Metric Checks `tests/functional/distillation.sh`, `grpo_multiple_datasets.sh`, `sft.sh`	Extended metric validation checks to include per-dataset accuracy/loss metrics (e.g., `validation/accuracy_<DatasetName>`) in addition to aggregate metrics.

Sequence Diagram(s)

sequenceDiagram
    participant Trainer
    participant DataLoader as Per-Task<br/>DataLoaders
    participant Validator
    participant Metrics as Per-Task<br/>Metrics
    participant Checkpoint as Checkpoint<br/>Manager

    rect rgba(0, 100, 200, 0.5)
    Note over Trainer,Checkpoint: Old Flow: Single Validation Dataset
    Trainer->>DataLoader: Load val_dataloader (merged)
    DataLoader->>Validator: Single batch stream
    Validator->>Metrics: Aggregate results
    Metrics->>Checkpoint: global_accuracy/loss
    Checkpoint->>Checkpoint: Save if metric improves
    end

    rect rgba(0, 150, 100, 0.5)
    Note over Trainer,Checkpoint: New Flow: Per-Task Validation Datasets
    Trainer->>DataLoader: Load val_dataloaders: dict[task_name]
    loop For each task in dict
        DataLoader->>Validator: Per-task batch stream
        Validator->>Metrics: Per-task results
    end
    Metrics->>Metrics: Accumulate per-task metrics
    Metrics->>Metrics: Compute total_accuracy/loss
    Checkpoint->>Checkpoint: Select metric by task_name<br/>(val:accuracy_Task1)
    Checkpoint->>Checkpoint: Save if metric improves
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

PR #1763: Continues the dataset split/refactor; both modify validation from single merged dataset to per-task dicts across setup_response_data, examples/run_sft.py, and algorithm signatures.
PR #1649: Both introduce per-dataset task names and per-task validation datasets in setup functions, config structures, and checkpointing metric naming.
PR #1291: Both modify checkpointing metric_name parsing and handling of train:/val: prefixes in checkpoint logic.

Suggested labels

CI:L1

Suggested reviewers

terrykong
ashors1
yfw

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 35.90% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'feat: split validation statistics by task name' directly and clearly summarizes the main objective of the changeset, which is to enable per-task validation statistics tracking across multiple algorithms.
Test Results For Major Changes	✅ Passed	PR includes validation accuracy metrics comparison graphs and updated functional tests demonstrating per-task validation statistics work correctly without regression.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch yukih/validation-task-name

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)

examples/run_sft.py (1)

104-121: ⚠️ Potential issue | 🟡 Minor

Silent overwrite when the same task_name appears in both validation sources.

If data.task_name at line 106 equals val_data.task_name at line 121 (a task that appears in both the train-split validation and the explicit validation: config), the train-split dataset is silently replaced. Add a warning:

⚠️ Proposed guard

+        if val_data.task_name in val_data_dict:
+            warnings.warn(
+                f"task_name '{val_data.task_name}' already exists in val_data_dict "
+                "(from train split). Overwriting with config-defined validation dataset."
+            )
         val_data_dict[val_data.task_name] = val_data.dataset

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/run_sft.py` around lines 104 - 121, The current logic populates
val_data_dict from two sources (the data_list loop using data.task_name and the
data_config["validation"] loop using val_data.task_name) and silently overwrites
entries when task names collide; update the code so before assigning into
val_data_dict in the second validation-source loop (after val_data =
load_response_dataset(cfg)) you check if val_data.task_name already exists in
val_data_dict and, if so, emit a warning (e.g., using logger.warning or print)
that the validation dataset for that task_name will be overwritten, including
which source is being overridden; ensure the check references val_data_dict,
val_data.task_name, and the loading path around load_response_dataset so the
warning is clear and only emitted on duplicates.

nemo_rl/algorithms/dpo.py (1)

663-682: ⚠️ Potential issue | 🟡 Minor

Same missing-colon validation as rm.py — cryptic ValueError for misconfigured metric_name.

Line 665 (prefix, metric_name = full_metric_name.split(":", 1)) is the same unguarded split as in rm.py. Apply the same fix:

🛡️ Proposed fix

     if full_metric_name is not None:
+        if ":" not in full_metric_name:
+            raise ValueError(
+                f"checkpointing.metric_name must be in '<prefix>:<metric>' format, got '{full_metric_name}'"
+            )
         prefix, metric_name = full_metric_name.split(":", 1)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/algorithms/dpo.py` around lines 663 - 682, The code does an unguarded
split of full_metric_name into prefix and metric_name (prefix, metric_name =
full_metric_name.split(":", 1)) which raises a cryptic ValueError for
misconfigured strings; update the validation so you first check that
full_metric_name is a non-empty string containing a ":" (or use str.partition
and verify the separator was present) and raise a clear ValueError like
"full_metric_name must be '<prefix>:<metric>'" when missing; then proceed to set
prefix, metric_name and the rest of the logic that chooses metrics_source and
updates dpo_save_state (keep the existing warnings, deletion from
dpo_save_state, and the metric-not-found error path intact).

nemo_rl/algorithms/rm.py (1)

590-610: ⚠️ Potential issue | 🟡 Minor

Removed format assertion leaves a cryptic ValueError when metric_name lacks a colon.

With the assertion gone, a misconfigured metric_name: accuracy (no prefix: part) reaches line 592:

prefix, metric_name = full_metric_name.split(":", 1)
# → ValueError: not enough values to unpack (expected 2, got 1)

Add an explicit guard before the split:

🛡️ Proposed fix

 if full_metric_name is not None:
+    if ":" not in full_metric_name:
+        raise ValueError(
+            f"checkpointing.metric_name must be in '<prefix>:<metric>' format, got '{full_metric_name}'"
+        )
     prefix, metric_name = full_metric_name.split(":", 1)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/algorithms/rm.py` around lines 590 - 610, The code currently does an
unchecked split of full_metric_name into prefix and metric_name causing a
cryptic ValueError when the colon is missing; before calling
full_metric_name.split(":", 1) in the checkpointing block that updates
rm_save_state, add an explicit guard that checks that full_metric_name contains
exactly one ':' (or at least contains ':') and if not raise a clear ValueError
(or warnings.warn then skip) that explains the expected format like
"checkpointing.metric_name must be 'prefix:metric' (e.g. 'train:accuracy')" so
callers see a helpful error instead of the unpacking exception.

nemo_rl/algorithms/distillation.py (1)

945-1078: ⚠️ Potential issue | 🟡 Minor

Remove unused rewards variable assignment.

Line 1011 assigns val_batch["total_reward"] to rewards, but the variable is never used; the next line accesses val_batch["total_reward"].tolist() directly. Delete the unused assignment.

Diff

-                rewards = val_batch["total_reward"]

                 task_rewards.extend(val_batch["total_reward"].tolist())

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/algorithms/distillation.py` around lines 945 - 1078, In function
validate (in nemo_rl/algorithms/distillation.py) remove the unused local
assignment "rewards = val_batch['total_reward']" (the variable rewards is never
referenced afterward); simply delete that line so the code uses
val_batch['total_reward'].tolist() directly and avoids an unused variable in the
loop that processes val_batch within validate.

nemo_rl/data/utils.py (1)

116-140: ⚠️ Potential issue | 🟠 Major

Guard against duplicate task_name overwrites in validation data loading.

Direct dict assignment silently drops earlier validation datasets if the same task_name appears in multiple configs. The training path handles this correctly with concatenate_datasets() (line 105), but validation does not. When duplicate task_names occur, later assignments overwrite earlier ones in val_data_dict, val_task_data_processors, and val_task_to_env, causing silent data loss.

Concatenate validation datasets with matching task_names using the same pattern as training data, or assert that task_names are unique across validation configs.

💡 Suggested fix (merge duplicates like training data)

-            val_data_dict[data.task_name] = data.val_dataset
+            if data.task_name in val_data_dict:
+                val_data_dict[data.task_name] = concatenate_datasets(
+                    [val_data_dict[data.task_name], data.val_dataset]
+                )
+            else:
+                val_data_dict[data.task_name] = data.val_dataset
...
-            val_data_dict[val_data.task_name] = val_data.dataset
+            if val_data.task_name in val_data_dict:
+                val_data_dict[val_data.task_name] = concatenate_datasets(
+                    [val_data_dict[val_data.task_name], val_data.dataset]
+                )
+            else:
+                val_data_dict[val_data.task_name] = val_data.dataset

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/utils.py` around lines 116 - 140, The validation loader
currently overwrites earlier entries when multiple validation configs share the
same task_name; update the block that assigns to val_data_dict,
val_task_data_processors, and val_task_to_env to merge duplicates like the
training path: after loading val_data = load_response_dataset(cfg), check if
val_data.task_name already exists in val_data_dict and if so call
concatenate_datasets(existing_dataset, val_data.dataset) (same helper used on
line ~105) and replace the entry, otherwise set it; ensure
val_task_data_processors[val_data.task_name] and
val_task_to_env[val_data.task_name] are only set once (or validated to be
consistent) to avoid inconsistent processor/env mappings.

♻️ Duplicate comments (7)

nemo_rl/data/datasets/preference_datasets/tulu3.py (1)

27-28: Same common_init argument concern as refcoco.