feat: unify nemogym dataset by yuki-97 · Pull Request #1807 · NVIDIA-NeMo/RL

yuki-97 · 2026-01-22T06:47:05Z

Update run_grpo_nemo_gym.py to use the common util setup_response_data, so that it can also use multiple datasets supported in #1691 and multiple dataloaders which will be supported in #1698.

Add NemoGymDataset and nemo_gym_data_processor to match current NeMo-RL dataset structure.
Update setup_data_with_envs to setup_response_data, which supports not create env.

Test Result

Summary by CodeRabbit

Release Notes

New Features
- Added integrated Nemo Gym dataset and processor support for seamless environment integration
- Optional environment configuration support in data setup flow
Refactor
- Simplified data setup process with conditional environment handling
- Reorganized configuration structure for improved dataset specification
- Updated data loading workflow with optional environment binding
Tests
- Updated test suite to support dataset-based processing approach

coderabbitai · 2026-02-04T13:32:39Z

📝 Walkthrough

Walkthrough

This PR refactors the data setup pipeline by renaming setup_data_with_envs to setup_response_data with optional environment handling, introduces NemoGymDataset and nemo_gym_data_processor for structured NeMo Gym data loading, removes direct datum-spec conversion, and updates example scripts and tests to use the new API with environment registry integration.

Changes

Cohort / File(s)	Summary
Data setup refactoring `nemo_rl/data/utils.py`, `nemo_rl/data/__init__.py`	Renames `setup_data_with_envs` to `setup_response_data`, makes `env_configs` optional (default None), adds conditional return type logic (2-tuple when no envs, 4-tuple when envs provided), and updates `DataConfig.max_input_seq_length` type from `int` to `int \| None`.
NemoGym dataset and processor `nemo_rl/data/datasets/response_datasets/nemogym_dataset.py`, `nemo_rl/data/datasets/response_datasets/__init__.py`, `nemo_rl/data/processors.py`	Introduces `NemoGymDataset` class for loading JSONL data from file paths, registers it in `DATASET_REGISTRY`, adds `nemo_gym_data_processor` for converting dataset entries to `DatumSpec` format, and registers it in `PROCESSOR_REGISTRY`.
Environment utilities `nemo_rl/environments/utils.py`, `nemo_rl/environments/nemo_gym.py`	Adds `"nemo_gym"` entry to `ENV_REGISTRY` and removes `nemo_gym_example_to_nemo_rl_datum_spec` function (conversion now handled by processor).
Example script updates `examples/nemo_gym/run_grpo_nemo_gym.py`, `examples/run_grpo.py`, `examples/run_distillation.py`, `examples/run_vlm_grpo.py`	Updates imports and function calls to use `setup_response_data` instead of `setup_data_with_envs`, replaces NemoGym-specific initialization with generic `create_env(env_name="nemo_gym", env_config=...)`.
Configuration updates `examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml`	Adds checkpoint directory configuration, restructures data specification with train/validation/default blocks, replaces legacy jsonl file paths, and adds NeMo Gym-specific environment configuration.
Test updates `tests/unit/experience/test_rollouts.py`	Refactors NeMo Gym rollout tests to use `NemoGymDataset` for data loading and `nemo_gym_data_processor` for datum conversion instead of direct helper function.
Project configuration `pyrefly.toml`	Adds `nemogym_dataset.py` to project includes and reorders sglang generation module entries.

Sequence Diagram(s)

sequenceDiagram
    participant Script as Example Script
    participant DataSetup as setup_response_data()
    participant DataLoader as NemoGymDataset
    participant Processor as nemo_gym_data_processor
    participant EnvSetup as create_env()
    participant Training as Training Loop

    Script->>DataSetup: Call with data_config, env_configs
    DataSetup->>DataLoader: Load dataset from JSONL path
    DataLoader-->>DataSetup: Return HuggingFace Dataset
    
    DataSetup->>Processor: Process each dataset entry
    Processor-->>DataSetup: Return DatumSpec with env_info
    
    alt env_configs provided
        DataSetup->>EnvSetup: create_env(env_name="nemo_gym")
        EnvSetup-->>DataSetup: Return environment interface
        DataSetup-->>Script: (train_dataset, val_dataset, task_to_env, val_task_to_env)
    else env_configs is None
        DataSetup-->>Script: (train_dataset, val_dataset)
    end
    
    Script->>Training: Start training with datasets and envs
    Training-->>Script: Training results

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

feat: support multiple datasets for response dataset #1691: Modifies the same data setup pathway in nemo_rl/data/utils.py, adding multi-dataset merging logic alongside the setup function refactoring.
refactor: refactor dataset module #977: Extends the response-datasets and dataset utils surface introduced in that PR with NemoGymDataset registration and processor integration.
refactor: refactor env and data processor & add nemotron super 49b recipes #1506: Also modifies environment and processor registries, adding new ENV_REGISTRY entries and registering new data processors.

Suggested labels

documentation

Suggested reviewers

terrykong

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: unify nemogym dataset' directly and clearly describes the main objective of the PR: unifying the NemoGym dataset interface to use the common setup_response_data utility.
Docstring Coverage	✅ Passed	Docstring coverage is 81.82% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes	✅ Passed	PR includes new NemoGymDataset class and data processor with updated unit tests, plus metrics dashboard comparing feature branch against main branch baseline across training/validation reward and prompt length, demonstrating no regressions.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch yukih/nemogym-dataset

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

✨ 136.113.208.247/32 (new)
34.170.211.100/32
35.222.179.152/32

Reviews will stop working after February 8, 2026 if the new IP is not added to your allowlist.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/nemo_gym/run_grpo_nemo_gym.py (1)

162-182: ⚠️ Potential issue | 🟠 Major

Handle the case where no validation dataset is produced.

setup_response_data(..., env_configs=None) can return val_dataset = None. The current code unconditionally calls len(val_dataset), which will raise. Please guard or fail fast with a clear error.

Suggested fix

-    train_dataset, val_dataset = setup_response_data(
-        tokenizer, config["data"], env_configs=None
-    )
+    train_dataset, val_dataset = setup_response_data(
+        tokenizer, config["data"], env_configs=None
+    )
+    if val_dataset is None:
+        raise ValueError(
+            "Validation dataset is required for NeMo-Gym runs; please configure "
+            "data.validation or split_validation_size > 0."
+        )
@@
-    print(
-        f"Setting `grpo.max_val_samples` and `grpo.val_batch_size` to the length of the validation dataset, which is {len(val_dataset)}"
-    )
+    print(
+        f"Setting `grpo.max_val_samples` and `grpo.val_batch_size` to the length of the validation dataset, which is {len(val_dataset)}"
+    )

🤖 Fix all issues with AI agents

In `@examples/nemo_gym/run_grpo_nemo_gym.py`:
- Around line 211-219: The current hardcoded task_to_env {"nemo_gym": nemo_gym}
can mismatch NemoGymDataset.task_name (derived via
"-".join(data_path.split("/")[-2:]).split(".")[0]) and break environment lookups
in rollouts.py; change the binding to use the dataset's task_name (obtain from
the NemoGymDataset instance used to create the env) as the key instead of
"nemo_gym", e.g., compute key = dataset.task_name and set task_to_env = {key:
nemo_gym} and mirror that for val_task_to_env so
run_async_nemo_gym_rollout/rollouts.py environment lookups succeed.

In `@nemo_rl/data/__init__.py`:
- Line 47: The code permits max_input_seq_length (alias max_seq_length) to be
None but downstream processors perform unsafe comparisons/arithmetic with it;
either add explicit None checks in every processor function that uses
max_input_seq_length (e.g., guard patterns in token/window truncate logic before
any "if length > max_input_seq_length" or "max_input_seq_length // ..."
operations) or enforce non-None at dataset construction by validating in
AllTaskProcessedDataset.__init__ (raise/config error if max_input_seq_length is
None) so processors can assume an int. Update references to
max_input_seq_length/max_seq_length in the processor functions and
AllTaskProcessedDataset to implement the chosen approach and add a clear error
message when rejecting None.

In `@nemo_rl/data/datasets/response_datasets/nemogym_dataset.py`:
- Line 1: Update the copyright header year from 2025 to 2026 in the file's
top-of-file comment (the existing line containing "Copyright (c) 2025, NVIDIA
CORPORATION.  All rights reserved."); replace "2025" with "2026" so the header
reads "Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved."
- Around line 23-31: Add a short docstring to the __init__ method describing
parameters (data_path: path to jsonl file, repeat: repetition count) and what
attributes are created (task_name and dataset), and silence the unused kwargs
lint by either renaming kwargs to _kwargs or explicitly consuming it (e.g., _ =
kwargs) or adding a comment like # noqa: F401 after kwargs; update the __init__
signature reference in the docstring and mention task_name and dataset so
reviewers can find the code (look for __init__, task_name, dataset, kwargs).

In `@nemo_rl/data/processors.py`:
- Around line 667-684: In nemo_gym_data_processor, silence the unused-argument
warnings by referencing task_data_spec, tokenizer, and max_seq_length (e.g.,
assign them to a throwaway variable or use them in a no-op) and change the fake
message_log token_ids creation from torch.tensor([]) to an empty integer tensor
(torch.tensor([], dtype=torch.long)) so token IDs use an integer dtype; update
the "message_log" entry creation in the function accordingly.

In `@nemo_rl/data/utils.py`:
- Around line 99-100: The code assumes cfg["env_name"] exists when env_configs
is provided (see variables has_envs, envs, task_to_env and cfg), which can raise
KeyError; update the logic to validate each dataset config early: after
extracting env names from env_configs, check every cfg in the dataset loop and
if has_envs is True and "env_name" is missing raise a clear ValueError (or
KeyError with a descriptive message) indicating that env_name is required when
using env_configs and include the dataset identifier in the message;
alternatively, wrap the access to cfg["env_name"] with a check and provide the
same descriptive error before assigning into task_to_env.

In `@tests/unit/experience/test_rollouts.py`:
- Around line 800-814: The temp file created with
tempfile.NamedTemporaryFile(..., delete=False) is not removed, causing
accumulation; change the test to either create the temp file with delete=True
and instantiate NemoGymDataset(data_path) inside the with block so the file is
read before it's auto-deleted, or keep delete=False but ensure explicit cleanup
(os.remove(data_path)) in a finally/teardown; locate the creation site
(tempfile.NamedTemporaryFile), the variable data_path, and where NemoGymDataset
is constructed (NemoGymDataset(data_path)) and apply one of these fixes.

🧹 Nitpick comments (3)

examples/nemo_gym/grpo_workplace_assistant_nemotron_nano_v2_9b.yaml (2)
46-48: Derive checkpoint_dir from logger.log_dir to avoid collisions.

Hardcoding a shared directory risks overlapping runs. Consider scoping checkpoints to the run log directory.
♻️ Suggested tweak
 checkpointing:
   enabled: true
-  checkpoint_dir: "results/grpo"
+  checkpoint_dir: "${logger.log_dir}/checkpoints"
237-242: Consider parameterizing local dataset paths.

The hardcoded 3rdparty/... paths are brittle across machines/CI. Using env overrides makes the config portable.
♻️ Example (env‑override with defaults)
 train:
-  data_path: 3rdparty/Gym-workspace/Gym/data/workplace_assistant/train.jsonl
+  data_path: ${oc.env:WORKPLACE_ASSISTANT_TRAIN_JSONL,"3rdparty/Gym-workspace/Gym/data/workplace_assistant/train.jsonl"}
 validation:
-  data_path: 3rdparty/Gym-workspace/Gym/data/workplace_assistant/validation.jsonl
+  data_path: ${oc.env:WORKPLACE_ASSISTANT_VALID_JSONL,"3rdparty/Gym-workspace/Gym/data/workplace_assistant/validation.jsonl"}
nemo_rl/data/utils.py (1)
34-47: Consider using @overload for clearer return type discrimination.

The Union return type makes it harder for static type checkers and callers to know which tuple shape they'll receive. Using @typing.overload would provide better type safety and IDE support.
♻️ Optional refactor using overload
from typing import overload, Literal

`@overload`
def setup_response_data(
    tokenizer: AutoProcessor | AutoTokenizer,
    data_config: DataConfig,
    env_configs: None = None,
    is_vlm: bool = False,
) -> tuple[AllTaskProcessedDataset, Optional[AllTaskProcessedDataset]]: ...

`@overload`
def setup_response_data(
    tokenizer: AutoProcessor | AutoTokenizer,
    data_config: DataConfig,
    env_configs: dict[str, Any],
    is_vlm: bool = False,
) -> tuple[
    AllTaskProcessedDataset,
    Optional[AllTaskProcessedDataset],
    dict[str, EnvironmentInterface],
    dict[str, EnvironmentInterface],
]: ...

def setup_response_data(
    tokenizer: AutoProcessor | AutoTokenizer,
    data_config: DataConfig,
    env_configs: Optional[dict[str, Any]] = None,
    is_vlm: bool = False,
) -> Union[...]:
    # implementation unchanged

examples/nemo_gym/run_grpo_nemo_gym.py

nemo_rl/data/__init__.py

nemo_rl/data/datasets/response_datasets/nemogym_dataset.py

nemo_rl/data/processors.py

nemo_rl/data/utils.py

tests/unit/experience/test_rollouts.py

Signed-off-by: ruit <ruit@nvidia.com>

Signed-off-by: Yuki Huang <yukih@nvidia.com>

Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: Yuki Huang <yukih@nvidia.com> Co-authored-by: ruit <ruit@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: Yuki Huang <yukih@nvidia.com> Co-authored-by: ruit <ruit@nvidia.com>

RayenTian force-pushed the yukih/nemogym-dataset branch from 6c9a653 to 39d699d Compare January 30, 2026 03:41

RayenTian added the CI:L1 Run doctests, unit tests, and functional tests label Jan 30, 2026

RayenTian temporarily deployed to nemo-ci January 30, 2026 04:04 — with GitHub Actions Inactive

RayenTian temporarily deployed to nemo-ci January 30, 2026 06:45 — with GitHub Actions Inactive

RayenTian temporarily deployed to nemo-ci January 30, 2026 16:04 — with GitHub Actions Inactive

yuki-97 force-pushed the yukih/nemogym-dataset branch 3 times, most recently from 856a9cf to 971dc6e Compare February 4, 2026 09:12

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 4, 2026

yuki-97 had a problem deploying to nemo-ci February 4, 2026 09:13 — with GitHub Actions Error

yuki-97 force-pushed the yukih/nemogym-dataset branch from 971dc6e to 9f310cd Compare February 4, 2026 09:17

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 4, 2026

yuki-97 temporarily deployed to nemo-ci February 4, 2026 09:18 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci February 4, 2026 11:00 — with GitHub Actions Inactive

yuki-97 force-pushed the yukih/nemogym-dataset branch from 8cf62f7 to d1428ba Compare February 4, 2026 13:22

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 4, 2026

yuki-97 temporarily deployed to nemo-ci February 4, 2026 13:23 — with GitHub Actions Inactive

yuki-97 marked this pull request as ready for review February 4, 2026 13:24

yuki-97 requested review from a team as code owners February 4, 2026 13:24

yuki-97 requested review from bxyu-nvidia and terrykong February 4, 2026 13:26

coderabbitai bot reviewed Feb 4, 2026

View reviewed changes

yuki-97 dismissed bxyu-nvidia’s stale review via 2aa0039 February 7, 2026 03:49

RayenTian and others added 13 commits February 7, 2026 11:49

support nemor gym config

6ec3b95

Signed-off-by: ruit <ruit@nvidia.com>

support run nemo-gym grpo

31f2703

Signed-off-by: ruit <ruit@nvidia.com>

unify nemo gym interaface

32a63e3

Signed-off-by: ruit <ruit@nvidia.com>

update nemogym config

7a2f009

Signed-off-by: ruit <ruit@nvidia.com>

update config

2a08dd2

Signed-off-by: Yuki Huang <yukih@nvidia.com>

update nemogym dataset and data processor

d9fd70d

Signed-off-by: Yuki Huang <yukih@nvidia.com>

fix Dataset issue

98b6744

Signed-off-by: Yuki Huang <yukih@nvidia.com>

use setup_response_data

e52569f

Signed-off-by: Yuki Huang <yukih@nvidia.com>

fix unit test and minor update

b181c78

Signed-off-by: Yuki Huang <yukih@nvidia.com>

pyrefly

e0c20e1

Signed-off-by: Yuki Huang <yukih@nvidia.com>

coderabbit

a28d13b

Signed-off-by: Yuki Huang <yukih@nvidia.com>

address comments

0e03bff

Signed-off-by: Yuki Huang <yukih@nvidia.com>

add comment

66cb82c

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 force-pushed the yukih/nemogym-dataset branch from 2aa0039 to 66cb82c Compare February 7, 2026 03:49

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 7, 2026

yuki-97 temporarily deployed to nemo-ci February 7, 2026 03:50 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci February 7, 2026 04:15 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci February 7, 2026 06:39 — with GitHub Actions Inactive

terrykong approved these changes Feb 10, 2026

View reviewed changes

yuki-97 merged commit f93b56a into main Feb 10, 2026
41 of 43 checks passed

yuki-97 deleted the yukih/nemogym-dataset branch February 10, 2026 08:32

coderabbitai bot mentioned this pull request Mar 3, 2026

feat: add code generation evaluation with test-case-driven sandbox environment #2056

Open

6 tasks

seonjinn pushed a commit that referenced this pull request Mar 8, 2026

feat: unify nemogym dataset (#1807)

27d62c6

Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: Yuki Huang <yukih@nvidia.com> Co-authored-by: ruit <ruit@nvidia.com>

seonjinn pushed a commit that referenced this pull request Mar 8, 2026

feat: unify nemogym dataset (#1807)

4387786

Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: Yuki Huang <yukih@nvidia.com> Co-authored-by: ruit <ruit@nvidia.com>

seonjinn pushed a commit that referenced this pull request Mar 9, 2026

feat: unify nemogym dataset (#1807)

713b97d

Signed-off-by: ruit <ruit@nvidia.com> Signed-off-by: Yuki Huang <yukih@nvidia.com> Co-authored-by: ruit <ruit@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: unify nemogym dataset#1807

feat: unify nemogym dataset#1807
yuki-97 merged 13 commits intomainfrom
yukih/nemogym-dataset

yuki-97 commented Jan 22, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 4, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yuki-97 commented Jan 22, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Feb 4, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yuki-97 commented Jan 22, 2026 •

edited by coderabbitai bot

Loading