Skip to content

feat: add code generation evaluation with test-case-driven sandbox environment#2056

Open
brluobt wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
brluobt:feat/code-generation-evaluation
Open

feat: add code generation evaluation with test-case-driven sandbox environment#2056
brluobt wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
brluobt:feat/code-generation-evaluation

Conversation

@brluobt
Copy link

@brluobt brluobt commented Mar 3, 2026

Background

As a new NeMo RL user, I found the GRPO Quick Start guide and evaluation pipeline excellent for getting started with math tasks. However, when I tried to extend my workflow to code generation — an increasingly important use case for RL-based LLM training — I realized there was no equivalent quick-start path:

  • The existing CodeEnvironment provides sandbox execution but has no test-case validation or reward signal, making it unsuitable for evaluation or GRPO training out of the box.
  • run_eval.py only works with MathEnvironment, so there is no simple command to benchmark a coding model.
  • There are no sample configs or standard benchmarks integrated for code generation, which raises the barrier for developers who want to explore this direction.

My goal with this PR is to provide the same "build, run, see results" experience that math tasks already have, but for code generation. A developer should be able to:

  1. Run a single eval command against LiveCodeBench and see a pass@k score.
  2. Launch GRPO training with code test-case rewards using a sample config.
  3. Use this as a starting point to plug in their own code benchmarks.

Relates to #858

Summary

Add end-to-end code generation evaluation to NeMo RL. The pipeline supports both evaluation (pass@k on LiveCodeBench) and GRPO training with code test-case rewards.

Quick start examples:

# Evaluate a coding model (pass@1 on LiveCodeBench)
uv run examples/run_eval.py --config examples/configs/evals/code_eval.yaml

# Train with GRPO on code generation
uv run examples/run_grpo.py --config examples/configs/grpo_code.yaml cluster.gpus_per_node=8

Validated end-to-end: Qwen2.5-Coder-1.5B-Instruct on LiveCodeBench v5 (167 problems), pass@1 = 3.0% with public test cases.

Changes by Component

1. CodeTestCaseEnvironment (core)

  • nemo_rl/environments/code_test_environment.py — New environment that extracts code from model responses, executes it against stdin/stdout test cases via subprocess, and returns binary pass/fail rewards (1.0 if all tests pass, 0.0 otherwise). Follows the MathEnvironment pattern with accuracy and pass@k metrics in global_post_process_and_metrics().
  • nemo_rl/environments/utils.py — Register "code_test" in ENV_REGISTRY
  • nemo_rl/distributed/ray_actor_environment_registry.py — Register actor with PY_EXECUTABLES.SYSTEM

2. Data Processor

  • nemo_rl/data/processors.py — Add code_data_processor that maps code problems with test cases into DatumSpec, passing test_cases and ground_truth through extra_env_info. Registered in PROCESSOR_REGISTRY.

3. Generalize Eval Pipeline

  • examples/run_eval.py — Replace hardcoded MathEnvironment with create_env(env_name, ...). The env_name is read from data config and defaults to "math", so all existing math eval configs work unchanged.

4. LiveCodeBench Dataset Integration

  • nemo_rl/data/datasets/eval_datasets/livecodebench.py — Eval dataset loader supporting release_v1 through v6.
  • nemo_rl/data/datasets/response_datasets/livecodebench.py — GRPO training dataset following the RawDataset pattern.
  • Both loaders handle datasets 3.x (script loading) and 4.x (JSONL fallback) for container compatibility.

5. Configs and Tests

  • examples/configs/evals/code_eval.yaml — Ready-to-run code evaluation config
  • examples/configs/grpo_code.yaml — Ready-to-run GRPO code training config
  • examples/prompts/code.txt — Prompt template for code generation
  • tests/unit/environments/test_code_test_environment.py — Unit tests covering code extraction, test case execution, timeout handling, batch processing, and extracted answer support.

Splitting Offer

This PR is self-contained but can be split if preferred:

  • PR A (core): Components 1–3 (CodeTestCaseEnvironment + code_data_processor + eval generalization)
  • PR B (integration): Components 4–5 (LiveCodeBench datasets + configs + tests)

Happy to split on request.

Test Plan

  • Unit tests for code extraction (extract_code) and test case execution (run_single_test)
  • Ray-based integration tests for CodeTestCaseEnvironment (pass, fail, batch, extracted answer)
  • Dataset loading verified (LiveCodeBench v5, 167 samples; JSONL fallback for datasets 4.x)
  • Backward compatibility confirmed (existing math eval configs work unchanged with default env_name="math")
  • End-to-end eval in Docker container: Qwen2.5-Coder-1.5B-Instruct on H20, pass@1 = 3.0% (5/167)
  • GRPO training convergence validation (planned follow-up)

Summary by CodeRabbit

  • New Features

    • Code generation evaluation config and GRPO training config added
    • LiveCodeBench integrated for training and evaluation (dataset + prompts)
    • New code-testing environment to run and validate Python solutions
    • Data processor and registry updates to support code evaluation workflows
    • Eval workflow updated to create environments dynamically
  • Tests

    • Added comprehensive unit tests for the code testing environment

@brluobt brluobt requested review from a team as code owners March 3, 2026 14:41
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 3, 2026

📝 Walkthrough

Walkthrough

Adds code-generation evaluation and training artifacts: a CodeTestCaseEnvironment for executing and scoring Python solutions, LiveCodeBench dataset loaders (eval and response), a code data processor, config examples for eval/training, and unit tests; wires environment and actor registries.

Changes

Cohort / File(s) Summary
Configs
examples/configs/evals/code_eval.yaml, examples/configs/grpo_code.yaml, examples/configs/evals/eval.yaml
New evaluation and GRPO training YAMLs; small update to eval.yaml adding env_name.
Example scripts & prompts
examples/run_eval.py, examples/prompts/code.txt
Switches eval script to create_env by name; adds a Python problem prompt template.
Eval dataset loader
nemo_rl/data/datasets/eval_datasets/.../__init__.py, nemo_rl/data/datasets/eval_datasets/livecodebench.py
Adds LiveCodeBench evaluation dataset loader, variant-to-file mapping, rekey transformation, and exports.
Response dataset (training) support
nemo_rl/data/datasets/response_datasets/.../__init__.py, nemo_rl/data/datasets/response_datasets/livecodebench.py
Adds LiveCodeBench response/training dataset loader, formatting and train/validation split, and registry entry.
Data processing
nemo_rl/data/processors.py
Adds code_data_processor and registers it in PROCESSOR_REGISTRY to produce DatumSpec with message_log, test_cases, and truncation/length handling.
Environment implementation & registry
nemo_rl/environments/code_test_environment.py, nemo_rl/environments/utils.py, nemo_rl/distributed/ray_actor_environment_registry.py
Introduces CodeTestCaseEnvironment and verify worker actor, utility functions extract_code/run_single_test, registers "code_test" env and actor mapping in registries.
Tests
tests/unit/environments/test_code_test_environment.py
Adds comprehensive unit tests for code extraction, single-test execution, timeouts/errors, and environment actor behavior.

Sequence Diagram(s)

sequenceDiagram
    participant Model as Model/Response Generator
    participant Env as CodeTestCaseEnvironment
    participant Extractor as Code Extractor
    participant WorkerPool as VerifyWorker Pool
    participant Executor as Subprocess Executor

    Model->>Env: send generated response
    Env->>Extractor: extract code block / code text
    Extractor-->>Env: return code

    Env->>Env: parse/normalize test_cases
    Env->>WorkerPool: assign (code, test_cases) batches

    par parallel
        WorkerPool->>Executor: run code with test input (isolated subprocess)
        Executor->>Executor: execute & compare stdout to expected
        Executor-->>WorkerPool: return pass/fail
    end

    WorkerPool-->>Env: aggregate pass/fail per response
    Env->>Env: compute reward (1.0 if all pass else 0.0)
    Env-->>Model: observation + reward (+ optional extracted code)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

CI:L1

Suggested reviewers

  • terrykong
  • odelalleau
🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main objective of the changeset: adding code generation evaluation functionality with a test-case-driven sandbox environment, which is the central focus across all file changes.
Test Results For Major Changes ✅ Passed PR includes documented test results for major changes: unit tests (206 lines, 11 methods), Ray integration tests, dataset loading validation (167 samples), backward-compatibility checks, and end-to-end evaluation (3.0% pass@1).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🧹 Nitpick comments (2)
tests/unit/environments/test_code_test_environment.py (1)

27-30: Rename module-level config to match global naming rule.

Use G_... upper snake_case for this global test config (currently cfg).

✏️ Proposed rename
-cfg: CodeTestEnvConfig = {
+G_CODE_TEST_ENV_CFG: CodeTestEnvConfig = {
@@
-    env_actor = CodeTestCaseEnvironment.remote(cfg)
+    env_actor = CodeTestCaseEnvironment.remote(G_CODE_TEST_ENV_CFG)

As per coding guidelines, “Use upper snake_case with G prefix for global variables, e.g., G_MY_GLOBAL.”

Also applies to: 35-35

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/environments/test_code_test_environment.py` around lines 27 - 30,
The module-level global variable cfg of type CodeTestEnvConfig should be renamed
to follow the project's global naming rule (upper snake_case with G_ prefix);
rename cfg to a descriptive constant like G_CODE_TEST_ENV_CFG, update the
declaration and all other references (including the other occurrence noted) to
use G_CODE_TEST_ENV_CFG, and preserve the value and type (num_workers and
timeout_per_test) so tests still use the same config.
nemo_rl/environments/code_test_environment.py (1)

129-133: Remove hidden timeout_per_test default from code path.

This should come from YAML config, not a non-None fallback in Python.

♻️ Suggested fix
-        self.timeout = cfg.get("timeout_per_test", 5)
+        self.timeout = cfg["timeout_per_test"]

As per coding guidelines, "YAML is the single source of truth for configuration defaults; do not set non-None defaults in code for configuration values".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/environments/code_test_environment.py` around lines 129 - 133, The
constructor (__init__) currently sets self.timeout = cfg.get("timeout_per_test",
5), which injects a hidden default; remove the hardcoded fallback so the timeout
comes strictly from the YAML-backed CodeTestEnvConfig. Change the assignment to
read the value directly (e.g., self.timeout = cfg["timeout_per_test"] or
otherwise access the config key without a non-None default) so missing values
surface as config errors rather than silently using 5.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/configs/grpo_code.yaml`:
- Around line 266-274: The config uses dataset_name "LiveCodeBench" but the
chosen processor "code_data_processor" expects a "problem" key (accessed as
datum_dict["problem"] in code_data_processor in nemo_rl/data/processors.py)
while the registered LiveCodeBench response dataset returns "messages" and
"test_cases"; fix by either switching to the eval variant of the dataset (e.g.,
the LiveCodeBench evaluation dataset name used elsewhere) so it provides
"problem", or change the processor to one that consumes the response-schema (a
processor that handles "messages"/"test_cases") so preprocessing no longer
attempts to access datum_dict["problem"].

In `@examples/run_eval.py`:
- Around line 58-59: Remove the hidden default by reading the env_name directly
from the config and ensure the YAML contains the default; replace the current
env_name = data_config.get("env_name", "math") with a direct lookup (e.g.,
data_config["env_name"]) and confirm the YAML used by data_config provides a
default value for env_name so create_env(env_name, env_configs[env_name]) still
receives a valid name.

In `@nemo_rl/data/datasets/eval_datasets/livecodebench.py`:
- Line 1: Update the NVIDIA copyright header year from 2025 to the current year
(2026) in nemo_rl/data/datasets/eval_datasets/livecodebench.py so the file
header matches project guidelines; locate the top-of-file header in
livecodebench.py and replace the year token "2025" with "2026".
- Around line 39-43: The Literal type for the parameter variant in the function
(the parameter named variant used to compute data_file from VARIANT_TO_FILE) is
missing supported keys (release_v1, release_v2, release_v6) and causes
type-check failures; update the variant annotation in the function signature to
include those missing Literal values (or replace the Literal with a broader type
such as str/Union[str, Literal[...]] that covers all keys), then keep the
data_file computation using VARIANT_TO_FILE.get(variant, ...) unchanged so valid
mapped variants are accepted by the type checker.

In `@nemo_rl/data/datasets/response_datasets/livecodebench.py`:
- Line 1: Update the copyright header in the top of livecodebench.py to use the
current year (replace "2025" with "2026"); locate the header comment at the top
of the file (in nemo_rl/data/datasets/response_datasets/livecodebench.py) and
change the year in the NVIDIA copyright line so it matches the coding guideline.
- Around line 32-33: The code is setting hidden non-None config defaults in the
code (e.g., using kwargs.get("variant", "release_v5") and similar defaults near
variables like variant_to_file); remove those in-code defaults and read
configuration keys directly (e.g., use kwargs["variant"] or the config object)
so the default values come from the YAML config instead; update any references
in this module (symbols: variant, variant_to_file and any other kwargs.get calls
around this function/class) to assume the key exists and let the YAML be the
single source of truth for defaults.
- Around line 67-79: public_test_cases normalization must be hardened: ensure
public_tests becomes a list before iterating (handle if it's a dict by wrapping
it in a list, if it's a scalar/other type set to empty list), and skip any
non-dict entries when building test_cases so tc.get won't raise; update the
block that currently assigns public_tests (and uses json.loads) and the loop
that builds test_cases to validate types and only append entries where tc is a
dict with safe .get access.
- Around line 44-49: Add a boolean config option (e.g., allow_trust_remote_code
defaulting to False) and use it to gate the call to load_dataset(...) so
trust_remote_code is set to that setting rather than True; update the load path
in livecodebench.py (the call site using load_dataset and the variable ds) to
pass trust_remote_code=allow_trust_remote_code, and add an accompanying optional
dataset_revision config (default None) to pin a dataset revision and pass it to
load_dataset when provided; also update the function/class docstring or config
comment to state that trust_remote_code must be enabled only after reviewing
remote code and that pinning a revision is recommended if enabled.

In `@nemo_rl/environments/code_test_environment.py`:
- Line 1: Update the NVIDIA copyright header year in
nemo_rl/environments/code_test_environment.py from 2025 to the current year
(2026); locate the top-of-file header comment (the copyright line) and replace
the year value only so the file header matches repo guidance requiring the
current year.
- Around line 67-79: The run_single_test function currently runs untrusted code
with full privileges and only checks stdout; modify it to require
result.returncode == 0 in addition to stdout matching, and harden the
subprocess.run invocation by adding isolation flags "-I" and "-S" to the Python
args, setting cwd to a newly created isolated temporary directory, and passing a
minimal sanitized env that disables user site packages (e.g., set
PYTHONNOUSERSITE=1) and removes dangerous variables; ensure these changes are
applied to the subprocess.run call and the return decision (in run_single_test)
so a non-zero exit code causes the test to fail.
- Line 96: The zip between pred_responses and test_cases_batch should use
strict=True to fail fast on length mismatches: update the for loop iterating
over pred_responses and test_cases_batch (the variables pred_responses and
test_cases_batch in code_test_environment.py) to use zip(..., strict=True). Do
the same for the other zip that pairs the chunks produced by
chunk_list_to_workers() (the zip around the chunked worker lists) so both zips
enforce equal lengths and raise immediately on mismatches.

---

Nitpick comments:
In `@nemo_rl/environments/code_test_environment.py`:
- Around line 129-133: The constructor (__init__) currently sets self.timeout =
cfg.get("timeout_per_test", 5), which injects a hidden default; remove the
hardcoded fallback so the timeout comes strictly from the YAML-backed
CodeTestEnvConfig. Change the assignment to read the value directly (e.g.,
self.timeout = cfg["timeout_per_test"] or otherwise access the config key
without a non-None default) so missing values surface as config errors rather
than silently using 5.

In `@tests/unit/environments/test_code_test_environment.py`:
- Around line 27-30: The module-level global variable cfg of type
CodeTestEnvConfig should be renamed to follow the project's global naming rule
(upper snake_case with G_ prefix); rename cfg to a descriptive constant like
G_CODE_TEST_ENV_CFG, update the declaration and all other references (including
the other occurrence noted) to use G_CODE_TEST_ENV_CFG, and preserve the value
and type (num_workers and timeout_per_test) so tests still use the same config.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1a32e9e and 7bb29f6.

📒 Files selected for processing (13)
  • examples/configs/evals/code_eval.yaml
  • examples/configs/grpo_code.yaml
  • examples/prompts/code.txt
  • examples/run_eval.py
  • nemo_rl/data/datasets/eval_datasets/__init__.py
  • nemo_rl/data/datasets/eval_datasets/livecodebench.py
  • nemo_rl/data/datasets/response_datasets/__init__.py
  • nemo_rl/data/datasets/response_datasets/livecodebench.py
  • nemo_rl/data/processors.py
  • nemo_rl/distributed/ray_actor_environment_registry.py
  • nemo_rl/environments/code_test_environment.py
  • nemo_rl/environments/utils.py
  • tests/unit/environments/test_code_test_environment.py

Comment on lines +44 to +49
ds = load_dataset(
"livecodebench/code_generation_lite",
variant,
split="test",
trust_remote_code=True,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Hugging Face datasets load_dataset trust_remote_code parameter security guidance best practices

💡 Result:

What trust_remote_code does (and why it’s risky)

  • datasets.load_dataset(..., trust_remote_code=...) controls whether the 🤗 Datasets library is allowed to execute a dataset repository’s Python loading script (custom code) on your machine. This is effectively remote code execution if you load from the Hub. The docs explicitly warn to enable it only for repos you trust and whose code you’ve reviewed. [1]

Current behavior / defaults to be aware of

  • The trust_remote_code parameter was introduced to make executing dataset scripts an explicit choice (and to support disabling it by default). [2]
  • In datasets 2.20.0, datasets that rely on a Python loading script require passing trust_remote_code=True (i.e., it’s no longer implicitly trusted by default for those cases). [3]
  • Some curated/enterprise contexts disallow models that require trust_remote_code=True unless they’re explicitly verified or from a trusted org (example: Hugging Face’s Azure collection policy). [4]

Security best practices (recommended)

  1. Prefer datasets with no custom Python script (e.g., plain Parquet/JSON/CSV hosted in the repo). If you control the dataset, publish it in a “data-only” format to avoid requiring remote code. (This is the direction the library has encouraged as part of the trust change.) [2]
  2. Keep trust_remote_code=False by default, and only enable it per-dataset when you must.
    • You can enforce a safer default via HF_DATASETS_TRUST_REMOTE_CODE=0. [2]
  3. If you must enable it:
    • Inspect the repository code (the dataset script and any referenced modules) before running it. [1]
    • Pin a specific revision (commit SHA/tag) so the code you reviewed can’t change underneath you between runs. This “pin a revision” guidance is explicitly recommended in Hugging Face security guidance around remote code. [5]
  4. Sandbox execution when feasible:
    • Run dataset loading in an isolated container/VM with least privilege (read-only filesystem where possible, no secrets in env, restricted network/egress if you don’t need downloads beyond the dataset artifacts).
  5. Use Hub security controls in your org:
    • Enable MFA, use scoped access tokens, and take advantage of Hub-side security features (e.g., malware scanning and related controls) to reduce supply-chain risk. [6]

Sources

[1] 🤗 Datasets “Loading methods” reference (trust_remote_code warning) (huggingface.co)
[2] datasets 2.16.0 release notes (adds trust_remote_code, env var HF_DATASETS_TRUST_REMOTE_CODE) (newreleases.io)
[3] datasets 2.20.0 release notes (scripts require explicit trust_remote_code=True) (newreleases.io)
[4] Hugging Face on Azure security/compliance policy (trust_remote_code not allowed unless verified/trusted) (huggingface.co)
[5] Hugging Face security guidance on remote code + pinning revisions (Transformers SECURITY.md excerpt in Hub) (huggingface.co)
[6] Hugging Face Hub security features overview (huggingface.co)

Citations:


🏁 Script executed:

cd /dev/null && find . -type f -name "livecodebench.py" | head -1 | xargs wc -l

Repository: NVIDIA-NeMo/RL

Length of output: 107


🏁 Script executed:

cd /dev/null && find . -type f -name "livecodebench.py" -exec cat -n {} \;

Repository: NVIDIA-NeMo/RL

Length of output: 107


🏁 Script executed:

find . -type f -name "livecodebench.py" -path "*/nemo_rl/*"

Repository: NVIDIA-NeMo/RL

Length of output: 171


🏁 Script executed:

cat -n ./nemo_rl/data/datasets/response_datasets/livecodebench.py

Repository: NVIDIA-NeMo/RL

Length of output: 3936


Gate trust_remote_code=True with a configuration parameter, defaulting to False.

Lines 44-49 unconditionally enable remote code execution from the livecodebench/code_generation_lite dataset without a security gate. Hugging Face documentation explicitly recommends keeping trust_remote_code=False by default and only enabling it for datasets whose code you've reviewed. Add a config parameter to control this behavior and document when it is required. Consider pinning a specific dataset revision to prevent code changes between runs if remote code is necessary.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/response_datasets/livecodebench.py` around lines 44 -
49, Add a boolean config option (e.g., allow_trust_remote_code defaulting to
False) and use it to gate the call to load_dataset(...) so trust_remote_code is
set to that setting rather than True; update the load path in livecodebench.py
(the call site using load_dataset and the variable ds) to pass
trust_remote_code=allow_trust_remote_code, and add an accompanying optional
dataset_revision config (default None) to pin a dataset revision and pass it to
load_dataset when provided; also update the function/class docstring or config
comment to state that trust_remote_code must be enabled only after reviewing
remote code and that pinning a revision is recommended if enabled.

Comment on lines +67 to +79
public_tests = data.get("public_test_cases", [])
if isinstance(public_tests, str):
try:
public_tests = json.loads(public_tests)
except (json.JSONDecodeError, TypeError):
public_tests = []

test_cases = []
for tc in public_tests:
test_cases.append({
"input": tc.get("input", ""),
"expected_output": tc.get("output", ""),
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Harden public_test_cases normalization before iterating.

Lines [75]-[79] assume each tc is a dict. A non-list/non-dict JSON payload will raise and break dataset mapping.

🛠️ Proposed hardening
-        test_cases = []
-        for tc in public_tests:
-            test_cases.append({
-                "input": tc.get("input", ""),
-                "expected_output": tc.get("output", ""),
-            })
+        test_cases: list[dict[str, str]] = []
+        if isinstance(public_tests, list):
+            for tc in public_tests:
+                if isinstance(tc, dict):
+                    test_cases.append({
+                        "input": str(tc.get("input", "")),
+                        "expected_output": str(tc.get("output", "")),
+                    })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/response_datasets/livecodebench.py` around lines 67 -
79, public_test_cases normalization must be hardened: ensure public_tests
becomes a list before iterating (handle if it's a dict by wrapping it in a list,
if it's a scalar/other type set to empty list), and skip any non-dict entries
when building test_cases so tc.get won't raise; update the block that currently
assigns public_tests (and uses json.loads) and the loop that builds test_cases
to validate types and only append entries where tc is a dict with safe .get
access.

@brluobt
Copy link
Author

brluobt commented Mar 4, 2026

@coderabbitai review

Pushed a fix commit addressing all 11 review comments (schema mismatch, subprocess hardening, zip strict, type checks, copyright, docstrings, naming conventions). Please re-review.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 4, 2026

@brluobt Sure! I'll re-review the PR with the latest changes, focusing on the fixes you mentioned — schema mismatch, subprocess hardening, zip strict, type checks, copyright headers, docstrings, and naming conventions.

✅ Actions performed

Full review triggered.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
nemo_rl/data/datasets/response_datasets/livecodebench.py (2)

52-54: ⚠️ Potential issue | 🟠 Major

Remove hidden config defaults from constructor inputs.

Lines 52-54 and 73-74 still use non-None in-code defaults (variant, split_validation_size, seed). Read required keys directly so defaults stay in YAML/config only.

♻️ Suggested fix
-        variant = kwargs.get("variant", "release_v5")
-        data_file = VARIANT_TO_FILE.get(variant, "test5.jsonl")
+        variant = kwargs["variant"]
+        data_file = VARIANT_TO_FILE[variant]
@@
-        split_validation_size = kwargs.get("split_validation_size", 0)
-        seed = kwargs.get("seed", 42)
+        split_validation_size = kwargs["split_validation_size"]
+        seed = kwargs["seed"]

As per coding guidelines, “YAML is the single source of truth for configuration defaults; do not set non-None defaults in code for configuration values.”

Also applies to: 73-74

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/response_datasets/livecodebench.py` around lines 52 -
54, The constructor currently uses in-code defaults for configuration keys
(e.g., variant via kwargs.get("variant", "release_v5") and other params like
split_validation_size and seed); remove those non-None defaults and read
required keys directly from kwargs (e.g., access kwargs["variant"],
kwargs["split_validation_size"], kwargs["seed"]) so the YAML/config remains the
single source of truth; update any logic that uses VARIANT_TO_FILE lookup to
handle a KeyError or validate presence and surface a clear error if the config
key is missing rather than falling back to an in-code default.

55-60: ⚠️ Potential issue | 🟠 Major

trust_remote_code should be config-gated and optionally revision-pinned.

Line 59 still hard-enables remote dataset code execution. Make this explicit via config (and support revision pinning) before loading.

🔒 Suggested hardening
+        allow_trust_remote_code = kwargs["allow_trust_remote_code"]
+        dataset_revision = kwargs.get("dataset_revision")
         try:
             ds = load_dataset(
                 "livecodebench/code_generation_lite",
                 variant,
                 split="test",
-                trust_remote_code=True,
+                trust_remote_code=allow_trust_remote_code,
+                revision=dataset_revision,
             )
Hugging Face Datasets `load_dataset` docs: security guidance for `trust_remote_code` and recommendation for pinning dataset `revision`.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/response_datasets/livecodebench.py` around lines 55 -
60, The code currently calls load_dataset("livecodebench/code_generation_lite",
variant, split="test", trust_remote_code=True) unguarded; change this to read a
config flag (e.g., a new boolean like allow_trust_remote_code in your app
config) and only pass trust_remote_code=True when that flag is explicitly
enabled, and add optional revision pinning support (accept a dataset_revision
config value and pass it as the revision argument to load_dataset when
provided). Update the call site that constructs ds (the load_dataset call using
variant and split="test") to conditionally set trust_remote_code and include
revision if non-empty so remote code execution is opt-in and reproducible.
🧹 Nitpick comments (1)
nemo_rl/data/datasets/response_datasets/__init__.py (1)

88-104: Consider exporting LiveCodeBenchResponseDataset in __all__ for API consistency.

The new dataset is registered and imported, but not exposed in __all__, unlike other dataset classes in this module.

🧩 Optional cleanup
 __all__ = [
     "AIME2024Dataset",
     "CLEVRCoGenTDataset",
     "DAPOMath17KDataset",
     "DAPOMathAIME2024Dataset",
     "DeepScalerDataset",
     "Geometry3KDataset",
     "HelpSteer3Dataset",
+    "LiveCodeBenchResponseDataset",
     "NemoGymDataset",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/response_datasets/__init__.py` around lines 88 - 104,
The module __all__ is missing the recently added LiveCodeBenchResponseDataset
export, so add "LiveCodeBenchResponseDataset" to the __all__ list to make its
symbol part of the public API; update the list alongside the other dataset names
(e.g., near "OpenMathInstruct2Dataset", "RefCOCODataset", etc.) so the imported
LiveCodeBenchResponseDataset is exported consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_rl/data/datasets/eval_datasets/__init__.py`:
- Around line 87-92: The loader currently hardcodes the LiveCodeBench variant
("release_v5") when constructing LiveCodeBenchDataset; change this to read the
variant from the config instead: use data_config.get("variant") (or
data_config["variant"] if required) and pass that value into
LiveCodeBenchDataset(variant=...), leaving no hardcoded default in the code path
that handles dataset_name == "livecodebench".

In `@nemo_rl/data/datasets/eval_datasets/livecodebench.py`:
- Around line 65-70: The dataset load currently calls
load_dataset("livecodebench/code_generation_lite", variant, split="test",
trust_remote_code=True) which unconditionally enables remote code; modify the
class or function that constructs this loader to accept new parameters (e.g.,
allow_trust_remote_code: bool = False and dataset_revision: Optional[str] =
None), pass allow_trust_remote_code to the trust_remote_code argument instead of
True, and pass dataset_revision to the revision argument of load_dataset; update
any callers or constructor defaults so trust_remote_code stays False by default
and require explicit True and a pinned commit SHA (dataset_revision) to enable
remote code.

In `@nemo_rl/environments/code_test_environment.py`:
- Around line 150-153: The loop over test_cases currently assumes each tc is a
dict and calls tc.get(...), which will raise AttributeError for non-dict
entries; update the loop in the function/method that iterates test_cases (the
block calling run_single_test) to first validate that tc is an instance of dict
(e.g., isinstance(tc, dict)) before accessing .get(), and if it is not, handle
it gracefully by logging a warning or recording a failed test case result and
continue to the next item instead of calling run_single_test; ensure references
to test_input and expected_output come only from validated dicts so
run_single_test(code, test_input, expected_output, timeout) is never invoked
with invalid tc.

---

Duplicate comments:
In `@nemo_rl/data/datasets/response_datasets/livecodebench.py`:
- Around line 52-54: The constructor currently uses in-code defaults for
configuration keys (e.g., variant via kwargs.get("variant", "release_v5") and
other params like split_validation_size and seed); remove those non-None
defaults and read required keys directly from kwargs (e.g., access
kwargs["variant"], kwargs["split_validation_size"], kwargs["seed"]) so the
YAML/config remains the single source of truth; update any logic that uses
VARIANT_TO_FILE lookup to handle a KeyError or validate presence and surface a
clear error if the config key is missing rather than falling back to an in-code
default.
- Around line 55-60: The code currently calls
load_dataset("livecodebench/code_generation_lite", variant, split="test",
trust_remote_code=True) unguarded; change this to read a config flag (e.g., a
new boolean like allow_trust_remote_code in your app config) and only pass
trust_remote_code=True when that flag is explicitly enabled, and add optional
revision pinning support (accept a dataset_revision config value and pass it as
the revision argument to load_dataset when provided). Update the call site that
constructs ds (the load_dataset call using variant and split="test") to
conditionally set trust_remote_code and include revision if non-empty so remote
code execution is opt-in and reproducible.

---

Nitpick comments:
In `@nemo_rl/data/datasets/response_datasets/__init__.py`:
- Around line 88-104: The module __all__ is missing the recently added
LiveCodeBenchResponseDataset export, so add "LiveCodeBenchResponseDataset" to
the __all__ list to make its symbol part of the public API; update the list
alongside the other dataset names (e.g., near "OpenMathInstruct2Dataset",
"RefCOCODataset", etc.) so the imported LiveCodeBenchResponseDataset is exported
consistently.

ℹ️ Review info
Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a22e1e36-9c13-4ef9-9afa-dc0f0b277b65

📥 Commits

Reviewing files that changed from the base of the PR and between 1a32e9e and 0196938.

📒 Files selected for processing (14)
  • examples/configs/evals/code_eval.yaml
  • examples/configs/evals/eval.yaml
  • examples/configs/grpo_code.yaml
  • examples/prompts/code.txt
  • examples/run_eval.py
  • nemo_rl/data/datasets/eval_datasets/__init__.py
  • nemo_rl/data/datasets/eval_datasets/livecodebench.py
  • nemo_rl/data/datasets/response_datasets/__init__.py
  • nemo_rl/data/datasets/response_datasets/livecodebench.py
  • nemo_rl/data/processors.py
  • nemo_rl/distributed/ray_actor_environment_registry.py
  • nemo_rl/environments/code_test_environment.py
  • nemo_rl/environments/utils.py
  • tests/unit/environments/test_code_test_environment.py

Comment on lines +87 to +92
elif dataset_name == "livecodebench":
base_dataset = LiveCodeBenchDataset(
variant="release_v5",
prompt_file=data_config["prompt_file"],
system_prompt_file=data_config["system_prompt_file"],
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove the hardcoded LiveCodeBench variant from loader logic.

Line 89 bakes in "release_v5" as a hidden default. Read the required variant from config so YAML remains the default authority.

♻️ Suggested fix
     elif dataset_name == "livecodebench":
         base_dataset = LiveCodeBenchDataset(
-            variant="release_v5",
+            variant=data_config["variant"],
             prompt_file=data_config["prompt_file"],
             system_prompt_file=data_config["system_prompt_file"],
         )

As per coding guidelines, “YAML is the single source of truth for configuration defaults; do not set non-None defaults in code for configuration values.”

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
elif dataset_name == "livecodebench":
base_dataset = LiveCodeBenchDataset(
variant="release_v5",
prompt_file=data_config["prompt_file"],
system_prompt_file=data_config["system_prompt_file"],
)
elif dataset_name == "livecodebench":
base_dataset = LiveCodeBenchDataset(
variant=data_config["variant"],
prompt_file=data_config["prompt_file"],
system_prompt_file=data_config["system_prompt_file"],
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/eval_datasets/__init__.py` around lines 87 - 92, The
loader currently hardcodes the LiveCodeBench variant ("release_v5") when
constructing LiveCodeBenchDataset; change this to read the variant from the
config instead: use data_config.get("variant") (or data_config["variant"] if
required) and pass that value into LiveCodeBenchDataset(variant=...), leaving no
hardcoded default in the code path that handles dataset_name == "livecodebench".

Comment on lines +65 to +70
ds = load_dataset(
"livecodebench/code_generation_lite",
variant,
split="test",
trust_remote_code=True,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the file
find . -name "livecodebench.py" -path "*/eval_datasets/*" 2>/dev/null

Repository: NVIDIA-NeMo/RL

Length of output: 112


🏁 Script executed:

# Read the file around the specified lines
cat -n nemo_rl/data/datasets/eval_datasets/livecodebench.py | head -120

Repository: NVIDIA-NeMo/RL

Length of output: 4610


🌐 Web query:

Hugging Face Datasets security recommendations for load_dataset trust_remote_code parameter best practices

💡 Result:

What trust_remote_code means (and why it’s risky)

In datasets.load_dataset(...), trust_remote_code controls whether 🤗 Datasets is allowed to execute a dataset repository’s Python loading script (“custom code”) from the Hub on your machine. Hugging Face explicitly warns to enable it only for repos you trust and whose code you’ve read, because it executes Hub-hosted code locally. (huggingface.co)

Security best practices

  1. Prefer datasets that don’t require custom code

    • If the dataset is just data files in standard formats (parquet/json/csv/webdataset, etc.), load it without remote code. (This avoids executing arbitrary Python entirely.) (huggingface.co)
  2. Keep trust_remote_code=False unless you have a review process

    • Treat enabling it like running an untrusted program. Only enable after you (or your org) has reviewed the repo’s loading script and any imported modules in the repo. (huggingface.co)
  3. Pin the exact repo version with revision

    • Don’t run “whatever is on main today”. Pin to a commit SHA (or at least a tag) so the code you reviewed can’t change silently. Datasets supports revision=... for this. (huggingface.co)
    • Hugging Face gives the same recommendation for remote code in general: verify the code and pin a revision to protect against updates. (github.com)
  4. Vendor the loader code locally for production

    • A common production pattern is:
      • download/inspect the dataset script,
      • store it in your own repo,
      • load via a local path (still code execution, but now it’s your controlled artifact), and avoid fetching executable code at runtime. (Local loading scripts are supported.) (huggingface.co)
  5. Run with least privilege / in a sandbox when you must enable it

    • Use containers/VMs, read-only filesystems where possible, restricted network/egress, and non-privileged users—especially in CI or shared servers—because the loader code can do arbitrary Python-side effects. (This follows directly from the “executes code on your machine” warning.) (huggingface.co)
  6. Use Hub transparency features as part of review

    • HF emphasizes that repos are git-versioned (auditable history) and that you can pin specific revisions; the Hub also runs security scanning. Use these as inputs to your trust decision—not as a substitute for code review when enabling remote code. (github.com)

Recommended usage pattern (example)

from datasets import load_dataset

ds = load_dataset(
    "org/dataset_with_script",
    revision="COMMIT_SHA",        # pin what you reviewed
    trust_remote_code=True,       # only after review
)

(Prefer trust_remote_code=False whenever the dataset can be loaded without a script.) (huggingface.co)

Citations:


Gate trust_remote_code instead of hard-enabling it.

Line 69 unconditionally enables remote code execution during dataset load. According to Hugging Face security best practices, trust_remote_code should remain disabled by default and only be enabled after code review. Additionally, a revision parameter should be pinned to a specific commit SHA to prevent silent updates to the loading script on the Hub. Add constructor parameters to control both the remote code execution and dataset revision.

🔒 Suggested hardening
 class LiveCodeBenchDataset:
     def __init__(
         self,
         variant: VariantType = "release_v5",
         prompt_file: Optional[str] = None,
         system_prompt_file: Optional[str] = None,
+        allow_trust_remote_code: bool = False,
+        dataset_revision: Optional[str] = None,
     ):
         data_file = VARIANT_TO_FILE.get(variant, f"test{variant.replace('release_v', '')}.jsonl")
         try:
             ds = load_dataset(
                 "livecodebench/code_generation_lite",
                 variant,
                 split="test",
-                trust_remote_code=True,
+                trust_remote_code=allow_trust_remote_code,
+                revision=dataset_revision,
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/eval_datasets/livecodebench.py` around lines 65 - 70,
The dataset load currently calls
load_dataset("livecodebench/code_generation_lite", variant, split="test",
trust_remote_code=True) which unconditionally enables remote code; modify the
class or function that constructs this loader to accept new parameters (e.g.,
allow_trust_remote_code: bool = False and dataset_revision: Optional[str] =
None), pass allow_trust_remote_code to the trust_remote_code argument instead of
True, and pass dataset_revision to the revision argument of load_dataset; update
any callers or constructor defaults so trust_remote_code stays False by default
and require explicit True and a pinned commit SHA (dataset_revision) to enable
remote code.

Comment on lines +150 to +153
for tc in test_cases:
test_input = tc.get("input", "")
expected_output = tc.get("expected_output", tc.get("output", ""))
if not run_single_test(code, test_input, expected_output, timeout):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard test_cases item types before calling .get().

At Line 151, a non-dict entry in test_cases will raise AttributeError and fail the worker task. Validate each element first and fail that sample gracefully.

🛠️ Suggested fix
         for tc in test_cases:
-            test_input = tc.get("input", "")
-            expected_output = tc.get("expected_output", tc.get("output", ""))
+            if not isinstance(tc, dict):
+                all_passed = False
+                break
+            test_input = str(tc.get("input", ""))
+            expected_output = str(tc.get("expected_output", tc.get("output", "")))
             if not run_single_test(code, test_input, expected_output, timeout):
                 all_passed = False
                 break
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/environments/code_test_environment.py` around lines 150 - 153, The
loop over test_cases currently assumes each tc is a dict and calls tc.get(...),
which will raise AttributeError for non-dict entries; update the loop in the
function/method that iterates test_cases (the block calling run_single_test) to
first validate that tc is an instance of dict (e.g., isinstance(tc, dict))
before accessing .get(), and if it is not, handle it gracefully by logging a
warning or recording a failed test case result and continue to the next item
instead of calling run_single_test; ensure references to test_input and
expected_output come only from validated dicts so run_single_test(code,
test_input, expected_output, timeout) is never invoked with invalid tc.

brluo added 2 commits March 4, 2026 13:58
…vironment

Add end-to-end code generation evaluation to NeMo RL with:

- CodeTestCaseEnvironment: new environment that executes model-generated code
  against stdin/stdout test cases and returns binary pass/fail rewards.
  Follows the MathEnvironment pattern with accuracy, pass@k metrics.
- LiveCodeBench dataset integration for both eval and GRPO training pipelines,
  compatible with datasets library v3.x and v4.x.
- code_data_processor for mapping code problems with test cases to DatumSpec.
- Generalized run_eval.py to use create_env() instead of hardcoded
  MathEnvironment, enabling evaluation with any registered environment
  while maintaining backward compatibility (defaults to "math").
- Eval config (code_eval.yaml) and GRPO training config (grpo_code.yaml).
- Unit tests for the new environment and code extraction logic.

Validated end-to-end: Qwen2.5-Coder-1.5B-Instruct on LiveCodeBench v5
(167 problems, pass@1 = 3.0% with public test cases only).

Made-with: Cursor
Signed-off-by: brluo <brluo@nvidia.com>
- Fix GRPO schema mismatch: LiveCodeBench response dataset now outputs
  'problem' and 'test_cases' keys matching code_data_processor expectations
- Harden run_single_test: check returncode==0, add Python isolation flags
  (-I, -S), use temp directory sandbox, minimal env with PYTHONNOUSERSITE
- Add strict=True to zip() calls to catch length mismatches early
- Harden public_test_cases parsing with isinstance checks before iteration
- Move hidden defaults to YAML: env_name in eval.yaml, timeout_per_test
  read directly from config without fallback
- Update copyright year to 2026 in all new files
- Complete Literal type annotation for all supported LCB variants
- Rename test global cfg to G_CODE_TEST_ENV_CFG per naming convention
- Add comprehensive docstrings to all public functions and classes

Made-with: Cursor
Signed-off-by: brluo <brluo@nvidia.com>
@brluobt brluobt force-pushed the feat/code-generation-evaluation branch from 0196938 to e6a3f4b Compare March 4, 2026 05:59
- Read LiveCodeBench variant from config instead of hardcoding release_v5
- Prefer JSONL loading (no remote code execution) over trust_remote_code;
  fall back to dataset script only when JSONL is unavailable
- Remove hidden kwargs.get defaults in response dataset constructor;
  require variant, split_validation_size, seed from YAML config
- Add isinstance(tc, dict) guard in verify loop before calling tc.get()
- Add LiveCodeBenchResponseDataset to __all__ export
- Add variant field to grpo_code.yaml train config

Signed-off-by: brluo <brluo@nvidia.com>
Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant