feat: add code generation evaluation with test-case-driven sandbox environment#2056
feat: add code generation evaluation with test-case-driven sandbox environment#2056brluobt wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
Conversation
📝 WalkthroughWalkthroughAdds code-generation evaluation and training artifacts: a CodeTestCaseEnvironment for executing and scoring Python solutions, LiveCodeBench dataset loaders (eval and response), a code data processor, config examples for eval/training, and unit tests; wires environment and actor registries. Changes
Sequence Diagram(s)sequenceDiagram
participant Model as Model/Response Generator
participant Env as CodeTestCaseEnvironment
participant Extractor as Code Extractor
participant WorkerPool as VerifyWorker Pool
participant Executor as Subprocess Executor
Model->>Env: send generated response
Env->>Extractor: extract code block / code text
Extractor-->>Env: return code
Env->>Env: parse/normalize test_cases
Env->>WorkerPool: assign (code, test_cases) batches
par parallel
WorkerPool->>Executor: run code with test input (isolated subprocess)
Executor->>Executor: execute & compare stdout to expected
Executor-->>WorkerPool: return pass/fail
end
WorkerPool-->>Env: aggregate pass/fail per response
Env->>Env: compute reward (1.0 if all pass else 0.0)
Env-->>Model: observation + reward (+ optional extracted code)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 11
🧹 Nitpick comments (2)
tests/unit/environments/test_code_test_environment.py (1)
27-30: Rename module-level config to match global naming rule.Use
G_...upper snake_case for this global test config (currentlycfg).✏️ Proposed rename
-cfg: CodeTestEnvConfig = { +G_CODE_TEST_ENV_CFG: CodeTestEnvConfig = { @@ - env_actor = CodeTestCaseEnvironment.remote(cfg) + env_actor = CodeTestCaseEnvironment.remote(G_CODE_TEST_ENV_CFG)As per coding guidelines, “Use upper snake_case with
Gprefix for global variables, e.g.,G_MY_GLOBAL.”Also applies to: 35-35
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/environments/test_code_test_environment.py` around lines 27 - 30, The module-level global variable cfg of type CodeTestEnvConfig should be renamed to follow the project's global naming rule (upper snake_case with G_ prefix); rename cfg to a descriptive constant like G_CODE_TEST_ENV_CFG, update the declaration and all other references (including the other occurrence noted) to use G_CODE_TEST_ENV_CFG, and preserve the value and type (num_workers and timeout_per_test) so tests still use the same config.nemo_rl/environments/code_test_environment.py (1)
129-133: Remove hiddentimeout_per_testdefault from code path.This should come from YAML config, not a non-None fallback in Python.
♻️ Suggested fix
- self.timeout = cfg.get("timeout_per_test", 5) + self.timeout = cfg["timeout_per_test"]As per coding guidelines, "YAML is the single source of truth for configuration defaults; do not set non-None defaults in code for configuration values".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/environments/code_test_environment.py` around lines 129 - 133, The constructor (__init__) currently sets self.timeout = cfg.get("timeout_per_test", 5), which injects a hidden default; remove the hardcoded fallback so the timeout comes strictly from the YAML-backed CodeTestEnvConfig. Change the assignment to read the value directly (e.g., self.timeout = cfg["timeout_per_test"] or otherwise access the config key without a non-None default) so missing values surface as config errors rather than silently using 5.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/configs/grpo_code.yaml`:
- Around line 266-274: The config uses dataset_name "LiveCodeBench" but the
chosen processor "code_data_processor" expects a "problem" key (accessed as
datum_dict["problem"] in code_data_processor in nemo_rl/data/processors.py)
while the registered LiveCodeBench response dataset returns "messages" and
"test_cases"; fix by either switching to the eval variant of the dataset (e.g.,
the LiveCodeBench evaluation dataset name used elsewhere) so it provides
"problem", or change the processor to one that consumes the response-schema (a
processor that handles "messages"/"test_cases") so preprocessing no longer
attempts to access datum_dict["problem"].
In `@examples/run_eval.py`:
- Around line 58-59: Remove the hidden default by reading the env_name directly
from the config and ensure the YAML contains the default; replace the current
env_name = data_config.get("env_name", "math") with a direct lookup (e.g.,
data_config["env_name"]) and confirm the YAML used by data_config provides a
default value for env_name so create_env(env_name, env_configs[env_name]) still
receives a valid name.
In `@nemo_rl/data/datasets/eval_datasets/livecodebench.py`:
- Line 1: Update the NVIDIA copyright header year from 2025 to the current year
(2026) in nemo_rl/data/datasets/eval_datasets/livecodebench.py so the file
header matches project guidelines; locate the top-of-file header in
livecodebench.py and replace the year token "2025" with "2026".
- Around line 39-43: The Literal type for the parameter variant in the function
(the parameter named variant used to compute data_file from VARIANT_TO_FILE) is
missing supported keys (release_v1, release_v2, release_v6) and causes
type-check failures; update the variant annotation in the function signature to
include those missing Literal values (or replace the Literal with a broader type
such as str/Union[str, Literal[...]] that covers all keys), then keep the
data_file computation using VARIANT_TO_FILE.get(variant, ...) unchanged so valid
mapped variants are accepted by the type checker.
In `@nemo_rl/data/datasets/response_datasets/livecodebench.py`:
- Line 1: Update the copyright header in the top of livecodebench.py to use the
current year (replace "2025" with "2026"); locate the header comment at the top
of the file (in nemo_rl/data/datasets/response_datasets/livecodebench.py) and
change the year in the NVIDIA copyright line so it matches the coding guideline.
- Around line 32-33: The code is setting hidden non-None config defaults in the
code (e.g., using kwargs.get("variant", "release_v5") and similar defaults near
variables like variant_to_file); remove those in-code defaults and read
configuration keys directly (e.g., use kwargs["variant"] or the config object)
so the default values come from the YAML config instead; update any references
in this module (symbols: variant, variant_to_file and any other kwargs.get calls
around this function/class) to assume the key exists and let the YAML be the
single source of truth for defaults.
- Around line 67-79: public_test_cases normalization must be hardened: ensure
public_tests becomes a list before iterating (handle if it's a dict by wrapping
it in a list, if it's a scalar/other type set to empty list), and skip any
non-dict entries when building test_cases so tc.get won't raise; update the
block that currently assigns public_tests (and uses json.loads) and the loop
that builds test_cases to validate types and only append entries where tc is a
dict with safe .get access.
- Around line 44-49: Add a boolean config option (e.g., allow_trust_remote_code
defaulting to False) and use it to gate the call to load_dataset(...) so
trust_remote_code is set to that setting rather than True; update the load path
in livecodebench.py (the call site using load_dataset and the variable ds) to
pass trust_remote_code=allow_trust_remote_code, and add an accompanying optional
dataset_revision config (default None) to pin a dataset revision and pass it to
load_dataset when provided; also update the function/class docstring or config
comment to state that trust_remote_code must be enabled only after reviewing
remote code and that pinning a revision is recommended if enabled.
In `@nemo_rl/environments/code_test_environment.py`:
- Line 1: Update the NVIDIA copyright header year in
nemo_rl/environments/code_test_environment.py from 2025 to the current year
(2026); locate the top-of-file header comment (the copyright line) and replace
the year value only so the file header matches repo guidance requiring the
current year.
- Around line 67-79: The run_single_test function currently runs untrusted code
with full privileges and only checks stdout; modify it to require
result.returncode == 0 in addition to stdout matching, and harden the
subprocess.run invocation by adding isolation flags "-I" and "-S" to the Python
args, setting cwd to a newly created isolated temporary directory, and passing a
minimal sanitized env that disables user site packages (e.g., set
PYTHONNOUSERSITE=1) and removes dangerous variables; ensure these changes are
applied to the subprocess.run call and the return decision (in run_single_test)
so a non-zero exit code causes the test to fail.
- Line 96: The zip between pred_responses and test_cases_batch should use
strict=True to fail fast on length mismatches: update the for loop iterating
over pred_responses and test_cases_batch (the variables pred_responses and
test_cases_batch in code_test_environment.py) to use zip(..., strict=True). Do
the same for the other zip that pairs the chunks produced by
chunk_list_to_workers() (the zip around the chunked worker lists) so both zips
enforce equal lengths and raise immediately on mismatches.
---
Nitpick comments:
In `@nemo_rl/environments/code_test_environment.py`:
- Around line 129-133: The constructor (__init__) currently sets self.timeout =
cfg.get("timeout_per_test", 5), which injects a hidden default; remove the
hardcoded fallback so the timeout comes strictly from the YAML-backed
CodeTestEnvConfig. Change the assignment to read the value directly (e.g.,
self.timeout = cfg["timeout_per_test"] or otherwise access the config key
without a non-None default) so missing values surface as config errors rather
than silently using 5.
In `@tests/unit/environments/test_code_test_environment.py`:
- Around line 27-30: The module-level global variable cfg of type
CodeTestEnvConfig should be renamed to follow the project's global naming rule
(upper snake_case with G_ prefix); rename cfg to a descriptive constant like
G_CODE_TEST_ENV_CFG, update the declaration and all other references (including
the other occurrence noted) to use G_CODE_TEST_ENV_CFG, and preserve the value
and type (num_workers and timeout_per_test) so tests still use the same config.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
examples/configs/evals/code_eval.yamlexamples/configs/grpo_code.yamlexamples/prompts/code.txtexamples/run_eval.pynemo_rl/data/datasets/eval_datasets/__init__.pynemo_rl/data/datasets/eval_datasets/livecodebench.pynemo_rl/data/datasets/response_datasets/__init__.pynemo_rl/data/datasets/response_datasets/livecodebench.pynemo_rl/data/processors.pynemo_rl/distributed/ray_actor_environment_registry.pynemo_rl/environments/code_test_environment.pynemo_rl/environments/utils.pytests/unit/environments/test_code_test_environment.py
| ds = load_dataset( | ||
| "livecodebench/code_generation_lite", | ||
| variant, | ||
| split="test", | ||
| trust_remote_code=True, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Hugging Face datasets load_dataset trust_remote_code parameter security guidance best practices
💡 Result:
What trust_remote_code does (and why it’s risky)
datasets.load_dataset(..., trust_remote_code=...)controls whether the 🤗 Datasets library is allowed to execute a dataset repository’s Python loading script (custom code) on your machine. This is effectively remote code execution if you load from the Hub. The docs explicitly warn to enable it only for repos you trust and whose code you’ve reviewed. [1]
Current behavior / defaults to be aware of
- The
trust_remote_codeparameter was introduced to make executing dataset scripts an explicit choice (and to support disabling it by default). [2] - In
datasets2.20.0, datasets that rely on a Python loading script require passingtrust_remote_code=True(i.e., it’s no longer implicitly trusted by default for those cases). [3] - Some curated/enterprise contexts disallow models that require
trust_remote_code=Trueunless they’re explicitly verified or from a trusted org (example: Hugging Face’s Azure collection policy). [4]
Security best practices (recommended)
- Prefer datasets with no custom Python script (e.g., plain Parquet/JSON/CSV hosted in the repo). If you control the dataset, publish it in a “data-only” format to avoid requiring remote code. (This is the direction the library has encouraged as part of the trust change.) [2]
- Keep
trust_remote_code=Falseby default, and only enable it per-dataset when you must.- You can enforce a safer default via
HF_DATASETS_TRUST_REMOTE_CODE=0. [2]
- You can enforce a safer default via
- If you must enable it:
- Inspect the repository code (the dataset script and any referenced modules) before running it. [1]
- Pin a specific revision (commit SHA/tag) so the code you reviewed can’t change underneath you between runs. This “pin a revision” guidance is explicitly recommended in Hugging Face security guidance around remote code. [5]
- Sandbox execution when feasible:
- Run dataset loading in an isolated container/VM with least privilege (read-only filesystem where possible, no secrets in env, restricted network/egress if you don’t need downloads beyond the dataset artifacts).
- Use Hub security controls in your org:
- Enable MFA, use scoped access tokens, and take advantage of Hub-side security features (e.g., malware scanning and related controls) to reduce supply-chain risk. [6]
Sources
[1] 🤗 Datasets “Loading methods” reference (trust_remote_code warning) (huggingface.co)
[2] datasets 2.16.0 release notes (adds trust_remote_code, env var HF_DATASETS_TRUST_REMOTE_CODE) (newreleases.io)
[3] datasets 2.20.0 release notes (scripts require explicit trust_remote_code=True) (newreleases.io)
[4] Hugging Face on Azure security/compliance policy (trust_remote_code not allowed unless verified/trusted) (huggingface.co)
[5] Hugging Face security guidance on remote code + pinning revisions (Transformers SECURITY.md excerpt in Hub) (huggingface.co)
[6] Hugging Face Hub security features overview (huggingface.co)
Citations:
- 1: https://huggingface.co/docs/datasets/v2.19.0/package_reference/loading_methods?utm_source=openai
- 2: https://newreleases.io/project/github/huggingface/datasets/release/2.16.0?utm_source=openai
- 3: https://newreleases.io/project/github/huggingface/datasets/release/2.20.0?utm_source=openai
- 4: https://huggingface.co/docs/microsoft-azure/security?utm_source=openai
- 5: https://huggingface.co/Student0809/interactSpeech/blob/c848ed39c9d57bf921c6a7c762349c9bd76876af/docs/transformers/SECURITY.md?utm_source=openai
- 6: https://huggingface.co/docs/hub/security?utm_source=openai
🏁 Script executed:
cd /dev/null && find . -type f -name "livecodebench.py" | head -1 | xargs wc -lRepository: NVIDIA-NeMo/RL
Length of output: 107
🏁 Script executed:
cd /dev/null && find . -type f -name "livecodebench.py" -exec cat -n {} \;Repository: NVIDIA-NeMo/RL
Length of output: 107
🏁 Script executed:
find . -type f -name "livecodebench.py" -path "*/nemo_rl/*"Repository: NVIDIA-NeMo/RL
Length of output: 171
🏁 Script executed:
cat -n ./nemo_rl/data/datasets/response_datasets/livecodebench.pyRepository: NVIDIA-NeMo/RL
Length of output: 3936
Gate trust_remote_code=True with a configuration parameter, defaulting to False.
Lines 44-49 unconditionally enable remote code execution from the livecodebench/code_generation_lite dataset without a security gate. Hugging Face documentation explicitly recommends keeping trust_remote_code=False by default and only enabling it for datasets whose code you've reviewed. Add a config parameter to control this behavior and document when it is required. Consider pinning a specific dataset revision to prevent code changes between runs if remote code is necessary.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_rl/data/datasets/response_datasets/livecodebench.py` around lines 44 -
49, Add a boolean config option (e.g., allow_trust_remote_code defaulting to
False) and use it to gate the call to load_dataset(...) so trust_remote_code is
set to that setting rather than True; update the load path in livecodebench.py
(the call site using load_dataset and the variable ds) to pass
trust_remote_code=allow_trust_remote_code, and add an accompanying optional
dataset_revision config (default None) to pin a dataset revision and pass it to
load_dataset when provided; also update the function/class docstring or config
comment to state that trust_remote_code must be enabled only after reviewing
remote code and that pinning a revision is recommended if enabled.
| public_tests = data.get("public_test_cases", []) | ||
| if isinstance(public_tests, str): | ||
| try: | ||
| public_tests = json.loads(public_tests) | ||
| except (json.JSONDecodeError, TypeError): | ||
| public_tests = [] | ||
|
|
||
| test_cases = [] | ||
| for tc in public_tests: | ||
| test_cases.append({ | ||
| "input": tc.get("input", ""), | ||
| "expected_output": tc.get("output", ""), | ||
| }) |
There was a problem hiding this comment.
Harden public_test_cases normalization before iterating.
Lines [75]-[79] assume each tc is a dict. A non-list/non-dict JSON payload will raise and break dataset mapping.
🛠️ Proposed hardening
- test_cases = []
- for tc in public_tests:
- test_cases.append({
- "input": tc.get("input", ""),
- "expected_output": tc.get("output", ""),
- })
+ test_cases: list[dict[str, str]] = []
+ if isinstance(public_tests, list):
+ for tc in public_tests:
+ if isinstance(tc, dict):
+ test_cases.append({
+ "input": str(tc.get("input", "")),
+ "expected_output": str(tc.get("output", "")),
+ })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_rl/data/datasets/response_datasets/livecodebench.py` around lines 67 -
79, public_test_cases normalization must be hardened: ensure public_tests
becomes a list before iterating (handle if it's a dict by wrapping it in a list,
if it's a scalar/other type set to empty list), and skip any non-dict entries
when building test_cases so tc.get won't raise; update the block that currently
assigns public_tests (and uses json.loads) and the loop that builds test_cases
to validate types and only append entries where tc is a dict with safe .get
access.
|
@coderabbitai review Pushed a fix commit addressing all 11 review comments (schema mismatch, subprocess hardening, zip strict, type checks, copyright, docstrings, naming conventions). Please re-review. |
|
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (2)
nemo_rl/data/datasets/response_datasets/livecodebench.py (2)
52-54:⚠️ Potential issue | 🟠 MajorRemove hidden config defaults from constructor inputs.
Lines 52-54 and 73-74 still use non-
Nonein-code defaults (variant,split_validation_size,seed). Read required keys directly so defaults stay in YAML/config only.♻️ Suggested fix
- variant = kwargs.get("variant", "release_v5") - data_file = VARIANT_TO_FILE.get(variant, "test5.jsonl") + variant = kwargs["variant"] + data_file = VARIANT_TO_FILE[variant] @@ - split_validation_size = kwargs.get("split_validation_size", 0) - seed = kwargs.get("seed", 42) + split_validation_size = kwargs["split_validation_size"] + seed = kwargs["seed"]As per coding guidelines, “YAML is the single source of truth for configuration defaults; do not set non-None defaults in code for configuration values.”
Also applies to: 73-74
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/data/datasets/response_datasets/livecodebench.py` around lines 52 - 54, The constructor currently uses in-code defaults for configuration keys (e.g., variant via kwargs.get("variant", "release_v5") and other params like split_validation_size and seed); remove those non-None defaults and read required keys directly from kwargs (e.g., access kwargs["variant"], kwargs["split_validation_size"], kwargs["seed"]) so the YAML/config remains the single source of truth; update any logic that uses VARIANT_TO_FILE lookup to handle a KeyError or validate presence and surface a clear error if the config key is missing rather than falling back to an in-code default.
55-60:⚠️ Potential issue | 🟠 Major
trust_remote_codeshould be config-gated and optionally revision-pinned.Line 59 still hard-enables remote dataset code execution. Make this explicit via config (and support revision pinning) before loading.
🔒 Suggested hardening
+ allow_trust_remote_code = kwargs["allow_trust_remote_code"] + dataset_revision = kwargs.get("dataset_revision") try: ds = load_dataset( "livecodebench/code_generation_lite", variant, split="test", - trust_remote_code=True, + trust_remote_code=allow_trust_remote_code, + revision=dataset_revision, )Hugging Face Datasets `load_dataset` docs: security guidance for `trust_remote_code` and recommendation for pinning dataset `revision`.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/data/datasets/response_datasets/livecodebench.py` around lines 55 - 60, The code currently calls load_dataset("livecodebench/code_generation_lite", variant, split="test", trust_remote_code=True) unguarded; change this to read a config flag (e.g., a new boolean like allow_trust_remote_code in your app config) and only pass trust_remote_code=True when that flag is explicitly enabled, and add optional revision pinning support (accept a dataset_revision config value and pass it as the revision argument to load_dataset when provided). Update the call site that constructs ds (the load_dataset call using variant and split="test") to conditionally set trust_remote_code and include revision if non-empty so remote code execution is opt-in and reproducible.
🧹 Nitpick comments (1)
nemo_rl/data/datasets/response_datasets/__init__.py (1)
88-104: Consider exportingLiveCodeBenchResponseDatasetin__all__for API consistency.The new dataset is registered and imported, but not exposed in
__all__, unlike other dataset classes in this module.🧩 Optional cleanup
__all__ = [ "AIME2024Dataset", "CLEVRCoGenTDataset", "DAPOMath17KDataset", "DAPOMathAIME2024Dataset", "DeepScalerDataset", "Geometry3KDataset", "HelpSteer3Dataset", + "LiveCodeBenchResponseDataset", "NemoGymDataset",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/data/datasets/response_datasets/__init__.py` around lines 88 - 104, The module __all__ is missing the recently added LiveCodeBenchResponseDataset export, so add "LiveCodeBenchResponseDataset" to the __all__ list to make its symbol part of the public API; update the list alongside the other dataset names (e.g., near "OpenMathInstruct2Dataset", "RefCOCODataset", etc.) so the imported LiveCodeBenchResponseDataset is exported consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_rl/data/datasets/eval_datasets/__init__.py`:
- Around line 87-92: The loader currently hardcodes the LiveCodeBench variant
("release_v5") when constructing LiveCodeBenchDataset; change this to read the
variant from the config instead: use data_config.get("variant") (or
data_config["variant"] if required) and pass that value into
LiveCodeBenchDataset(variant=...), leaving no hardcoded default in the code path
that handles dataset_name == "livecodebench".
In `@nemo_rl/data/datasets/eval_datasets/livecodebench.py`:
- Around line 65-70: The dataset load currently calls
load_dataset("livecodebench/code_generation_lite", variant, split="test",
trust_remote_code=True) which unconditionally enables remote code; modify the
class or function that constructs this loader to accept new parameters (e.g.,
allow_trust_remote_code: bool = False and dataset_revision: Optional[str] =
None), pass allow_trust_remote_code to the trust_remote_code argument instead of
True, and pass dataset_revision to the revision argument of load_dataset; update
any callers or constructor defaults so trust_remote_code stays False by default
and require explicit True and a pinned commit SHA (dataset_revision) to enable
remote code.
In `@nemo_rl/environments/code_test_environment.py`:
- Around line 150-153: The loop over test_cases currently assumes each tc is a
dict and calls tc.get(...), which will raise AttributeError for non-dict
entries; update the loop in the function/method that iterates test_cases (the
block calling run_single_test) to first validate that tc is an instance of dict
(e.g., isinstance(tc, dict)) before accessing .get(), and if it is not, handle
it gracefully by logging a warning or recording a failed test case result and
continue to the next item instead of calling run_single_test; ensure references
to test_input and expected_output come only from validated dicts so
run_single_test(code, test_input, expected_output, timeout) is never invoked
with invalid tc.
---
Duplicate comments:
In `@nemo_rl/data/datasets/response_datasets/livecodebench.py`:
- Around line 52-54: The constructor currently uses in-code defaults for
configuration keys (e.g., variant via kwargs.get("variant", "release_v5") and
other params like split_validation_size and seed); remove those non-None
defaults and read required keys directly from kwargs (e.g., access
kwargs["variant"], kwargs["split_validation_size"], kwargs["seed"]) so the
YAML/config remains the single source of truth; update any logic that uses
VARIANT_TO_FILE lookup to handle a KeyError or validate presence and surface a
clear error if the config key is missing rather than falling back to an in-code
default.
- Around line 55-60: The code currently calls
load_dataset("livecodebench/code_generation_lite", variant, split="test",
trust_remote_code=True) unguarded; change this to read a config flag (e.g., a
new boolean like allow_trust_remote_code in your app config) and only pass
trust_remote_code=True when that flag is explicitly enabled, and add optional
revision pinning support (accept a dataset_revision config value and pass it as
the revision argument to load_dataset when provided). Update the call site that
constructs ds (the load_dataset call using variant and split="test") to
conditionally set trust_remote_code and include revision if non-empty so remote
code execution is opt-in and reproducible.
---
Nitpick comments:
In `@nemo_rl/data/datasets/response_datasets/__init__.py`:
- Around line 88-104: The module __all__ is missing the recently added
LiveCodeBenchResponseDataset export, so add "LiveCodeBenchResponseDataset" to
the __all__ list to make its symbol part of the public API; update the list
alongside the other dataset names (e.g., near "OpenMathInstruct2Dataset",
"RefCOCODataset", etc.) so the imported LiveCodeBenchResponseDataset is exported
consistently.
ℹ️ Review info
Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a22e1e36-9c13-4ef9-9afa-dc0f0b277b65
📒 Files selected for processing (14)
examples/configs/evals/code_eval.yamlexamples/configs/evals/eval.yamlexamples/configs/grpo_code.yamlexamples/prompts/code.txtexamples/run_eval.pynemo_rl/data/datasets/eval_datasets/__init__.pynemo_rl/data/datasets/eval_datasets/livecodebench.pynemo_rl/data/datasets/response_datasets/__init__.pynemo_rl/data/datasets/response_datasets/livecodebench.pynemo_rl/data/processors.pynemo_rl/distributed/ray_actor_environment_registry.pynemo_rl/environments/code_test_environment.pynemo_rl/environments/utils.pytests/unit/environments/test_code_test_environment.py
| elif dataset_name == "livecodebench": | ||
| base_dataset = LiveCodeBenchDataset( | ||
| variant="release_v5", | ||
| prompt_file=data_config["prompt_file"], | ||
| system_prompt_file=data_config["system_prompt_file"], | ||
| ) |
There was a problem hiding this comment.
Remove the hardcoded LiveCodeBench variant from loader logic.
Line 89 bakes in "release_v5" as a hidden default. Read the required variant from config so YAML remains the default authority.
♻️ Suggested fix
elif dataset_name == "livecodebench":
base_dataset = LiveCodeBenchDataset(
- variant="release_v5",
+ variant=data_config["variant"],
prompt_file=data_config["prompt_file"],
system_prompt_file=data_config["system_prompt_file"],
)As per coding guidelines, “YAML is the single source of truth for configuration defaults; do not set non-None defaults in code for configuration values.”
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| elif dataset_name == "livecodebench": | |
| base_dataset = LiveCodeBenchDataset( | |
| variant="release_v5", | |
| prompt_file=data_config["prompt_file"], | |
| system_prompt_file=data_config["system_prompt_file"], | |
| ) | |
| elif dataset_name == "livecodebench": | |
| base_dataset = LiveCodeBenchDataset( | |
| variant=data_config["variant"], | |
| prompt_file=data_config["prompt_file"], | |
| system_prompt_file=data_config["system_prompt_file"], | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_rl/data/datasets/eval_datasets/__init__.py` around lines 87 - 92, The
loader currently hardcodes the LiveCodeBench variant ("release_v5") when
constructing LiveCodeBenchDataset; change this to read the variant from the
config instead: use data_config.get("variant") (or data_config["variant"] if
required) and pass that value into LiveCodeBenchDataset(variant=...), leaving no
hardcoded default in the code path that handles dataset_name == "livecodebench".
| ds = load_dataset( | ||
| "livecodebench/code_generation_lite", | ||
| variant, | ||
| split="test", | ||
| trust_remote_code=True, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate and examine the file
find . -name "livecodebench.py" -path "*/eval_datasets/*" 2>/dev/nullRepository: NVIDIA-NeMo/RL
Length of output: 112
🏁 Script executed:
# Read the file around the specified lines
cat -n nemo_rl/data/datasets/eval_datasets/livecodebench.py | head -120Repository: NVIDIA-NeMo/RL
Length of output: 4610
🌐 Web query:
Hugging Face Datasets security recommendations for load_dataset trust_remote_code parameter best practices
💡 Result:
What trust_remote_code means (and why it’s risky)
In datasets.load_dataset(...), trust_remote_code controls whether 🤗 Datasets is allowed to execute a dataset repository’s Python loading script (“custom code”) from the Hub on your machine. Hugging Face explicitly warns to enable it only for repos you trust and whose code you’ve read, because it executes Hub-hosted code locally. (huggingface.co)
Security best practices
-
Prefer datasets that don’t require custom code
- If the dataset is just data files in standard formats (parquet/json/csv/webdataset, etc.), load it without remote code. (This avoids executing arbitrary Python entirely.) (huggingface.co)
-
Keep
trust_remote_code=Falseunless you have a review process- Treat enabling it like running an untrusted program. Only enable after you (or your org) has reviewed the repo’s loading script and any imported modules in the repo. (huggingface.co)
-
Pin the exact repo version with
revision- Don’t run “whatever is on
maintoday”. Pin to a commit SHA (or at least a tag) so the code you reviewed can’t change silently. Datasets supportsrevision=...for this. (huggingface.co) - Hugging Face gives the same recommendation for remote code in general: verify the code and pin a revision to protect against updates. (github.com)
- Don’t run “whatever is on
-
Vendor the loader code locally for production
- A common production pattern is:
- download/inspect the dataset script,
- store it in your own repo,
- load via a local path (still code execution, but now it’s your controlled artifact), and avoid fetching executable code at runtime. (Local loading scripts are supported.) (huggingface.co)
- A common production pattern is:
-
Run with least privilege / in a sandbox when you must enable it
- Use containers/VMs, read-only filesystems where possible, restricted network/egress, and non-privileged users—especially in CI or shared servers—because the loader code can do arbitrary Python-side effects. (This follows directly from the “executes code on your machine” warning.) (huggingface.co)
-
Use Hub transparency features as part of review
- HF emphasizes that repos are git-versioned (auditable history) and that you can pin specific revisions; the Hub also runs security scanning. Use these as inputs to your trust decision—not as a substitute for code review when enabling remote code. (github.com)
Recommended usage pattern (example)
from datasets import load_dataset
ds = load_dataset(
"org/dataset_with_script",
revision="COMMIT_SHA", # pin what you reviewed
trust_remote_code=True, # only after review
)(Prefer trust_remote_code=False whenever the dataset can be loaded without a script.) (huggingface.co)
Citations:
- 1: https://huggingface.co/docs/datasets/v2.17.0/package_reference/loading_methods
- 2: https://huggingface.co/docs/datasets/v2.17.0/package_reference/loading_methods
- 3: https://huggingface.co/docs/datasets/v2.17.0/package_reference/loading_methods
- 4: https://huggingface.co/docs/datasets/v3.2.0/loading?utm_source=openai
- 5: https://github.com/huggingface/transformers/security/policy?utm_source=openai
- 6: https://huggingface.co/docs/datasets/v3.1.0/en/loading?utm_source=openai
- 7: https://huggingface.co/docs/datasets/v2.17.0/package_reference/loading_methods
- 8: https://github.com/huggingface/faq?utm_source=openai
- 9: https://huggingface.co/docs/datasets/v2.17.0/package_reference/loading_methods
Gate trust_remote_code instead of hard-enabling it.
Line 69 unconditionally enables remote code execution during dataset load. According to Hugging Face security best practices, trust_remote_code should remain disabled by default and only be enabled after code review. Additionally, a revision parameter should be pinned to a specific commit SHA to prevent silent updates to the loading script on the Hub. Add constructor parameters to control both the remote code execution and dataset revision.
🔒 Suggested hardening
class LiveCodeBenchDataset:
def __init__(
self,
variant: VariantType = "release_v5",
prompt_file: Optional[str] = None,
system_prompt_file: Optional[str] = None,
+ allow_trust_remote_code: bool = False,
+ dataset_revision: Optional[str] = None,
):
data_file = VARIANT_TO_FILE.get(variant, f"test{variant.replace('release_v', '')}.jsonl")
try:
ds = load_dataset(
"livecodebench/code_generation_lite",
variant,
split="test",
- trust_remote_code=True,
+ trust_remote_code=allow_trust_remote_code,
+ revision=dataset_revision,
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_rl/data/datasets/eval_datasets/livecodebench.py` around lines 65 - 70,
The dataset load currently calls
load_dataset("livecodebench/code_generation_lite", variant, split="test",
trust_remote_code=True) which unconditionally enables remote code; modify the
class or function that constructs this loader to accept new parameters (e.g.,
allow_trust_remote_code: bool = False and dataset_revision: Optional[str] =
None), pass allow_trust_remote_code to the trust_remote_code argument instead of
True, and pass dataset_revision to the revision argument of load_dataset; update
any callers or constructor defaults so trust_remote_code stays False by default
and require explicit True and a pinned commit SHA (dataset_revision) to enable
remote code.
| for tc in test_cases: | ||
| test_input = tc.get("input", "") | ||
| expected_output = tc.get("expected_output", tc.get("output", "")) | ||
| if not run_single_test(code, test_input, expected_output, timeout): |
There was a problem hiding this comment.
Guard test_cases item types before calling .get().
At Line 151, a non-dict entry in test_cases will raise AttributeError and fail the worker task. Validate each element first and fail that sample gracefully.
🛠️ Suggested fix
for tc in test_cases:
- test_input = tc.get("input", "")
- expected_output = tc.get("expected_output", tc.get("output", ""))
+ if not isinstance(tc, dict):
+ all_passed = False
+ break
+ test_input = str(tc.get("input", ""))
+ expected_output = str(tc.get("expected_output", tc.get("output", "")))
if not run_single_test(code, test_input, expected_output, timeout):
all_passed = False
break🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_rl/environments/code_test_environment.py` around lines 150 - 153, The
loop over test_cases currently assumes each tc is a dict and calls tc.get(...),
which will raise AttributeError for non-dict entries; update the loop in the
function/method that iterates test_cases (the block calling run_single_test) to
first validate that tc is an instance of dict (e.g., isinstance(tc, dict))
before accessing .get(), and if it is not, handle it gracefully by logging a
warning or recording a failed test case result and continue to the next item
instead of calling run_single_test; ensure references to test_input and
expected_output come only from validated dicts so run_single_test(code,
test_input, expected_output, timeout) is never invoked with invalid tc.
…vironment Add end-to-end code generation evaluation to NeMo RL with: - CodeTestCaseEnvironment: new environment that executes model-generated code against stdin/stdout test cases and returns binary pass/fail rewards. Follows the MathEnvironment pattern with accuracy, pass@k metrics. - LiveCodeBench dataset integration for both eval and GRPO training pipelines, compatible with datasets library v3.x and v4.x. - code_data_processor for mapping code problems with test cases to DatumSpec. - Generalized run_eval.py to use create_env() instead of hardcoded MathEnvironment, enabling evaluation with any registered environment while maintaining backward compatibility (defaults to "math"). - Eval config (code_eval.yaml) and GRPO training config (grpo_code.yaml). - Unit tests for the new environment and code extraction logic. Validated end-to-end: Qwen2.5-Coder-1.5B-Instruct on LiveCodeBench v5 (167 problems, pass@1 = 3.0% with public test cases only). Made-with: Cursor Signed-off-by: brluo <brluo@nvidia.com>
- Fix GRPO schema mismatch: LiveCodeBench response dataset now outputs 'problem' and 'test_cases' keys matching code_data_processor expectations - Harden run_single_test: check returncode==0, add Python isolation flags (-I, -S), use temp directory sandbox, minimal env with PYTHONNOUSERSITE - Add strict=True to zip() calls to catch length mismatches early - Harden public_test_cases parsing with isinstance checks before iteration - Move hidden defaults to YAML: env_name in eval.yaml, timeout_per_test read directly from config without fallback - Update copyright year to 2026 in all new files - Complete Literal type annotation for all supported LCB variants - Rename test global cfg to G_CODE_TEST_ENV_CFG per naming convention - Add comprehensive docstrings to all public functions and classes Made-with: Cursor Signed-off-by: brluo <brluo@nvidia.com>
0196938 to
e6a3f4b
Compare
- Read LiveCodeBench variant from config instead of hardcoding release_v5 - Prefer JSONL loading (no remote code execution) over trust_remote_code; fall back to dataset script only when JSONL is unavailable - Remove hidden kwargs.get defaults in response dataset constructor; require variant, split_validation_size, seed from YAML config - Add isinstance(tc, dict) guard in verify loop before calling tc.get() - Add LiveCodeBenchResponseDataset to __all__ export - Add variant field to grpo_code.yaml train config Signed-off-by: brluo <brluo@nvidia.com> Made-with: Cursor
Background
As a new NeMo RL user, I found the GRPO Quick Start guide and evaluation pipeline excellent for getting started with math tasks. However, when I tried to extend my workflow to code generation — an increasingly important use case for RL-based LLM training — I realized there was no equivalent quick-start path:
CodeEnvironmentprovides sandbox execution but has no test-case validation or reward signal, making it unsuitable for evaluation or GRPO training out of the box.run_eval.pyonly works withMathEnvironment, so there is no simple command to benchmark a coding model.My goal with this PR is to provide the same "build, run, see results" experience that math tasks already have, but for code generation. A developer should be able to:
Relates to #858
Summary
Add end-to-end code generation evaluation to NeMo RL. The pipeline supports both evaluation (pass@k on LiveCodeBench) and GRPO training with code test-case rewards.
Quick start examples:
Validated end-to-end: Qwen2.5-Coder-1.5B-Instruct on LiveCodeBench v5 (167 problems), pass@1 = 3.0% with public test cases.
Changes by Component
1. CodeTestCaseEnvironment (core)
nemo_rl/environments/code_test_environment.py— New environment that extracts code from model responses, executes it against stdin/stdout test cases via subprocess, and returns binary pass/fail rewards (1.0 if all tests pass, 0.0 otherwise). Follows the MathEnvironment pattern with accuracy and pass@k metrics inglobal_post_process_and_metrics().nemo_rl/environments/utils.py— Register"code_test"inENV_REGISTRYnemo_rl/distributed/ray_actor_environment_registry.py— Register actor withPY_EXECUTABLES.SYSTEM2. Data Processor
nemo_rl/data/processors.py— Addcode_data_processorthat maps code problems with test cases intoDatumSpec, passingtest_casesandground_truththroughextra_env_info. Registered inPROCESSOR_REGISTRY.3. Generalize Eval Pipeline
examples/run_eval.py— Replace hardcodedMathEnvironmentwithcreate_env(env_name, ...). Theenv_nameis read from data config and defaults to"math", so all existing math eval configs work unchanged.4. LiveCodeBench Dataset Integration
nemo_rl/data/datasets/eval_datasets/livecodebench.py— Eval dataset loader supporting release_v1 through v6.nemo_rl/data/datasets/response_datasets/livecodebench.py— GRPO training dataset following theRawDatasetpattern.datasets3.x (script loading) and 4.x (JSONL fallback) for container compatibility.5. Configs and Tests
examples/configs/evals/code_eval.yaml— Ready-to-run code evaluation configexamples/configs/grpo_code.yaml— Ready-to-run GRPO code training configexamples/prompts/code.txt— Prompt template for code generationtests/unit/environments/test_code_test_environment.py— Unit tests covering code extraction, test case execution, timeout handling, batch processing, and extracted answer support.Splitting Offer
This PR is self-contained but can be split if preferred:
Happy to split on request.
Test Plan
extract_code) and test case execution (run_single_test)CodeTestCaseEnvironment(pass, fail, batch, extracted answer)datasets4.x)env_name="math")Summary by CodeRabbit
New Features
Tests