Support agent-specific num_repeats in ng_collect_rollouts by gwarmstrong · Pull Request #1356 · NVIDIA-NeMo/Gym

gwarmstrong · 2026-05-18T20:48:21Z

Agent-specific `num_repeats` for `ng_collect_rollouts`

Motivation

Currently ng_collect_rollouts applies one global num_repeats to every row,
even when the input JSONL mixes multiple agents (different agent_ref.name).
That makes e.g., "run simple_agent for pass@32 alongside swe_agent for pass@1 in
one job" awkward. You either run separate jobs or downsample and recompute metrics after the fact.

Change

num_repeats on RolloutCollectionConfig is now Union[int, Dict[str, int]]
(default 1):

int form — unchanged behavior; applies to every row.
dict form — keys are agent_ref.name. The special key _default is the
fallback for agents not explicitly listed. Without _default, any row whose
agent isn't a key in the dict raises a single consolidated error listing
every unlisted agent. Dict keys that never appear in any input row emit
a UserWarning.

How to verify

Run end-to-end against integrate.api.nvidia.com. From Gym/, after
uv sync --extra dev and export NVIDIA_API_KEY=...:

# 1. Point Gym at integrate.api.nvidia.com.
cat > env.yaml <<EOF
policy_base_url: https://integrate.api.nvidia.com/v1
policy_api_key: ${NVIDIA_API_KEY:?export NVIDIA_API_KEY=nvapi-...}
policy_model_name: nvidia/nvidia-nemotron-nano-9b-v2
EOF

# 2. Wire two agent server instances backed by the shipped simple_agent.
#    Distinct instance names are what lets dict-form num_repeats target them
#    independently.
cat > /tmp/two_agents_demo.yaml <<'EOF'
agent_alpha:
  responses_api_agents:
    simple_agent:
      entrypoint: app.py
      resources_server: {type: resources_servers, name: example_single_tool_call}
      model_server:    {type: responses_api_models, name: policy_model}
agent_beta:
  responses_api_agents:
    simple_agent:
      entrypoint: app.py
      resources_server: {type: resources_servers, name: example_single_tool_call}
      model_server:    {type: responses_api_models, name: policy_model}
EOF

# 3. Generate a 2-row input — one row pinned to each agent instance.
python3 - > /tmp/mixed_input.jsonl <<'PY'
import json
prompt = [
    {"role": "developer", "content": "You are a helpful assistant."},
    {"role": "user", "content": "what is 2+2?"},
]
for name in ("agent_alpha", "agent_beta"):
    print(json.dumps({
        "responses_create_params": {"input": prompt, "tools": []},
        "agent_ref": {"name": name},
    }))
PY

# 4. Start ng_run in this terminal and leave it running until all servers
#    log "ready" (head server + 4 children).
ng_run \
  "+config_paths=[resources_servers/example_single_tool_call/configs/example_single_tool_call.yaml,responses_api_models/vllm_model/configs/vllm_model.yaml,/tmp/two_agents_demo.yaml]"

# 5. In a separate terminal (same Gym/ cwd), run rollouts with per-agent num_repeats.
ng_collect_rollouts \
  +input_jsonl_fpath=/tmp/mixed_input.jsonl \
  +output_jsonl_fpath=/tmp/mixed_rollouts.jsonl \
  +num_samples_in_parallel=4 \
  +upload_rollouts_to_wandb=false \
  '+responses_create_params={max_output_tokens: 64, temperature: 0.0}' \
  '+num_repeats={agent_alpha: 4, agent_beta: 1}'

`num_repeats` on `RolloutCollectionConfig` is now `Union[int, Dict[str, int]]` (default `1`): - **int form** — unchanged behavior; applies to every row. - **dict form** — keys are `agent_ref.name`. The special key `_default` is the fallback for agents not explicitly listed. Without `_default`, any row whose agent isn't a key in the dict raises a single consolidated error listing every unlisted agent. Dict keys that never appear in any input row emit a `UserWarning` (catches typos). Validation surfaces are batched: the Pydantic validator collects every sub-1 value into one error, and the preprocess-time missing-agent check accumulates all offenders across the input before raising. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

copy-pr-bot · 2026-05-18T20:48:25Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

The previous version had an `elif agent_name is None: row_num_repeats = 0` branch inside the new num_repeats dispatch, which only ever fired in the dict-no-default + missing-agent-ref subcase. In the int and dict-with-default subcases the row would still expand wastefully before the post-loop raise. Hoisting the missing-agent-ref check above the dispatch and `continue`-ing on miss makes `agent_name` non-None for the rest of the body, eliminates the special branch, and gives consistent (no-expansion) behavior across all forms when agent_ref is absent. The final user-visible error is unchanged. Mirrors the same record-then-continue pattern used for `agents_missing_from_num_repeats`. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

Adds the Artificial Analysis Intelligence Index as a Gym benchmark group (7 of 8 subs; scicode skipped per scope). - benchmarks/aai/config.yaml chains 7 sub configs and overrides aime25 + livecodebench prompts to match Skills' eval/aai/* renderings. - benchmarks/aai/prompts/{math,livecodebench}.yaml — character match with Skills' eval/aai/* user prompts (mmlu-pro/gpqa defaults already match). - benchmarks/aai/merge.py — combines the 7 prepared JSONLs into one rollout-ready file with per-row prompt baking + agent_ref tagging. - benchmarks/aai/score.py — post-hoc composite (overall_score, math_score, code_score) from per-agent aggregate metrics. - benchmarks/livecodebench/v5_2407_2412/ — new split matching Skills' test_v5_2407_2412 (Jul24-Dec24) used by AAI's livecodebench sub. Depends on PR NVIDIA-NeMo#1356 (cherry-picked) for agent-specific num_repeats. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

cmunley1 · 2026-05-20T04:55:35Z

can we add to docs or at least open an issue to document it?

cmunley1 · 2026-05-20T04:59:02Z

+            "How many times to repeat each example. Either an int (applied to every row) or a "
+            "dict keyed by agent_ref.name (e.g. {simple_agent: 32, swe_agent: 1}). In dict form, "
+            "every agent that appears in the input rows must have an entry, unless a special "
+            '"_default" key is provided as a fallback. Useful for mean@k.'


I wonder if a separate field like num_repeats_default: Optional[int] would be cleaner than _default?

can ultimately take it wherever you want, the arguments in favor of a single field as it is currently implemented are:
(1) a single field to update when you want to modify the behavior
(2) you don't have to do extra handling for the case where e.g., num_repeats is an int, and num_repeats_default is an int. Which do you use in that case? How does a user reason about how to set it? the _default key is a bit less ambiguous.

cmunley1

looks good!

Updates two fern/versions/latest/ pages: - reference/cli-commands.mdx: bump the num_repeats row type from Optional[int] to "int or Dict[str, int]" and describe the dict form, the _default fallback, the consolidated-error semantics, plus a per-agent example CLI block alongside the existing one. - get-started/rollout-collection.mdx: same update on the tutorial's CLI override table, plus a new "Per-agent rollouts" section between View Rollouts and Rollout Generation Parameters. Dict literals in prose are wrapped in backticks so MDX doesn't parse them as JSX expressions. No new pages, no nav changes. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

Resolves: - nemo_gym/rollout_collection.py: kept both `import os` (upstream, for NEMO_GYM_MAX_ROLLOUT_ATTEMPTS env var) and `import warnings` (ours, for the dict-form unused-agent UserWarning). No semantic conflict on the num_repeats / preprocess logic itself — auto-merge slotted the new field next to upstream's additions cleanly. - fern/versions/latest/pages/get-started/rollout-collection.mdx: upstream (NVIDIA-NeMo#1283) deleted this page when refactoring get-started/ into prerequisites/installation/quickstart. Accepted the deletion; the dict-form num_repeats documentation continues to live in fern/versions/latest/pages/reference/cli-commands.mdx (which auto- merged cleanly). All 33 tests in tests/unit_tests/test_rollout_collection.py pass post-merge (20 mine + 13 from upstream's new RolloutAggregationHelper). Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

gwarmstrong · 2026-05-20T20:29:59Z

can we add to docs or at least open an issue to document it?

just updated docs

gwarmstrong requested review from bxyu-nvidia and cmunley1 May 18, 2026 21:04

cmunley1 reviewed May 20, 2026

View reviewed changes

cmunley1 requested review from adil-a and ananthsub May 20, 2026 04:59

gwarmstrong added 3 commits May 20, 2026 10:00

docs: trim prescriptive phrasing from per-agent rollouts section

ea4d61e

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support agent-specific num_repeats in ng_collect_rollouts#1356

Support agent-specific num_repeats in ng_collect_rollouts#1356
gwarmstrong wants to merge 5 commits into
NVIDIA-NeMo:mainfrom
gwarmstrong:georgea/gym-agent-specific-repeats

gwarmstrong commented May 18, 2026

Uh oh!

copy-pr-bot Bot commented May 18, 2026

Uh oh!

cmunley1 commented May 20, 2026

Uh oh!

cmunley1 May 20, 2026

Uh oh!

gwarmstrong May 20, 2026

Uh oh!

cmunley1 left a comment

Uh oh!

gwarmstrong commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gwarmstrong commented May 18, 2026

Agent-specific num_repeats for ng_collect_rollouts

Motivation

Change

How to verify

Uh oh!

copy-pr-bot Bot commented May 18, 2026

Uh oh!

cmunley1 commented May 20, 2026

Uh oh!

cmunley1 May 20, 2026

Choose a reason for hiding this comment

Uh oh!

gwarmstrong May 20, 2026

Choose a reason for hiding this comment

Uh oh!

cmunley1 left a comment

Choose a reason for hiding this comment

Uh oh!

gwarmstrong commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Agent-specific `num_repeats` for `ng_collect_rollouts`