Skip to content

Support agent-specific num_repeats in ng_collect_rollouts#1356

Open
gwarmstrong wants to merge 5 commits into
NVIDIA-NeMo:mainfrom
gwarmstrong:georgea/gym-agent-specific-repeats
Open

Support agent-specific num_repeats in ng_collect_rollouts#1356
gwarmstrong wants to merge 5 commits into
NVIDIA-NeMo:mainfrom
gwarmstrong:georgea/gym-agent-specific-repeats

Conversation

@gwarmstrong
Copy link
Copy Markdown
Contributor

Agent-specific num_repeats for ng_collect_rollouts

Motivation

Currently ng_collect_rollouts applies one global num_repeats to every row,
even when the input JSONL mixes multiple agents (different agent_ref.name).
That makes e.g., "run simple_agent for pass@32 alongside swe_agent for pass@1 in
one job" awkward. You either run separate jobs or downsample and recompute metrics after the fact.

Change

num_repeats on RolloutCollectionConfig is now Union[int, Dict[str, int]]
(default 1):

  • int form — unchanged behavior; applies to every row.
  • dict form — keys are agent_ref.name. The special key _default is the
    fallback for agents not explicitly listed. Without _default, any row whose
    agent isn't a key in the dict raises a single consolidated error listing
    every unlisted agent. Dict keys that never appear in any input row emit
    a UserWarning.

How to verify

Run end-to-end against integrate.api.nvidia.com. From Gym/, after
uv sync --extra dev and export NVIDIA_API_KEY=...:

# 1. Point Gym at integrate.api.nvidia.com.
cat > env.yaml <<EOF
policy_base_url: https://integrate.api.nvidia.com/v1
policy_api_key: ${NVIDIA_API_KEY:?export NVIDIA_API_KEY=nvapi-...}
policy_model_name: nvidia/nvidia-nemotron-nano-9b-v2
EOF

# 2. Wire two agent server instances backed by the shipped simple_agent.
#    Distinct instance names are what lets dict-form num_repeats target them
#    independently.
cat > /tmp/two_agents_demo.yaml <<'EOF'
agent_alpha:
  responses_api_agents:
    simple_agent:
      entrypoint: app.py
      resources_server: {type: resources_servers, name: example_single_tool_call}
      model_server:    {type: responses_api_models, name: policy_model}
agent_beta:
  responses_api_agents:
    simple_agent:
      entrypoint: app.py
      resources_server: {type: resources_servers, name: example_single_tool_call}
      model_server:    {type: responses_api_models, name: policy_model}
EOF

# 3. Generate a 2-row input — one row pinned to each agent instance.
python3 - > /tmp/mixed_input.jsonl <<'PY'
import json
prompt = [
    {"role": "developer", "content": "You are a helpful assistant."},
    {"role": "user", "content": "what is 2+2?"},
]
for name in ("agent_alpha", "agent_beta"):
    print(json.dumps({
        "responses_create_params": {"input": prompt, "tools": []},
        "agent_ref": {"name": name},
    }))
PY

# 4. Start ng_run in this terminal and leave it running until all servers
#    log "ready" (head server + 4 children).
ng_run \
  "+config_paths=[resources_servers/example_single_tool_call/configs/example_single_tool_call.yaml,responses_api_models/vllm_model/configs/vllm_model.yaml,/tmp/two_agents_demo.yaml]"

# 5. In a separate terminal (same Gym/ cwd), run rollouts with per-agent num_repeats.
ng_collect_rollouts \
  +input_jsonl_fpath=/tmp/mixed_input.jsonl \
  +output_jsonl_fpath=/tmp/mixed_rollouts.jsonl \
  +num_samples_in_parallel=4 \
  +upload_rollouts_to_wandb=false \
  '+responses_create_params={max_output_tokens: 64, temperature: 0.0}' \
  '+num_repeats={agent_alpha: 4, agent_beta: 1}'

`num_repeats` on `RolloutCollectionConfig` is now `Union[int, Dict[str, int]]`
(default `1`):

- **int form** — unchanged behavior; applies to every row.
- **dict form** — keys are `agent_ref.name`. The special key `_default` is
  the fallback for agents not explicitly listed. Without `_default`, any
  row whose agent isn't a key in the dict raises a single consolidated
  error listing every unlisted agent. Dict keys that never appear in any
  input row emit a `UserWarning` (catches typos).

Validation surfaces are batched: the Pydantic validator collects every
sub-1 value into one error, and the preprocess-time missing-agent check
accumulates all offenders across the input before raising.

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

The previous version had an `elif agent_name is None: row_num_repeats = 0`
branch inside the new num_repeats dispatch, which only ever fired in the
dict-no-default + missing-agent-ref subcase. In the int and dict-with-default
subcases the row would still expand wastefully before the post-loop raise.

Hoisting the missing-agent-ref check above the dispatch and `continue`-ing on
miss makes `agent_name` non-None for the rest of the body, eliminates the
special branch, and gives consistent (no-expansion) behavior across all
forms when agent_ref is absent. The final user-visible error is unchanged.

Mirrors the same record-then-continue pattern used for
`agents_missing_from_num_repeats`.

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
gwarmstrong added a commit to gwarmstrong/Gym that referenced this pull request May 18, 2026
Adds the Artificial Analysis Intelligence Index as a Gym benchmark group
(7 of 8 subs; scicode skipped per scope).

- benchmarks/aai/config.yaml chains 7 sub configs and overrides aime25 +
  livecodebench prompts to match Skills' eval/aai/* renderings.
- benchmarks/aai/prompts/{math,livecodebench}.yaml — character match with
  Skills' eval/aai/* user prompts (mmlu-pro/gpqa defaults already match).
- benchmarks/aai/merge.py — combines the 7 prepared JSONLs into one
  rollout-ready file with per-row prompt baking + agent_ref tagging.
- benchmarks/aai/score.py — post-hoc composite (overall_score, math_score,
  code_score) from per-agent aggregate metrics.
- benchmarks/livecodebench/v5_2407_2412/ — new split matching Skills'
  test_v5_2407_2412 (Jul24-Dec24) used by AAI's livecodebench sub.

Depends on PR NVIDIA-NeMo#1356 (cherry-picked) for agent-specific num_repeats.

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
@cmunley1
Copy link
Copy Markdown
Contributor

can we add to docs or at least open an issue to document it?

"How many times to repeat each example. Either an int (applied to every row) or a "
"dict keyed by agent_ref.name (e.g. {simple_agent: 32, swe_agent: 1}). In dict form, "
"every agent that appears in the input rows must have an entry, unless a special "
'"_default" key is provided as a fallback. Useful for mean@k.'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if a separate field like num_repeats_default: Optional[int] would be cleaner than _default?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can ultimately take it wherever you want, the arguments in favor of a single field as it is currently implemented are:
(1) a single field to update when you want to modify the behavior
(2) you don't have to do extra handling for the case where e.g., num_repeats is an int, and num_repeats_default is an int. Which do you use in that case? How does a user reason about how to set it? the _default key is a bit less ambiguous.

Copy link
Copy Markdown
Contributor

@cmunley1 cmunley1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

@cmunley1 cmunley1 requested review from adil-a and ananthsub May 20, 2026 04:59
Updates two fern/versions/latest/ pages:

- reference/cli-commands.mdx: bump the num_repeats row type from
  Optional[int] to "int or Dict[str, int]" and describe the dict
  form, the _default fallback, the consolidated-error semantics,
  plus a per-agent example CLI block alongside the existing one.
- get-started/rollout-collection.mdx: same update on the tutorial's
  CLI override table, plus a new "Per-agent rollouts" section
  between View Rollouts and Rollout Generation Parameters.

Dict literals in prose are wrapped in backticks so MDX doesn't parse
them as JSX expressions. No new pages, no nav changes.

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Resolves:
- nemo_gym/rollout_collection.py: kept both `import os` (upstream, for
  NEMO_GYM_MAX_ROLLOUT_ATTEMPTS env var) and `import warnings` (ours,
  for the dict-form unused-agent UserWarning). No semantic conflict on
  the num_repeats / preprocess logic itself — auto-merge slotted the
  new field next to upstream's additions cleanly.
- fern/versions/latest/pages/get-started/rollout-collection.mdx:
  upstream (NVIDIA-NeMo#1283) deleted this page when refactoring get-started/ into
  prerequisites/installation/quickstart. Accepted the deletion; the
  dict-form num_repeats documentation continues to live in
  fern/versions/latest/pages/reference/cli-commands.mdx (which auto-
  merged cleanly).

All 33 tests in tests/unit_tests/test_rollout_collection.py pass
post-merge (20 mine + 13 from upstream's new RolloutAggregationHelper).

Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
@gwarmstrong
Copy link
Copy Markdown
Contributor Author

can we add to docs or at least open an issue to document it?

just updated docs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants