Support agent-specific num_repeats in ng_collect_rollouts#1356
Open
gwarmstrong wants to merge 5 commits into
Open
Support agent-specific num_repeats in ng_collect_rollouts#1356gwarmstrong wants to merge 5 commits into
gwarmstrong wants to merge 5 commits into
Conversation
`num_repeats` on `RolloutCollectionConfig` is now `Union[int, Dict[str, int]]` (default `1`): - **int form** — unchanged behavior; applies to every row. - **dict form** — keys are `agent_ref.name`. The special key `_default` is the fallback for agents not explicitly listed. Without `_default`, any row whose agent isn't a key in the dict raises a single consolidated error listing every unlisted agent. Dict keys that never appear in any input row emit a `UserWarning` (catches typos). Validation surfaces are batched: the Pydantic validator collects every sub-1 value into one error, and the preprocess-time missing-agent check accumulates all offenders across the input before raising. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
The previous version had an `elif agent_name is None: row_num_repeats = 0` branch inside the new num_repeats dispatch, which only ever fired in the dict-no-default + missing-agent-ref subcase. In the int and dict-with-default subcases the row would still expand wastefully before the post-loop raise. Hoisting the missing-agent-ref check above the dispatch and `continue`-ing on miss makes `agent_name` non-None for the rest of the body, eliminates the special branch, and gives consistent (no-expansion) behavior across all forms when agent_ref is absent. The final user-visible error is unchanged. Mirrors the same record-then-continue pattern used for `agents_missing_from_num_repeats`. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
gwarmstrong
added a commit
to gwarmstrong/Gym
that referenced
this pull request
May 18, 2026
Adds the Artificial Analysis Intelligence Index as a Gym benchmark group
(7 of 8 subs; scicode skipped per scope).
- benchmarks/aai/config.yaml chains 7 sub configs and overrides aime25 +
livecodebench prompts to match Skills' eval/aai/* renderings.
- benchmarks/aai/prompts/{math,livecodebench}.yaml — character match with
Skills' eval/aai/* user prompts (mmlu-pro/gpqa defaults already match).
- benchmarks/aai/merge.py — combines the 7 prepared JSONLs into one
rollout-ready file with per-row prompt baking + agent_ref tagging.
- benchmarks/aai/score.py — post-hoc composite (overall_score, math_score,
code_score) from per-agent aggregate metrics.
- benchmarks/livecodebench/v5_2407_2412/ — new split matching Skills'
test_v5_2407_2412 (Jul24-Dec24) used by AAI's livecodebench sub.
Depends on PR NVIDIA-NeMo#1356 (cherry-picked) for agent-specific num_repeats.
Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Contributor
|
can we add to docs or at least open an issue to document it? |
cmunley1
reviewed
May 20, 2026
| "How many times to repeat each example. Either an int (applied to every row) or a " | ||
| "dict keyed by agent_ref.name (e.g. {simple_agent: 32, swe_agent: 1}). In dict form, " | ||
| "every agent that appears in the input rows must have an entry, unless a special " | ||
| '"_default" key is provided as a fallback. Useful for mean@k.' |
Contributor
There was a problem hiding this comment.
I wonder if a separate field like num_repeats_default: Optional[int] would be cleaner than _default?
Contributor
Author
There was a problem hiding this comment.
can ultimately take it wherever you want, the arguments in favor of a single field as it is currently implemented are:
(1) a single field to update when you want to modify the behavior
(2) you don't have to do extra handling for the case where e.g., num_repeats is an int, and num_repeats_default is an int. Which do you use in that case? How does a user reason about how to set it? the _default key is a bit less ambiguous.
Updates two fern/versions/latest/ pages: - reference/cli-commands.mdx: bump the num_repeats row type from Optional[int] to "int or Dict[str, int]" and describe the dict form, the _default fallback, the consolidated-error semantics, plus a per-agent example CLI block alongside the existing one. - get-started/rollout-collection.mdx: same update on the tutorial's CLI override table, plus a new "Per-agent rollouts" section between View Rollouts and Rollout Generation Parameters. Dict literals in prose are wrapped in backticks so MDX doesn't parse them as JSX expressions. No new pages, no nav changes. Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Resolves: - nemo_gym/rollout_collection.py: kept both `import os` (upstream, for NEMO_GYM_MAX_ROLLOUT_ATTEMPTS env var) and `import warnings` (ours, for the dict-form unused-agent UserWarning). No semantic conflict on the num_repeats / preprocess logic itself — auto-merge slotted the new field next to upstream's additions cleanly. - fern/versions/latest/pages/get-started/rollout-collection.mdx: upstream (NVIDIA-NeMo#1283) deleted this page when refactoring get-started/ into prerequisites/installation/quickstart. Accepted the deletion; the dict-form num_repeats documentation continues to live in fern/versions/latest/pages/reference/cli-commands.mdx (which auto- merged cleanly). All 33 tests in tests/unit_tests/test_rollout_collection.py pass post-merge (20 mine + 13 from upstream's new RolloutAggregationHelper). Signed-off-by: gwarmstrong <gwarmstrong@users.noreply.github.com>
Contributor
Author
just updated docs |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Agent-specific
num_repeatsforng_collect_rolloutsMotivation
Currently
ng_collect_rolloutsapplies one globalnum_repeatsto every row,even when the input JSONL mixes multiple agents (different
agent_ref.name).That makes e.g., "run
simple_agentfor pass@32 alongsideswe_agentfor pass@1 inone job" awkward. You either run separate jobs or downsample and recompute metrics after the fact.
Change
num_repeatsonRolloutCollectionConfigis nowUnion[int, Dict[str, int]](default
1):agent_ref.name. The special key_defaultis thefallback for agents not explicitly listed. Without
_default, any row whoseagent isn't a key in the dict raises a single consolidated error listing
every unlisted agent. Dict keys that never appear in any input row emit
a
UserWarning.How to verify
Run end-to-end against
integrate.api.nvidia.com. FromGym/, afteruv sync --extra devandexport NVIDIA_API_KEY=...: