From 429067be1a0415d10f062a3d29b627f1d20316ed Mon Sep 17 00:00:00 2001 From: Chris Wing Date: Wed, 13 May 2026 12:42:02 -0700 Subject: [PATCH 1/2] Add GitHub issue template for environment integrations Provides a structured template for requesting integration of existing environments/benchmarks (e.g. OSWorld, EnterpriseOps Gym). Auto-applies the env-integration label. Signed-off-by: Chris Wing --- .../ISSUE_TEMPLATE/environment-integration.md | 65 +++++++++++++++++++ 1 file changed, 65 insertions(+) create mode 100644 .github/ISSUE_TEMPLATE/environment-integration.md diff --git a/.github/ISSUE_TEMPLATE/environment-integration.md b/.github/ISSUE_TEMPLATE/environment-integration.md new file mode 100644 index 000000000..9bcd03756 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/environment-integration.md @@ -0,0 +1,65 @@ +--- +name: Environment Integration +about: Propose integrating an existing environment or benchmark (e.g. OSWorld, EnterpriseOps Gym) +title: '[Environment] ' +labels: 'env-integration' +assignees: '' + +--- + +**Environment Overview** + +- Name: +- Source repo: +- Paper/reference (if applicable): +- License: +- Brief description: What does this environment evaluate? (e.g. web navigation, code generation, tool use) + +**How does the agent interact with the environment?** + +Describe what a typical task looks like from the agent's perspective. For example: +- Does the agent receive a natural language prompt and return an answer? +- Does the model use tools (function calling, code execution, web browsing)? +- Is it single-turn or multi-turn (does the model get feedback and retry)? + +**What does success look like?** + +Describe the reward signal — what constitutes a successful completion? Is it binary pass/fail, a score, or multiple metrics? How is correctness determined (exact match, test cases, judge model, human eval)? + +**External Dependencies** + +Does this environment require external tools, runtimes, or sandboxes (e.g. compilers, browsers, Docker, VMs)? +If so, list them and note whether they can be auto-installed on server startup. + +**Data** + +- Dataset source (e.g. HuggingFace, custom): +- Approximate size (number of tasks): +- Splits available (train/validation/test): + +**Known Results** + +Are there published or known results to use as a reference? Link to leaderboards, papers, or repos with reported numbers. + +**Constraints & Requirements** + +Note anything an engineer should know about running this environment: +- Does it need specific hardware (GPUs, large memory)? +- Does it require network access, Docker, or a VM? +- Are there known limitations on parallelism or throughput? +- Any OS or platform restrictions? + +**Definition of Done** + +- [ ] Environment can be launched with `ng_run` +- [ ] Rollouts can be collected end-to-end with `ng_collect_rollouts` +- [ ] Reward scores reproduce known/expected results +- [ ] Example data committed for smoke testing +- [ ] Train/validation datasets uploaded to dataset registry +- [ ] Tests passing +- [ ] Documentation in environment README +- [ ] Benchmark config defined if environment is benchmark + +**Additional Context** + +Add any other context, links, or screenshots here. From 6e3640d3578a846b96538c142694c7511551734a Mon Sep 17 00:00:00 2001 From: Chris Wing Date: Wed, 13 May 2026 14:23:31 -0700 Subject: [PATCH 2/2] refine environment integration issue template Use h3 headers for better visual hierarchy and GitHub outline support. Clarify benchmark DoD item, add implementation request section, and update description text. Signed-off-by: Chris Wing --- .../ISSUE_TEMPLATE/environment-integration.md | 28 +++++++++++-------- 1 file changed, 16 insertions(+), 12 deletions(-) diff --git a/.github/ISSUE_TEMPLATE/environment-integration.md b/.github/ISSUE_TEMPLATE/environment-integration.md index 9bcd03756..4927221e4 100644 --- a/.github/ISSUE_TEMPLATE/environment-integration.md +++ b/.github/ISSUE_TEMPLATE/environment-integration.md @@ -1,13 +1,13 @@ --- name: Environment Integration -about: Propose integrating an existing environment or benchmark (e.g. OSWorld, EnterpriseOps Gym) +about: Propose integrating an existing environment or benchmark into NeMo Gym title: '[Environment] ' labels: 'env-integration' assignees: '' --- -**Environment Overview** +### Environment Overview - Name: - Source repo: @@ -15,33 +15,33 @@ assignees: '' - License: - Brief description: What does this environment evaluate? (e.g. web navigation, code generation, tool use) -**How does the agent interact with the environment?** +### How does the agent interact with the environment? Describe what a typical task looks like from the agent's perspective. For example: - Does the agent receive a natural language prompt and return an answer? - Does the model use tools (function calling, code execution, web browsing)? - Is it single-turn or multi-turn (does the model get feedback and retry)? -**What does success look like?** +### Verifier Shape Describe the reward signal — what constitutes a successful completion? Is it binary pass/fail, a score, or multiple metrics? How is correctness determined (exact match, test cases, judge model, human eval)? -**External Dependencies** +### External Dependencies -Does this environment require external tools, runtimes, or sandboxes (e.g. compilers, browsers, Docker, VMs)? +Does this environment require external tools, specific runtimes, or sandboxes (e.g. compilers, browsers, Docker, VMs)? If so, list them and note whether they can be auto-installed on server startup. -**Data** +### Data - Dataset source (e.g. HuggingFace, custom): - Approximate size (number of tasks): - Splits available (train/validation/test): -**Known Results** +### Known Results Are there published or known results to use as a reference? Link to leaderboards, papers, or repos with reported numbers. -**Constraints & Requirements** +### Constraints & Requirements Note anything an engineer should know about running this environment: - Does it need specific hardware (GPUs, large memory)? @@ -49,7 +49,11 @@ Note anything an engineer should know about running this environment: - Are there known limitations on parallelism or throughput? - Any OS or platform restrictions? -**Definition of Done** +### Implementation Request +- [ ] I plan to implement this myself +- [ ] I'm requesting help to implement this + +### Definition of Done - [ ] Environment can be launched with `ng_run` - [ ] Rollouts can be collected end-to-end with `ng_collect_rollouts` @@ -58,8 +62,8 @@ Note anything an engineer should know about running this environment: - [ ] Train/validation datasets uploaded to dataset registry - [ ] Tests passing - [ ] Documentation in environment README -- [ ] Benchmark config defined if environment is benchmark +- [ ] Benchmark config defined if applicable (e.g. pinned agent harness, dataset subset, num_repeats) -**Additional Context** +### Additional Context Add any other context, links, or screenshots here.