From 564c97cc36c03b59ba2c077626f82e4ee2a467f9 Mon Sep 17 00:00:00 2001 From: openhands Date: Wed, 6 May 2026 10:40:24 +0000 Subject: [PATCH 1/2] Add programbench to run-eval benchmark choices MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Wires ProgramBench into the SDK's run-eval.yml entrypoint so it can be dispatched from the SDK repo on a per-PR / per-release basis like the other benchmarks (swebench, gaia, swtbench, commit0, swebenchmultimodal, terminalbench). The choice is forwarded to OpenHands/evaluation eval-job.yml's BENCHMARK input verbatim — that repo's PR (#544) adds the matching dispatcher (build-programbench job, run_programbench.sh phase runner, helm values for k8s with DIND_ENABLED=true). Also extend the benchmark list in the manage-evals skill / run-eval skill docs so the agent knows programbench is dispatchable. No code changes — choice-list / docs only. Co-authored-by: openhands --- .agents/skills/manage-evals/SKILL.md | 1 + .agents/skills/manage-evals/references/eval-infrastructure.md | 2 +- .agents/skills/run-eval.md | 2 +- .github/workflows/run-eval.yml | 1 + 4 files changed, 4 insertions(+), 2 deletions(-) diff --git a/.agents/skills/manage-evals/SKILL.md b/.agents/skills/manage-evals/SKILL.md index df6e1f45a7..a54889b599 100644 --- a/.agents/skills/manage-evals/SKILL.md +++ b/.agents/skills/manage-evals/SKILL.md @@ -69,6 +69,7 @@ done | `commit0` | Commit0 — commit generation tasks | | `swebenchmultimodal` | SWE-bench Multimodal — tasks with images | | `terminalbench` | TerminalBench — terminal interaction tasks | +| `programbench` | ProgramBench — program-repair tasks against gold-standard test binaries | ### Trigger Options diff --git a/.agents/skills/manage-evals/references/eval-infrastructure.md b/.agents/skills/manage-evals/references/eval-infrastructure.md index 4fe90b4ca5..f7f5dbadb9 100644 --- a/.agents/skills/manage-evals/references/eval-infrastructure.md +++ b/.agents/skills/manage-evals/references/eval-infrastructure.md @@ -42,7 +42,7 @@ and served via CDN at `https://results.eval.all-hands.dev/`. {benchmark}/{model_slug}/{github_run_id}/ ``` -- **benchmark**: `swebench`, `gaia`, `swtbench`, `commit0`, `swebenchmultimodal`, `terminalbench` +- **benchmark**: `swebench`, `gaia`, `swtbench`, `commit0`, `swebenchmultimodal`, `terminalbench`, `programbench` - **model_slug**: Model name with `/:@.` replaced by `-` - Example: `litellm_proxy/claude-sonnet-4-5-20250929` → `litellm_proxy-claude-sonnet-4-5-20250929` - **github_run_id**: The GitHub Actions run ID from the `OpenHands/evaluation` repo diff --git a/.agents/skills/run-eval.md b/.agents/skills/run-eval.md index ef4e340963..3e5a35b550 100644 --- a/.agents/skills/run-eval.md +++ b/.agents/skills/run-eval.md @@ -31,7 +31,7 @@ curl -X POST \ ``` **Key parameters:** -- `benchmark`: `swebench`, `swebenchmultimodal`, `gaia`, `swtbench`, `commit0`, `multiswebench`, `terminalbench` +- `benchmark`: `swebench`, `swebenchmultimodal`, `gaia`, `swtbench`, `commit0`, `multiswebench`, `terminalbench`, `programbench` - `eval_limit`: Any positive integer (e.g., `1`, `10`, `50`, `200`) - `model_ids`: See `.github/run-eval/resolve_model_config.py` for available models - `benchmarks_branch`: Use feature branch from the benchmarks repo to test benchmark changes before merging diff --git a/.github/workflows/run-eval.yml b/.github/workflows/run-eval.yml index 3610850d3b..efcbba0e1c 100644 --- a/.github/workflows/run-eval.yml +++ b/.github/workflows/run-eval.yml @@ -21,6 +21,7 @@ on: - commit0 - swebenchmultimodal - terminalbench + - programbench sdk_ref: description: SDK commit/ref to evaluate (must be a semantic version like v1.0.0 unless 'Allow unreleased branches' is checked) required: true From 904861d34de353db2446ddd65b9c8d1ab8e78653 Mon Sep 17 00:00:00 2001 From: openhands Date: Sun, 10 May 2026 16:09:27 +0000 Subject: [PATCH 2/2] docs: complete ProgramBench eval tooling references Co-authored-by: openhands --- .agents/skills/manage-evals/SKILL.md | 2 +- .agents/skills/manage-evals/scripts/manage_evals.py | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/.agents/skills/manage-evals/SKILL.md b/.agents/skills/manage-evals/SKILL.md index a54889b599..54e05df761 100644 --- a/.agents/skills/manage-evals/SKILL.md +++ b/.agents/skills/manage-evals/SKILL.md @@ -130,7 +130,7 @@ Each line is a run path. Match by benchmark and model to find the run. ### Step 2: Identify the Run Path Components A run path has three components: -- **benchmark**: `swebench`, `gaia`, `swtbench`, `commit0`, `swebenchmultimodal`, `terminalbench` +- **benchmark**: `swebench`, `gaia`, `swtbench`, `commit0`, `swebenchmultimodal`, `terminalbench`, `programbench` - **model_slug**: Derived from model name with `/:@.` replaced by `-` (e.g., `litellm_proxy-claude-sonnet-4-5-20250929`) - **run_id**: The GitHub Actions workflow run ID from the `OpenHands/evaluation` repo diff --git a/.agents/skills/manage-evals/scripts/manage_evals.py b/.agents/skills/manage-evals/scripts/manage_evals.py index 5f53439db4..14c3cda10e 100755 --- a/.agents/skills/manage-evals/scripts/manage_evals.py +++ b/.agents/skills/manage-evals/scripts/manage_evals.py @@ -42,6 +42,7 @@ "commit0", "swebenchmultimodal", "terminalbench", + "programbench", ] TOOL_PRESETS = ["default", "gemini", "gpt5", "planning"] AGENT_TYPES = ["default", "acp-claude", "acp-codex"]