From 564c97cc36c03b59ba2c077626f82e4ee2a467f9 Mon Sep 17 00:00:00 2001
From: openhands <openhands@all-hands.dev>
Date: Wed, 6 May 2026 10:40:24 +0000
Subject: [PATCH 1/2] Add programbench to run-eval benchmark choices
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Wires ProgramBench into the SDK's run-eval.yml entrypoint so it can be
dispatched from the SDK repo on a per-PR / per-release basis like the
other benchmarks (swebench, gaia, swtbench, commit0, swebenchmultimodal,
terminalbench). The choice is forwarded to OpenHands/evaluation
eval-job.yml's BENCHMARK input verbatim — that repo's PR (#544) adds the
matching dispatcher (build-programbench job, run_programbench.sh phase
runner, helm values for k8s with DIND_ENABLED=true).

Also extend the benchmark list in the manage-evals skill / run-eval skill
docs so the agent knows programbench is dispatchable.

No code changes — choice-list / docs only.

Co-authored-by: openhands <openhands@all-hands.dev>
---
 .agents/skills/manage-evals/SKILL.md                          | 1 +
 .agents/skills/manage-evals/references/eval-infrastructure.md | 2 +-
 .agents/skills/run-eval.md                                    | 2 +-
 .github/workflows/run-eval.yml                                | 1 +
 4 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/.agents/skills/manage-evals/SKILL.md b/.agents/skills/manage-evals/SKILL.md
index df6e1f45a7..a54889b599 100644
--- a/.agents/skills/manage-evals/SKILL.md
+++ b/.agents/skills/manage-evals/SKILL.md
@@ -69,6 +69,7 @@ done
 | `commit0` | Commit0 — commit generation tasks |
 | `swebenchmultimodal` | SWE-bench Multimodal — tasks with images |
 | `terminalbench` | TerminalBench — terminal interaction tasks |
+| `programbench` | ProgramBench — program-repair tasks against gold-standard test binaries |
 
 ### Trigger Options
 
diff --git a/.agents/skills/manage-evals/references/eval-infrastructure.md b/.agents/skills/manage-evals/references/eval-infrastructure.md
index 4fe90b4ca5..f7f5dbadb9 100644
--- a/.agents/skills/manage-evals/references/eval-infrastructure.md
+++ b/.agents/skills/manage-evals/references/eval-infrastructure.md
@@ -42,7 +42,7 @@ and served via CDN at `https://results.eval.all-hands.dev/`.
 {benchmark}/{model_slug}/{github_run_id}/
 ```
 
-- **benchmark**: `swebench`, `gaia`, `swtbench`, `commit0`, `swebenchmultimodal`, `terminalbench`
+- **benchmark**: `swebench`, `gaia`, `swtbench`, `commit0`, `swebenchmultimodal`, `terminalbench`, `programbench`
 - **model_slug**: Model name with `/:@.` replaced by `-`
   - Example: `litellm_proxy/claude-sonnet-4-5-20250929` → `litellm_proxy-claude-sonnet-4-5-20250929`
 - **github_run_id**: The GitHub Actions run ID from the `OpenHands/evaluation` repo
diff --git a/.agents/skills/run-eval.md b/.agents/skills/run-eval.md
index ef4e340963..3e5a35b550 100644
--- a/.agents/skills/run-eval.md
+++ b/.agents/skills/run-eval.md
@@ -31,7 +31,7 @@ curl -X POST \
 ```
 
 **Key parameters:**
-- `benchmark`: `swebench`, `swebenchmultimodal`, `gaia`, `swtbench`, `commit0`, `multiswebench`, `terminalbench`
+- `benchmark`: `swebench`, `swebenchmultimodal`, `gaia`, `swtbench`, `commit0`, `multiswebench`, `terminalbench`, `programbench`
 - `eval_limit`: Any positive integer (e.g., `1`, `10`, `50`, `200`)
 - `model_ids`: See `.github/run-eval/resolve_model_config.py` for available models
 - `benchmarks_branch`: Use feature branch from the benchmarks repo to test benchmark changes before merging
diff --git a/.github/workflows/run-eval.yml b/.github/workflows/run-eval.yml
index 3610850d3b..efcbba0e1c 100644
--- a/.github/workflows/run-eval.yml
+++ b/.github/workflows/run-eval.yml
@@ -21,6 +21,7 @@ on:
                     - commit0
                     - swebenchmultimodal
                     - terminalbench
+                    - programbench
             sdk_ref:
                 description: SDK commit/ref to evaluate (must be a semantic version like v1.0.0 unless 'Allow unreleased branches' is checked)
                 required: true

From 904861d34de353db2446ddd65b9c8d1ab8e78653 Mon Sep 17 00:00:00 2001
From: openhands <openhands@all-hands.dev>
Date: Sun, 10 May 2026 16:09:27 +0000
Subject: [PATCH 2/2] docs: complete ProgramBench eval tooling references

Co-authored-by: openhands <openhands@all-hands.dev>
---
 .agents/skills/manage-evals/SKILL.md                | 2 +-
 .agents/skills/manage-evals/scripts/manage_evals.py | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/.agents/skills/manage-evals/SKILL.md b/.agents/skills/manage-evals/SKILL.md
index a54889b599..54e05df761 100644
--- a/.agents/skills/manage-evals/SKILL.md
+++ b/.agents/skills/manage-evals/SKILL.md
@@ -130,7 +130,7 @@ Each line is a run path. Match by benchmark and model to find the run.
 ### Step 2: Identify the Run Path Components
 
 A run path has three components:
-- **benchmark**: `swebench`, `gaia`, `swtbench`, `commit0`, `swebenchmultimodal`, `terminalbench`
+- **benchmark**: `swebench`, `gaia`, `swtbench`, `commit0`, `swebenchmultimodal`, `terminalbench`, `programbench`
 - **model_slug**: Derived from model name with `/:@.` replaced by `-` (e.g., `litellm_proxy-claude-sonnet-4-5-20250929`)
 - **run_id**: The GitHub Actions workflow run ID from the `OpenHands/evaluation` repo
 
diff --git a/.agents/skills/manage-evals/scripts/manage_evals.py b/.agents/skills/manage-evals/scripts/manage_evals.py
index 5f53439db4..14c3cda10e 100755
--- a/.agents/skills/manage-evals/scripts/manage_evals.py
+++ b/.agents/skills/manage-evals/scripts/manage_evals.py
@@ -42,6 +42,7 @@
     "commit0",
     "swebenchmultimodal",
     "terminalbench",
+    "programbench",
 ]
 TOOL_PRESETS = ["default", "gemini", "gpt5", "planning"]
 AGENT_TYPES = ["default", "acp-claude", "acp-codex"]