Add programbench to run-eval benchmark choices by neubig · Pull Request #3075 · OpenHands/software-agent-sdk

neubig · 2026-05-06T10:40:55Z

This PR was created by an AI agent (OpenHands) on behalf of the user.

Summary

Adds programbench to the SDK's per-PR / per-release evaluation entrypoint so it can be dispatched the same way every other benchmark (swebench, gaia, swtbench, commit0, swebenchmultimodal, terminalbench) is — from the SDK repo, against a chosen SDK ref/commit, with parallel inference workers running on k8s in the evaluation repo's evaluation-jobs namespace.

This PR is now stacked on top of #3191, which contains the standalone RemoteConversation FINISHED/stop-hook race fix discovered during ProgramBench testing. With that split, this PR contains only the ProgramBench workflow/docs changes.

Companion PRs

Add ProgramBench benchmark integration benchmarks#703 — merged; adds the ProgramBench benchmark package and reusable image-check workflow.
OpenHands/evaluation#544 — merged; adds eval-job orchestration for ProgramBench.
Fix RemoteConversation FINISHED stop-hook race #3191 — base PR for the RemoteConversation race fix.

Changes

.github/workflows/run-eval.yml — add programbench to the workflow_dispatch.inputs.benchmark.options list.
.agents/skills/manage-evals/SKILL.md — add programbench to the benchmark table.
.agents/skills/manage-evals/references/eval-infrastructure.md — document programbench in the benchmark enumeration.
.agents/skills/run-eval.md — document programbench as a supported benchmark parameter.

Verification

uv run python - <<'PY' ... yaml.safe_load(open('.github/workflows/run-eval.yml')) ... PY — run-eval.yml valid YAML

End-to-end ProgramBench CI evidence from the companion evaluation workflow-dispatch runs:

Parent eval job: https://github.com/OpenHands/evaluation/actions/runs/25552125670 — success; jobs included Verify ProgramBench Images, Run Inference, Wait For Infer Output, and Run Eval Harness for benchmark=programbench, eval_limit=1 (cmatrix).
Downstream inference phase: https://github.com/OpenHands/evaluation/actions/runs/25552168816 — success.
Downstream eval harness phase: https://github.com/OpenHands/evaluation/actions/runs/25552772174 — success.

After the base PR lands, programbench will be dispatchable end-to-end via:

gh workflow run run-eval.yml \
  --repo OpenHands/software-agent-sdk \
  -f sdk_ref=v1.20.1 \
  -f benchmark=programbench \
  -f eval_limit=5 \
  -f model_ids=litellm_proxy/anthropic/claude-sonnet-4-5-20250929

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.13-nodejs22-slim`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:564c97c-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-564c97c-python \
  ghcr.io/openhands/agent-server:564c97c-python

All tags pushed for this build

ghcr.io/openhands/agent-server:564c97c-golang-amd64
ghcr.io/openhands/agent-server:564c97cc36c03b59ba2c077626f82e4ee2a467f9-golang-amd64
ghcr.io/openhands/agent-server:feat-programbench-golang-amd64
ghcr.io/openhands/agent-server:564c97c-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:564c97c-golang-arm64
ghcr.io/openhands/agent-server:564c97cc36c03b59ba2c077626f82e4ee2a467f9-golang-arm64
ghcr.io/openhands/agent-server:feat-programbench-golang-arm64
ghcr.io/openhands/agent-server:564c97c-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:564c97c-java-amd64
ghcr.io/openhands/agent-server:564c97cc36c03b59ba2c077626f82e4ee2a467f9-java-amd64
ghcr.io/openhands/agent-server:feat-programbench-java-amd64
ghcr.io/openhands/agent-server:564c97c-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:564c97c-java-arm64
ghcr.io/openhands/agent-server:564c97cc36c03b59ba2c077626f82e4ee2a467f9-java-arm64
ghcr.io/openhands/agent-server:feat-programbench-java-arm64
ghcr.io/openhands/agent-server:564c97c-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:564c97c-python-amd64
ghcr.io/openhands/agent-server:564c97cc36c03b59ba2c077626f82e4ee2a467f9-python-amd64
ghcr.io/openhands/agent-server:feat-programbench-python-amd64
ghcr.io/openhands/agent-server:564c97c-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:564c97c-python-arm64
ghcr.io/openhands/agent-server:564c97cc36c03b59ba2c077626f82e4ee2a467f9-python-arm64
ghcr.io/openhands/agent-server:feat-programbench-python-arm64
ghcr.io/openhands/agent-server:564c97c-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:564c97c-golang
ghcr.io/openhands/agent-server:564c97cc36c03b59ba2c077626f82e4ee2a467f9-golang
ghcr.io/openhands/agent-server:feat-programbench-golang
ghcr.io/openhands/agent-server:564c97c-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:564c97c-java
ghcr.io/openhands/agent-server:564c97cc36c03b59ba2c077626f82e4ee2a467f9-java
ghcr.io/openhands/agent-server:feat-programbench-java
ghcr.io/openhands/agent-server:564c97c-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:564c97c-python
ghcr.io/openhands/agent-server:564c97cc36c03b59ba2c077626f82e4ee2a467f9-python
ghcr.io/openhands/agent-server:feat-programbench-python
ghcr.io/openhands/agent-server:564c97c-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

Each variant tag (e.g., 564c97c-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 564c97c-python-amd64) are also available if needed

github-actions · 2026-05-06T10:41:23Z

Python API breakage checks — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-05-06T10:41:30Z

REST API breakage checks (OpenAPI) — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-05-07T16:40:52Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-sdk/openhands/sdk/conversation/impl
remote_conversation.py	664	100	84%	75, 77, 144, 171, 184, 186–189, 199, 221–222, 227–230, 314, 324–326, 332, 406, 553–556, 558, 584–588, 593–596, 599, 615, 774–775, 779–780, 794, 820–821, 840, 851–852, 872–875, 877–878, 902–904, 907–911, 913–914, 918, 920–928, 930, 967, 1090, 1096–1097, 1111, 1220–1221, 1225, 1230–1234, 1240–1246, 1259, 1264, 1299, 1512–1513
TOTAL	26076	7484	71%

all-hands-bot

Taste Rating: 🟡 Acceptable - Bug fix is solid but requires human review per repo policy

[CRITICAL ISSUES]

[PR Description] Misleading scope statement - says "No source / test / runtime changes — choice-list and skill-doc updates only" but includes ~150 lines of production code changes to conversation termination logic in remote_conversation.py plus test additions. The PR description should accurately reflect the full scope.
[Missing Eval Evidence] Per repo review policy, changes that affect agent behavior (conversation termination logic, stop hook interaction) require eval evidence before approval. This PR modifies when/how agents terminate on FINISHED status. Draft status is noted, but eval validation should be added before marking ready for review.

[IMPROVEMENT OPPORTUNITIES]

[PR Scope] Consider splitting the race condition bug fix from the programbench feature addition. The fix is valuable independently and could be backported/reverted separately if needed.
[remote_conversation.py lines 1069-1086] The 18-line comment explaining the race condition signals complex logic. Consider extracting WS hint handling to _process_ws_terminal_hint(ws_status) -> bool with docs in the docstring.
[remote_conversation.py line 1146] Queue draining on status flip is good defensive programming, but adds concurrency complexity. Consider adding debug logging when draining occurs to help debug future races.
[remote_conversation.py lines 1149-1166] _immediate_terminal_statuses() returns a constant - make it a class constant to avoid recreating the frozenset on every call.

[RISK ASSESSMENT]

⚠️ Risk Assessment: 🟡 MEDIUM

This fixes a real production race condition (programbench retry-16) with good test coverage, but changes critical conversation lifecycle semantics. The WS-hint-only pattern for FINISHED with REST confirmation is the right fix, but adds complexity to an already intricate state machine. Requires human review and benchmark validation.

Recommendation: Request eval evidence showing programbench (and ideally one other benchmark like SWE-bench) completes successfully with this change before approval.

VERDICT:

⏸️ Needs eval evidence: Core logic is sound, but repo policy requires benchmark validation for agent behavior changes.

KEY INSIGHT:

The fix correctly identifies that WebSocket terminal status events are hints, not authoritative termination signals, because stop hooks can revert FINISHED→RUNNING. The REST polling confirmation pattern is the right architectural choice.

Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.

Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.

When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

all-hands-bot

✅ QA Report: PASS

This PR successfully adds programbench to the SDK's evaluation workflow dispatch options and includes a race condition fix for RemoteConversation discovered during programbench testing.

Does this PR achieve its stated goal?

Yes. The PR accomplishes both of its objectives:

Primary goal — Add programbench as a dispatchable benchmark choice: The workflow file now includes programbench in the benchmark options list (line 24 of run-eval.yml), the YAML is syntactically valid, and all documentation is updated consistently. When a user dispatches this workflow from the GitHub Actions UI, programbench will appear as a selectable option alongside the existing benchmarks.
Secondary goal — Fix RemoteConversation WebSocket race condition: The code changes correctly implement the fix by treating WS FINISHED status as a hint rather than authoritative termination. Only ERROR and STUCK statuses trigger immediate termination; FINISHED requires REST poll confirmation to avoid racing stop hooks that can revert the status back to RUNNING.

Phase	Result
Environment Setup	✅ Repository cloned, dependencies verified
CI Status	✅ All checks passing (sdk-tests, tools-tests, agent-server-tests, workspace-tests, windows-tests all SUCCESS)
Functional Verification	✅ Workflow structure validated, race condition fix verified

Functional Verification

Test 1: Workflow File Validation

Step 1 — Verify baseline (main branch):
Checked the workflow file on main branch:

$ gh workflow view "run-eval.yml" --repo OpenHands/software-agent-sdk --yaml | grep -A 7 "options:"
options:
    - gaia
    - swebench
    - swtbench
    - commit0
    - swebenchmultimodal
    - terminalbench

This shows programbench is NOT in the main branch options.

Step 2 — Apply PR changes:
Checked out feat/programbench branch (commit 6ad2915).

Step 3 — Verify with PR changes:
Validated the workflow file:

$ python3 -c "import yaml; yaml.safe_load(open('.github/workflows/run-eval.yml'))"
✓ YAML is valid

$ gh workflow view "run-eval.yml" --ref feat/programbench --yaml | grep -A 8 "options:"
options:
    - gaia
    - swebench
    - swtbench
    - commit0
    - swebenchmultimodal
    - terminalbench
    - programbench

This confirms programbench is now in the workflow options and the YAML is syntactically valid.

Step 4 — Verify the parameter is passed through:
Checked the workflow file to confirm the benchmark parameter is forwarded to downstream workflows:

$ grep "BENCHMARK:" .github/workflows/run-eval.yml
BENCHMARK: ${{ github.event.inputs.benchmark || 'swebench' }}

$ grep '"benchmark"' .github/workflows/run-eval.yml | grep arg
--arg benchmark "$BENCHMARK" \

This shows the workflow correctly accepts and forwards the benchmark parameter.

Test 2: Documentation Consistency

Verified all documentation files include programbench:

$ grep -n "programbench" .agents/skills/manage-evals/SKILL.md
72:| `programbench` | ProgramBench — program-repair tasks against gold-standard test binaries |

$ grep -n "programbench" .agents/skills/run-eval.md
34:- `benchmark`: `swebench`, `swebenchmultimodal`, `gaia`, `swtbench`, `commit0`, `multiswebench`, `terminalbench`, `programbench`

$ grep -n "programbench" .agents/skills/manage-evals/references/eval-infrastructure.md
45:- **benchmark**: `swebench`, `gaia`, `swtbench`, `commit0`, `swebenchmultimodal`, `terminalbench`, `programbench`

All documentation is updated consistently.

Test 3: Race Condition Fix Implementation

Verified the fix logic:
Imported and tested the race condition fix:

$ python3
>>> from openhands.sdk.conversation.impl.remote_conversation import RemoteConversation
>>> from openhands.sdk.conversation.state import ConversationExecutionStatus
>>> statuses = RemoteConversation._immediate_terminal_statuses()
>>> print(statuses)
frozenset({'stuck', 'error'})
>>> ConversationExecutionStatus.FINISHED.value in statuses
False
>>> ConversationExecutionStatus.ERROR.value in statuses
True
>>> ConversationExecutionStatus.STUCK.value in statuses
True

This confirms the fix is implemented correctly:

✓ FINISHED is NOT in the immediate terminal statuses set
✓ ERROR and STUCK ARE in the set
✓ This prevents the WebSocket race with stop hooks (FINISHED requires REST confirmation)

Unable to Verify

1. End-to-End Workflow Dispatch

What was attempted:
Tried to verify a complete programbench evaluation workflow run.

Why it couldn't be verified:
The PR description states this is one of three coordinated PRs and is "ready to flip out of draft once the companion PRs land" (OpenHands/benchmarks #703 and OpenHands/evaluation #544). Without those companion changes merged, dispatching a programbench evaluation would fail at the orchestrator/executor level even though the SDK workflow accepts the parameter correctly.

What was verified instead:

Workflow YAML syntax is valid
programbench is in the options list and will appear in the GitHub Actions dispatch UI
The parameter is correctly forwarded to downstream workflows

Suggested AGENTS.md guidance:
For future QA of coordinated multi-repo changes, document the expected failure modes when testing PRs before companion changes merge, and what partial verification is sufficient (e.g., "verify YAML structure and parameter forwarding; full dispatch verification requires companion PRs").

2. Race Condition Scenario Reproduction

What was attempted:
Tried to trigger the specific race condition (WS FINISHED event racing stop hook evaluation) to verify the fix works in practice.

Why it couldn't be verified:
Reproducing this race requires:

Setting up an agent server
Configuring stop hooks that can deny stopping (return rc=2)
Triggering a conversation that hits the exact timing window where agent.step() sets FINISHED, the WS event is delivered, and then the stop hook flips status back to RUNNING

This level of integration testing is beyond the scope of QA (it's covered by the new unit tests in test_remote_conversation.py).

What was verified instead:

The _immediate_terminal_statuses() method returns the correct set ({'error', 'stuck'} excluding 'finished')
The code logic correctly checks this set before short-circuiting on WS events
The PR description documents the empirical evidence from programbench retry-16 that triggered the bug
CI tests pass (including the new regression tests)

Suggested AGENTS.md guidance:
For race condition fixes in RemoteConversation, QA should verify the logic is implemented correctly (set membership checks, queue draining) and defer to unit tests for full race scenario coverage. Manual reproduction of race conditions is impractical without extensive integration test harnesses.

Issues Found

None.

neubig · 2026-05-10T15:30:36Z

I updated the PR description to address the scope/evidence feedback: it now explicitly covers the RemoteConversation production/test changes as well as the run-eval option/docs updates, and links the successful ProgramBench end-to-end CI runs from the companion evaluation workflow (25552125670, 25552168816, 25552772174). Ready for another look.\n\nThis PR comment was created by an AI agent (OpenHands) on behalf of the user.

all-hands-bot

🟡 Acceptable - Bug fix is solid but requires human review per repo policy

[EVAL RISK POLICY]

Per the repository's code review guidelines, changes to terminal/event handling require human maintainer review before approval. While eval evidence is provided (successful GH Actions runs), the eval-risk policy exception requires a link to the eval monitor (openhands-eval-monitor.vercel.app) AND human maintainer confirmation.

This PR changes conversation termination behavior (a critical execution path). Even though it's a bug fix for a real race condition, a human maintainer should review the eval results before merging.

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM

Changes conversation termination logic, which is a critical code path. The fix is correct and well-tested, but the behavior change could affect evaluation metrics (agents may run slightly longer until stop hooks confirm termination). The successful programbench CI runs are encouraging, but human review is warranted given the impact area.

VERDICT:
⚠️ Needs human maintainer review: Code is solid, but eval-risk policy requires human review for termination behavior changes.

KEY INSIGHT:
The race fix correctly treats WebSocket terminal events as wakeup hints rather than authoritative termination signals, ensuring stop hooks can complete their evaluation before conversation teardown.

Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.

Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.

When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

all-hands-bot

✅ QA Report: PASS

Verified that programbench is now dispatchable through the SDK's evaluation workflow with the same interface as other benchmarks (swebench, gaia, etc.).

Does this PR achieve its stated goal?

Yes. This PR successfully adds programbench to the SDK's evaluation workflow dispatch system. I verified that: (1) the workflow file syntax is valid and includes programbench in the benchmark choice options, (2) the workflow correctly constructs and forwards the benchmark parameter to the downstream evaluation workflow, (3) a simulated dispatch payload with benchmark=programbench passes validation and would be accepted by the GitHub API. The companion PRs in the benchmarks and evaluation repos (already merged/in-progress) complete the end-to-end integration, but this PR's changes are necessary to close the loop from the SDK repo.

Phase	Result
Environment Setup	✅ Repository checked out on feat/programbench branch
CI Status	✅ All checks passing (30 successful, 0 failing, 2 pending QA workflows)
Functional Verification	✅ Workflow dispatch validated, RemoteConversation race fix verified with tests

Functional Verification

Test 1: Workflow File Validation

Step 1 — Verify programbench is absent in the base (establish baseline):

Checked the PR diff to confirm programbench was not previously in the workflow options list. The diff shows:

+                    - programbench

This confirms programbench is a new addition.

Step 2 — Verify programbench is present in the PR branch:

Extracted the benchmark options from .github/workflows/run-eval.yml:

Available benchmark options:
  - gaia
  - swebench
  - swtbench
  - commit0
  - swebenchmultimodal
  - terminalbench
  - programbench

Validated YAML syntax:

python3 -c "import yaml; yaml.safe_load(open('.github/workflows/run-eval.yml')); print('✓ Workflow YAML syntax is valid')"
# Output: ✓ Workflow YAML syntax is valid

This confirms programbench is correctly added to the workflow options with valid YAML syntax.

Test 2: Workflow Dispatch Payload Construction

Step 1 — Verify the workflow forwards the benchmark parameter:

Inspected the workflow dispatch step (lines 366-386 in run-eval.yml):

Line 376: --arg benchmark "$BENCHMARK" — parameter is captured from workflow input
Line 386: benchmark: $benchmark — parameter is included in the JSON payload sent to the downstream evaluation workflow

This confirms the workflow correctly propagates the benchmark choice.

Step 2 — Simulate a programbench dispatch payload:

Ran a simulation of the jq command that constructs the dispatch payload:

jq -n \
  --arg benchmark "programbench" \
  --arg sdk "v1.21.1" \
  --arg eval_limit "1" \
  # ... (other parameters)
  '{ref: $ref, inputs: {benchmark: $benchmark, ...}}'

Output:

{
  "ref": "main",
  "inputs": {
    "benchmark": "programbench",
    "sdk_commit": "v1.21.1",
    "eval_limit": "1",
    ...
  }
}

✓ Payload successfully constructed with benchmark=programbench
✓ This validates that a workflow dispatch with programbench would correctly format the API call to the downstream evaluation workflow.

Test 3: RemoteConversation Race Fix

Step 1 — Verify the race condition is documented:

Checked AGENTS.md for the race condition documentation:

WebSocket terminal status events are *hints*, not authoritative termination.
The server-side `LocalConversation.run` loop releases its state lock at the
end of each iteration, so a `FINISHED` status set by `agent.step()` is visible
to clients before the *next* loop iteration runs stop hooks. If a stop hook
returns rc=2 (denying the stop), status flips back to RUNNING.

This confirms the race condition is well-documented.

Step 2 — Verify the fix implementation:

Inspected openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py:

Lines 1072-1084: Detailed comment explaining WS FINISHED is now a "wakeup hint"
Lines 1089-1097: WS status only terminates immediately for ERROR/STUCK
Lines 1148-1163: New _immediate_terminal_statuses() helper excludes FINISHED
Lines 1165-1177: New _drain_terminal_status_queue() helper prevents stale events

This confirms the race fix is properly implemented.

Step 3 — Verify test coverage for the race fix:

Checked tests/sdk/conversation/remote/test_remote_conversation.py:

Lines 770-852: test_remote_conversation_run_ws_finished_is_only_a_hint_not_terminal — regression test that seeds a WS FINISHED event and verifies the client waits for REST confirmation (polling through RUNNING states before accepting the final FINISHED)
Lines 858-898: test_remote_conversation_run_ws_error_still_terminates_immediately — verifies ERROR/STUCK still fast-path terminate without REST polling

Both tests include detailed docstrings explaining the race scenario and expected behavior.

✓ The RemoteConversation race fix has comprehensive test coverage and clear documentation.

Test 4: Documentation Consistency

Verify programbench is documented across all files:

Searched for programbench references:

.agents/skills/manage-evals/SKILL.md: programbench listed in supported benchmarks table
.agents/skills/manage-evals/references/eval-infrastructure.md: programbench in benchmark list
.agents/skills/run-eval.md: programbench in key parameters documentation
AGENTS.md: programbench mentioned in the race fix context

✓ Documentation is consistent across all skill files and repository guidance.

Issues Found

None.

Summary: This PR successfully integrates programbench into the SDK's evaluation dispatch workflow. The workflow file is syntactically valid, correctly forwards the benchmark parameter, and would accept benchmark=programbench in workflow dispatch calls. The companion RemoteConversation race fix (discovered during ProgramBench CI) is well-implemented with comprehensive regression tests. All CI checks pass. Ready to merge.

Wires ProgramBench into the SDK's run-eval.yml entrypoint so it can be dispatched from the SDK repo on a per-PR / per-release basis like the other benchmarks (swebench, gaia, swtbench, commit0, swebenchmultimodal, terminalbench). The choice is forwarded to OpenHands/evaluation eval-job.yml's BENCHMARK input verbatim — that repo's PR (#544) adds the matching dispatcher (build-programbench job, run_programbench.sh phase runner, helm values for k8s with DIND_ENABLED=true). Also extend the benchmark list in the manage-evals skill / run-eval skill docs so the agent knows programbench is dispatchable. No code changes — choice-list / docs only. Co-authored-by: openhands <openhands@all-hands.dev>

neubig · 2026-05-10T15:58:07Z

Restacked this PR on top of the new standalone RemoteConversation race-fix PR (#3191). This PR now contains only the ProgramBench run-eval option and docs changes; the critical conversation termination behavior change is isolated for focused human review in #3191.\n\nThis PR comment was created by an AI agent (OpenHands) on behalf of the user.

all-hands-bot

LGTM! Clean addition of programbench to the eval workflow. Follows the existing pattern perfectly and companion infrastructure PRs are already merged.

all-hands-bot

⚠️ QA Report: PASS WITH ISSUES

Verified that programbench has been added to the workflow dispatcher, but found two locations where the benchmark list needs to be updated.

Does this PR achieve its stated goal?

Partially. The PR aims to add programbench as a dispatchable benchmark option alongside the existing benchmarks. The GitHub Actions workflow file (.github/workflows/run-eval.yml) is correctly updated and most documentation is consistent. However, two issues prevent full achievement:

.agents/skills/manage-evals/scripts/manage_evals.py (lines 38-45) is missing programbench in the BENCHMARKS list. Users cannot use python manage_evals.py trigger --benchmark programbench because the script's argument parser will reject it.
.agents/skills/manage-evals/SKILL.md (line 133) in the "Run Path Components" section lists valid benchmark values but does not include programbench.

Both files need to be updated for programbench to be fully integrated into the evaluation workflow.

Phase	Result
Environment Setup	✅ Repository checked out, dependencies available
CI Status	✅ 21 checks passing, 7 pending (Docker builds), 0 failing
Functional Verification	⚠️ Workflow updated correctly, but supporting tools incomplete

Functional Verification

Test 1: Baseline — Verify benchmarks before this PR

Checked base branch workflow options:

git show fix/remote-conversation-finished-hint:.github/workflows/run-eval.yml | head -30

Baseline had these benchmarks:

gaia
swebench
swtbench
commit0
swebenchmultimodal
terminalbench

(Missing programbench — this is what the PR adds)

Test 2: Verify workflow file changes

Validated YAML syntax and options:

import yaml
with open('.github/workflows/run-eval.yml') as f:
    workflow = yaml.safe_load(f)
benchmark_options = workflow[True]['workflow_dispatch']['inputs']['benchmark']['options']
print(benchmark_options)
# Output: ['gaia', 'swebench', 'swtbench', 'commit0', 'swebenchmultimodal', 'terminalbench', 'programbench']

✅ YAML is valid and programbench is now in the options list.

Test 3: Verify documentation consistency

Checked that programbench appears in skill documentation:

grep -A 2 "programbench" \
  .agents/skills/manage-evals/SKILL.md \
  .agents/skills/manage-evals/references/eval-infrastructure.md \
  .agents/skills/run-eval.md

Results:

✅ .agents/skills/manage-evals/SKILL.md: Has programbench in benchmark table (line 72)
✅ .agents/skills/manage-evals/references/eval-infrastructure.md: Lists programbench in benchmark enumeration
✅ .agents/skills/run-eval.md: Documents programbench as supported parameter

Test 4: Attempt to dispatch workflow with programbench

Constructed dispatch payload and attempted to trigger:

gh api -X POST \
  /repos/OpenHands/software-agent-sdk/actions/workflows/run-eval.yml/dispatches \
  --input /tmp/dispatch_payload.json
# Payload: benchmark=programbench, sdk_ref=v1.21.1, eval_limit=1

Result:

{"message": "Resource not accessible by personal access token", "status": "403"}

⚠️ Cannot dispatch due to token permissions (read-only token). This is an environmental limitation, not a code issue. The PR description provides evidence of successful dispatches in the companion evaluation repo runs.

Test 5: Check for other places that need updating

Searched for references to other benchmarks to ensure complete coverage:

grep -r "terminalbench" --include="*.md" --include="*.yml" --include="*.py" \
  .github .agents | grep -v "programbench"

Found two locations missing programbench:

Issue 1: .agents/skills/manage-evals/scripts/manage_evals.py lines 38-45:

BENCHMARKS = [
    "swebench",
    "gaia",
    "swtbench",
    "commit0",
    "swebenchmultimodal",
    "terminalbench",
    # ❌ Missing: "programbench"
]

This prevents users from running:

python manage_evals.py trigger --sdk-ref v1.21.1 --benchmark programbench --eval-limit 5
# Will fail with: error: argument --benchmark: invalid choice: 'programbench'

Issue 2: .agents/skills/manage-evals/SKILL.md line 133 (run path components section):

- **benchmark**: `swebench`, `gaia`, `swtbench`, `commit0`, `swebenchmultimodal`, `terminalbench`

Should include programbench in the list.

Issues Found

🟠 Important: .agents/skills/manage-evals/scripts/manage_evals.py is missing programbench in the BENCHMARKS list (lines 38-45). Add "programbench", after "terminalbench", to enable the manage_evals.py script to accept programbench as a valid benchmark choice.
🟠 Important: .agents/skills/manage-evals/SKILL.md line 133 in the "Run Path Components" section lists valid benchmarks but is missing programbench. Update the line to:
```
- **benchmark**: `swebench`, `gaia`, `swtbench`, `commit0`, `swebenchmultimodal`, `terminalbench`, `programbench`
```

Recommendation: Address both issues before merging to ensure programbench is fully usable through all documented workflows.

Co-authored-by: openhands <openhands@all-hands.dev>

neubig · 2026-05-10T16:09:37Z

Addressed the QA follow-up in 904861d: added programbench to the manage_evals.py benchmark choices and the manage-evals run-path documentation, then verified the trigger help accepts --benchmark programbench and run-eval YAML still parses. Ready for another look.\n\nThis PR comment was created by an AI agent (OpenHands) on behalf of the user.

all-hands-bot

🟢 Good taste - Clean addition following existing patterns.

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟢 LOW

Configuration-only change adding programbench to workflow dispatch options and documentation. No code logic changes, follows the exact pattern of existing benchmarks (swebench, gaia, etc.). Companion infrastructure PRs already merged.

all-hands-bot

✅ QA Report: PASS

This PR successfully adds programbench to the evaluation workflow and all supporting documentation. The changes are consistent, complete, and ready to enable programbench dispatching.

Does this PR achieve its stated goal?

Yes. The PR's goal is to "add programbench to the SDK's per-PR / per-release evaluation entrypoint so it can be dispatched the same way every other benchmark is." I verified that:

programbench is now a valid option in the run-eval.yml workflow dispatch inputs
All documentation references are updated consistently across 5 files
The benchmark lists in the workflow YAML and manage_evals.py script match perfectly (7 benchmarks each)
The example dispatch command from the PR description uses valid parameters
The PR author has already successfully run end-to-end tests in the companion evaluation repository

Phase	Result
Environment Setup	✅ Repository cloned, on PR branch (`904861d`)
CI Status	✅ 10 checks passing, 3 pending (including this QA), 1 skipped
Functional Verification	✅ All changes verified programmatically

Functional Verification

Test 1: YAML Validity and Workflow Configuration

Baseline: Before this PR, programbench was not in the workflow options.

Verification with PR changes:

Ran Python script to parse .github/workflows/run-eval.yml:

import yaml
with open('.github/workflows/run-eval.yml') as f:
    data = yaml.safe_load(f)
    benchmark_options = data[True]['workflow_dispatch']['inputs']['benchmark']['options']

Result:

Benchmark options: ['gaia', 'swebench', 'swtbench', 'commit0', 'swebenchmultimodal', 'terminalbench', 'programbench']
✓ 'programbench' in options: True
✓ Total benchmark options: 7
✓ YAML is valid and can be parsed

Interpretation: The workflow YAML is syntactically valid and programbench is now present as the 7th benchmark option, confirming the workflow can accept it as a dispatch parameter.

Test 2: Documentation Consistency

Verification: Checked that programbench was added to all required files:

✓ .github/workflows/run-eval.yml - Contains 'programbench': True
✓ .agents/skills/manage-evals/SKILL.md - Contains 'programbench': True (found 2 times)
✓ .agents/skills/manage-evals/references/eval-infrastructure.md - Contains 'programbench': True (found 1 time)
✓ .agents/skills/manage-evals/scripts/manage_evals.py - Contains 'programbench': True (found 1 time)
✓ .agents/skills/run-eval.md - Contains 'programbench': True (found 1 time)

Interpretation: All documentation and supporting files are updated consistently. Users will find programbench referenced in skills, scripts, and infrastructure docs.

Test 3: Benchmark List Consistency

Verification: Compared benchmark lists between workflow YAML and Python script to ensure they match:

Workflow benchmarks (7): ['commit0', 'gaia', 'programbench', 'swebench', 'swebenchmultimodal', 'swtbench', 'terminalbench']
Script benchmarks (7): ['commit0', 'gaia', 'programbench', 'swebench', 'swebenchmultimodal', 'swtbench', 'terminalbench']

✓ Workflow and script benchmark lists match perfectly

Interpretation: No discrepancies between the workflow definition and the manage_evals.py script. Both recognize the same 7 benchmarks including programbench.

Test 4: Example Dispatch Command Validation

Verification: Validated the example command from the PR description would work:

gh workflow run run-eval.yml \
  --repo OpenHands/software-agent-sdk \
  -f sdk_ref=v1.20.1 \
  -f benchmark=programbench \
  -f eval_limit=5 \
  -f model_ids=litellm_proxy/anthropic/claude-sonnet-4-5-20250929

Checked each parameter against workflow inputs:

✓ Parameter 'sdk_ref' exists (Type: string, Required: True)
✓ Parameter 'benchmark' exists (Type: choice, Required: False)
  ✓ Value 'programbench' is in allowed options
✓ Parameter 'eval_limit' exists (Type: string, Required: False)
✓ Parameter 'model_ids' exists (Type: string, Required: False)

Interpretation: The example dispatch command uses only valid parameters, and programbench is accepted as a benchmark choice. The workflow can be triggered with these parameters after merge.

Test 5: End-to-End Evidence

From PR description: The author reports successful end-to-end runs in the companion evaluation repository:

Parent eval job: https://github.com/OpenHands/evaluation/actions/runs/25552125670 (success)
Inference phase: https://github.com/OpenHands/evaluation/actions/runs/25552168816 (success)
Eval harness phase: https://github.com/OpenHands/evaluation/actions/runs/25552772174 (success)

Interpretation: While I cannot independently verify these external workflow runs, the author has provided specific run IDs and claims success. Combined with the verified changes in this repository, this provides reasonable confidence the integration works end-to-end.

Issues Found

None.

neubig marked this pull request as ready for review May 10, 2026 14:34

all-hands-bot reviewed May 10, 2026

View reviewed changes

neubig mentioned this pull request May 10, 2026

Add ProgramBench benchmark integration OpenHands/benchmarks#703

Merged

neubig requested a review from all-hands-bot May 10, 2026 15:30

all-hands-bot reviewed May 10, 2026

View reviewed changes

neubig mentioned this pull request May 10, 2026

Fix RemoteConversation FINISHED stop-hook race #3191

Open

neubig force-pushed the feat/programbench branch from 412bf5c to 564c97c Compare May 10, 2026 15:57

neubig changed the base branch from main to fix/remote-conversation-finished-hint May 10, 2026 15:57

neubig requested a review from all-hands-bot May 10, 2026 15:58

all-hands-bot approved these changes May 10, 2026

View reviewed changes

all-hands-bot reviewed May 10, 2026

View reviewed changes

docs: complete ProgramBench eval tooling references

904861d

Co-authored-by: openhands <openhands@all-hands.dev>

neubig requested a review from all-hands-bot May 10, 2026 16:09

all-hands-bot approved these changes May 10, 2026

View reviewed changes

all-hands-bot reviewed May 10, 2026

View reviewed changes

Conversation

neubig commented May 6, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Companion PRs

Changes

Verification

Uh oh!

github-actions Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python API breakage checks — ✅ PASSED

Uh oh!

github-actions Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

REST API breakage checks (OpenAPI) — ✅ PASSED

Uh oh!

github-actions Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

✅ QA Report: PASS

Does this PR achieve its stated goal?

Test 1: Workflow File Validation

Test 2: Documentation Consistency

Test 3: Race Condition Fix Implementation

1. End-to-End Workflow Dispatch

2. Race Condition Scenario Reproduction

Issues Found

Uh oh!

neubig commented May 10, 2026

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

✅ QA Report: PASS

Does this PR achieve its stated goal?

Test 1: Workflow File Validation

Test 2: Workflow Dispatch Payload Construction

Test 3: RemoteConversation Race Fix

Test 4: Documentation Consistency

Issues Found

Uh oh!

neubig commented May 10, 2026

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

⚠️ QA Report: PASS WITH ISSUES

Does this PR achieve its stated goal?

Test 1: Baseline — Verify benchmarks before this PR

Test 2: Verify workflow file changes

Test 3: Verify documentation consistency

Test 4: Attempt to dispatch workflow with programbench

Test 5: Check for other places that need updating

Issues Found

Uh oh!

neubig commented May 10, 2026

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

✅ QA Report: PASS

Does this PR achieve its stated goal?

Test 1: YAML Validity and Workflow Configuration

Test 2: Documentation Consistency

Test 3: Benchmark List Consistency

Test 4: Example Dispatch Command Validation

Test 5: End-to-End Evidence

Issues Found

Uh oh!

neubig commented May 6, 2026 •

edited by github-actions Bot

Loading

github-actions Bot commented May 6, 2026 •

edited

Loading

github-actions Bot commented May 6, 2026 •

edited

Loading

github-actions Bot commented May 7, 2026 •

edited

Loading