Add programbench to run-eval benchmark choices#3075
Conversation
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
Coverage Report •
|
||||||||||||||||||||
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable - Bug fix is solid but requires human review per repo policy
[CRITICAL ISSUES]
-
[PR Description] Misleading scope statement - says "No source / test / runtime changes — choice-list and skill-doc updates only" but includes ~150 lines of production code changes to conversation termination logic in
remote_conversation.pyplus test additions. The PR description should accurately reflect the full scope. -
[Missing Eval Evidence] Per repo review policy, changes that affect agent behavior (conversation termination logic, stop hook interaction) require eval evidence before approval. This PR modifies when/how agents terminate on FINISHED status. Draft status is noted, but eval validation should be added before marking ready for review.
[IMPROVEMENT OPPORTUNITIES]
-
[PR Scope] Consider splitting the race condition bug fix from the programbench feature addition. The fix is valuable independently and could be backported/reverted separately if needed.
-
[remote_conversation.py lines 1069-1086] The 18-line comment explaining the race condition signals complex logic. Consider extracting WS hint handling to
_process_ws_terminal_hint(ws_status) -> boolwith docs in the docstring. -
[remote_conversation.py line 1146] Queue draining on status flip is good defensive programming, but adds concurrency complexity. Consider adding debug logging when draining occurs to help debug future races.
-
[remote_conversation.py lines 1149-1166]
_immediate_terminal_statuses()returns a constant - make it a class constant to avoid recreating the frozenset on every call.
[RISK ASSESSMENT]
This fixes a real production race condition (programbench retry-16) with good test coverage, but changes critical conversation lifecycle semantics. The WS-hint-only pattern for FINISHED with REST confirmation is the right fix, but adds complexity to an already intricate state machine. Requires human review and benchmark validation.
Recommendation: Request eval evidence showing programbench (and ideally one other benchmark like SWE-bench) completes successfully with this change before approval.
VERDICT:
⏸️ Needs eval evidence: Core logic is sound, but repo policy requires benchmark validation for agent behavior changes.
KEY INSIGHT:
The fix correctly identifies that WebSocket terminal status events are hints, not authoritative termination signals, because stop hooks can revert FINISHED→RUNNING. The REST polling confirmation pattern is the right architectural choice.
Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:
- Add a
.agents/skills/custom-codereview-guide.mdfile to your branch (or edit it if one already exists) with the/codereviewtrigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.- Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
- When your PR is merged, the guideline file goes through normal code review by repository maintainers.
Resolve with AI? Install the iterate skill in your agent and run
/iterateto automatically drive this PR through CI, review, and QA until it's merge-ready.
all-hands-bot
left a comment
There was a problem hiding this comment.
✅ QA Report: PASS
This PR successfully adds programbench to the SDK's evaluation workflow dispatch options and includes a race condition fix for RemoteConversation discovered during programbench testing.
Does this PR achieve its stated goal?
Yes. The PR accomplishes both of its objectives:
-
Primary goal — Add programbench as a dispatchable benchmark choice: The workflow file now includes
programbenchin the benchmark options list (line 24 ofrun-eval.yml), the YAML is syntactically valid, and all documentation is updated consistently. When a user dispatches this workflow from the GitHub Actions UI, programbench will appear as a selectable option alongside the existing benchmarks. -
Secondary goal — Fix RemoteConversation WebSocket race condition: The code changes correctly implement the fix by treating WS
FINISHEDstatus as a hint rather than authoritative termination. OnlyERRORandSTUCKstatuses trigger immediate termination;FINISHEDrequires REST poll confirmation to avoid racing stop hooks that can revert the status back toRUNNING.
| Phase | Result |
|---|---|
| Environment Setup | ✅ Repository cloned, dependencies verified |
| CI Status | ✅ All checks passing (sdk-tests, tools-tests, agent-server-tests, workspace-tests, windows-tests all SUCCESS) |
| Functional Verification | ✅ Workflow structure validated, race condition fix verified |
Functional Verification
Test 1: Workflow File Validation
Step 1 — Verify baseline (main branch):
Checked the workflow file on main branch:
$ gh workflow view "run-eval.yml" --repo OpenHands/software-agent-sdk --yaml | grep -A 7 "options:"
options:
- gaia
- swebench
- swtbench
- commit0
- swebenchmultimodal
- terminalbenchThis shows programbench is NOT in the main branch options.
Step 2 — Apply PR changes:
Checked out feat/programbench branch (commit 6ad2915).
Step 3 — Verify with PR changes:
Validated the workflow file:
$ python3 -c "import yaml; yaml.safe_load(open('.github/workflows/run-eval.yml'))"
✓ YAML is valid
$ gh workflow view "run-eval.yml" --ref feat/programbench --yaml | grep -A 8 "options:"
options:
- gaia
- swebench
- swtbench
- commit0
- swebenchmultimodal
- terminalbench
- programbenchThis confirms programbench is now in the workflow options and the YAML is syntactically valid.
Step 4 — Verify the parameter is passed through:
Checked the workflow file to confirm the benchmark parameter is forwarded to downstream workflows:
$ grep "BENCHMARK:" .github/workflows/run-eval.yml
BENCHMARK: ${{ github.event.inputs.benchmark || 'swebench' }}
$ grep '"benchmark"' .github/workflows/run-eval.yml | grep arg
--arg benchmark "$BENCHMARK" \This shows the workflow correctly accepts and forwards the benchmark parameter.
Test 2: Documentation Consistency
Verified all documentation files include programbench:
$ grep -n "programbench" .agents/skills/manage-evals/SKILL.md
72:| `programbench` | ProgramBench — program-repair tasks against gold-standard test binaries |
$ grep -n "programbench" .agents/skills/run-eval.md
34:- `benchmark`: `swebench`, `swebenchmultimodal`, `gaia`, `swtbench`, `commit0`, `multiswebench`, `terminalbench`, `programbench`
$ grep -n "programbench" .agents/skills/manage-evals/references/eval-infrastructure.md
45:- **benchmark**: `swebench`, `gaia`, `swtbench`, `commit0`, `swebenchmultimodal`, `terminalbench`, `programbench`All documentation is updated consistently.
Test 3: Race Condition Fix Implementation
Verified the fix logic:
Imported and tested the race condition fix:
$ python3
>>> from openhands.sdk.conversation.impl.remote_conversation import RemoteConversation
>>> from openhands.sdk.conversation.state import ConversationExecutionStatus
>>> statuses = RemoteConversation._immediate_terminal_statuses()
>>> print(statuses)
frozenset({'stuck', 'error'})
>>> ConversationExecutionStatus.FINISHED.value in statuses
False
>>> ConversationExecutionStatus.ERROR.value in statuses
True
>>> ConversationExecutionStatus.STUCK.value in statuses
TrueThis confirms the fix is implemented correctly:
- ✓
FINISHEDis NOT in the immediate terminal statuses set - ✓
ERRORandSTUCKARE in the set - ✓ This prevents the WebSocket race with stop hooks (FINISHED requires REST confirmation)
Unable to Verify
1. End-to-End Workflow Dispatch
What was attempted:
Tried to verify a complete programbench evaluation workflow run.
Why it couldn't be verified:
The PR description states this is one of three coordinated PRs and is "ready to flip out of draft once the companion PRs land" (OpenHands/benchmarks #703 and OpenHands/evaluation #544). Without those companion changes merged, dispatching a programbench evaluation would fail at the orchestrator/executor level even though the SDK workflow accepts the parameter correctly.
What was verified instead:
- Workflow YAML syntax is valid
programbenchis in the options list and will appear in the GitHub Actions dispatch UI- The parameter is correctly forwarded to downstream workflows
Suggested AGENTS.md guidance:
For future QA of coordinated multi-repo changes, document the expected failure modes when testing PRs before companion changes merge, and what partial verification is sufficient (e.g., "verify YAML structure and parameter forwarding; full dispatch verification requires companion PRs").
2. Race Condition Scenario Reproduction
What was attempted:
Tried to trigger the specific race condition (WS FINISHED event racing stop hook evaluation) to verify the fix works in practice.
Why it couldn't be verified:
Reproducing this race requires:
- Setting up an agent server
- Configuring stop hooks that can deny stopping (return rc=2)
- Triggering a conversation that hits the exact timing window where
agent.step()sets FINISHED, the WS event is delivered, and then the stop hook flips status back to RUNNING
This level of integration testing is beyond the scope of QA (it's covered by the new unit tests in test_remote_conversation.py).
What was verified instead:
- The
_immediate_terminal_statuses()method returns the correct set ({'error', 'stuck'}excluding'finished') - The code logic correctly checks this set before short-circuiting on WS events
- The PR description documents the empirical evidence from programbench retry-16 that triggered the bug
- CI tests pass (including the new regression tests)
Suggested AGENTS.md guidance:
For race condition fixes in RemoteConversation, QA should verify the logic is implemented correctly (set membership checks, queue draining) and defer to unit tests for full race scenario coverage. Manual reproduction of race conditions is impractical without extensive integration test harnesses.
Issues Found
None.
|
I updated the PR description to address the scope/evidence feedback: it now explicitly covers the RemoteConversation production/test changes as well as the run-eval option/docs updates, and links the successful ProgramBench end-to-end CI runs from the companion evaluation workflow (25552125670, 25552168816, 25552772174). Ready for another look.\n\nThis PR comment was created by an AI agent (OpenHands) on behalf of the user. |
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Bug fix is solid but requires human review per repo policy
[EVAL RISK POLICY]
Per the repository's code review guidelines, changes to terminal/event handling require human maintainer review before approval. While eval evidence is provided (successful GH Actions runs), the eval-risk policy exception requires a link to the eval monitor (openhands-eval-monitor.vercel.app) AND human maintainer confirmation.
This PR changes conversation termination behavior (a critical execution path). Even though it's a bug fix for a real race condition, a human maintainer should review the eval results before merging.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟡 MEDIUM
Changes conversation termination logic, which is a critical code path. The fix is correct and well-tested, but the behavior change could affect evaluation metrics (agents may run slightly longer until stop hooks confirm termination). The successful programbench CI runs are encouraging, but human review is warranted given the impact area.
VERDICT:
KEY INSIGHT:
The race fix correctly treats WebSocket terminal events as wakeup hints rather than authoritative termination signals, ensuring stop hooks can complete their evaluation before conversation teardown.
Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:
- Add a
.agents/skills/custom-codereview-guide.mdfile to your branch (or edit it if one already exists) with the/codereviewtrigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.- Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
- When your PR is merged, the guideline file goes through normal code review by repository maintainers.
Resolve with AI? Install the iterate skill in your agent and run
/iterateto automatically drive this PR through CI, review, and QA until it's merge-ready.
all-hands-bot
left a comment
There was a problem hiding this comment.
✅ QA Report: PASS
Verified that programbench is now dispatchable through the SDK's evaluation workflow with the same interface as other benchmarks (swebench, gaia, etc.).
Does this PR achieve its stated goal?
Yes. This PR successfully adds programbench to the SDK's evaluation workflow dispatch system. I verified that: (1) the workflow file syntax is valid and includes programbench in the benchmark choice options, (2) the workflow correctly constructs and forwards the benchmark parameter to the downstream evaluation workflow, (3) a simulated dispatch payload with benchmark=programbench passes validation and would be accepted by the GitHub API. The companion PRs in the benchmarks and evaluation repos (already merged/in-progress) complete the end-to-end integration, but this PR's changes are necessary to close the loop from the SDK repo.
| Phase | Result |
|---|---|
| Environment Setup | ✅ Repository checked out on feat/programbench branch |
| CI Status | ✅ All checks passing (30 successful, 0 failing, 2 pending QA workflows) |
| Functional Verification | ✅ Workflow dispatch validated, RemoteConversation race fix verified with tests |
Functional Verification
Test 1: Workflow File Validation
Step 1 — Verify programbench is absent in the base (establish baseline):
Checked the PR diff to confirm programbench was not previously in the workflow options list. The diff shows:
+ - programbenchThis confirms programbench is a new addition.
Step 2 — Verify programbench is present in the PR branch:
Extracted the benchmark options from .github/workflows/run-eval.yml:
Available benchmark options:
- gaia
- swebench
- swtbench
- commit0
- swebenchmultimodal
- terminalbench
- programbench
Validated YAML syntax:
python3 -c "import yaml; yaml.safe_load(open('.github/workflows/run-eval.yml')); print('✓ Workflow YAML syntax is valid')"
# Output: ✓ Workflow YAML syntax is validThis confirms programbench is correctly added to the workflow options with valid YAML syntax.
Test 2: Workflow Dispatch Payload Construction
Step 1 — Verify the workflow forwards the benchmark parameter:
Inspected the workflow dispatch step (lines 366-386 in run-eval.yml):
- Line 376:
--arg benchmark "$BENCHMARK"— parameter is captured from workflow input - Line 386:
benchmark: $benchmark— parameter is included in the JSON payload sent to the downstream evaluation workflow
This confirms the workflow correctly propagates the benchmark choice.
Step 2 — Simulate a programbench dispatch payload:
Ran a simulation of the jq command that constructs the dispatch payload:
jq -n \
--arg benchmark "programbench" \
--arg sdk "v1.21.1" \
--arg eval_limit "1" \
# ... (other parameters)
'{ref: $ref, inputs: {benchmark: $benchmark, ...}}'Output:
{
"ref": "main",
"inputs": {
"benchmark": "programbench",
"sdk_commit": "v1.21.1",
"eval_limit": "1",
...
}
}✓ Payload successfully constructed with benchmark=programbench
✓ This validates that a workflow dispatch with programbench would correctly format the API call to the downstream evaluation workflow.
Test 3: RemoteConversation Race Fix
Step 1 — Verify the race condition is documented:
Checked AGENTS.md for the race condition documentation:
WebSocket terminal status events are *hints*, not authoritative termination.
The server-side `LocalConversation.run` loop releases its state lock at the
end of each iteration, so a `FINISHED` status set by `agent.step()` is visible
to clients before the *next* loop iteration runs stop hooks. If a stop hook
returns rc=2 (denying the stop), status flips back to RUNNING.
This confirms the race condition is well-documented.
Step 2 — Verify the fix implementation:
Inspected openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py:
- Lines 1072-1084: Detailed comment explaining WS FINISHED is now a "wakeup hint"
- Lines 1089-1097: WS status only terminates immediately for ERROR/STUCK
- Lines 1148-1163: New
_immediate_terminal_statuses()helper excludes FINISHED - Lines 1165-1177: New
_drain_terminal_status_queue()helper prevents stale events
This confirms the race fix is properly implemented.
Step 3 — Verify test coverage for the race fix:
Checked tests/sdk/conversation/remote/test_remote_conversation.py:
- Lines 770-852:
test_remote_conversation_run_ws_finished_is_only_a_hint_not_terminal— regression test that seeds a WS FINISHED event and verifies the client waits for REST confirmation (polling through RUNNING states before accepting the final FINISHED) - Lines 858-898:
test_remote_conversation_run_ws_error_still_terminates_immediately— verifies ERROR/STUCK still fast-path terminate without REST polling
Both tests include detailed docstrings explaining the race scenario and expected behavior.
✓ The RemoteConversation race fix has comprehensive test coverage and clear documentation.
Test 4: Documentation Consistency
Verify programbench is documented across all files:
Searched for programbench references:
.agents/skills/manage-evals/SKILL.md: programbench listed in supported benchmarks table
.agents/skills/manage-evals/references/eval-infrastructure.md: programbench in benchmark list
.agents/skills/run-eval.md: programbench in key parameters documentation
AGENTS.md: programbench mentioned in the race fix context
✓ Documentation is consistent across all skill files and repository guidance.
Issues Found
None.
Summary: This PR successfully integrates programbench into the SDK's evaluation dispatch workflow. The workflow file is syntactically valid, correctly forwards the benchmark parameter, and would accept benchmark=programbench in workflow dispatch calls. The companion RemoteConversation race fix (discovered during ProgramBench CI) is well-implemented with comprehensive regression tests. All CI checks pass. Ready to merge.
Wires ProgramBench into the SDK's run-eval.yml entrypoint so it can be dispatched from the SDK repo on a per-PR / per-release basis like the other benchmarks (swebench, gaia, swtbench, commit0, swebenchmultimodal, terminalbench). The choice is forwarded to OpenHands/evaluation eval-job.yml's BENCHMARK input verbatim — that repo's PR (#544) adds the matching dispatcher (build-programbench job, run_programbench.sh phase runner, helm values for k8s with DIND_ENABLED=true). Also extend the benchmark list in the manage-evals skill / run-eval skill docs so the agent knows programbench is dispatchable. No code changes — choice-list / docs only. Co-authored-by: openhands <openhands@all-hands.dev>
|
Restacked this PR on top of the new standalone RemoteConversation race-fix PR (#3191). This PR now contains only the ProgramBench run-eval option and docs changes; the critical conversation termination behavior change is isolated for focused human review in #3191.\n\nThis PR comment was created by an AI agent (OpenHands) on behalf of the user. |
all-hands-bot
left a comment
There was a problem hiding this comment.
LGTM! Clean addition of programbench to the eval workflow. Follows the existing pattern perfectly and companion infrastructure PRs are already merged.
all-hands-bot
left a comment
There was a problem hiding this comment.
⚠️ QA Report: PASS WITH ISSUES
Verified that programbench has been added to the workflow dispatcher, but found two locations where the benchmark list needs to be updated.
Does this PR achieve its stated goal?
Partially. The PR aims to add programbench as a dispatchable benchmark option alongside the existing benchmarks. The GitHub Actions workflow file (.github/workflows/run-eval.yml) is correctly updated and most documentation is consistent. However, two issues prevent full achievement:
-
.agents/skills/manage-evals/scripts/manage_evals.py(lines 38-45) is missingprogrambenchin theBENCHMARKSlist. Users cannot usepython manage_evals.py trigger --benchmark programbenchbecause the script's argument parser will reject it. -
.agents/skills/manage-evals/SKILL.md(line 133) in the "Run Path Components" section lists valid benchmark values but does not includeprogrambench.
Both files need to be updated for programbench to be fully integrated into the evaluation workflow.
| Phase | Result |
|---|---|
| Environment Setup | ✅ Repository checked out, dependencies available |
| CI Status | ✅ 21 checks passing, 7 pending (Docker builds), 0 failing |
| Functional Verification |
Functional Verification
Test 1: Baseline — Verify benchmarks before this PR
Checked base branch workflow options:
git show fix/remote-conversation-finished-hint:.github/workflows/run-eval.yml | head -30Baseline had these benchmarks:
- gaia
- swebench
- swtbench
- commit0
- swebenchmultimodal
- terminalbench
(Missing programbench — this is what the PR adds)
Test 2: Verify workflow file changes
Validated YAML syntax and options:
import yaml
with open('.github/workflows/run-eval.yml') as f:
workflow = yaml.safe_load(f)
benchmark_options = workflow[True]['workflow_dispatch']['inputs']['benchmark']['options']
print(benchmark_options)
# Output: ['gaia', 'swebench', 'swtbench', 'commit0', 'swebenchmultimodal', 'terminalbench', 'programbench']✅ YAML is valid and programbench is now in the options list.
Test 3: Verify documentation consistency
Checked that programbench appears in skill documentation:
grep -A 2 "programbench" \
.agents/skills/manage-evals/SKILL.md \
.agents/skills/manage-evals/references/eval-infrastructure.md \
.agents/skills/run-eval.mdResults:
- ✅
.agents/skills/manage-evals/SKILL.md: Has programbench in benchmark table (line 72) - ✅
.agents/skills/manage-evals/references/eval-infrastructure.md: Lists programbench in benchmark enumeration - ✅
.agents/skills/run-eval.md: Documents programbench as supported parameter
Test 4: Attempt to dispatch workflow with programbench
Constructed dispatch payload and attempted to trigger:
gh api -X POST \
/repos/OpenHands/software-agent-sdk/actions/workflows/run-eval.yml/dispatches \
--input /tmp/dispatch_payload.json
# Payload: benchmark=programbench, sdk_ref=v1.21.1, eval_limit=1Result:
{"message": "Resource not accessible by personal access token", "status": "403"}
Test 5: Check for other places that need updating
Searched for references to other benchmarks to ensure complete coverage:
grep -r "terminalbench" --include="*.md" --include="*.yml" --include="*.py" \
.github .agents | grep -v "programbench"Found two locations missing programbench:
Issue 1: .agents/skills/manage-evals/scripts/manage_evals.py lines 38-45:
BENCHMARKS = [
"swebench",
"gaia",
"swtbench",
"commit0",
"swebenchmultimodal",
"terminalbench",
# ❌ Missing: "programbench"
]This prevents users from running:
python manage_evals.py trigger --sdk-ref v1.21.1 --benchmark programbench --eval-limit 5
# Will fail with: error: argument --benchmark: invalid choice: 'programbench'Issue 2: .agents/skills/manage-evals/SKILL.md line 133 (run path components section):
- **benchmark**: `swebench`, `gaia`, `swtbench`, `commit0`, `swebenchmultimodal`, `terminalbench`Should include programbench in the list.
Issues Found
-
🟠 Important:
.agents/skills/manage-evals/scripts/manage_evals.pyis missingprogrambenchin theBENCHMARKSlist (lines 38-45). Add"programbench",after"terminalbench",to enable the manage_evals.py script to accept programbench as a valid benchmark choice. -
🟠 Important:
.agents/skills/manage-evals/SKILL.mdline 133 in the "Run Path Components" section lists valid benchmarks but is missingprogrambench. Update the line to:- **benchmark**: `swebench`, `gaia`, `swtbench`, `commit0`, `swebenchmultimodal`, `terminalbench`, `programbench`
Recommendation: Address both issues before merging to ensure programbench is fully usable through all documented workflows.
Co-authored-by: openhands <openhands@all-hands.dev>
|
Addressed the QA follow-up in 904861d: added |
all-hands-bot
left a comment
There was a problem hiding this comment.
🟢 Good taste - Clean addition following existing patterns.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟢 LOW
Configuration-only change adding programbench to workflow dispatch options and documentation. No code logic changes, follows the exact pattern of existing benchmarks (swebench, gaia, etc.). Companion infrastructure PRs already merged.
all-hands-bot
left a comment
There was a problem hiding this comment.
✅ QA Report: PASS
This PR successfully adds programbench to the evaluation workflow and all supporting documentation. The changes are consistent, complete, and ready to enable programbench dispatching.
Does this PR achieve its stated goal?
Yes. The PR's goal is to "add programbench to the SDK's per-PR / per-release evaluation entrypoint so it can be dispatched the same way every other benchmark is." I verified that:
programbenchis now a valid option in therun-eval.ymlworkflow dispatch inputs- All documentation references are updated consistently across 5 files
- The benchmark lists in the workflow YAML and
manage_evals.pyscript match perfectly (7 benchmarks each) - The example dispatch command from the PR description uses valid parameters
- The PR author has already successfully run end-to-end tests in the companion evaluation repository
| Phase | Result |
|---|---|
| Environment Setup | ✅ Repository cloned, on PR branch (904861d) |
| CI Status | ✅ 10 checks passing, 3 pending (including this QA), 1 skipped |
| Functional Verification | ✅ All changes verified programmatically |
Functional Verification
Test 1: YAML Validity and Workflow Configuration
Baseline: Before this PR, programbench was not in the workflow options.
Verification with PR changes:
Ran Python script to parse .github/workflows/run-eval.yml:
import yaml
with open('.github/workflows/run-eval.yml') as f:
data = yaml.safe_load(f)
benchmark_options = data[True]['workflow_dispatch']['inputs']['benchmark']['options']Result:
Benchmark options: ['gaia', 'swebench', 'swtbench', 'commit0', 'swebenchmultimodal', 'terminalbench', 'programbench']
✓ 'programbench' in options: True
✓ Total benchmark options: 7
✓ YAML is valid and can be parsed
Interpretation: The workflow YAML is syntactically valid and programbench is now present as the 7th benchmark option, confirming the workflow can accept it as a dispatch parameter.
Test 2: Documentation Consistency
Verification: Checked that programbench was added to all required files:
✓ .github/workflows/run-eval.yml - Contains 'programbench': True
✓ .agents/skills/manage-evals/SKILL.md - Contains 'programbench': True (found 2 times)
✓ .agents/skills/manage-evals/references/eval-infrastructure.md - Contains 'programbench': True (found 1 time)
✓ .agents/skills/manage-evals/scripts/manage_evals.py - Contains 'programbench': True (found 1 time)
✓ .agents/skills/run-eval.md - Contains 'programbench': True (found 1 time)
Interpretation: All documentation and supporting files are updated consistently. Users will find programbench referenced in skills, scripts, and infrastructure docs.
Test 3: Benchmark List Consistency
Verification: Compared benchmark lists between workflow YAML and Python script to ensure they match:
Workflow benchmarks (7): ['commit0', 'gaia', 'programbench', 'swebench', 'swebenchmultimodal', 'swtbench', 'terminalbench']
Script benchmarks (7): ['commit0', 'gaia', 'programbench', 'swebench', 'swebenchmultimodal', 'swtbench', 'terminalbench']
✓ Workflow and script benchmark lists match perfectly
Interpretation: No discrepancies between the workflow definition and the manage_evals.py script. Both recognize the same 7 benchmarks including programbench.
Test 4: Example Dispatch Command Validation
Verification: Validated the example command from the PR description would work:
gh workflow run run-eval.yml \
--repo OpenHands/software-agent-sdk \
-f sdk_ref=v1.20.1 \
-f benchmark=programbench \
-f eval_limit=5 \
-f model_ids=litellm_proxy/anthropic/claude-sonnet-4-5-20250929Checked each parameter against workflow inputs:
✓ Parameter 'sdk_ref' exists (Type: string, Required: True)
✓ Parameter 'benchmark' exists (Type: choice, Required: False)
✓ Value 'programbench' is in allowed options
✓ Parameter 'eval_limit' exists (Type: string, Required: False)
✓ Parameter 'model_ids' exists (Type: string, Required: False)
Interpretation: The example dispatch command uses only valid parameters, and programbench is accepted as a benchmark choice. The workflow can be triggered with these parameters after merge.
Test 5: End-to-End Evidence
From PR description: The author reports successful end-to-end runs in the companion evaluation repository:
- Parent eval job: https://github.com/OpenHands/evaluation/actions/runs/25552125670 (success)
- Inference phase: https://github.com/OpenHands/evaluation/actions/runs/25552168816 (success)
- Eval harness phase: https://github.com/OpenHands/evaluation/actions/runs/25552772174 (success)
Interpretation: While I cannot independently verify these external workflow runs, the author has provided specific run IDs and claims success. Combined with the verified changes in this repository, this provides reasonable confidence the integration works end-to-end.
Issues Found
None.
This PR was created by an AI agent (OpenHands) on behalf of the user.
Summary
Adds
programbenchto the SDK's per-PR / per-release evaluation entrypoint so it can be dispatched the same way every other benchmark (swebench,gaia,swtbench,commit0,swebenchmultimodal,terminalbench) is — from the SDK repo, against a chosen SDK ref/commit, with parallel inference workers running on k8s in the evaluation repo'sevaluation-jobsnamespace.This PR is now stacked on top of #3191, which contains the standalone
RemoteConversationFINISHED/stop-hook race fix discovered during ProgramBench testing. With that split, this PR contains only the ProgramBench workflow/docs changes.Companion PRs
Changes
.github/workflows/run-eval.yml— addprogrambenchto theworkflow_dispatch.inputs.benchmark.optionslist..agents/skills/manage-evals/SKILL.md— addprogrambenchto the benchmark table..agents/skills/manage-evals/references/eval-infrastructure.md— documentprogrambenchin the benchmark enumeration..agents/skills/run-eval.md— documentprogrambenchas a supportedbenchmarkparameter.Verification
uv run python - <<'PY' ... yaml.safe_load(open('.github/workflows/run-eval.yml')) ... PY—run-eval.yml valid YAMLEnd-to-end ProgramBench CI evidence from the companion evaluation workflow-dispatch runs:
Verify ProgramBench Images,Run Inference,Wait For Infer Output, andRun Eval Harnessforbenchmark=programbench,eval_limit=1(cmatrix).After the base PR lands,
programbenchwill be dispatchable end-to-end via:Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:564c97c-pythonRun
All tags pushed for this build
About Multi-Architecture Support
564c97c-python) is a multi-arch manifest supporting both amd64 and arm64564c97c-python-amd64) are also available if needed