Summary
A full SWE-Bench build embedded in eval-job.yml timed out after 6 hours on April 23, 2026 (OpenHands/evaluation#24809880634), but the original leading hypothesis about agent-type=acp-claude was wrong.
The investigation showed that the slow run was a cold build on main without the disk-space cleanup changes, while the "fast standalone" comparison was running a different benchmarks branch that already contained aggressive disk reclamation changes.
Confirmed Findings
1. agent-type was not affecting SWE-Bench image builds
benchmarks/swebench/build_images.py parsed --agent-type but did not thread it into any build step. In other words, for SWE-Bench image building, agent-type=acp-claude vs default did not change image contents, build phases, or tag selection.
This dead plumbing has now been removed in:
Those PRs also remove the same dead build-time plumbing for SWT-Bench. This does not remove inference-time agent_type, which is still real for ACP runs.
2. The “fast standalone” run was not an apples-to-apples comparison
The fast run was:
- OpenHands/benchmarks#24810843867
- started at 2026-04-23T01:01:53Z
- completed at 2026-04-23T05:31:41Z
- head SHA:
72caec0 on branch openhands/fix-swebench-disk-space-oom
That branch differed from main in the parts that matter here:
- added a
Free disk space workflow step before Docker setup
- added pre-assembly Buildx pruning
- added per-image cleanup during assembly (
docker rmi, docker system prune, docker builder prune)
The slow eval-embedded run used benchmarks-branch=main, so it did not have those changes.
3. The available log data points to build/assembly time, not GHCR push time
From the archived logs of the fast standalone run:
- builder image started around 2026-04-23T01:07:48Z
- base-image phase completed at 2026-04-23T03:05:41Z
- assembly phase ran until about 2026-04-23T05:30:03Z
Per-image assembly timings in that run were roughly:
- average total: 131.8s
- average Docker build time: 106.9s
- average push time: 5.8s
So the available evidence does not support GHCR push bandwidth as the primary explanation.
4. workflow_call vs workflow_dispatch was not the root cause
A later retry run through the evaluation workflow:
That proves the reusable workflow path itself was not inherently slow. This retry benefited from cache hits / already-built artifacts, so it does not disprove the cold-build disk-pressure theory.
Most Likely Root Cause
The best-supported explanation is:
- the original failing run was a cold full-500 build on
main
main did not yet include the disk-space mitigation branch that the faster standalone run was using
- the job eventually hit GitHub’s 6-hour cap before the post-build artifact upload steps ran
- because the build logs were only archived after the build step, the most useful artifacts were lost when the job was killed
Impact
This failure mode is especially bad because it can look like a very poor evaluation result when the real problem is partial image availability:
- only part of the image set gets built
- the remaining instances fail downstream with
ImageNotFound
- the final success/resolve numbers can understate the model’s actual performance on submitted instances
Follow-up Work
Completed / in progress:
Still worth doing separately:
- make build telemetry durable even when the build step is killed before
always() artifact upload runs
- keep the disk-space mitigations on the production
main path for cold full-500 builds
- run a controlled cold-build comparison on the same SDK commit after the disk-space fixes are in place
Original Slow Run
| Field |
Value |
| Eval workflow run |
OpenHands/evaluation#24809880634 |
| Build job |
Build SWE-Bench Images / build-and-push |
| Started |
2026-04-23T00:30:24Z |
| Killed |
2026-04-23T06:35:24Z |
| Images built before timeout |
about 121 / 500 |
| SDK commit |
ac3c350c |
| agent_type input |
acp-claude |
| benchmarks branch |
main |
| Runner label |
ubuntu-latest-8core |
| Build logs |
not preserved; upload step never ran |
Related: OpenHands/evaluation#523
Summary
A full SWE-Bench build embedded in
eval-job.ymltimed out after 6 hours on April 23, 2026 (OpenHands/evaluation#24809880634), but the original leading hypothesis aboutagent-type=acp-claudewas wrong.The investigation showed that the slow run was a cold build on
mainwithout the disk-space cleanup changes, while the "fast standalone" comparison was running a different benchmarks branch that already contained aggressive disk reclamation changes.Confirmed Findings
1.
agent-typewas not affecting SWE-Bench image buildsbenchmarks/swebench/build_images.pyparsed--agent-typebut did not thread it into any build step. In other words, for SWE-Bench image building,agent-type=acp-claudevsdefaultdid not change image contents, build phases, or tag selection.This dead plumbing has now been removed in:
Those PRs also remove the same dead build-time plumbing for SWT-Bench. This does not remove inference-time
agent_type, which is still real for ACP runs.2. The “fast standalone” run was not an apples-to-apples comparison
The fast run was:
72caec0on branchopenhands/fix-swebench-disk-space-oomThat branch differed from
mainin the parts that matter here:Free disk spaceworkflow step before Docker setupdocker rmi,docker system prune,docker builder prune)The slow eval-embedded run used
benchmarks-branch=main, so it did not have those changes.3. The available log data points to build/assembly time, not GHCR push time
From the archived logs of the fast standalone run:
Per-image assembly timings in that run were roughly:
So the available evidence does not support GHCR push bandwidth as the primary explanation.
4.
workflow_callvsworkflow_dispatchwas not the root causeA later retry run through the evaluation workflow:
Build SWE-Bench Images / build-and-pushThat proves the reusable workflow path itself was not inherently slow. This retry benefited from cache hits / already-built artifacts, so it does not disprove the cold-build disk-pressure theory.
Most Likely Root Cause
The best-supported explanation is:
mainmaindid not yet include the disk-space mitigation branch that the faster standalone run was usingImpact
This failure mode is especially bad because it can look like a very poor evaluation result when the real problem is partial image availability:
ImageNotFoundFollow-up Work
Completed / in progress:
agent-typeplumbing: OpenHands/benchmarks#696Still worth doing separately:
always()artifact upload runsmainpath for cold full-500 buildsOriginal Slow Run
Build SWE-Bench Images / build-and-pushac3c350cacp-claudemainubuntu-latest-8coreRelated: OpenHands/evaluation#523