Investigate: SWE-bench full-500 image build >6h

## Summary

A full SWE-Bench build embedded in `eval-job.yml` timed out after 6 hours on **April 23, 2026** ([OpenHands/evaluation#24809880634](https://github.com/OpenHands/evaluation/actions/runs/24809880634)), but the original leading hypothesis about `agent-type=acp-claude` was wrong.

The investigation showed that the slow run was a **cold build on `main` without the disk-space cleanup changes**, while the "fast standalone" comparison was running a **different benchmarks branch** that already contained aggressive disk reclamation changes.

## Confirmed Findings

### 1. `agent-type` was not affecting SWE-Bench image builds
`benchmarks/swebench/build_images.py` parsed `--agent-type` but did not thread it into any build step. In other words, for SWE-Bench image building, `agent-type=acp-claude` vs `default` did **not** change image contents, build phases, or tag selection.

This dead plumbing has now been removed in:
- [OpenHands/benchmarks#696](https://github.com/OpenHands/benchmarks/pull/696)
- [OpenHands/evaluation#526](https://github.com/OpenHands/evaluation/pull/526)

Those PRs also remove the same dead build-time plumbing for SWT-Bench. This does **not** remove inference-time `agent_type`, which is still real for ACP runs.

### 2. The “fast standalone” run was not an apples-to-apples comparison
The fast run was:
- [OpenHands/benchmarks#24810843867](https://github.com/OpenHands/benchmarks/actions/runs/24810843867)
- started at **2026-04-23T01:01:53Z**
- completed at **2026-04-23T05:31:41Z**
- head SHA: `72caec0` on branch `openhands/fix-swebench-disk-space-oom`

That branch differed from `main` in the parts that matter here:
- added a `Free disk space` workflow step before Docker setup
- added pre-assembly Buildx pruning
- added per-image cleanup during assembly (`docker rmi`, `docker system prune`, `docker builder prune`)

The slow eval-embedded run used `benchmarks-branch=main`, so it did **not** have those changes.

### 3. The available log data points to build/assembly time, not GHCR push time
From the archived logs of the fast standalone run:
- builder image started around **2026-04-23T01:07:48Z**
- base-image phase completed at **2026-04-23T03:05:41Z**
- assembly phase ran until about **2026-04-23T05:30:03Z**

Per-image assembly timings in that run were roughly:
- average total: **131.8s**
- average Docker build time: **106.9s**
- average push time: **5.8s**

So the available evidence does **not** support GHCR push bandwidth as the primary explanation.

### 4. `workflow_call` vs `workflow_dispatch` was not the root cause
A later retry run through the evaluation workflow:
- [OpenHands/evaluation#24840336632](https://github.com/OpenHands/evaluation/actions/runs/24840336632)
- build job `Build SWE-Bench Images / build-and-push`
- started at **2026-04-23T14:15:39Z**
- completed at **2026-04-23T14:38:08Z**

That proves the reusable workflow path itself was not inherently slow. This retry benefited from cache hits / already-built artifacts, so it does not disprove the cold-build disk-pressure theory.

## Most Likely Root Cause

The best-supported explanation is:

1. the original failing run was a **cold full-500 build** on `main`
2. `main` did **not** yet include the disk-space mitigation branch that the faster standalone run was using
3. the job eventually hit GitHub’s 6-hour cap before the post-build artifact upload steps ran
4. because the build logs were only archived after the build step, the most useful artifacts were lost when the job was killed

## Impact

This failure mode is especially bad because it can look like a very poor evaluation result when the real problem is partial image availability:
- only part of the image set gets built
- the remaining instances fail downstream with `ImageNotFound`
- the final success/resolve numbers can understate the model’s actual performance on submitted instances

## Follow-up Work

Completed / in progress:
- remove dead SWE-Bench build-time `agent-type` plumbing: [OpenHands/benchmarks#696](https://github.com/OpenHands/benchmarks/pull/696)
- remove matching evaluation caller plumbing: [OpenHands/evaluation#526](https://github.com/OpenHands/evaluation/pull/526)

Still worth doing separately:
- make build telemetry durable even when the build step is killed before `always()` artifact upload runs
- keep the disk-space mitigations on the production `main` path for cold full-500 builds
- run a controlled cold-build comparison on the same SDK commit after the disk-space fixes are in place

## Original Slow Run

| Field | Value |
|-------|-------|
| Eval workflow run | [OpenHands/evaluation#24809880634](https://github.com/OpenHands/evaluation/actions/runs/24809880634) |
| Build job | `Build SWE-Bench Images / build-and-push` |
| Started | 2026-04-23T00:30:24Z |
| Killed | 2026-04-23T06:35:24Z |
| Images built before timeout | about 121 / 500 |
| SDK commit | `ac3c350c` |
| agent_type input | `acp-claude` |
| benchmarks branch | `main` |
| Runner label | `ubuntu-latest-8core` |
| Build logs | not preserved; upload step never ran |

Related: OpenHands/evaluation#523

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate: SWE-bench full-500 image build >6h #695

Summary

Confirmed Findings

1. `agent-type` was not affecting SWE-Bench image builds

2. The “fast standalone” run was not an apples-to-apples comparison

3. The available log data points to build/assembly time, not GHCR push time

4. `workflow_call` vs `workflow_dispatch` was not the root cause

Most Likely Root Cause

Impact

Follow-up Work

Original Slow Run

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Field	Value
Eval workflow run	OpenHands/evaluation#24809880634
Build job	`Build SWE-Bench Images / build-and-push`
Started	2026-04-23T00:30:24Z
Killed	2026-04-23T06:35:24Z
Images built before timeout	about 121 / 500
SDK commit	`ac3c350c`
agent_type input	`acp-claude`
benchmarks branch	`main`
Runner label	`ubuntu-latest-8core`
Build logs	not preserved; upload step never ran

Investigate: SWE-bench full-500 image build >6h #695

Description

Summary

Confirmed Findings

1. agent-type was not affecting SWE-Bench image builds

2. The “fast standalone” run was not an apples-to-apples comparison

3. The available log data points to build/assembly time, not GHCR push time

4. workflow_call vs workflow_dispatch was not the root cause

Most Likely Root Cause

Impact

Follow-up Work

Original Slow Run

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. `agent-type` was not affecting SWE-Bench image builds

4. `workflow_call` vs `workflow_dispatch` was not the root cause