Skip to content

Investigate: SWE-bench full-500 image build >6h #695

@simonrosenberg

Description

@simonrosenberg

Summary

A full SWE-Bench build embedded in eval-job.yml timed out after 6 hours on April 23, 2026 (OpenHands/evaluation#24809880634), but the original leading hypothesis about agent-type=acp-claude was wrong.

The investigation showed that the slow run was a cold build on main without the disk-space cleanup changes, while the "fast standalone" comparison was running a different benchmarks branch that already contained aggressive disk reclamation changes.

Confirmed Findings

1. agent-type was not affecting SWE-Bench image builds

benchmarks/swebench/build_images.py parsed --agent-type but did not thread it into any build step. In other words, for SWE-Bench image building, agent-type=acp-claude vs default did not change image contents, build phases, or tag selection.

This dead plumbing has now been removed in:

Those PRs also remove the same dead build-time plumbing for SWT-Bench. This does not remove inference-time agent_type, which is still real for ACP runs.

2. The “fast standalone” run was not an apples-to-apples comparison

The fast run was:

  • OpenHands/benchmarks#24810843867
  • started at 2026-04-23T01:01:53Z
  • completed at 2026-04-23T05:31:41Z
  • head SHA: 72caec0 on branch openhands/fix-swebench-disk-space-oom

That branch differed from main in the parts that matter here:

  • added a Free disk space workflow step before Docker setup
  • added pre-assembly Buildx pruning
  • added per-image cleanup during assembly (docker rmi, docker system prune, docker builder prune)

The slow eval-embedded run used benchmarks-branch=main, so it did not have those changes.

3. The available log data points to build/assembly time, not GHCR push time

From the archived logs of the fast standalone run:

  • builder image started around 2026-04-23T01:07:48Z
  • base-image phase completed at 2026-04-23T03:05:41Z
  • assembly phase ran until about 2026-04-23T05:30:03Z

Per-image assembly timings in that run were roughly:

  • average total: 131.8s
  • average Docker build time: 106.9s
  • average push time: 5.8s

So the available evidence does not support GHCR push bandwidth as the primary explanation.

4. workflow_call vs workflow_dispatch was not the root cause

A later retry run through the evaluation workflow:

That proves the reusable workflow path itself was not inherently slow. This retry benefited from cache hits / already-built artifacts, so it does not disprove the cold-build disk-pressure theory.

Most Likely Root Cause

The best-supported explanation is:

  1. the original failing run was a cold full-500 build on main
  2. main did not yet include the disk-space mitigation branch that the faster standalone run was using
  3. the job eventually hit GitHub’s 6-hour cap before the post-build artifact upload steps ran
  4. because the build logs were only archived after the build step, the most useful artifacts were lost when the job was killed

Impact

This failure mode is especially bad because it can look like a very poor evaluation result when the real problem is partial image availability:

  • only part of the image set gets built
  • the remaining instances fail downstream with ImageNotFound
  • the final success/resolve numbers can understate the model’s actual performance on submitted instances

Follow-up Work

Completed / in progress:

Still worth doing separately:

  • make build telemetry durable even when the build step is killed before always() artifact upload runs
  • keep the disk-space mitigations on the production main path for cold full-500 builds
  • run a controlled cold-build comparison on the same SDK commit after the disk-space fixes are in place

Original Slow Run

Field Value
Eval workflow run OpenHands/evaluation#24809880634
Build job Build SWE-Bench Images / build-and-push
Started 2026-04-23T00:30:24Z
Killed 2026-04-23T06:35:24Z
Images built before timeout about 121 / 500
SDK commit ac3c350c
agent_type input acp-claude
benchmarks branch main
Runner label ubuntu-latest-8core
Build logs not preserved; upload step never ran

Related: OpenHands/evaluation#523

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions