Problem
The phased SWE/SWT-bench image assembly logs include human-readable timing in lines like:
[assembly] OK ... total=109.8s build=87.2s push=6.2s
but the emitted manifest.jsonl records do not populate structured timing fields for assembly. In the full SWE-bench validation run below, fields such as duration_seconds, build_seconds, and related timing fields were null, so analyzing performance required scraping the raw GitHub Actions log.
Validation run: https://github.com/OpenHands/benchmarks/actions/runs/24810843867
That run succeeded (500/500 base images and 500/500 assembled images), but performance/debug analysis was more manual than it should be:
- Average assembly total: ~131.8s/image
- Average docker build: ~106.9s/image
- Average push: ~5.8s/image
- Implied cleanup/other post-build gap: ~19.1s/image
docker system prune produced 170 warning-only rc=1 failures, likely prune-lock contention
Proposal
Populate structured timing fields in assembly manifest records, including at least:
- total assembly duration
- docker build duration
- docker push duration
- docker rmi duration
- docker system prune duration
- docker builder prune duration
- cleanup status / return codes for rmi, system prune, and builder prune
This would make future tuning less guessy and avoid scraping CI logs for basic timing and cleanup behavior.
Acceptance criteria
manifest.jsonl for assemble_all_agent_images includes non-null timing fields for each assembled image.
- Cleanup subprocess outcomes are represented in a structured way.
- Existing summary tooling either preserves these fields or can surface aggregate build/push/cleanup timing.
- Tests cover the new timing fields for a successful assembly.
Context
This came out of the SWE-bench disk/OOM follow-up to the SWT-bench cleanup work in PR #672 and SWE-bench validation PR #690.
Problem
The phased SWE/SWT-bench image assembly logs include human-readable timing in lines like:
but the emitted
manifest.jsonlrecords do not populate structured timing fields for assembly. In the full SWE-bench validation run below, fields such asduration_seconds,build_seconds, and related timing fields werenull, so analyzing performance required scraping the raw GitHub Actions log.Validation run: https://github.com/OpenHands/benchmarks/actions/runs/24810843867
That run succeeded (
500/500base images and500/500assembled images), but performance/debug analysis was more manual than it should be:docker system pruneproduced 170 warning-only rc=1 failures, likely prune-lock contentionProposal
Populate structured timing fields in assembly manifest records, including at least:
This would make future tuning less guessy and avoid scraping CI logs for basic timing and cleanup behavior.
Acceptance criteria
manifest.jsonlforassemble_all_agent_imagesincludes non-null timing fields for each assembled image.Context
This came out of the SWE-bench disk/OOM follow-up to the SWT-bench cleanup work in PR #672 and SWE-bench validation PR #690.