Skip to content

Track structured timing for phased image assembly cleanup #692

@simonrosenberg

Description

@simonrosenberg

Problem

The phased SWE/SWT-bench image assembly logs include human-readable timing in lines like:

[assembly] OK ... total=109.8s build=87.2s push=6.2s

but the emitted manifest.jsonl records do not populate structured timing fields for assembly. In the full SWE-bench validation run below, fields such as duration_seconds, build_seconds, and related timing fields were null, so analyzing performance required scraping the raw GitHub Actions log.

Validation run: https://github.com/OpenHands/benchmarks/actions/runs/24810843867

That run succeeded (500/500 base images and 500/500 assembled images), but performance/debug analysis was more manual than it should be:

  • Average assembly total: ~131.8s/image
  • Average docker build: ~106.9s/image
  • Average push: ~5.8s/image
  • Implied cleanup/other post-build gap: ~19.1s/image
  • docker system prune produced 170 warning-only rc=1 failures, likely prune-lock contention

Proposal

Populate structured timing fields in assembly manifest records, including at least:

  • total assembly duration
  • docker build duration
  • docker push duration
  • docker rmi duration
  • docker system prune duration
  • docker builder prune duration
  • cleanup status / return codes for rmi, system prune, and builder prune

This would make future tuning less guessy and avoid scraping CI logs for basic timing and cleanup behavior.

Acceptance criteria

  • manifest.jsonl for assemble_all_agent_images includes non-null timing fields for each assembled image.
  • Cleanup subprocess outcomes are represented in a structured way.
  • Existing summary tooling either preserves these fields or can surface aggregate build/push/cleanup timing.
  • Tests cover the new timing fields for a successful assembly.

Context

This came out of the SWE-bench disk/OOM follow-up to the SWT-bench cleanup work in PR #672 and SWE-bench validation PR #690.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions