Problem
The package ghcr.io/openhands/eval-agent-server currently holds 133,445 tags with no retention policy, accumulated over ~5 months. The tag scheme {sdk_sha}-…-{bench}-{repo}-source-minimal fans out on every SDK commit, so each new sha triggers a full rebuild+push of the whole benchmark dataset. This is both an ergonomic problem (GHCR UI and tag APIs are unusable at this scale) and a concrete pipeline problem (builds now hit ENOSPC on GH-hosted runners).
The package is public, so this is not a billing/quota issue — but it is actively hurting the build pipeline.
Evidence (as of 2026-04-21)
| Metric |
Value |
| Total tags |
133,445 |
| SDK-sha image tags |
130,602 (97.9%) |
buildcache-* tags |
2,793 |
Legacy v1.0.0* tags |
30 |
| Distinct SDK shas producing tags |
548 |
| SDK shas with ≥ 1,000 tags each |
63 |
| Oldest SDK sha still present |
db900ea — committed 2025-11-24 |
| Median per-tag compressed size |
1.38 GB (multi-arch index) |
| Buildcache median size |
2.03 GB |
Benchmark distribution (substring match):
swebench: 130,304 tags (97.6%)
commit0: 1,435
gaia: 192
Downstream effect
This directly causes the build hangs we've been seeing. Example: eval run OpenHands/evaluation Actions run 24706920746 (commit0 × Gemini 3.1 Pro) has been stuck in Build Commit0 Images / build-and-push for 4h45m+ because its SDK sha aabf4072… has no cached tags in the registry → BuildKit has to build + push the full image set → GH ubuntu-24.04 runner (~14 GB free) hits ENOSPC → BuildKit hangs instead of failing fast.
The prior failed run on the same branch (24682483119) has this in its manifest.jsonl artifact:
"built", "docker.io/wentingzhao/minitorch:v0", "error": "OSError(28, 'No space left on device')"
Reproduce the numbers
Anonymous (no GH token needed; package is public):
TOKEN=$(curl -sS 'https://ghcr.io/token?service=ghcr.io&scope=repository:openhands/eval-agent-server:pull' | jq -r .token)
# total tag count (paginate via Link header; n=1000 per page)
curl -sS -H "Authorization: Bearer $TOKEN" \
'https://ghcr.io/v2/openhands/eval-agent-server/tags/list?n=1000'
# inspect a manifest / size
curl -sS -H "Authorization: Bearer $TOKEN" \
-H 'Accept: application/vnd.oci.image.index.v1+json, application/vnd.docker.distribution.manifest.list.v2+json, application/vnd.oci.image.manifest.v1+json, application/vnd.docker.distribution.manifest.v2+json' \
'https://ghcr.io/v2/openhands/eval-agent-server/manifests/<TAG>'
UI: https://github.com/orgs/OpenHands/packages/container/package/eval-agent-server
Proposed fix
Two changes, in order of impact:
-
Retention / GC workflow (execution fix). Nightly or weekly job that deletes versions where:
- the SDK sha is NOT one of the last N (say 30) shas on
software-agent-sdk main, AND
- no metadata entry in
gs://openhands-evaluation-results/metadata/*.jsonl with a non-terminal status references that sha.
Dry-run first. On first real pass this will likely purge the majority of the 130k tags. Buildcache tags can be pruned more aggressively (keep last 2–3 per {bench}/{repo}).
-
Split the image (structural fix). The tag scheme invalidates everything on every SDK commit. Bake a stable per-repo base ({ext_sha}-{bench}-{repo}) and produce a thin SDK-only layer on top. New SDK commits then push MBs, not GBs, which also eliminates the ENOSPC hang on GH runners as a side effect.
Cross-ref: this investigation touched OpenHands/evaluation side too (register_metadata.py, kill-eval-job.yml), so any retention workflow will want to coordinate with the metadata jsonl schema defined there.
cc @juanmichelini @simonrosenberg
Problem
The package
ghcr.io/openhands/eval-agent-servercurrently holds 133,445 tags with no retention policy, accumulated over ~5 months. The tag scheme{sdk_sha}-…-{bench}-{repo}-source-minimalfans out on every SDK commit, so each new sha triggers a full rebuild+push of the whole benchmark dataset. This is both an ergonomic problem (GHCR UI and tag APIs are unusable at this scale) and a concrete pipeline problem (builds now hit ENOSPC on GH-hosted runners).The package is public, so this is not a billing/quota issue — but it is actively hurting the build pipeline.
Evidence (as of 2026-04-21)
buildcache-*tagsv1.0.0*tagsdb900ea— committed 2025-11-24Benchmark distribution (substring match):
swebench: 130,304 tags (97.6%)commit0: 1,435gaia: 192Downstream effect
This directly causes the build hangs we've been seeing. Example: eval run
OpenHands/evaluationActions run 24706920746 (commit0 × Gemini 3.1 Pro) has been stuck inBuild Commit0 Images / build-and-pushfor 4h45m+ because its SDK shaaabf4072…has no cached tags in the registry → BuildKit has to build + push the full image set → GHubuntu-24.04runner (~14 GB free) hits ENOSPC → BuildKit hangs instead of failing fast.The prior failed run on the same branch (
24682483119) has this in itsmanifest.jsonlartifact:Reproduce the numbers
Anonymous (no GH token needed; package is public):
UI: https://github.com/orgs/OpenHands/packages/container/package/eval-agent-server
Proposed fix
Two changes, in order of impact:
Retention / GC workflow (execution fix). Nightly or weekly job that deletes versions where:
software-agent-sdkmain, ANDgs://openhands-evaluation-results/metadata/*.jsonlwith a non-terminal status references that sha.Dry-run first. On first real pass this will likely purge the majority of the 130k tags. Buildcache tags can be pruned more aggressively (keep last 2–3 per
{bench}/{repo}).Split the image (structural fix). The tag scheme invalidates everything on every SDK commit. Bake a stable per-repo base (
{ext_sha}-{bench}-{repo}) and produce a thin SDK-only layer on top. New SDK commits then push MBs, not GBs, which also eliminates the ENOSPC hang on GH runners as a side effect.Cross-ref: this investigation touched
OpenHands/evaluationside too (register_metadata.py,kill-eval-job.yml), so any retention workflow will want to coordinate with the metadata jsonl schema defined there.cc @juanmichelini @simonrosenberg