Skip to content

GHCR eval-agent-server has 133k+ tags with no retention; tag scheme fans out per SDK sha #684

@VascoSch92

Description

@VascoSch92

Problem

The package ghcr.io/openhands/eval-agent-server currently holds 133,445 tags with no retention policy, accumulated over ~5 months. The tag scheme {sdk_sha}-…-{bench}-{repo}-source-minimal fans out on every SDK commit, so each new sha triggers a full rebuild+push of the whole benchmark dataset. This is both an ergonomic problem (GHCR UI and tag APIs are unusable at this scale) and a concrete pipeline problem (builds now hit ENOSPC on GH-hosted runners).

The package is public, so this is not a billing/quota issue — but it is actively hurting the build pipeline.

Evidence (as of 2026-04-21)

Metric Value
Total tags 133,445
SDK-sha image tags 130,602 (97.9%)
buildcache-* tags 2,793
Legacy v1.0.0* tags 30
Distinct SDK shas producing tags 548
SDK shas with ≥ 1,000 tags each 63
Oldest SDK sha still present db900ea — committed 2025-11-24
Median per-tag compressed size 1.38 GB (multi-arch index)
Buildcache median size 2.03 GB

Benchmark distribution (substring match):

  • swebench: 130,304 tags (97.6%)
  • commit0: 1,435
  • gaia: 192

Downstream effect

This directly causes the build hangs we've been seeing. Example: eval run OpenHands/evaluation Actions run 24706920746 (commit0 × Gemini 3.1 Pro) has been stuck in Build Commit0 Images / build-and-push for 4h45m+ because its SDK sha aabf4072… has no cached tags in the registry → BuildKit has to build + push the full image set → GH ubuntu-24.04 runner (~14 GB free) hits ENOSPC → BuildKit hangs instead of failing fast.

The prior failed run on the same branch (24682483119) has this in its manifest.jsonl artifact:

"built", "docker.io/wentingzhao/minitorch:v0", "error": "OSError(28, 'No space left on device')"

Reproduce the numbers

Anonymous (no GH token needed; package is public):

TOKEN=$(curl -sS 'https://ghcr.io/token?service=ghcr.io&scope=repository:openhands/eval-agent-server:pull' | jq -r .token)

# total tag count (paginate via Link header; n=1000 per page)
curl -sS -H "Authorization: Bearer $TOKEN" \
  'https://ghcr.io/v2/openhands/eval-agent-server/tags/list?n=1000'

# inspect a manifest / size
curl -sS -H "Authorization: Bearer $TOKEN" \
  -H 'Accept: application/vnd.oci.image.index.v1+json, application/vnd.docker.distribution.manifest.list.v2+json, application/vnd.oci.image.manifest.v1+json, application/vnd.docker.distribution.manifest.v2+json' \
  'https://ghcr.io/v2/openhands/eval-agent-server/manifests/<TAG>'

UI: https://github.com/orgs/OpenHands/packages/container/package/eval-agent-server

Proposed fix

Two changes, in order of impact:

  1. Retention / GC workflow (execution fix). Nightly or weekly job that deletes versions where:

    • the SDK sha is NOT one of the last N (say 30) shas on software-agent-sdk main, AND
    • no metadata entry in gs://openhands-evaluation-results/metadata/*.jsonl with a non-terminal status references that sha.

    Dry-run first. On first real pass this will likely purge the majority of the 130k tags. Buildcache tags can be pruned more aggressively (keep last 2–3 per {bench}/{repo}).

  2. Split the image (structural fix). The tag scheme invalidates everything on every SDK commit. Bake a stable per-repo base ({ext_sha}-{bench}-{repo}) and produce a thin SDK-only layer on top. New SDK commits then push MBs, not GBs, which also eliminates the ENOSPC hang on GH runners as a side effect.

Cross-ref: this investigation touched OpenHands/evaluation side too (register_metadata.py, kill-eval-job.yml), so any retention workflow will want to coordinate with the metadata jsonl schema defined there.

cc @juanmichelini @simonrosenberg

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions