GHCR `eval-agent-server` has 133k+ tags with no retention; tag scheme fans out per SDK sha

## Problem

The package `ghcr.io/openhands/eval-agent-server` currently holds **133,445 tags** with no retention policy, accumulated over ~5 months. The tag scheme `{sdk_sha}-…-{bench}-{repo}-source-minimal` fans out on every SDK commit, so each new sha triggers a full rebuild+push of the whole benchmark dataset. This is both an ergonomic problem (GHCR UI and tag APIs are unusable at this scale) and a concrete pipeline problem (builds now hit ENOSPC on GH-hosted runners).

The package is public, so this is **not** a billing/quota issue — but it is actively hurting the build pipeline.

## Evidence (as of 2026-04-21)

| Metric | Value |
|---|---|
| Total tags | 133,445 |
| SDK-sha image tags | 130,602 (97.9%) |
| `buildcache-*` tags | 2,793 |
| Legacy `v1.0.0*` tags | 30 |
| Distinct SDK shas producing tags | 548 |
| SDK shas with ≥ 1,000 tags each | 63 |
| Oldest SDK sha still present | `db900ea` — committed 2025-11-24 |
| Median per-tag compressed size | 1.38 GB (multi-arch index) |
| Buildcache median size | 2.03 GB |

Benchmark distribution (substring match):
- `swebench`: 130,304 tags (97.6%)
- `commit0`: 1,435
- `gaia`: 192

## Downstream effect

This directly causes the build hangs we've been seeing. Example: eval run `OpenHands/evaluation` Actions run [24706920746](https://github.com/OpenHands/evaluation/actions/runs/24706920746) (commit0 × Gemini 3.1 Pro) has been stuck in `Build Commit0 Images / build-and-push` for 4h45m+ because its SDK sha `aabf4072…` has no cached tags in the registry → BuildKit has to build + push the full image set → GH `ubuntu-24.04` runner (~14 GB free) hits ENOSPC → BuildKit hangs instead of failing fast.

The prior failed run on the same branch (`24682483119`) has this in its `manifest.jsonl` artifact:

```
"built", "docker.io/wentingzhao/minitorch:v0", "error": "OSError(28, 'No space left on device')"
```

## Reproduce the numbers

Anonymous (no GH token needed; package is public):

```bash
TOKEN=$(curl -sS 'https://ghcr.io/token?service=ghcr.io&scope=repository:openhands/eval-agent-server:pull' | jq -r .token)

# total tag count (paginate via Link header; n=1000 per page)
curl -sS -H "Authorization: Bearer $TOKEN" \
  'https://ghcr.io/v2/openhands/eval-agent-server/tags/list?n=1000'

# inspect a manifest / size
curl -sS -H "Authorization: Bearer $TOKEN" \
  -H 'Accept: application/vnd.oci.image.index.v1+json, application/vnd.docker.distribution.manifest.list.v2+json, application/vnd.oci.image.manifest.v1+json, application/vnd.docker.distribution.manifest.v2+json' \
  'https://ghcr.io/v2/openhands/eval-agent-server/manifests/<TAG>'
```

UI: https://github.com/orgs/OpenHands/packages/container/package/eval-agent-server

## Proposed fix

Two changes, in order of impact:

1. **Retention / GC workflow** (execution fix). Nightly or weekly job that deletes versions where:
   - the SDK sha is NOT one of the last N (say 30) shas on `software-agent-sdk` main, AND
   - no metadata entry in `gs://openhands-evaluation-results/metadata/*.jsonl` with a non-terminal status references that sha.

   Dry-run first. On first real pass this will likely purge the majority of the 130k tags. Buildcache tags can be pruned more aggressively (keep last 2–3 per `{bench}/{repo}`).

2. **Split the image** (structural fix). The tag scheme invalidates everything on every SDK commit. Bake a stable per-repo base (`{ext_sha}-{bench}-{repo}`) and produce a thin SDK-only layer on top. New SDK commits then push MBs, not GBs, which also eliminates the ENOSPC hang on GH runners as a side effect.

Cross-ref: this investigation touched `OpenHands/evaluation` side too (`register_metadata.py`, `kill-eval-job.yml`), so any retention workflow will want to coordinate with the metadata jsonl schema defined there.

cc @juanmichelini @simonrosenberg 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GHCR `eval-agent-server` has 133k+ tags with no retention; tag scheme fans out per SDK sha #684

Problem

Evidence (as of 2026-04-21)

Downstream effect

Reproduce the numbers

Proposed fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Metric	Value
Total tags	133,445
SDK-sha image tags	130,602 (97.9%)
`buildcache-*` tags	2,793
Legacy `v1.0.0*` tags	30
Distinct SDK shas producing tags	548
SDK shas with ≥ 1,000 tags each	63
Oldest SDK sha still present	`db900ea` — committed 2025-11-24
Median per-tag compressed size	1.38 GB (multi-arch index)
Buildcache median size	2.03 GB

GHCR eval-agent-server has 133k+ tags with no retention; tag scheme fans out per SDK sha #684

Description

Problem

Evidence (as of 2026-04-21)

Downstream effect

Reproduce the numbers

Proposed fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

GHCR `eval-agent-server` has 133k+ tags with no retention; tag scheme fans out per SDK sha #684