You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SWT-bench image build throughput regressed significantly. Late February runs (SDK cefaebf) built 364-382 images in 5h03m-5h29m (66-76 img/h). Mid-March runs built 314-409 images in 9h20m-9h57m (31-42 img/h). The root cause is broken registry cache due to a Dockerfile ARG ordering mistake.
What happened
SDK PR #2130 (commit fd80128, Mar 3) added OPENHANDS_BUILD_GIT_SHA ARG before the apt-get install and npm install layers in base-image-minimal:
FROM ${BASE_IMAGE} AS base-image-minimal
+ ARG OPENHANDS_BUILD_GIT_SHA=unknown # changes every SDK bump
+ ENV OPENHANDS_BUILD_GIT_SHA=...
RUN apt-get install ... # cache key now includes SHA → miss
RUN npm install ... # also busted (depends on apt-get)
Why this matters
Benchmark images use GHCR registry cache (--cache-from type=registry). The cache tags are stable across SDK bumps (buildcache-{target}-{base_image_slug}, no SHA). Prior builds export layers with --cache-to type=registry,mode=max.
When the Dockerfile build graph matches, BuildKit reuses cached layers from GHCR — including the expensive apt-get and npm install. The ARG before these layers changes the ancestor chain hash, breaking layer matching even though the cache tag is found.
Measured impact (from 414 build logs, run #23164396524)
apt-get install: 103s mean per image (rebuilt from scratch, 0/414 cached)
npm install: 38s mean per image
Combined: 141s/image of unnecessary rebuilds per SDK bump
Secondary factor
SDK PR #2465 (commit d129025, Mar 16) added npm install -g @zed-industries/claude-agent-acp @zed-industries/codex-acp to every benchmark image. This is a new ~38s/image step that didn't exist in the Feb baseline. Not a bug per se, but compounds the cache invalidation problem.
Fix
SDK PR #2522: Move the ARG after the expensive layers. Registry cache layer matching is restored.
Benchmarks PR #547: Set SWT-bench cache-mode default back to max (was changed to off in #541 because cache was broken — it works again with the ARG fix).
Per-image buildx p50 dropped from 322s to 154s (4 images only — see full-scale validation below).
Full-scale validation (433 images, run #23382357696)
Result: 9h04m, 47.8 img/h (vs 35.5 img/h pre-fix with cache-mode=max)
Cache behavior confirmed:cache_import_miss_count=1 for 432/433 images, cached_step_count 12-13 for 94% of images
Remaining gap: The cache fix restored layer caching but throughput is still below the Feb baseline (66-76 img/h). Profiling identified 42.2% of per-image wall clock is I/O overhead (image export, cache export, push) unrelated to the ARG fix. See SWT-bench image build throughput tracker (historical source of truth) #530 comment 5 for full profiling breakdown.
Previous investigation
The original analysis in this issue correctly identified Layer 1 (SDK build path changes) as the dominant problem but attributed it to Python 3.12/boto3 changes. Those were already present in the fast Feb baseline (SDK cefaebf). The actual culprit was narrowed down in #544 to the ARG ordering in commit fd80128.
Layer 2 (registry cache export contention, PR #541) and Layer 3 (BuildKit instability at high concurrency) from the original analysis remain valid but secondary.
Further Optimizations
Beyond the ARG fix, several improvements could reduce build times further:
Make npm install ACP optional for benchmarks
The ACP servers (@zed-industries/claude-agent-acp, @zed-industries/codex-acp) add ~38s/image (measured). Benchmarks that don't use ACPAgent could skip this step via a build arg like INSTALL_ACP=false.
Reduce apt-get overhead
Pre-bake common packages into a shared base layer pushed to GHCR, so base-image-minimal inherits them instead of installing per image
Pin apt package versions to improve cache stability
Root Cause
SWT-bench image build throughput regressed significantly. Late February runs (SDK
cefaebf) built 364-382 images in 5h03m-5h29m (66-76 img/h). Mid-March runs built 314-409 images in 9h20m-9h57m (31-42 img/h). The root cause is broken registry cache due to a Dockerfile ARG ordering mistake.What happened
SDK PR #2130 (commit
fd80128, Mar 3) addedOPENHANDS_BUILD_GIT_SHAARG before theapt-get installandnpm installlayers inbase-image-minimal:Why this matters
Benchmark images use GHCR registry cache (
--cache-from type=registry). The cache tags are stable across SDK bumps (buildcache-{target}-{base_image_slug}, no SHA). Prior builds export layers with--cache-to type=registry,mode=max.When the Dockerfile build graph matches, BuildKit reuses cached layers from GHCR — including the expensive
apt-getandnpm install. The ARG before these layers changes the ancestor chain hash, breaking layer matching even though the cache tag is found.Measured impact (from 414 build logs, run #23164396524)
apt-get install: 103s mean per image (rebuilt from scratch, 0/414 cached)npm install: 38s mean per imageSecondary factor
SDK PR #2465 (commit
d129025, Mar 16) addednpm install -g @zed-industries/claude-agent-acp @zed-industries/codex-acpto every benchmark image. This is a new ~38s/image step that didn't exist in the Feb baseline. Not a bug per se, but compounds the cache invalidation problem.Fix
SDK PR #2522: Move the ARG after the expensive layers. Registry cache layer matching is restored.
Benchmarks PR #547: Set SWT-bench
cache-modedefault back tomax(was changed tooffin #541 because cache was broken — it works again with the ARG fix).Validated
Small-scale A/B test (4 django images)
A/B test on CI (details in #544):
cache-mode=maxPer-image buildx p50 dropped from 322s to 154s (4 images only — see full-scale validation below).
Full-scale validation (433 images, run #23382357696)
cache-mode=max)cache_import_miss_count=1for 432/433 images,cached_step_count12-13 for 94% of imagesPrevious investigation
The original analysis in this issue correctly identified Layer 1 (SDK build path changes) as the dominant problem but attributed it to Python 3.12/boto3 changes. Those were already present in the fast Feb baseline (SDK
cefaebf). The actual culprit was narrowed down in #544 to the ARG ordering in commitfd80128.Layer 2 (registry cache export contention, PR #541) and Layer 3 (BuildKit instability at high concurrency) from the original analysis remain valid but secondary.
Further Optimizations
Beyond the ARG fix, several improvements could reduce build times further:
Make npm install ACP optional for benchmarks
The ACP servers (
@zed-industries/claude-agent-acp,@zed-industries/codex-acp) add ~38s/image (measured). Benchmarks that don't use ACPAgent could skip this step via a build arg likeINSTALL_ACP=false.Reduce apt-get overhead
base-image-minimalinherits them instead of installing per imageSafe concurrency limits
max-workersat 4 for cold builds (16 workers caused 9-19 BuildKit resets in SWT-Bench image building slowness: root cause analysis and fix plan #531's A/B test)Cache seeding on Dockerfile changes
cache-mode=maxbuild to pre-populate registry cacheControlled cold-build workflow
Instead of cold-building all 433 images at max parallelism:
max-workers=2,cache-mode=max— populate registry cache without contentionmax-workers=4,cache-mode=off— read seeded cache, don't writePrevention
Related Issues