You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#486 addressed fork/spawn behavior, but not the cache-footprint regression.
#487 addressed SDK logger initialization behavior, but not the cache-footprint regression.
#499 addressed repeated uv build --sdist, but evidence suggests that only removes a secondary cost.
#503 is the correct short-term operational rollback, but it does not isolate the root cause.
Plan
Phase 0: unblock builds
Merge #503 to restore known-good behavior while the regression is investigated.
Phase 1: add lightweight instrumentation
Add temporary env-gated timing and disk-usage logging around the hot path so we stop inferring from noisy historical runs alone:
SDK _make_build_context() duration
SDK docker buildx build duration
prune duration and pre/post docker buildx du in the benchmarks batch loop
This should go into the exact historical branches / SHAs under test so the numbers are comparable.
Phase 2: run a reduced isolation matrix
Use isolated image namespaces for every run so cache-from / cache-to state does not bleed between experiments.
Run these cells:
A. pre-2bfcc6cbenchmarks + old SDK b498a699 (cold only)
B. 2bfcc6c code path + old SDK b498a699 (cold only)
C. pre-2bfcc6cbenchmarks + new SDK bde715c1 (cold + warm)
D. full 2bfcc6c + new SDK bde715c1 (cold + warm)
Why this trimmed matrix:
historical evidence already shows the old SDK can warm successfully, so we should not spend two extra 500-image runs proving A/B warm again
the key open question is whether the new SDK can warm successfully at all, or whether prune pressure keeps it effectively cold between batches
These first runs are independent and should be dispatched in parallel once isolated namespaces are in place.
Phase 3: bisect the SDK hot path with real historical checkpoints
If Phase 2 points to the SDK bump, test these exact SDK refs against the same benchmarks commit and workflow inputs:
b498a699 (baseline)
35d75e3a (cache-tag truncation change in _base_slug())
97731fe5 (Python 3.12 change)
5d65f389 (--extra boto3)
bde715c1 (final bundled SDK ref)
Why include 35d75e3a:
_base_slug() changed from splitting on the first _tag_ to the last _tag_
for some long SWE-bench image names, that changes the actual buildcache-* tag values
this can create partial cache misses independently of the Python 3.12 / boto3 changes
This phase should tell us whether the main jump comes from:
cache tag mismatch for a subset of images
Python 3.12 base-image / layer changes
--extra boto3 layer growth
or the combination
Phase 4: fix the real regression and the correctness bug
Reintroduce the benchmarks helper refactor separately from any SDK bump.
Replace early from openhands.sdk import get_logger imports in benchmarks utility modules with stdlib logging so workers do not inherit SDK/Rich logger state before ProcessPoolExecutor forks.
If the SDK Dockerfile / dependency profile is confirmed as the main regression, fix that in the SDK build path rather than only tuning prune settings. Likely directions:
make benchmark-oriented source-minimal builds use a lighter dependency profile
avoid installing extras such as boto3 unless they are required for that target
scope Python-version workarounds to the targets that actually need them
Keep the useful sdist-reuse optimization from #499, but treat it as a follow-up optimization after the primary regression is fixed.
Phase 5: make cold-cache runs survivable
Cold-cache events will happen again whenever the SDK Dockerfile changes materially. Even if the SDK build profile is improved, the infrastructure should tolerate those transitions.
Treat this as co-equal with reducing SDK layer footprint, not as a minor operational footnote.
Candidate changes:
use a larger-disk runner or higher BUILDKIT_PRUNE_KEEP_GB (for example 150-200) for the first seed run after major SDK Dockerfile changes
reduce max-workers for intentionally cold-cache runs to limit concurrent disk pressure
optionally run a smaller warm-up build before the full 500-image build after an SDK Dockerfile change
add timeout-minutes and explicit prune timing logs so pathological runs fail faster and are easier to diagnose
Suggested execution order
Confirm #503 restores build health.
Dispatch the reduced Phase 2 matrix with isolated namespaces.
If the SDK bump is implicated, run the five-checkpoint SDK bisection.
Land the low-risk correctness fix for stdlib logging in benchmarks utilities.
Land the helper refactor without the SDK bump.
Land the SDK-side fix for the benchmark build profile and/or cold-cache seed strategy.
Exit criteria
We should consider this fixed when we can show, with pinned refs and isolated namespaces, that:
a 500-image cold-cache run completes without pathological prune/rebuild behavior
a corresponding warm-cache run stays fast and within a stable disk envelope
the benchmarks refactor can be carried without reintroducing the regression
SDK logger initialization is no longer happening in utility modules before process fork
Summary
Issue #502 established that full 500-image builds regressed badly after
2bfcc6c/ PR #456. That change bundled two things:benchmarksrefactor for auto-detecting local Docker imagesb498a699tobde715c1The observed regression is severe:
20806796288built 500 images in about 52 minutes and ended around 55 GiB BuildKit usage22694639473reached about 599 GiB BuildKit usage by batch 19 and ran for ~5h49m before cancellation22929182421) reused a prebuilt sdist but was still slow and still grew past 550 GiB, so per-image sdist creation is real but not the dominant issueCurrent working hypothesis:
benchmarkshelper refactor itself.This issue is to track a clean isolation plan and the fix strategy.
Related:
Why the previous fixes were only partial
#486addressed fork/spawn behavior, but not the cache-footprint regression.#487addressed SDK logger initialization behavior, but not the cache-footprint regression.#499addressed repeateduv build --sdist, but evidence suggests that only removes a secondary cost.#503is the correct short-term operational rollback, but it does not isolate the root cause.Plan
Phase 0: unblock builds
#503to restore known-good behavior while the regression is investigated.Phase 1: add lightweight instrumentation
Add temporary env-gated timing and disk-usage logging around the hot path so we stop inferring from noisy historical runs alone:
_make_build_context()durationdocker buildx builddurationdocker buildx duin thebenchmarksbatch loopThis should go into the exact historical branches / SHAs under test so the numbers are comparable.
Phase 2: run a reduced isolation matrix
Use isolated image namespaces for every run so
cache-from/cache-tostate does not bleed between experiments.Run these cells:
2bfcc6cbenchmarks+ old SDKb498a699(cold only)2bfcc6ccode path + old SDKb498a699(cold only)2bfcc6cbenchmarks+ new SDKbde715c1(cold + warm)2bfcc6c+ new SDKbde715c1(cold + warm)Why this trimmed matrix:
These first runs are independent and should be dispatched in parallel once isolated namespaces are in place.
Phase 3: bisect the SDK hot path with real historical checkpoints
If Phase 2 points to the SDK bump, test these exact SDK refs against the same
benchmarkscommit and workflow inputs:b498a699(baseline)35d75e3a(cache-tag truncation change in_base_slug())97731fe5(Python 3.12 change)5d65f389(--extra boto3)bde715c1(final bundled SDK ref)Why include
35d75e3a:_base_slug()changed from splitting on the first_tag_to the last_tag_buildcache-*tag valuesboto3changesThis phase should tell us whether the main jump comes from:
--extra boto3layer growthPhase 4: fix the real regression and the correctness bug
benchmarkshelper refactor separately from any SDK bump.from openhands.sdk import get_loggerimports inbenchmarksutility modules with stdlib logging so workers do not inherit SDK/Rich logger state beforeProcessPoolExecutorforks.source-minimalbuilds use a lighter dependency profileboto3unless they are required for that target#499, but treat it as a follow-up optimization after the primary regression is fixed.Phase 5: make cold-cache runs survivable
Cold-cache events will happen again whenever the SDK Dockerfile changes materially. Even if the SDK build profile is improved, the infrastructure should tolerate those transitions.
Treat this as co-equal with reducing SDK layer footprint, not as a minor operational footnote.
Candidate changes:
BUILDKIT_PRUNE_KEEP_GB(for example 150-200) for the first seed run after major SDK Dockerfile changesmax-workersfor intentionally cold-cache runs to limit concurrent disk pressuretimeout-minutesand explicit prune timing logs so pathological runs fail faster and are easier to diagnoseSuggested execution order
#503restores build health.benchmarksutilities.Exit criteria
We should consider this fixed when we can show, with pinned refs and isolated namespaces, that:
benchmarksrefactor can be carried without reintroducing the regressionSub-issues