Track root cause isolation and fix plan for the 500-image build regression after #456

## Summary

Issue #502 established that full 500-image builds regressed badly after `2bfcc6c` / PR #456. That change bundled two things:

- a useful `benchmarks` refactor for auto-detecting local Docker images
- an SDK submodule bump from `b498a699` to `bde715c1`

The observed regression is severe:

- historical fast run: `20806796288` built 500 images in about 52 minutes and ended around 55 GiB BuildKit usage
- historical slow run: `22694639473` reached about 599 GiB BuildKit usage by batch 19 and ran for ~5h49m before cancellation
- PR #499 (`22929182421`) reused a prebuilt sdist but was still slow and still grew past 550 GiB, so per-image sdist creation is real but not the dominant issue

Current working hypothesis:

1. The primary regression is in the SDK build path introduced by the submodule bump, not in the `benchmarks` helper refactor itself.
2. The slowdown is amplified by BuildKit prune/rebuild cycles once cold-cache runs blow past the runner disk budget.
3. The Rich/fork logger issue is a real correctness bug, but it does not match the observed symptom of continuous, very slow progress.

This issue is to track a clean isolation plan and the fix strategy.

Related:

- #502
- #503

## Why the previous fixes were only partial

- `#486` addressed fork/spawn behavior, but not the cache-footprint regression.
- `#487` addressed SDK logger initialization behavior, but not the cache-footprint regression.
- `#499` addressed repeated `uv build --sdist`, but evidence suggests that only removes a secondary cost.
- `#503` is the correct short-term operational rollback, but it does not isolate the root cause.

## Plan

### Phase 0: unblock builds

1. Merge `#503` to restore known-good behavior while the regression is investigated.

### Phase 1: add lightweight instrumentation

Add temporary env-gated timing and disk-usage logging around the hot path so we stop inferring from noisy historical runs alone:

- SDK `_make_build_context()` duration
- SDK `docker buildx build` duration
- prune duration and pre/post `docker buildx du` in the `benchmarks` batch loop

This should go into the exact historical branches / SHAs under test so the numbers are comparable.

### Phase 2: run a reduced isolation matrix

Use isolated image namespaces for every run so `cache-from` / `cache-to` state does not bleed between experiments.

Run these cells:

- A. pre-`2bfcc6c` `benchmarks` + old SDK `b498a699` (cold only)
- B. `2bfcc6c` code path + old SDK `b498a699` (cold only)
- C. pre-`2bfcc6c` `benchmarks` + new SDK `bde715c1` (cold + warm)
- D. full `2bfcc6c` + new SDK `bde715c1` (cold + warm)

Why this trimmed matrix:

- historical evidence already shows the old SDK can warm successfully, so we should not spend two extra 500-image runs proving A/B warm again
- the key open question is whether the new SDK can warm successfully at all, or whether prune pressure keeps it effectively cold between batches

These first runs are independent and should be dispatched in parallel once isolated namespaces are in place.

### Phase 3: bisect the SDK hot path with real historical checkpoints

If Phase 2 points to the SDK bump, test these exact SDK refs against the same `benchmarks` commit and workflow inputs:

1. `b498a699` (baseline)
2. `35d75e3a` (cache-tag truncation change in `_base_slug()`)
3. `97731fe5` (Python 3.12 change)
4. `5d65f389` (`--extra boto3`)
5. `bde715c1` (final bundled SDK ref)

Why include `35d75e3a`:

- `_base_slug()` changed from splitting on the first `_tag_` to the last `_tag_`
- for some long SWE-bench image names, that changes the actual `buildcache-*` tag values
- this can create partial cache misses independently of the Python 3.12 / `boto3` changes

This phase should tell us whether the main jump comes from:

- cache tag mismatch for a subset of images
- Python 3.12 base-image / layer changes
- `--extra boto3` layer growth
- or the combination

### Phase 4: fix the real regression and the correctness bug

1. Reintroduce the `benchmarks` helper refactor separately from any SDK bump.
2. Replace early `from openhands.sdk import get_logger` imports in `benchmarks` utility modules with stdlib logging so workers do not inherit SDK/Rich logger state before `ProcessPoolExecutor` forks.
3. If the SDK Dockerfile / dependency profile is confirmed as the main regression, fix that in the SDK build path rather than only tuning prune settings. Likely directions:

- make benchmark-oriented `source-minimal` builds use a lighter dependency profile
- avoid installing extras such as `boto3` unless they are required for that target
- scope Python-version workarounds to the targets that actually need them

4. Keep the useful sdist-reuse optimization from `#499`, but treat it as a follow-up optimization after the primary regression is fixed.

### Phase 5: make cold-cache runs survivable

Cold-cache events will happen again whenever the SDK Dockerfile changes materially. Even if the SDK build profile is improved, the infrastructure should tolerate those transitions.

Treat this as co-equal with reducing SDK layer footprint, not as a minor operational footnote.

Candidate changes:

- use a larger-disk runner or higher `BUILDKIT_PRUNE_KEEP_GB` (for example 150-200) for the first seed run after major SDK Dockerfile changes
- reduce `max-workers` for intentionally cold-cache runs to limit concurrent disk pressure
- optionally run a smaller warm-up build before the full 500-image build after an SDK Dockerfile change
- add `timeout-minutes` and explicit prune timing logs so pathological runs fail faster and are easier to diagnose

## Suggested execution order

1. Confirm `#503` restores build health.
2. Dispatch the reduced Phase 2 matrix with isolated namespaces.
3. If the SDK bump is implicated, run the five-checkpoint SDK bisection.
4. Land the low-risk correctness fix for stdlib logging in `benchmarks` utilities.
5. Land the helper refactor without the SDK bump.
6. Land the SDK-side fix for the benchmark build profile and/or cold-cache seed strategy.

## Exit criteria

We should consider this fixed when we can show, with pinned refs and isolated namespaces, that:

- a 500-image cold-cache run completes without pathological prune/rebuild behavior
- a corresponding warm-cache run stays fast and within a stable disk envelope
- the `benchmarks` refactor can be carried without reintroducing the regression
- SDK logger initialization is no longer happening in utility modules before process fork

## Sub-issues

- [x] #510



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track root cause isolation and fix plan for the 500-image build regression after #456 #504

Summary

Why the previous fixes were only partial

Plan

Phase 0: unblock builds

Phase 1: add lightweight instrumentation

Phase 2: run a reduced isolation matrix

Phase 3: bisect the SDK hot path with real historical checkpoints

Phase 4: fix the real regression and the correctness bug

Phase 5: make cold-cache runs survivable

Suggested execution order

Exit criteria

Sub-issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Track root cause isolation and fix plan for the 500-image build regression after #456 #504

Description

Summary

Why the previous fixes were only partial

Plan

Phase 0: unblock builds

Phase 1: add lightweight instrumentation

Phase 2: run a reduced isolation matrix

Phase 3: bisect the SDK hot path with real historical checkpoints

Phase 4: fix the real regression and the correctness bug

Phase 5: make cold-cache runs survivable

Suggested execution order

Exit criteria

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions