Problem
The research recipe can produce experiments that consume extreme system resources without any pre-flight check or enforcement. These experiments typically run in the background while the user works on other things, so resource blowups are unacceptable.
Incident details (2026-04-09)
The 2026-04-09-subsampled-tw-tradeoff experiment planned compute-exact baselines at n=50,000 using sklearn.metrics.trustworthiness(), which materializes a full n×n float64 distance matrix:
- 50,000 × 50,000 × 8 bytes = 20 GB for the distance matrix alone
np.argsort(dist_X, axis=1) creates another ~20 GB int64 index array
- Peak memory for a single
sklearn_tw() call at n=50K: ~40 GB
The plan stated "Peak memory: ~2 GB" (analyzing only Approach A at m=5K). The exact baseline computation — the O(n²) operation the experiment was designed to avoid — was never analyzed for resource cost. The plan's ## Estimated Resource Requirements section is free-text prose with no schema, no validation, and no downstream consumption.
Root cause chain (from post-incident deep-dive of all 3 plan versions, 3 review cycles, and implemented code):
- scope mentioned "O(n²) memory" in passing but never computed concrete GB figures
- plan-experiment analyzed the focal algorithm's memory but not the baseline's — the planner never identified the single most expensive operation
- review-design
resource_proportionality was silenced in all 3 review cycles (L-weight → foothold-dropped → never evaluated)
- run-experiment pre-flight checks data manifest availability but not host memory/disk capacity
Relevant session IDs
f3b29126 — implement_phase (Group C Part B)
0ab52b1c — plan_phase (Group D)
7d2abdc2 — implement_phase (Group D)
- Run experiment interrupted by user due to resource exhaustion.
Current state
| What exists |
Where |
Gap |
## Estimated Resource Requirements section |
plan-experiment SKILL.md output template (line 287) |
Unstructured prose placeholder; no frontmatter field, no validation rule (V1-V9 don't cover it), no prose-to-frontmatter mapping |
linux_tracing proc monitoring |
execution/linux_tracing.py |
Collects snapshots but doesn't enforce limits or feed back into recipe routing |
stale_threshold overrides |
research.yaml (2400s for heavy steps) |
Time-only; no memory/disk gates |
resource_proportionality dimension |
review-design SKILL.md weight matrix |
L-weight for ALL experiment types; no subagent prompt, evaluation criteria, or scoring rubric defined; silenced by foothold validation when no resource prose present |
Proposed solution
1. Structured resource estimates in plan-experiment (PRIMARY — everything else depends on this)
Evolve ## Estimated Resource Requirements from free-text to a machine-readable frontmatter block (coordinates with #590):
resource_estimate:
peak_memory_gb: <float> # Worst-case single-process RSS
disk_gb: <float> # Total disk footprint (data + outputs)
wall_time_minutes: <int> # Estimated wall-clock time for full run
justification: <string> # Which operation dominates and why
Skill changes:
- Add a mandatory derivation step in plan-experiment's instructions requiring the planner to identify the single largest data structure across ALL experiment operations — including baselines, comparisons, and reference computations, not just the focal algorithm. This is the key instruction that would have caught the incident: the planner analyzed Approach A's memory correctly but never analyzed
compute_exact.py's sklearn call.
- Add validation rule V10:
resource_estimate must have peak_memory_gb > 0 for non-exploratory experiments. ERROR if absent.
- Add the
resource_estimate → ## Estimated Resource Requirements row to the prose-to-frontmatter mapping table.
- Capture output token:
resource_estimate = {path} from plan_experiment step.
Implementation options (pick one):
- Option A: Mandatory derivation step — add instructions to the main planner flow requiring explicit peak-memory computation before writing the resource section. Cheaper (no extra subagent).
- Option B: Subagent D — Resource Assessment — dedicated subagent that analyzes algorithms, dataset sizes, and data types. More thorough but adds token cost per plan run. plan-experiment already runs 3 subagents (A: prior art, B: data feasibility, C: environment).
2. Pre-flight resource gate in run-experiment
Add a resource feasibility check to run-experiment Step 2 (Pre-flight), after data manifest verification:
- Read
resource_estimate.peak_memory_gb from the experiment plan frontmatter
- Query host capacity: available RAM, available disk
- If
peak_memory_gb > available_ram * 0.85 → emit clear failure message explaining what exceeded and by how much, set Status: FAILED
- If
peak_memory_gb > available_ram * 0.6 → log warning, proceed
No recipe routing changes needed. The existing failure path already handles this gracefully:
run_experiment FAILS (resource pre-flight)
→ adjust_experiment (--adjust reads failure, can't fix capacity → fails)
→ ensure_results (creates INCONCLUSIVE placeholder)
→ write_report_inconclusive
The pipeline terminates with a clear report explaining why it couldn't run and what resources it needed. The human decides whether to reduce experiment scale or get a bigger machine.
3. Conditional resource_proportionality elevation in review-design
Extend the existing +high_cost secondary modifier to cover memory and wall-time, not just GPU-hours:
- Keep L-weight as default for
resource_proportionality
- Add conditional elevation: if
resource_estimate.peak_memory_gb exceeds a configurable threshold OR resource_estimate.wall_time_minutes exceeds a configurable threshold → elevate to M-weight for that review
- If the plan has a
resource_estimate block, evaluate the dimension. If absent, emit a single WARNING: "Plan contains no structured resource estimate."
- Define evaluation criteria (currently none exist): Is
peak_memory_gb plausible for the declared data scale? Does the justification account for all compute-heavy steps including baselines?
No hardcoded thresholds. Elevation thresholds should come from config, allowing projects with different hardware profiles to set appropriate values.
Priority
- plan-experiment structured estimates — highest impact, creates the data everything else needs
- run-experiment pre-flight gate — defense in depth, catches wrong estimates at runtime
- review-design conditional elevation — catches missing/implausible estimates during review
Related issues
Acceptance criteria
Problem
The research recipe can produce experiments that consume extreme system resources without any pre-flight check or enforcement. These experiments typically run in the background while the user works on other things, so resource blowups are unacceptable.
Incident details (2026-04-09)
The
2026-04-09-subsampled-tw-tradeoffexperiment planned compute-exact baselines atn=50,000usingsklearn.metrics.trustworthiness(), which materializes a fulln×nfloat64 distance matrix:np.argsort(dist_X, axis=1)creates another ~20 GB int64 index arraysklearn_tw()call at n=50K: ~40 GBThe plan stated "Peak memory: ~2 GB" (analyzing only Approach A at m=5K). The exact baseline computation — the O(n²) operation the experiment was designed to avoid — was never analyzed for resource cost. The plan's
## Estimated Resource Requirementssection is free-text prose with no schema, no validation, and no downstream consumption.Root cause chain (from post-incident deep-dive of all 3 plan versions, 3 review cycles, and implemented code):
resource_proportionalitywas silenced in all 3 review cycles (L-weight → foothold-dropped → never evaluated)Relevant session IDs
f3b29126— implement_phase (Group C Part B)0ab52b1c— plan_phase (Group D)7d2abdc2— implement_phase (Group D)Current state
## Estimated Resource Requirementssectionplan-experimentSKILL.md output template (line 287)linux_tracingproc monitoringexecution/linux_tracing.pystale_thresholdoverridesresearch.yaml(2400s for heavy steps)resource_proportionalitydimensionreview-designSKILL.md weight matrixProposed solution
1. Structured resource estimates in plan-experiment (PRIMARY — everything else depends on this)
Evolve
## Estimated Resource Requirementsfrom free-text to a machine-readable frontmatter block (coordinates with #590):Skill changes:
compute_exact.py's sklearn call.resource_estimatemust havepeak_memory_gb > 0for non-exploratory experiments. ERROR if absent.resource_estimate→## Estimated Resource Requirementsrow to the prose-to-frontmatter mapping table.resource_estimate = {path}fromplan_experimentstep.Implementation options (pick one):
2. Pre-flight resource gate in run-experiment
Add a resource feasibility check to run-experiment Step 2 (Pre-flight), after data manifest verification:
resource_estimate.peak_memory_gbfrom the experiment plan frontmatterpeak_memory_gb > available_ram * 0.85→ emit clear failure message explaining what exceeded and by how much, set Status: FAILEDpeak_memory_gb > available_ram * 0.6→ log warning, proceedNo recipe routing changes needed. The existing failure path already handles this gracefully:
The pipeline terminates with a clear report explaining why it couldn't run and what resources it needed. The human decides whether to reduce experiment scale or get a bigger machine.
3. Conditional resource_proportionality elevation in review-design
Extend the existing
+high_costsecondary modifier to cover memory and wall-time, not just GPU-hours:resource_proportionalityresource_estimate.peak_memory_gbexceeds a configurable threshold ORresource_estimate.wall_time_minutesexceeds a configurable threshold → elevate to M-weight for that reviewresource_estimateblock, evaluate the dimension. If absent, emit a single WARNING: "Plan contains no structured resource estimate."peak_memory_gbplausible for the declared data scale? Does the justification account for all compute-heavy steps including baselines?No hardcoded thresholds. Elevation thresholds should come from config, allowing projects with different hardware profiles to set appropriate values.
Priority
Related issues
Acceptance criteria
plan-experimentproduces machine-readable resource estimates (peak_memory_gb,disk_gb,wall_time_minutes) in YAML frontmatterrun-experimentpre-flight compares estimates against host capacity and fails early if exceededreview-designresource_proportionalityis conditionally elevated based on configurable thresholds (not hardcoded)review-designresource_proportionalityhas defined evaluation criteria and a subagent prompt