Skip to content

Research recipe: add resource budget assessment and enforcement gates #689

@Trecek

Description

@Trecek

Problem

The research recipe can produce experiments that consume extreme system resources without any pre-flight check or enforcement. These experiments typically run in the background while the user works on other things, so resource blowups are unacceptable.

Incident details (2026-04-09)

The 2026-04-09-subsampled-tw-tradeoff experiment planned compute-exact baselines at n=50,000 using sklearn.metrics.trustworthiness(), which materializes a full n×n float64 distance matrix:

  • 50,000 × 50,000 × 8 bytes = 20 GB for the distance matrix alone
  • np.argsort(dist_X, axis=1) creates another ~20 GB int64 index array
  • Peak memory for a single sklearn_tw() call at n=50K: ~40 GB

The plan stated "Peak memory: ~2 GB" (analyzing only Approach A at m=5K). The exact baseline computation — the O(n²) operation the experiment was designed to avoid — was never analyzed for resource cost. The plan's ## Estimated Resource Requirements section is free-text prose with no schema, no validation, and no downstream consumption.

Root cause chain (from post-incident deep-dive of all 3 plan versions, 3 review cycles, and implemented code):

  1. scope mentioned "O(n²) memory" in passing but never computed concrete GB figures
  2. plan-experiment analyzed the focal algorithm's memory but not the baseline's — the planner never identified the single most expensive operation
  3. review-design resource_proportionality was silenced in all 3 review cycles (L-weight → foothold-dropped → never evaluated)
  4. run-experiment pre-flight checks data manifest availability but not host memory/disk capacity

Relevant session IDs

  • f3b29126 — implement_phase (Group C Part B)
  • 0ab52b1c — plan_phase (Group D)
  • 7d2abdc2 — implement_phase (Group D)
  • Run experiment interrupted by user due to resource exhaustion.

Current state

What exists Where Gap
## Estimated Resource Requirements section plan-experiment SKILL.md output template (line 287) Unstructured prose placeholder; no frontmatter field, no validation rule (V1-V9 don't cover it), no prose-to-frontmatter mapping
linux_tracing proc monitoring execution/linux_tracing.py Collects snapshots but doesn't enforce limits or feed back into recipe routing
stale_threshold overrides research.yaml (2400s for heavy steps) Time-only; no memory/disk gates
resource_proportionality dimension review-design SKILL.md weight matrix L-weight for ALL experiment types; no subagent prompt, evaluation criteria, or scoring rubric defined; silenced by foothold validation when no resource prose present

Proposed solution

1. Structured resource estimates in plan-experiment (PRIMARY — everything else depends on this)

Evolve ## Estimated Resource Requirements from free-text to a machine-readable frontmatter block (coordinates with #590):

resource_estimate:
  peak_memory_gb: <float>       # Worst-case single-process RSS
  disk_gb: <float>              # Total disk footprint (data + outputs)
  wall_time_minutes: <int>      # Estimated wall-clock time for full run
  justification: <string>       # Which operation dominates and why

Skill changes:

  • Add a mandatory derivation step in plan-experiment's instructions requiring the planner to identify the single largest data structure across ALL experiment operations — including baselines, comparisons, and reference computations, not just the focal algorithm. This is the key instruction that would have caught the incident: the planner analyzed Approach A's memory correctly but never analyzed compute_exact.py's sklearn call.
  • Add validation rule V10: resource_estimate must have peak_memory_gb > 0 for non-exploratory experiments. ERROR if absent.
  • Add the resource_estimate## Estimated Resource Requirements row to the prose-to-frontmatter mapping table.
  • Capture output token: resource_estimate = {path} from plan_experiment step.

Implementation options (pick one):

  • Option A: Mandatory derivation step — add instructions to the main planner flow requiring explicit peak-memory computation before writing the resource section. Cheaper (no extra subagent).
  • Option B: Subagent D — Resource Assessment — dedicated subagent that analyzes algorithms, dataset sizes, and data types. More thorough but adds token cost per plan run. plan-experiment already runs 3 subagents (A: prior art, B: data feasibility, C: environment).

2. Pre-flight resource gate in run-experiment

Add a resource feasibility check to run-experiment Step 2 (Pre-flight), after data manifest verification:

  • Read resource_estimate.peak_memory_gb from the experiment plan frontmatter
  • Query host capacity: available RAM, available disk
  • If peak_memory_gb > available_ram * 0.85 → emit clear failure message explaining what exceeded and by how much, set Status: FAILED
  • If peak_memory_gb > available_ram * 0.6 → log warning, proceed

No recipe routing changes needed. The existing failure path already handles this gracefully:

run_experiment FAILS (resource pre-flight)
  → adjust_experiment (--adjust reads failure, can't fix capacity → fails)
    → ensure_results (creates INCONCLUSIVE placeholder)
      → write_report_inconclusive

The pipeline terminates with a clear report explaining why it couldn't run and what resources it needed. The human decides whether to reduce experiment scale or get a bigger machine.

3. Conditional resource_proportionality elevation in review-design

Extend the existing +high_cost secondary modifier to cover memory and wall-time, not just GPU-hours:

  • Keep L-weight as default for resource_proportionality
  • Add conditional elevation: if resource_estimate.peak_memory_gb exceeds a configurable threshold OR resource_estimate.wall_time_minutes exceeds a configurable threshold → elevate to M-weight for that review
  • If the plan has a resource_estimate block, evaluate the dimension. If absent, emit a single WARNING: "Plan contains no structured resource estimate."
  • Define evaluation criteria (currently none exist): Is peak_memory_gb plausible for the declared data scale? Does the justification account for all compute-heavy steps including baselines?

No hardcoded thresholds. Elevation thresholds should come from config, allowing projects with different hardware profiles to set appropriate values.

Priority

  1. plan-experiment structured estimates — highest impact, creates the data everything else needs
  2. run-experiment pre-flight gate — defense in depth, catches wrong estimates at runtime
  3. review-design conditional elevation — catches missing/implausible estimates during review

Related issues

Acceptance criteria

  • plan-experiment produces machine-readable resource estimates (peak_memory_gb, disk_gb, wall_time_minutes) in YAML frontmatter
  • The mandatory derivation step requires analyzing ALL compute-heavy operations including baselines
  • run-experiment pre-flight compares estimates against host capacity and fails early if exceeded
  • review-design resource_proportionality is conditionally elevated based on configurable thresholds (not hardcoded)
  • review-design resource_proportionality has defined evaluation criteria and a subagent prompt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions