Research recipe: add resource budget assessment and enforcement gates

## Problem

The research recipe can produce experiments that consume extreme system resources without any pre-flight check or enforcement. These experiments typically run in the background while the user works on other things, so resource blowups are unacceptable.

### Incident details (2026-04-09)

The `2026-04-09-subsampled-tw-tradeoff` experiment planned compute-exact baselines at `n=50,000` using `sklearn.metrics.trustworthiness()`, which materializes a full `n×n` float64 distance matrix:
- **50,000 × 50,000 × 8 bytes = 20 GB** for the distance matrix alone
- `np.argsort(dist_X, axis=1)` creates another **~20 GB** int64 index array
- Peak memory for a single `sklearn_tw()` call at n=50K: **~40 GB**

The plan stated "Peak memory: ~2 GB" (analyzing only Approach A at m=5K). The exact baseline computation — the O(n²) operation the experiment was designed to avoid — was never analyzed for resource cost. The plan's `## Estimated Resource Requirements` section is **free-text prose with no schema, no validation, and no downstream consumption**.

**Root cause chain** (from post-incident deep-dive of all 3 plan versions, 3 review cycles, and implemented code):
1. **scope** mentioned "O(n²) memory" in passing but never computed concrete GB figures
2. **plan-experiment** analyzed the focal algorithm's memory but not the baseline's — the planner never identified the single most expensive operation
3. **review-design** `resource_proportionality` was silenced in all 3 review cycles (L-weight → foothold-dropped → never evaluated)
4. **run-experiment** pre-flight checks data manifest availability but not host memory/disk capacity

### Relevant session IDs

- `f3b29126` — implement_phase (Group C Part B)
- `0ab52b1c` — plan_phase (Group D)
- `7d2abdc2` — implement_phase (Group D)
- Run experiment interrupted by user due to resource exhaustion.

## Current state

| What exists | Where | Gap |
|---|---|---|
| `## Estimated Resource Requirements` section | `plan-experiment` SKILL.md output template (line 287) | Unstructured prose placeholder; no frontmatter field, no validation rule (V1-V9 don't cover it), no prose-to-frontmatter mapping |
| `linux_tracing` proc monitoring | `execution/linux_tracing.py` | Collects snapshots but doesn't enforce limits or feed back into recipe routing |
| `stale_threshold` overrides | `research.yaml` (2400s for heavy steps) | Time-only; no memory/disk gates |
| `resource_proportionality` dimension | `review-design` SKILL.md weight matrix | L-weight for ALL experiment types; no subagent prompt, evaluation criteria, or scoring rubric defined; silenced by foothold validation when no resource prose present |

## Proposed solution

### 1. Structured resource estimates in plan-experiment (PRIMARY — everything else depends on this)

Evolve `## Estimated Resource Requirements` from free-text to a machine-readable frontmatter block (coordinates with #590):

```yaml
resource_estimate:
  peak_memory_gb: <float>       # Worst-case single-process RSS
  disk_gb: <float>              # Total disk footprint (data + outputs)
  wall_time_minutes: <int>      # Estimated wall-clock time for full run
  justification: <string>       # Which operation dominates and why
```

**Skill changes:**
- Add a mandatory derivation step in plan-experiment's instructions requiring the planner to identify the single largest data structure across ALL experiment operations — **including baselines, comparisons, and reference computations, not just the focal algorithm**. This is the key instruction that would have caught the incident: the planner analyzed Approach A's memory correctly but never analyzed `compute_exact.py`'s sklearn call.
- Add validation rule **V10**: `resource_estimate` must have `peak_memory_gb > 0` for non-exploratory experiments. ERROR if absent.
- Add the `resource_estimate` → `## Estimated Resource Requirements` row to the prose-to-frontmatter mapping table.
- Capture output token: `resource_estimate = {path}` from `plan_experiment` step.

**Implementation options (pick one):**
- **Option A: Mandatory derivation step** — add instructions to the main planner flow requiring explicit peak-memory computation before writing the resource section. Cheaper (no extra subagent).
- **Option B: Subagent D — Resource Assessment** — dedicated subagent that analyzes algorithms, dataset sizes, and data types. More thorough but adds token cost per plan run. plan-experiment already runs 3 subagents (A: prior art, B: data feasibility, C: environment).

### 2. Pre-flight resource gate in run-experiment

Add a resource feasibility check to run-experiment Step 2 (Pre-flight), after data manifest verification:

- Read `resource_estimate.peak_memory_gb` from the experiment plan frontmatter
- Query host capacity: available RAM, available disk
- If `peak_memory_gb > available_ram * 0.85` → emit clear failure message explaining what exceeded and by how much, set Status: FAILED
- If `peak_memory_gb > available_ram * 0.6` → log warning, proceed

**No recipe routing changes needed.** The existing failure path already handles this gracefully:

```
run_experiment FAILS (resource pre-flight)
  → adjust_experiment (--adjust reads failure, can't fix capacity → fails)
    → ensure_results (creates INCONCLUSIVE placeholder)
      → write_report_inconclusive
```

The pipeline terminates with a clear report explaining why it couldn't run and what resources it needed. The human decides whether to reduce experiment scale or get a bigger machine.

### 3. Conditional resource_proportionality elevation in review-design

Extend the existing `+high_cost` secondary modifier to cover memory and wall-time, not just GPU-hours:

- **Keep L-weight as default** for `resource_proportionality`
- Add conditional elevation: if `resource_estimate.peak_memory_gb` exceeds a configurable threshold OR `resource_estimate.wall_time_minutes` exceeds a configurable threshold → elevate to M-weight for that review
- If the plan has a `resource_estimate` block, evaluate the dimension. If absent, emit a single WARNING: "Plan contains no structured resource estimate."
- **Define evaluation criteria** (currently none exist): Is `peak_memory_gb` plausible for the declared data scale? Does the justification account for all compute-heavy steps including baselines?

**No hardcoded thresholds.** Elevation thresholds should come from config, allowing projects with different hardware profiles to set appropriate values.

## Priority

1. **plan-experiment structured estimates** — highest impact, creates the data everything else needs
2. **run-experiment pre-flight gate** — defense in depth, catches wrong estimates at runtime
3. **review-design conditional elevation** — catches missing/implausible estimates during review

## Related issues

- #590 — Evolve experiment plan schema (coordinate on frontmatter format)
- #690 — STOP verdict fail-fast gate halts L2-L4 analysis (found during same investigation)
- #692 — agent_implementability review dimension (found during same investigation)
- #693 — Computational complexity section in scope (feeds better resource estimates into plan-experiment)

## Acceptance criteria

- [ ] `plan-experiment` produces machine-readable resource estimates (`peak_memory_gb`, `disk_gb`, `wall_time_minutes`) in YAML frontmatter
- [ ] The mandatory derivation step requires analyzing ALL compute-heavy operations including baselines
- [ ] `run-experiment` pre-flight compares estimates against host capacity and fails early if exceeded
- [ ] `review-design` `resource_proportionality` is conditionally elevated based on configurable thresholds (not hardcoded)
- [ ] `review-design` `resource_proportionality` has defined evaluation criteria and a subagent prompt

What exists	Where	Gap
`## Estimated Resource Requirements` section	`plan-experiment` SKILL.md output template (line 287)	Unstructured prose placeholder; no frontmatter field, no validation rule (V1-V9 don't cover it), no prose-to-frontmatter mapping
`linux_tracing` proc monitoring	`execution/linux_tracing.py`	Collects snapshots but doesn't enforce limits or feed back into recipe routing
`stale_threshold` overrides	`research.yaml` (2400s for heavy steps)	Time-only; no memory/disk gates
`resource_proportionality` dimension	`review-design` SKILL.md weight matrix	L-weight for ALL experiment types; no subagent prompt, evaluation criteria, or scoring rubric defined; silenced by foothold validation when no resource prose present

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research recipe: add resource budget assessment and enforcement gates #689

Problem

Incident details (2026-04-09)

Relevant session IDs

Current state

Proposed solution

1. Structured resource estimates in plan-experiment (PRIMARY — everything else depends on this)

2. Pre-flight resource gate in run-experiment

3. Conditional resource_proportionality elevation in review-design

Priority

Related issues

Acceptance criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Research recipe: add resource budget assessment and enforcement gates #689

Description

Problem

Incident details (2026-04-09)

Relevant session IDs

Current state

Proposed solution

1. Structured resource estimates in plan-experiment (PRIMARY — everything else depends on this)

2. Pre-flight resource gate in run-experiment

3. Conditional resource_proportionality elevation in review-design

Priority

Related issues

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions