Skip to content

feat(agent-comparison): add autoresearch optimization review flow#205

Open
notque wants to merge 13 commits intomainfrom
feat/agent-comparison-autoresearch-clean
Open

feat(agent-comparison): add autoresearch optimization review flow#205
notque wants to merge 13 commits intomainfrom
feat/agent-comparison-autoresearch-clean

Conversation

@notque
Copy link
Copy Markdown
Owner

@notque notque commented Mar 29, 2026

Summary

  • add an autoresearch optimization loop for agent-comparison with variant generation, scoring, and iteration artifacts
  • add optimization-report support to the eval viewer, including optimization-only presentation polish and snapshot export/review flow
  • add regression tests for optimizer safety, result loading, and keep the comprehensive-review skill description under the Codex length limit

Validation

  • pytest -q scripts/tests/test_agent_comparison_optimize_loop.py scripts/tests/test_eval_compare_optimization.py
  • python3 -m py_compile skills/skill-creator/scripts/eval_compare.py skills/agent-comparison/scripts/generate_variant.py skills/agent-comparison/scripts/optimize_loop.py
  • git diff --check
  • browser validation against a generated optimization report fixture with desktop/mobile screenshots and no runtime console errors

notque added 7 commits March 28, 2026 19:17
PR #204 was merged to main while this branch was being developed.
All conflicts resolved in favor of the clean rework versions (ours):
- SKILL.md: review/export approach over cherry-pick
- optimization-guide.md: snapshot review terminology
- eval_viewer.html: radio selection, setActivePage helper, optimization-only mode
- eval_compare.py: standalone is_optimization_data() validator
…view issues

- Migrate generate_variant.py and improve_description.py from Anthropic SDK
  to claude -p subprocess invocation
- Add beam search optimization with configurable width, candidates per parent,
  and frontier retention to optimize_loop.py
- Add beam search parameters display and empty-state UX in eval_viewer.html
- Update SKILL.md and optimization-guide.md for beam search documentation
- Migrate skill-eval run_loop and rules-distill to use claude -p
- Add test coverage for beam search, model flag omission, and claude -p flow

Fixes from review:
- Fix misplaced test_writes_pending_json_in_live_mode (back in TestFullPipeline)
- Remove dead round_keeps variable from optimize_loop.py
- Fix timeout mismatch (120s outer vs 300s inner → 360s outer)
- Clarify --max-iterations help text (rounds, not individual iterations)
Critical fixes:
- Temp file collision in beam search: embed iteration_counter in filename
- rules-distill.py: log errors on claude -p failure and JSONDecodeError
- _run_trigger_rate: always print subprocess errors, not just under --verbose
- _generate_variant_output: add cwd and env (strip CLAUDECODE)

Important fixes:
- _find_project_root: warn on silent cwd fallback in generate_variant and improve_description
- improve_description: warn when <new_description> tags not found
- search_strategy: emit "hill_climb" for single-path runs (beam_width=1, candidates=1)
- rules-distill: log exception in broad except clause
…x task-file leak

Critical fixes:
- Wrap json.loads in _run_trigger_rate with try/except JSONDecodeError
  (exits-0-but-invalid-JSON no longer crashes the entire optimization run)
- Move task_file assignment before json.dump so finally block can always
  clean up the temp file on disk

Also: document _run_claude_code soft-fail contract in rules-distill.py
…anup guard

- Add subprocess.TimeoutExpired to caught exceptions in variant generation
  loop (prevents unhandled crash when claude -p hits 360s timeout)
- Move temp_target.write_text() inside try/finally block so partial writes
  are cleaned up on disk-full or permission errors
- Fix import block ordering in test_eval_compare_optimization.py (ruff I001)
- Fix formatting in test_skill_eval_claude_code.py and eval_compare.py (ruff format)
@notque notque force-pushed the feat/agent-comparison-autoresearch-clean branch from aa853df to 926bedf Compare March 29, 2026 16:01
notque added 6 commits March 29, 2026 09:54
Add _run_behavioral_eval() to optimize_loop.py that runs
`claude -p "/do {query}"` and checks for ADR artifact creation,
enabling direct testing of /do's creation protocol compliance.

Trigger-rate optimization was proven inapplicable for /do (scored
0.0 across all 32 tasks) because /do is slash-invoked, not
description-discovered. Behavioral eval via headless /do is the
correct approach — confirmed that `claude -p "/do create..."` works
but does NOT produce ADRs, validating the compliance gap.

Changes:
- Add _run_behavioral_eval() with artifact snapshot/diff detection
- Add _is_behavioral_task() for eval_mode detection
- Update _validate_task_set() for behavioral task format
- Wire behavioral path into assess_target()
- Add DO NOT OPTIMIZE markers to /do SKILL.md (Phase 2-5 protected)
- Create 32-task benchmark set (16 positive, 16 negative, 60/40 split)
Add explicit Creation Request Detection block to Phase 1 CLASSIFY,
immediately before the Gate line. The block scans for creation verbs,
domain object targets, and implicit creation patterns, then flags the
request as [CREATION REQUEST DETECTED] so Phase 4 Step 0 is acknowledged
before routing decisions consume model attention.

This is ADR-133 Prong 2, Option A. Moving detection to Phase 1 addresses
the root cause: the creation protocol was buried in Phase 4 where it
competed with agent dispatch instructions and was frequently skipped.
Soft-warns when an Agent dispatch appears to be for a creation task but
no recent .adr-session.json is present (stale = >900s or missing).
Exit 0 only — never blocks. Prong 2 / Option B of ADR-133.
Three agents (kotlin-general-engineer, php-general-engineer,
swift-general-engineer) existed on disk but were missing from
agents/INDEX.json, making them invisible to the routing system.

Added all three entries with triggers, pairs_with, complexity, and
category sourced directly from each agent's frontmatter. Also fixes
the pre-existing golang-general-engineer-compact ordering bug as a
side effect of re-sorting the index alphabetically.
…meoutExpired

Two fixes to _run_behavioral_eval():
1. Default timeout 120s -> 240s: headless /do creation sessions frequently
   exceed 120s when they dispatch agents that write files, create plans, etc.
2. Check artifact glob after TimeoutExpired: the subprocess may have written
   artifacts before the timeout fired. The old code set triggered=False on
   any timeout, causing false FAIL for tasks that completed their artifact
   writes but ran over time.

E2E baseline results (6-task subset, 240s timeout):
  - Creation recall: 1/3 (33%) — implicit-create-rails passed (ADR-135 created)
  - Non-creation precision: 3/3 (100%)
  - build-agent-rust: genuine compliance gap (completed, no ADR)
1. behavioral eval: always print claude exit code (not only in verbose mode)
   — silent failures would produce phantom 50% accuracy, corrupting optimization
2. behavioral eval: clean up created artifacts between tasks to prevent
   stale before-snapshots in multi-round optimization runs
3. creation-protocol-enforcer: expand keyword set to match SKILL.md vocabulary
   — 'build a', 'add new', 'new feature', 'i need a/an', 'we need a/an'
   previously covered <50% of the benchmark creation queries
4. SKILL.md Phase 1: move [CREATION REQUEST DETECTED] output to the Gate
   condition so LLM cannot proceed to Phase 2 without acknowledging the flag
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant