feat(agent-comparison): add autoresearch optimization review flow by notque · Pull Request #205 · notque/claude-code-toolkit

notque · 2026-03-29T02:17:51Z

Summary

add an autoresearch optimization loop for agent-comparison with variant generation, scoring, and iteration artifacts
add optimization-report support to the eval viewer, including optimization-only presentation polish and snapshot export/review flow
add regression tests for optimizer safety, result loading, and keep the comprehensive-review skill description under the Codex length limit

Validation

pytest -q scripts/tests/test_agent_comparison_optimize_loop.py scripts/tests/test_eval_compare_optimization.py
python3 -m py_compile skills/skill-creator/scripts/eval_compare.py skills/agent-comparison/scripts/generate_variant.py skills/agent-comparison/scripts/optimize_loop.py
git diff --check
browser validation against a generated optimization report fixture with desktop/mobile screenshots and no runtime console errors

PR #204 was merged to main while this branch was being developed. All conflicts resolved in favor of the clean rework versions (ours): - SKILL.md: review/export approach over cherry-pick - optimization-guide.md: snapshot review terminology - eval_viewer.html: radio selection, setActivePage helper, optimization-only mode - eval_compare.py: standalone is_optimization_data() validator

…view issues - Migrate generate_variant.py and improve_description.py from Anthropic SDK to claude -p subprocess invocation - Add beam search optimization with configurable width, candidates per parent, and frontier retention to optimize_loop.py - Add beam search parameters display and empty-state UX in eval_viewer.html - Update SKILL.md and optimization-guide.md for beam search documentation - Migrate skill-eval run_loop and rules-distill to use claude -p - Add test coverage for beam search, model flag omission, and claude -p flow Fixes from review: - Fix misplaced test_writes_pending_json_in_live_mode (back in TestFullPipeline) - Remove dead round_keeps variable from optimize_loop.py - Fix timeout mismatch (120s outer vs 300s inner → 360s outer) - Clarify --max-iterations help text (rounds, not individual iterations)

Critical fixes: - Temp file collision in beam search: embed iteration_counter in filename - rules-distill.py: log errors on claude -p failure and JSONDecodeError - _run_trigger_rate: always print subprocess errors, not just under --verbose - _generate_variant_output: add cwd and env (strip CLAUDECODE) Important fixes: - _find_project_root: warn on silent cwd fallback in generate_variant and improve_description - improve_description: warn when <new_description> tags not found - search_strategy: emit "hill_climb" for single-path runs (beam_width=1, candidates=1) - rules-distill: log exception in broad except clause

…x task-file leak Critical fixes: - Wrap json.loads in _run_trigger_rate with try/except JSONDecodeError (exits-0-but-invalid-JSON no longer crashes the entire optimization run) - Move task_file assignment before json.dump so finally block can always clean up the temp file on disk Also: document _run_claude_code soft-fail contract in rules-distill.py

…anup guard - Add subprocess.TimeoutExpired to caught exceptions in variant generation loop (prevents unhandled crash when claude -p hits 360s timeout) - Move temp_target.write_text() inside try/finally block so partial writes are cleaned up on disk-full or permission errors

- Fix import block ordering in test_eval_compare_optimization.py (ruff I001) - Fix formatting in test_skill_eval_claude_code.py and eval_compare.py (ruff format)

Add _run_behavioral_eval() to optimize_loop.py that runs `claude -p "/do {query}"` and checks for ADR artifact creation, enabling direct testing of /do's creation protocol compliance. Trigger-rate optimization was proven inapplicable for /do (scored 0.0 across all 32 tasks) because /do is slash-invoked, not description-discovered. Behavioral eval via headless /do is the correct approach — confirmed that `claude -p "/do create..."` works but does NOT produce ADRs, validating the compliance gap. Changes: - Add _run_behavioral_eval() with artifact snapshot/diff detection - Add _is_behavioral_task() for eval_mode detection - Update _validate_task_set() for behavioral task format - Wire behavioral path into assess_target() - Add DO NOT OPTIMIZE markers to /do SKILL.md (Phase 2-5 protected) - Create 32-task benchmark set (16 positive, 16 negative, 60/40 split)

Add explicit Creation Request Detection block to Phase 1 CLASSIFY, immediately before the Gate line. The block scans for creation verbs, domain object targets, and implicit creation patterns, then flags the request as [CREATION REQUEST DETECTED] so Phase 4 Step 0 is acknowledged before routing decisions consume model attention. This is ADR-133 Prong 2, Option A. Moving detection to Phase 1 addresses the root cause: the creation protocol was buried in Phase 4 where it competed with agent dispatch instructions and was frequently skipped.

Soft-warns when an Agent dispatch appears to be for a creation task but no recent .adr-session.json is present (stale = >900s or missing). Exit 0 only — never blocks. Prong 2 / Option B of ADR-133.

Three agents (kotlin-general-engineer, php-general-engineer, swift-general-engineer) existed on disk but were missing from agents/INDEX.json, making them invisible to the routing system. Added all three entries with triggers, pairs_with, complexity, and category sourced directly from each agent's frontmatter. Also fixes the pre-existing golang-general-engineer-compact ordering bug as a side effect of re-sorting the index alphabetically.

…meoutExpired Two fixes to _run_behavioral_eval(): 1. Default timeout 120s -> 240s: headless /do creation sessions frequently exceed 120s when they dispatch agents that write files, create plans, etc. 2. Check artifact glob after TimeoutExpired: the subprocess may have written artifacts before the timeout fired. The old code set triggered=False on any timeout, causing false FAIL for tasks that completed their artifact writes but ran over time. E2E baseline results (6-task subset, 240s timeout): - Creation recall: 1/3 (33%) — implicit-create-rails passed (ADR-135 created) - Non-creation precision: 3/3 (100%) - build-agent-rust: genuine compliance gap (completed, no ADR)

1. behavioral eval: always print claude exit code (not only in verbose mode) — silent failures would produce phantom 50% accuracy, corrupting optimization 2. behavioral eval: clean up created artifacts between tasks to prevent stale before-snapshots in multi-round optimization runs 3. creation-protocol-enforcer: expand keyword set to match SKILL.md vocabulary — 'build a', 'add new', 'new feature', 'i need a/an', 'we need a/an' previously covered <50% of the benchmark creation queries 4. SKILL.md Phase 1: move [CREATION REQUEST DETECTED] output to the Gate condition so LLM cannot proceed to Phase 2 without acknowledging the flag

notque added 7 commits March 28, 2026 19:17

feat(agent-comparison): add autoresearch optimization review flow

79d2733

style: fix import sort order and formatting

926bedf

- Fix import block ordering in test_eval_compare_optimization.py (ruff I001) - Fix formatting in test_skill_eval_claude_code.py and eval_compare.py (ruff format)

notque force-pushed the feat/agent-comparison-autoresearch-clean branch from aa853df to 926bedf Compare March 29, 2026 16:01

notque added 6 commits March 29, 2026 09:54

feat(adr-133): add creation-protocol-enforcer PreToolUse hook

c25f6a7

Soft-warns when an Agent dispatch appears to be for a creation task but no recent .adr-session.json is present (stale = >900s or missing). Exit 0 only — never blocks. Prong 2 / Option B of ADR-133.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agent-comparison): add autoresearch optimization review flow#205

feat(agent-comparison): add autoresearch optimization review flow#205
notque wants to merge 13 commits intomainfrom
feat/agent-comparison-autoresearch-clean

notque commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

notque commented Mar 29, 2026

Summary

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant