Evaluation framework for AI coding agents (Claude Code, Codex, OpenCode). Runs Markdown test scenarios, judges results automatically, compares tool integration methods (MCP vs CLI vs mcpc).
Built on Apify — runs as a serverless Actor in Docker.
- Agent Evals Runner — runs one scenario with one agent, returns structured verdicts + metrics + trajectory. See its README for quickstart, scenario format, checkpoint syntax, and output reference.
shared/src/ Shared library (types, parsers, judge, agent adapters, OTel)
actors/runner/ Apify Actor — the eval runner
scenarios/ 21 ready-to-use test scenarios
examples/ 9 example input JSON files
docs/ Architecture decisions, research, build log
- Runner README — quickstart, scenario format, checkpoint syntax, output reference
- How we built it — 7-day build timeline
- Architecture decisions — why standalone Docker, TypeScript, markdown scenarios
- Implementation plan — original phased plan with user stories