Skip to content

DaveHanns/apify-evals

 
 

Repository files navigation

Agent Evals

Evaluation framework for AI coding agents (Claude Code, Codex, OpenCode). Runs Markdown test scenarios, judges results automatically, compares tool integration methods (MCP vs CLI vs mcpc).

Built on Apify — runs as a serverless Actor in Docker.

Actors

  • Agent Evals Runner — runs one scenario with one agent, returns structured verdicts + metrics + trajectory. See its README for quickstart, scenario format, checkpoint syntax, and output reference.

Project structure

shared/src/          Shared library (types, parsers, judge, agent adapters, OTel)
actors/runner/       Apify Actor — the eval runner
scenarios/           21 ready-to-use test scenarios
examples/            9 example input JSON files
docs/                Architecture decisions, research, build log

Docs

About

Agent Evals for MCP, CLI, and in between

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • TypeScript 96.8%
  • Shell 2.0%
  • Other 1.2%