Important
We are refactoring this benchamrk. The new one will be avaiavle at https://github.com/sys-intelligence/aebench.
ArtEvalBench is a benchmark for evaluating AI agents against Artifact Evaluation (AE) tasks (why artifact evaluation?). We believe that, despite the complexity of the AE process, AI agents can be succesfully trained to automatically evaluate artifacts that accompany research papers.
To train and improve AE agents in a principled way, we introduce ArtEvalBench, a curated collection of artifacts accompanying peer-reviewed papers. To ensure a fair comparison, we include artifacts that have already been evaluated in an official AE process and awarded all three badges by the committee. Each entry includes the original artifact (instructions, code, scripts, datasets/benchmarks, etc.), the original paper, and a collection of "oracle" scripts that define objective checkpoints at four canonical stages: environment setup, build/install, benchmark preparation, and experiment execution.
ArtEvalBench is designed to evaluate agents on capability (which stages they complete), efficiency (wall-clock time and intervention count), and fidelity (how closely reproduced results match those reported).
To check those capabilities, each artifact includes four oracle scripts that encode minimal, verifiable success criteria for each of the four stages. The oracles are invoked non-interactively and must be idempotent. Conceptually, these four stages correspond to:
- Environment setup. verifies presence and versions of required tools, libraries, or other dependencies; confirms hardware availability when applicable; and checks that configurations are portable rather than hardcoded or tied to a specific machine.
- Build (and install) the artifact. confirms a complete build (or install) operation from a specified version, with expected binaries/modules present; running tests, when available, or simple validation commands like invoking
--helpor equivalent. - Benchmark preparation. asserts that datasets/benchmarks are present and checksums match; verifies that necessary third-party tools compile and the artifact's instrumentation/monitoring hooks are enabled, if applicable.
- Experiment runs. executes each experiment according to the authors' guidelines; checks that the artifact produces the expected metrics, logs, files, figures, etc.; provides an initial assessment relative to specified tolerance bounds.
Adding to the benchmark requires users to include a new entry into ArtEvalBench schema file, where:
artifact_idis a unique identifier for the artifact;artifact_dirthe artifact directory withindata/benchmark/;artifact_readmeis the path to the artifact's README file that contains the step-by-step guide for preparing, installing, and running experiments;artifact_urlthe URL to the original artifact;evaluatoris a path to the evaluator'smain.pyentrypoint;expected_scoreis the total expected score for this artifact, which defaults to 4 as the agent is evaluated on it succesfully completing the four canonical AE stages (!!NOTE!! We encourage users not to change this value, unless they opt for another universal metric for artifact evaluation).docker_evn(optional) points to a Docker image on Docker Hub.
It also requires users to extend the artifact they plan to add with a self-contained evaluator in an _agent_eval/ directory. This evaluator encodes minimal, objective success criteria for the four canonical AE stages and is what the benchmark actually calls.
Using WASABI's agent evaluator as a template, users will therefore need to extend the artifact with:
-
An
_agent_eval/package which contains all benchmark-specific code and does not modify your original artifact logic. -
One oracle module per stage. In this benchmark, each stage is typically implemented as a derived oracle class that overrides
requirements()and returns an ordered list of programmatic checks (requirements). The base oracle handles running requirements, producing a structured report, printing a PASS/FAIL summary, and returningTrue/Falsefromrun(verbose=...).
A typical _agent_eval/ layout looks like:
_agent_eval/
├── main.py
├── oracle_env_setup.py
├── oracle_build_install.py
├── oracle_prep_benchmark.py
├── oracle_run_experiments.py
└── refs/
├── datasets.ref.json
└── results.ref.json
The refs/ directory stores machine-checkable ground truth (e.g., dataset manifests/checksums, expected metric tables, or summaries of deterministic outputs) used by benchmark-prep and experiment-runs checks.
Here is a simplified environment setup oracle (one dependency/version requirement):
# _agent_eval/oracle_env_setup.py
import sys
from collections.abc import Sequence
from evaluator.oracle_env_setup_primitives import (
DependencyVersionRequirement,
OracleEnvSetupBase,
VersionCompare,
)
class OracleEnvSetup(OracleEnvSetupBase):
def __init__(self, *, config, logger):
super().__init__(logger=logger)
self._config = config
def requirements(self) -> Sequence[DependencyVersionRequirement]:
return (
DependencyVersionRequirement(
name="python_version",
cmd=(sys.executable, "--version"),
required_version=(3, 10, 0),
compare=VersionCompare.GEQ,
timeout_seconds=5.0,
),
)Also, note that each oracle should be:
- Non-interactive, meaning not expecting input or prompt interactions.
- Idempotent, meaning safe to run multiple times without side-effects.
- Time-bounded, meaning every command has a timeout.
- Binary, meaning it returns pass/fail (as
True/False) for the stage.
For more details, check out this how-to guide
-
A single
main.pyorchestrator, the entrypoint used by ArtEvalBench, which constructs anEntryConfig, invokes the four oracles in order, and returns an overall score (an integer between 0 and 4):# _agent_eval/main.py import os from pathlib import Path from evaluator.utils import EntryConfig, LoggerConfig, get_logger, record_result from oracle_env_setup import OracleEnvSetup from oracle_build_install import OracleBuildInstall from oracle_prep_benchmark import OraclePrepBenchmark from oracle_run_experiments import OracleRunExperiments CONFIG = EntryConfig( name="my-artifact", home_dir=Path.home() / "artevalbench" / "my-artifact", repository_paths={ "my-artifact": Path.home() / "artevalbench" / "my-artifact" / "repo", }, results_paths={ "results": Path.home() / "artevalbench" / "my-artifact" / "repo" / "outputs" / "results.json", }, ground_truth_paths={ "datasets": Path.home() / "artevalbench" / "my-artifact" / "_agent_eval" / "refs" / "datasets.ref.json", "results": Path.home() / "artevalbench" / "my-artifact" / "_agent_eval" / "refs" / "results.ref.json", }, similarity_ratio=0.75, ) def main(argv: list[str]) -> int: verbose = "--verbose" in argv logger = get_logger( LoggerConfig(root_name=os.environ.get("EVAL_LOGGER_NAME", "ARTEVAL-EVAL")) ) results: dict[str, int] = {} score = 0 score += record_result( results, "env_setup", OracleEnvSetup(config=CONFIG, logger=logger).run(verbose=verbose), ) score += record_result( results, "build_install", OracleBuildInstall(config=CONFIG, logger=logger).run(verbose=verbose), ) score += record_result( results, "prep_benchmark", OraclePrepBenchmark(config=CONFIG, logger=logger).run(verbose=verbose), ) score += record_result( results, "run_experiments", OracleRunExperiments(config=CONFIG, logger=logger).run(verbose=verbose), ) logger.info("Stage scores: %s", results) logger.info("FINAL_SCORE %d/4", score) return score if __name__ == "__main__": raise SystemExit(main([]))
Note that the
ArtEvalBenchframework will invokemain.pyto run the oracles in order, compute the agent's score for this particular artifact, and store it into a JSON file that aggregates these outcomes for the entire benchmark.
To run the benchmark:
- Execute the
run.shscript with your model:
./run.sh <model_name>
# Example: ./run.sh claude-sonnet-4-5-20250929- Configure your LLM endpoint in
env.toml:
- For Azure/OpenAI models: Set
AZURE_API_KEY,AZURE_API_BASE,AZURE_API_VERSION - For Anthropic models: Set
ANTHROPIC_API_KEY - For self-hosted models: Configure
OPENAI_API_TYPEandOPENAI_BASE_URL
- Results will be saved to
outputs/with timestamp and model information
The benchmark supports multiple AI agents:
- Claude Code: Anthropic's code assistant
- Mini SWE Agent: The compact version of SWE-agent assistant
- OpenHands: Open-source coding agent
- ae_agent: Claude Agent SDK–based agent (same logic as the standalone artifact-agent repo), with full support for host/Docker, interactive mode, Skill, Sub-agent, per-task timeout, GPU, and optional container sync/commit/stop.
To add your own agent to the benchmark, see add_agents.md.
When using the ae_agent (-a ae_agent or -a ae-agent), you can pass the following from the command line and/or the task JSONL.
Command-line arguments
| Argument | Description |
|---|---|
-i, --input_file |
Input JSONL file with tasks (default: ./data/benchmark/arteval_tasks.jsonl). |
-o, --save_path |
Directory for results (default: ./outputs/ae_<model>_ae-agent_<timestamp>). |
-a, --agent |
Agent name; use ae_agent or ae-agent for this agent. |
-m, --model_name |
Model name (e.g. claude-sonnet-4-5-20250929). |
--interactive |
After the task completes, keep a session open so you can give more instructions (requires a TTY). In Docker mode the runner is executed in the foreground via docker exec -it. |
--enable-skill |
Enable Claude Agent SDK Skill (load from ~/.claude/skills/ and .claude/skills/). |
--enable-subagent |
Enable Claude Agent SDK Sub-agent (Task tool). |
JSONL task fields (per line)
| Field | Description |
|---|---|
artifact_id |
Unique task identifier. |
artifact_dir |
Artifact directory name (relative to the JSONL file’s directory). |
artifact_readme |
Path to the README or task description file (relative to artifact root). |
artifact_url |
Optional. Git clone URL; used when artifact_dir is missing or the path does not exist. |
env |
"local" for host; Docker image name (e.g. bastoica/ae-agent-ubuntu24.04:latest) for Docker. |
evaluator |
Command to run after the agent (e.g. python _agent_eval/main.py). |
expected_score |
Expected score for this artifact (default 4). |
timeout |
Optional. Per-task timeout in seconds or milliseconds (see utils: values < 86400 are seconds, else milliseconds). |
gpu |
Optional. When true, pass --gpus all to Docker (Docker mode only). |
interactive |
Optional. When true, enable interactive mode for this task (overrides CLI default). |
enable_skill |
Optional. When true, enable Skill for this task. |
enable_subagent |
Optional. When true, enable Sub-agent for this task. |
keep_container |
Optional. When false (default for ae_agent), after the run the workspace is synced from the container to the host, the container is committed as an image, and the container is stopped. When true, the container is left running for inspection. |
Examples
# Host mode, default options
python src/main.py -i ./data/benchmark/arteval_tasks.jsonl -a ae_agent -o ./outputs/run1
# With interactive mode (TTY required for Docker)
python src/main.py --interactive -i ./data/benchmark/arteval_tasks.jsonl -a ae_agent -o ./outputs/run2
# Enable Skill and Sub-agent
python src/main.py --enable-skill --enable-subagent -i ./data/benchmark/arteval_tasks.jsonl -a ae_agent -o ./outputs/run3Outputs (when using ae_agent)
Results are written under the given save_path:
result.jsonl— One JSON object per task (task_id, status, score, agent_run_results, etc.).avg_score.json— Benchmark summary (final_score, total_tasks).ae_report_<artifact_id>.md— Per-task report (status, project path, log file, agent summary, and optional Docker image instructions).summary.json— Total and successful task counts and success rate (same format as standalone artifact-agent).- When running via the benchmark entry, log paths and agent summary are filled from available data; standalone
python -m ae_agent.mainalso producesae_log_<artifact_id>.log.
Docker + interactive
For Docker tasks with interactive: true (or --interactive), the benchmark runs the agent in the foreground via docker exec -it so you can interact in the same terminal. This requires a real TTY (e.g. running python src/main.py ... in a terminal, not under CI or with redirected stdin). If stdin is not a TTY, the run falls back to non-interactive (background runner) and a warning is logged.