Name	Name	Last commit message	Last commit date
parent directory ..
data	data
src	src
tests	tests
.gitignore	.gitignore
Dockerfile	Dockerfile
README.md	README.md
WHY.md	WHY.md
add_agents.md	add_agents.md
env.toml	env.toml
install.sh	install.sh
requirements.txt	requirements.txt
run.sh	run.sh
run_ae_agent_smoke_test.sh	run_ae_agent_smoke_test.sh
test.sh	test.sh

ArtEvalBench (Archived)

Important

We are refactoring this benchamrk. The new one will be avaiavle at https://github.com/sys-intelligence/aebench.

ArtEvalBench is a benchmark for evaluating AI agents against Artifact Evaluation (AE) tasks (why artifact evaluation?). We believe that, despite the complexity of the AE process, AI agents can be succesfully trained to automatically evaluate artifacts that accompany research papers.

Contributor's guide

» Overview and high-level structure

To train and improve AE agents in a principled way, we introduce ArtEvalBench, a curated collection of artifacts accompanying peer-reviewed papers. To ensure a fair comparison, we include artifacts that have already been evaluated in an official AE process and awarded all three badges by the committee. Each entry includes the original artifact (instructions, code, scripts, datasets/benchmarks, etc.), the original paper, and a collection of "oracle" scripts that define objective checkpoints at four canonical stages: environment setup, build/install, benchmark preparation, and experiment execution.

ArtEvalBench is designed to evaluate agents on capability (which stages they complete), efficiency (wall-clock time and intervention count), and fidelity (how closely reproduced results match those reported).

To check those capabilities, each artifact includes four oracle scripts that encode minimal, verifiable success criteria for each of the four stages. The oracles are invoked non-interactively and must be idempotent. Conceptually, these four stages correspond to:

Environment setup. verifies presence and versions of required tools, libraries, or other dependencies; confirms hardware availability when applicable; and checks that configurations are portable rather than hardcoded or tied to a specific machine.
Build (and install) the artifact. confirms a complete build (or install) operation from a specified version, with expected binaries/modules present; running tests, when available, or simple validation commands like invoking --help or equivalent.
Benchmark preparation. asserts that datasets/benchmarks are present and checksums match; verifies that necessary third-party tools compile and the artifact's instrumentation/monitoring hooks are enabled, if applicable.
Experiment runs. executes each experiment according to the authors' guidelines; checks that the artifact produces the expected metrics, logs, files, figures, etc.; provides an initial assessment relative to specified tolerance bounds.

» Adding a new artifact

Adding to the benchmark requires users to include a new entry into ArtEvalBench schema file, where:

artifact_id is a unique identifier for the artifact;
artifact_dir the artifact directory within data/benchmark/;
artifact_readme is the path to the artifact's README file that contains the step-by-step guide for preparing, installing, and running experiments;
artifact_url the URL to the original artifact;
evaluator is a path to the evaluator's main.py entrypoint;
expected_score is the total expected score for this artifact, which defaults to 4 as the agent is evaluated on it succesfully completing the four canonical AE stages (!!NOTE!! We encourage users not to change this value, unless they opt for another universal metric for artifact evaluation).
docker_evn (optional) points to a Docker image on Docker Hub.

It also requires users to extend the artifact they plan to add with a self-contained evaluator in an _agent_eval/ directory. This evaluator encodes minimal, objective success criteria for the four canonical AE stages and is what the benchmark actually calls.

Using WASABI's agent evaluator as a template, users will therefore need to extend the artifact with:

An _agent_eval/ package which contains all benchmark-specific code and does not modify your original artifact logic.
One oracle module per stage. In this benchmark, each stage is typically implemented as a derived oracle class that overrides requirements() and returns an ordered list of programmatic checks (requirements). The base oracle handles running requirements, producing a structured report, printing a PASS/FAIL summary, and returning True/False from run(verbose=...).

A typical _agent_eval/ layout looks like:

_agent_eval/
├── main.py
├── oracle_env_setup.py
├── oracle_build_install.py
├── oracle_prep_benchmark.py
├── oracle_run_experiments.py
└── refs/
    ├── datasets.ref.json
    └── results.ref.json

The refs/ directory stores machine-checkable ground truth (e.g., dataset manifests/checksums, expected metric tables, or summaries of deterministic outputs) used by benchmark-prep and experiment-runs checks.

Here is a simplified environment setup oracle (one dependency/version requirement):

# _agent_eval/oracle_env_setup.py
import sys
from collections.abc import Sequence

from evaluator.oracle_env_setup_primitives import (
    DependencyVersionRequirement,
    OracleEnvSetupBase,
    VersionCompare,
)

class OracleEnvSetup(OracleEnvSetupBase):
    def __init__(self, *, config, logger):
        super().__init__(logger=logger)
        self._config = config

    def requirements(self) -> Sequence[DependencyVersionRequirement]:
        return (
            DependencyVersionRequirement(
                name="python_version",
                cmd=(sys.executable, "--version"),
                required_version=(3, 10, 0),
                compare=VersionCompare.GEQ,
                timeout_seconds=5.0,
            ),
        )

Also, note that each oracle should be:

Non-interactive, meaning not expecting input or prompt interactions.
Idempotent, meaning safe to run multiple times without side-effects.
Time-bounded, meaning every command has a timeout.
Binary, meaning it returns pass/fail (as True/False) for the stage.

For more details, check out this how-to guide

A single main.py orchestrator, the entrypoint used by ArtEvalBench, which constructs an EntryConfig, invokes the four oracles in order, and returns an overall score (an integer between 0 and 4):

# _agent_eval/main.py
import os
from pathlib import Path

from evaluator.utils import EntryConfig, LoggerConfig, get_logger, record_result

from oracle_env_setup import OracleEnvSetup
from oracle_build_install import OracleBuildInstall
from oracle_prep_benchmark import OraclePrepBenchmark
from oracle_run_experiments import OracleRunExperiments

CONFIG = EntryConfig(
    name="my-artifact",
    home_dir=Path.home() / "artevalbench" / "my-artifact",
    repository_paths={
        "my-artifact": Path.home() / "artevalbench" / "my-artifact" / "repo",
    },
    results_paths={
        "results": Path.home() / "artevalbench" / "my-artifact" / "repo" / "outputs" / "results.json",
    },
    ground_truth_paths={
        "datasets": Path.home() / "artevalbench" / "my-artifact" / "_agent_eval" / "refs" / "datasets.ref.json",
        "results": Path.home() / "artevalbench" / "my-artifact" / "_agent_eval" / "refs" / "results.ref.json",
    },
    similarity_ratio=0.75,
)

def main(argv: list[str]) -> int:
    verbose = "--verbose" in argv
    logger = get_logger(
        LoggerConfig(root_name=os.environ.get("EVAL_LOGGER_NAME", "ARTEVAL-EVAL"))
    )

    results: dict[str, int] = {}
    score = 0

    score += record_result(
        results, "env_setup",
        OracleEnvSetup(config=CONFIG, logger=logger).run(verbose=verbose),
    )
    score += record_result(
        results, "build_install",
        OracleBuildInstall(config=CONFIG, logger=logger).run(verbose=verbose),
    )
    score += record_result(
        results, "prep_benchmark",
        OraclePrepBenchmark(config=CONFIG, logger=logger).run(verbose=verbose),
    )
    score += record_result(
        results, "run_experiments",
        OracleRunExperiments(config=CONFIG, logger=logger).run(verbose=verbose),
    )

    logger.info("Stage scores: %s", results)
    logger.info("FINAL_SCORE %d/4", score)
    return score

if __name__ == "__main__":
    raise SystemExit(main([]))

Note that the ArtEvalBench framework will invoke main.py to run the oracles in order, compute the agent's score for this particular artifact, and store it into a JSON file that aggregates these outcomes for the entire benchmark.

Benchmark Setup

» Run the benchmark

To run the benchmark:

Execute the run.sh script with your model:

./run.sh <model_name>
# Example: ./run.sh claude-sonnet-4-5-20250929

Configure your LLM endpoint in env.toml:

For Azure/OpenAI models: Set AZURE_API_KEY, AZURE_API_BASE, AZURE_API_VERSION
For Anthropic models: Set ANTHROPIC_API_KEY
For self-hosted models: Configure OPENAI_API_TYPE and OPENAI_BASE_URL

Results will be saved to outputs/ with timestamp and model information

» Supported Agents

The benchmark supports multiple AI agents:

Claude Code: Anthropic's code assistant
Mini SWE Agent: The compact version of SWE-agent assistant
OpenHands: Open-source coding agent
ae_agent: Claude Agent SDK–based agent (same logic as the standalone artifact-agent repo), with full support for host/Docker, interactive mode, Skill, Sub-agent, per-task timeout, GPU, and optional container sync/commit/stop.

To add your own agent to the benchmark, see add_agents.md.

» ae_agent usage and options

When using the ae_agent (-a ae_agent or -a ae-agent), you can pass the following from the command line and/or the task JSONL.

Command-line arguments

Argument	Description
`-i`, `--input_file`	Input JSONL file with tasks (default: `./data/benchmark/arteval_tasks.jsonl`).
`-o`, `--save_path`	Directory for results (default: `./outputs/ae_<model>_ae-agent_<timestamp>`).
`-a`, `--agent`	Agent name; use `ae_agent` or `ae-agent` for this agent.
`-m`, `--model_name`	Model name (e.g. `claude-sonnet-4-5-20250929`).
`--interactive`	After the task completes, keep a session open so you can give more instructions (requires a TTY). In Docker mode the runner is executed in the foreground via `docker exec -it`.
`--enable-skill`	Enable Claude Agent SDK Skill (load from `~/.claude/skills/` and `.claude/skills/`).
`--enable-subagent`	Enable Claude Agent SDK Sub-agent (Task tool).

JSONL task fields (per line)

Field	Description
`artifact_id`	Unique task identifier.
`artifact_dir`	Artifact directory name (relative to the JSONL file’s directory).
`artifact_readme`	Path to the README or task description file (relative to artifact root).
`artifact_url`	Optional. Git clone URL; used when `artifact_dir` is missing or the path does not exist.
`env`	`"local"` for host; Docker image name (e.g. `bastoica/ae-agent-ubuntu24.04:latest`) for Docker.
`evaluator`	Command to run after the agent (e.g. `python _agent_eval/main.py`).
`expected_score`	Expected score for this artifact (default 4).
`timeout`	Optional. Per-task timeout in seconds or milliseconds (see utils: values < 86400 are seconds, else milliseconds).
`gpu`	Optional. When `true`, pass `--gpus all` to Docker (Docker mode only).
`interactive`	Optional. When `true`, enable interactive mode for this task (overrides CLI default).
`enable_skill`	Optional. When `true`, enable Skill for this task.
`enable_subagent`	Optional. When `true`, enable Sub-agent for this task.
`keep_container`	Optional. When `false` (default for ae_agent), after the run the workspace is synced from the container to the host, the container is committed as an image, and the container is stopped. When `true`, the container is left running for inspection.

Examples

# Host mode, default options
python src/main.py -i ./data/benchmark/arteval_tasks.jsonl -a ae_agent -o ./outputs/run1

# With interactive mode (TTY required for Docker)
python src/main.py --interactive -i ./data/benchmark/arteval_tasks.jsonl -a ae_agent -o ./outputs/run2

# Enable Skill and Sub-agent
python src/main.py --enable-skill --enable-subagent -i ./data/benchmark/arteval_tasks.jsonl -a ae_agent -o ./outputs/run3

Outputs (when using ae_agent)

Results are written under the given save_path:

result.jsonl — One JSON object per task (task_id, status, score, agent_run_results, etc.).
avg_score.json — Benchmark summary (final_score, total_tasks).
ae_report_<artifact_id>.md — Per-task report (status, project path, log file, agent summary, and optional Docker image instructions).
summary.json — Total and successful task counts and success rate (same format as standalone artifact-agent).
When running via the benchmark entry, log paths and agent summary are filled from available data; standalone python -m ae_agent.main also produces ae_log_<artifact_id>.log.

Docker + interactive

For Docker tasks with interactive: true (or --interactive), the benchmark runs the agent in the foreground via docker exec -it so you can interact in the same terminal. This requires a real TTY (e.g. running python src/main.py ... in a terminal, not under CI or with redirected stdin). If stdin is not a TTY, the run falls back to non-interactive (background runner) and a warning is logged.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

ArtEvalBench (Archived)

Contributor's guide

» Overview and high-level structure

» Adding a new artifact

Benchmark Setup

» Run the benchmark

» Supported Agents

» ae_agent usage and options

FilesExpand file tree

arteval_bench

Directory actions

More options

Directory actions

More options

Latest commit

History

arteval_bench

Folders and files

parent directory

README.md

ArtEvalBench (Archived)

Contributor's guide

» Overview and high-level structure

» Adding a new artifact

Benchmark Setup

» Run the benchmark

» Supported Agents

» ae_agent usage and options