This guide shows you how to run batch evaluations using mini-extra run-batch.
Minimal command to run batch evaluation:
mini-extra run-batch \
--instances-path instances.json \
--output-dir results \
--model gpt-4Full-featured command with all common options:
mini-extra run-batch \
--config config/default.yaml \
--output-dir sweagent_results/test/claude-sonnet-4-5 \
--num-workers 50 \
--random-delay-multiplier 1 \
--source file \
--instances-path sweagent_wrapper_configs/instances_test_file.yaml \
--no-shuffle \
--deployment-type modal \
--deployment-install-pipx \
--deployment-startup-timeout 900 \
--per-instance-call-limit 250 \
--per-instance-cost-limit 0 \
--total-cost-limit 0 \
--model anthropic/claude-haiku-4-5 \
--model-api-base https://litellm.ml-serving-internal.scale.com/v1 \
--model-api-key $OPENAI_API_KEY \
--model-temperature 0.0--instances-path PATH- Path to instances file (JSON/JSONL/YAML)--source {file,swebench,huggingface}- Instance source type--subset {lite,verified,full,multimodal,multilingual}- SWE-bench subset--split {dev,test}- Dataset split--dataset-name NAME- HuggingFace dataset name
--filter REGEX- Filter instance IDs by regex (default: ".*")--slice SLICE- Slice specification (e.g., "0:10" or "::2")--shuffle/--no-shuffle- Enable/disable shuffling (default: no shuffle)
-o, --output, --output-dir DIR- Output directory-w, --workers, --num-workers N- Number of parallel workers--config PATH- Agent configuration file
-m, --model NAME- Model name--model-class CLASS- Model class--model-api-base URL- Model API base URL--model-api-key KEY- Model API key--model-temperature FLOAT- Sampling temperature--model-top-p FLOAT- Top-p sampling
--per-instance-call-limit N- Max API calls per instance (0=unlimited)--per-instance-cost-limit FLOAT- Max cost per instance (0=unlimited)--total-cost-limit FLOAT- Max total cost (0=unlimited)
--environment-class {docker,singularity,local,modal}- Environment type--deployment-type {modal}- Deployment type (sets environment to modal)--deployment-install-pipx/--no-deployment-install-pipx- Install pipx in Modal deployment--deployment-startup-timeout SECONDS- Modal deployment startup timeout (default: 600s)
--redo-existing/--no-redo-existing- Re-run existing trajectories--raise-exceptions/--no-raise-exceptions- Stop on first error--random-delay-multiplier FLOAT- Startup delay multiplier
mini-extra run-batch \
--instances-path instances.json \
--output-dir results \
--model gpt-4mini-extra run-batch \
--source swebench \
--subset lite \
--split dev \
--slice 0:10 \
--output-dir results \
--workers 4 \
--model anthropic/claude-3-5-sonnet-20241022mini-extra run-batch \
--config config/default.yaml \
--instances-path my_instances.yaml \
--output-dir results/experiment \
--workers 10 \
--model anthropic/claude-haiku-4-5 \
--model-api-base https://api.example.com/v1 \
--model-api-key $API_KEY \
--deployment-type modal \
--deployment-startup-timeout 900mini-extra run-batch \
--instances-path instances.json \
--output-dir results \
--model anthropic/claude-haiku-4-5 \
--model-temperature 0.5 \
--per-instance-call-limit 100 \
--per-instance-cost-limit 1.0 \
--total-cost-limit 50.0Before running a full batch, test with a small slice:
# Test with just the first instance
mini-extra run-batch \
--config config/default.yaml \
--output-dir test_results \
--source file \
--instances-path sweagent_wrapper_configs/instances_test_file.yaml \
--slice 0:1 \
--model anthropic/claude-haiku-4-5 \
--model-api-base https://litellm.ml-serving-internal.scale.com/v1 \
--model-api-key $OPENAI_API_KEYUse --flag to enable, --no-flag to disable (NOT --flag False):
# Enable shuffle
--shuffle
# Disable shuffle (default)
--no-shuffle
# Enable redo existing
--redo-existing
# Disable redo existing (default)
--no-redo-existing- Use dashes, not underscores:
--output-dirnot--output_dir - All options are flat (no dots):
--modelnot--agent.model.name - Many options have short aliases:
-o,-w,-m
Use environment variables for sensitive data like API keys:
--model-api-key $OPENAI_API_KEY
--model-api-key $ANTHROPIC_API_KEYView all available options:
mini-extra run-batch --helpThe help is organized into these sections:
- Instance Loading - Load instances from files or datasets
- Instance Filtering - Filter, slice, and shuffle instances
- Basic Options - Core settings (output dir, workers, config)
- Model Options - Model selection and API configuration
- Model Limits - Call and cost limits
- Environment Options - Docker, Singularity, Local, or Modal
- Advanced Options - Retry behavior and timing
| Category | Common Options |
|---|---|
| Required | --instances-path or --source, --output-dir, --model |
| Instance Sources | --source {file,swebench,huggingface} |
| Filtering | --filter, --slice, --shuffle |
| Parallelization | --num-workers (or -w) |
| Model Config | --model-api-base, --model-api-key, --model-temperature |
| Limits | --per-instance-call-limit, --total-cost-limit |
| Deployment | --deployment-type modal, --deployment-startup-timeout |