A faster, parallelized evaluation infrastructure for TheAgentCompany benchmarks.
3.1 hours instead of 17.5 hours. 4 GB instead of 700 GB. Mock mode for 30-second infra testing.
Uses the same OpenHands config and task format as upstream — just swap the runner.
Quick test with no external services needed:
make single TASK=ds-sql-exerciseSee docs/DESIGN.md for architecture and docs/SETUP.md for full setup instructions including service deployment.
git clone git@github.com:illinoisdata/TheAgentCompany-lite.git && cd TheAgentCompany-lite
make setup-full # submodule + uv deps (with openhands) + docker base image
make mock # mock benchmark (no LLM, no services needed)
make dry-run # see execution plan
make single TASK=ds-sql-exercise # quick real task (no external services needed)Note: The first run of any task takes 10-20 minutes to build the OpenHands runtime image. Subsequent runs start in seconds.
Tasks that depend on GitLab, RocketChat, ownCloud, or Plane need those services running first. See docs/SETUP.md for service deployment. The harness starts services on-demand via ensure_services() when you run a task that needs them.
LLM config uses the same config.toml format as upstream OpenHands. Create it in the project root (after cloning and make setup-full):
[llm.agent]
model = "gpt-4o-mini"
base_url = "https://api.openai.com/v1"
api_key = "sk-..."
[llm.env]
model = "gpt-4o-mini"
base_url = "https://api.openai.com/v1"
api_key = "sk-..."For Azure OpenAI:
[llm.agent]
model = "azure/<your-deployment-name>"
base_url = "https://<your-resource>.cognitiveservices.azure.com"
api_key = "<your-api-key>"
api_version = "2024-12-01-preview"
[llm.env]
model = "azure/<your-deployment-name>"
base_url = "https://<your-resource>.cognitiveservices.azure.com"
api_key = "<your-api-key>"
api_version = "2024-12-01-preview"Important: The
modelfield must start withazure/prefix so litellm routes to Azure OpenAI.base_urlshould be just the endpoint, not the full deployment path.
--agent-llm-config agent maps to [llm.agent], --env-llm-config env maps to [llm.env]. For mock mode, the config file is not needed.
| Requirement | Linux (Ubuntu/Debian) | macOS |
|---|---|---|
| Docker 24+ | sudo apt-get install -y docker.io |
Docker Desktop (includes BuildX + Compose) |
| Docker BuildX | sudo apt-get install -y docker-buildx-plugin |
Included in Docker Desktop |
| Docker Compose v2 | sudo apt-get install -y docker-compose-v2 |
Included in Docker Desktop |
| Python 3.12+ | System package or pyenv | brew install python@3.12 |
| uv | curl -LsSf https://astral.sh/uv/install.sh | sh |
brew install uv or same curl command |
Quick install (Linux):
sudo apt-get install -y docker.io docker-buildx-plugin docker-compose-v2Quick install (macOS): Install Docker Desktop +
brew install uv
make setup-full # full: submodule + openhands deps + docker base image
# or:
make setup # base only: mock mode + schedulerOr manually:
git submodule update --init --recursive
uv sync --extra openhands # requires Python >=3.12
docker pull ghcr.io/haochengxia/theagentcompany-lite-base:latest || make build-baseNote: If
docker pullfails (private registry or no access),make build-basebuilds the image locally from source (~5 min first time).
# Mock benchmark (no LLM, no services needed)
make mock
# Quick real task (no external services needed)
make single TASK=ds-sql-exercise
make single TASK=sde-install-go
make single TASK=sde-install-openjdk
# Tasks that need GitLab/RocketChat/ownCloud/Plane
# (services must be running first, see SETUP.md)
make single TASK=sde-add-wiki-page
make single TASK=gitlab-create-repo-1
make single TASK=admin-arrange-meeting-rooms
# Multiple tasks
uv run python evaluation_lite/scheduler.py \
--agent-llm-config agent --env-llm-config env \
--tasks "admin-arrange-meeting-rooms,pm-update-project-milestones" \
--mock --mock-duration 5,8 --outputs-path ./outputs_mock
# Full benchmark (1 instance)
uv run python evaluation_lite/scheduler.py \
--agent-llm-config agent --env-llm-config env \
--server-hostname localhost --outputs-path ./outputs
# Full benchmark (6 instances)
uv run python evaluation_lite/scheduler.py \
--agent-llm-config agent --env-llm-config env \
--server-hostname tac_test \
--num-instances 6 --full-stack-ids 0,4,5 \
--outputs-path ./outputsevaluation_lite/
scheduler.py # Round-based parallel scheduler
service_manager.py # Multi-instance port mapping and locking
harness.py # Docker and OpenHands harness interfaces
run_eval.py # Single task execution pipeline
run_eval_mock.py # Mock executor for infra testing
browsing.py # Browser automation for pre-login
.github/workflows/
e2e-test.yml # CI: E2E test with on-demand service startup
| Target | Description |
|---|---|
make setup |
Init submodule, install base deps, pull/build docker image |
make setup-full |
Same + openhands (Python >=3.12) |
make build-base |
Build base image locally (~5 min) |
make mock |
Run mock benchmark |
make dry-run |
Print execution plan |
make single TASK=name |
Run one task |
make clean |
Remove outputs and venv |