Skip to content

illinoisdata/TheAgentCompany-lite

Repository files navigation

atc-lite-logo

A faster, parallelized evaluation infrastructure for TheAgentCompany benchmarks.

3.1 hours instead of 17.5 hours. 4 GB instead of 700 GB. Mock mode for 30-second infra testing.

Uses the same OpenHands config and task format as upstream — just swap the runner.

Quick test with no external services needed:

make single TASK=ds-sql-exercise

See docs/DESIGN.md for architecture and docs/SETUP.md for full setup instructions including service deployment.

Quick Start

git clone git@github.com:illinoisdata/TheAgentCompany-lite.git && cd TheAgentCompany-lite
make setup-full  # submodule + uv deps (with openhands) + docker base image
make mock        # mock benchmark (no LLM, no services needed)
make dry-run     # see execution plan
make single TASK=ds-sql-exercise   # quick real task (no external services needed)

Note: The first run of any task takes 10-20 minutes to build the OpenHands runtime image. Subsequent runs start in seconds.

Tasks that depend on GitLab, RocketChat, ownCloud, or Plane need those services running first. See docs/SETUP.md for service deployment. The harness starts services on-demand via ensure_services() when you run a task that needs them.

Configuration

LLM config uses the same config.toml format as upstream OpenHands. Create it in the project root (after cloning and make setup-full):

[llm.agent]
model = "gpt-4o-mini"
base_url = "https://api.openai.com/v1"
api_key = "sk-..."

[llm.env]
model = "gpt-4o-mini"
base_url = "https://api.openai.com/v1"
api_key = "sk-..."

For Azure OpenAI:

[llm.agent]
model = "azure/<your-deployment-name>"
base_url = "https://<your-resource>.cognitiveservices.azure.com"
api_key = "<your-api-key>"
api_version = "2024-12-01-preview"

[llm.env]
model = "azure/<your-deployment-name>"
base_url = "https://<your-resource>.cognitiveservices.azure.com"
api_key = "<your-api-key>"
api_version = "2024-12-01-preview"

Important: The model field must start with azure/ prefix so litellm routes to Azure OpenAI. base_url should be just the endpoint, not the full deployment path.

--agent-llm-config agent maps to [llm.agent], --env-llm-config env maps to [llm.env]. For mock mode, the config file is not needed.

Prerequisites

Requirement Linux (Ubuntu/Debian) macOS
Docker 24+ sudo apt-get install -y docker.io Docker Desktop (includes BuildX + Compose)
Docker BuildX sudo apt-get install -y docker-buildx-plugin Included in Docker Desktop
Docker Compose v2 sudo apt-get install -y docker-compose-v2 Included in Docker Desktop
Python 3.12+ System package or pyenv brew install python@3.12
uv curl -LsSf https://astral.sh/uv/install.sh | sh brew install uv or same curl command

Quick install (Linux): sudo apt-get install -y docker.io docker-buildx-plugin docker-compose-v2

Quick install (macOS): Install Docker Desktop + brew install uv

Install

make setup-full          # full: submodule + openhands deps + docker base image
# or:
make setup               # base only: mock mode + scheduler

Or manually:

git submodule update --init --recursive
uv sync --extra openhands  # requires Python >=3.12
docker pull ghcr.io/haochengxia/theagentcompany-lite-base:latest || make build-base

Note: If docker pull fails (private registry or no access), make build-base builds the image locally from source (~5 min first time).

Usage

# Mock benchmark (no LLM, no services needed)
make mock

# Quick real task (no external services needed)
make single TASK=ds-sql-exercise
make single TASK=sde-install-go
make single TASK=sde-install-openjdk

# Tasks that need GitLab/RocketChat/ownCloud/Plane
# (services must be running first, see SETUP.md)
make single TASK=sde-add-wiki-page
make single TASK=gitlab-create-repo-1
make single TASK=admin-arrange-meeting-rooms

# Multiple tasks
uv run python evaluation_lite/scheduler.py \
  --agent-llm-config agent --env-llm-config env \
  --tasks "admin-arrange-meeting-rooms,pm-update-project-milestones" \
  --mock --mock-duration 5,8 --outputs-path ./outputs_mock

# Full benchmark (1 instance)
uv run python evaluation_lite/scheduler.py \
  --agent-llm-config agent --env-llm-config env \
  --server-hostname localhost --outputs-path ./outputs

# Full benchmark (6 instances)
uv run python evaluation_lite/scheduler.py \
  --agent-llm-config agent --env-llm-config env \
  --server-hostname tac_test \
  --num-instances 6 --full-stack-ids 0,4,5 \
  --outputs-path ./outputs

Files

evaluation_lite/
  scheduler.py          # Round-based parallel scheduler
  service_manager.py    # Multi-instance port mapping and locking
  harness.py            # Docker and OpenHands harness interfaces
  run_eval.py           # Single task execution pipeline
  run_eval_mock.py      # Mock executor for infra testing
  browsing.py           # Browser automation for pre-login

.github/workflows/
  e2e-test.yml          # CI: E2E test with on-demand service startup

Makefile Targets

Target Description
make setup Init submodule, install base deps, pull/build docker image
make setup-full Same + openhands (Python >=3.12)
make build-base Build base image locally (~5 min)
make mock Run mock benchmark
make dry-run Print execution plan
make single TASK=name Run one task
make clean Remove outputs and venv

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors