GitHub - illinoisdata/TheAgentCompany-lite

A faster, parallelized evaluation infrastructure for TheAgentCompany benchmarks.

3.1 hours instead of 17.5 hours. 4 GB instead of 700 GB. Mock mode for 30-second infra testing.

Uses the same OpenHands config and task format as upstream — just swap the runner.

Quick test with no external services needed:

make single TASK=ds-sql-exercise

See docs/DESIGN.md for architecture and docs/SETUP.md for full setup instructions including service deployment.

Quick Start

git clone git@github.com:illinoisdata/TheAgentCompany-lite.git && cd TheAgentCompany-lite
make setup-full  # submodule + uv deps (with openhands) + docker base image
make mock        # mock benchmark (no LLM, no services needed)
make dry-run     # see execution plan
make single TASK=ds-sql-exercise   # quick real task (no external services needed)

Note: The first run of any task takes 10-20 minutes to build the OpenHands runtime image. Subsequent runs start in seconds.

Tasks that depend on GitLab, RocketChat, ownCloud, or Plane need those services running first. See docs/SETUP.md for service deployment. The harness starts services on-demand via ensure_services() when you run a task that needs them.

Configuration

LLM config uses the same config.toml format as upstream OpenHands. Create it in the project root (after cloning and make setup-full):

[llm.agent]
model = "gpt-4o-mini"
base_url = "https://api.openai.com/v1"
api_key = "sk-..."

[llm.env]
model = "gpt-4o-mini"
base_url = "https://api.openai.com/v1"
api_key = "sk-..."

For Azure OpenAI:

[llm.agent]
model = "azure/<your-deployment-name>"
base_url = "https://<your-resource>.cognitiveservices.azure.com"
api_key = "<your-api-key>"
api_version = "2024-12-01-preview"

[llm.env]
model = "azure/<your-deployment-name>"
base_url = "https://<your-resource>.cognitiveservices.azure.com"
api_key = "<your-api-key>"
api_version = "2024-12-01-preview"

Important: The model field must start with azure/ prefix so litellm routes to Azure OpenAI. base_url should be just the endpoint, not the full deployment path.

--agent-llm-config agent maps to [llm.agent], --env-llm-config env maps to [llm.env]. For mock mode, the config file is not needed.

Prerequisites

Requirement	Linux (Ubuntu/Debian)	macOS
Docker 24+	`sudo apt-get install -y docker.io`	Docker Desktop (includes BuildX + Compose)
Docker BuildX	`sudo apt-get install -y docker-buildx-plugin`	Included in Docker Desktop
Docker Compose v2	`sudo apt-get install -y docker-compose-v2`	Included in Docker Desktop
Python 3.12+	System package or pyenv	`brew install python@3.12`
uv	`curl -LsSf https://astral.sh/uv/install.sh \| sh`	`brew install uv` or same curl command

Quick install (Linux): sudo apt-get install -y docker.io docker-buildx-plugin docker-compose-v2

Quick install (macOS): Install Docker Desktop + brew install uv

Install

make setup-full          # full: submodule + openhands deps + docker base image
# or:
make setup               # base only: mock mode + scheduler

Or manually:

git submodule update --init --recursive
uv sync --extra openhands  # requires Python >=3.12
docker pull ghcr.io/haochengxia/theagentcompany-lite-base:latest || make build-base

Note: If docker pull fails (private registry or no access), make build-base builds the image locally from source (~5 min first time).

Usage

# Mock benchmark (no LLM, no services needed)
make mock

# Quick real task (no external services needed)
make single TASK=ds-sql-exercise
make single TASK=sde-install-go
make single TASK=sde-install-openjdk

# Tasks that need GitLab/RocketChat/ownCloud/Plane
# (services must be running first, see SETUP.md)
make single TASK=sde-add-wiki-page
make single TASK=gitlab-create-repo-1
make single TASK=admin-arrange-meeting-rooms

# Multiple tasks
uv run python evaluation_lite/scheduler.py \
  --agent-llm-config agent --env-llm-config env \
  --tasks "admin-arrange-meeting-rooms,pm-update-project-milestones" \
  --mock --mock-duration 5,8 --outputs-path ./outputs_mock

# Full benchmark (1 instance)
uv run python evaluation_lite/scheduler.py \
  --agent-llm-config agent --env-llm-config env \
  --server-hostname localhost --outputs-path ./outputs

# Full benchmark (6 instances)
uv run python evaluation_lite/scheduler.py \
  --agent-llm-config agent --env-llm-config env \
  --server-hostname tac_test \
  --num-instances 6 --full-stack-ids 0,4,5 \
  --outputs-path ./outputs

Files

evaluation_lite/
  scheduler.py          # Round-based parallel scheduler
  service_manager.py    # Multi-instance port mapping and locking
  harness.py            # Docker and OpenHands harness interfaces
  run_eval.py           # Single task execution pipeline
  run_eval_mock.py      # Mock executor for infra testing
  browsing.py           # Browser automation for pre-login

.github/workflows/
  e2e-test.yml          # CI: E2E test with on-demand service startup

Makefile Targets

Target	Description
`make setup`	Init submodule, install base deps, pull/build docker image
`make setup-full`	Same + openhands (Python >=3.12)
`make build-base`	Build base image locally (~5 min)
`make mock`	Run mock benchmark
`make dry-run`	Print execution plan
`make single TASK=name`	Run one task
`make clean`	Remove outputs and venv

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github/workflows		.github/workflows
TheAgentCompany @ 98b68ef		TheAgentCompany @ 98b68ef
docs		docs
evaluation_lite		evaluation_lite
.gitignore		.gitignore
.gitmodules		.gitmodules
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick Start

Configuration

Prerequisites

Install

Usage

Files

Makefile Targets

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Quick Start

Configuration

Prerequisites

Install

Usage

Files

Makefile Targets

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages