An open-source evolutionary coding agent that uses LLMs to discover and optimize algorithms through iterative evolution, for scientific and algorithmic discovery.
KaiEvolve is an evolutionary coding agent that uses Large Language Models to automatically optimize and discover algorithms through iterative improvement. Starting from the AlphaEvolve research, it incorporates advanced features for reproducibility, multi-language support, sophisticated evaluation pipelines, and integration with cutting-edge LLM optimization techniques. It serves as both a research platform for evolutionary AI and a practical tool for automated code optimization.
A faithful AlphaEvolve loop gets you mutation, evaluation, and a quality-diversity archive. KaiEvolve adds the machinery that makes long runs actually compound — the search remembers what it learned and steers itself instead of re-deriving the problem every iteration:
- Compounding memory (HMRD + running literature review) — every successful program carries a 4-section Hypothesis / Method / Result / Discussion summary. At each checkpoint, KaiEvolve distills only the new programs into a bounded "lab notebook" that is injected into every prompt and persists to disk, so the LLM builds on accumulated knowledge rather than a rotating window of neighbors — and a fresh run can pick up where the last one left off. See Compounding memory.
- Adaptive model selection (Thompson sampling) — hand the ensemble several models with default weights and a Beta–Bernoulli bandit learns which one to spend on from the rewards each earns, instead of a fixed split. See Adaptive model selection.
- Multi-file evolution — coordinate
EVOLVE-BLOCKs across a whole directory of files by shared block ID, not just a single program. See Multi-file evolution. - Self-seeding runs —
kai run --init-from <run_dir>starts a new run from a previous run's best program. See Seeding a run. - Human + automated steering — redirect a live run with a markdown steering brief, or let an opt-in research director meta-agent write strategic directives. See Watching and steering.
- Noise-aware fitness — for stochastic evaluators, re-evaluate and average so
selection isn't fooled by a single lucky draw (see
examples/noisy_optimization/). - Two-phase warmup — optional prompt optimization (pairwise feedback descent)
and hyperparameter tuning before the main run (see
configs/twophase_warmup.yaml).
Underneath those, KaiEvolve is a complete evolutionary coding system:
- Evolutionary coding agent: LLM-guided evolution of whole code files, not just functions
- MAP-Elites + island-based evolution: a quality-diversity archive across multiple populations with periodic migration, balancing exploration and exploitation
- Inspiration vs. performance prompting: top performers and diverse inspirations are sampled separately, with multi-strategy (elite / diverse / exploratory) selection
- Adaptive feature dimensions: default
complexity&diversityaxes, extensible to any metric your evaluator returns - Cascade evaluation + artifacts side-channel: multi-stage testing plus a feedback channel that hands build errors, tracebacks, and profiling data back to the LLM
- LLM ensemble over any OpenAI-compatible API: OpenAI, Anthropic, Google, or local models, with optillm for test-time compute (MoA, reflection) and plugins
- Multi-objective optimization with a
combined_scoreconvention for fitness - Checkpoint / resume of full evolution state, with robust loading
- Scientific reproducibility: deterministic, per-component seeding (
seed=42by default) - Language agnostic: Python, Rust, R, Metal shaders, and more
- Process-based parallelism: true parallel evaluation past Python's GIL, with memory/timeout limits
kaiCLI + zero-build web viewer: live monitoring, side-by-side comparison, CSV/JSON export, and interactive solution visualizations (see Exploring runs)
Each iteration runs the same loop, in a fresh process working from a database snapshot:
-
Prompt sampler builds a context-rich prompt: top-performing programs (for optimization guidance), diverse inspirations (for creative exploration), execution artifacts and error feedback, the running literature review, any active steering brief, and — via optillm plugins — dynamically fetched documentation.
-
LLM ensemble generates the mutation. With default model weights, a Thompson sampler picks which model to call from the rewards each has earned; with explicit weights it falls back to a fixed split. optillm adds test-time compute (MoA, reflection); model selection is deterministic under a fixed seed.
-
Evaluator pool runs multi-stage cascade evaluation in parallel under memory/timeout limits, collecting artifacts and optional LLM-based quality feedback. Stochastic evaluators can re-evaluate and average for noise-aware fitness.
-
Program database maps the result into the MAP-Elites archive across islands, tracks lineage and metadata, and — at checkpoints — distills new programs' HMRD summaries into the literature review for the next iterations.
To install natively, use:
git clone https://github.com/firstbatchxyz/kai-evolve.git
cd kai-evolve
pip install -e .Optional extras:
pip install -e ".[viewer]" # the `kai viewer` web UI (FastAPI + Jinja)
pip install -e ".[embeddings]" # local embedding models for novelty detection /
# strategy clustering (pulls in PyTorch; the
# default "api" embedding backend needs no extra)KaiEvolve uses the OpenAI SDK, which means it works with any LLM provider that supports an OpenAI-compatible API:
-
Set the API Key: Export the
OPENAI_API_KEYenvironment variable:export OPENAI_API_KEY=your-api-key-here -
Using Alternative LLM Providers:
- For providers other than OpenAI (e.g., Anthropic, Cohere, local models), update the
api_basein your config.yaml:
llm: api_base: "https://your-provider-endpoint.com/v1"
- For providers other than OpenAI (e.g., Anthropic, Cohere, local models), update the
-
Maximum Flexibility with optillm:
- For advanced routing, rate limiting, or using multiple providers, we recommend optillm
- optillm acts as a proxy that can route requests to different LLMs based on your rules
- Simply point
api_baseto your optillm instance:
llm: api_base: "http://localhost:8000/v1"
This setup ensures KaiEvolve can work with any LLM provider - OpenAI, Anthropic, Google, Cohere, local models via Ollama/vLLM, or any OpenAI-compatible endpoint.
import asyncio
import os
from kaievolve import KaiEvolve
# Ensure API key is set
if not os.environ.get("OPENAI_API_KEY"):
raise ValueError("Please set OPENAI_API_KEY environment variable")
# Initialize the system
evolve = KaiEvolve(
initial_program_path="path/to/initial_program.py",
evaluation_file="path/to/evaluator.py",
config_path="path/to/config.yaml"
)
# Run the evolution
best_program = asyncio.run(evolve.run(iterations=1000))
print("Best program metrics:")
for name, value in best_program.metrics.items():
print(f" {name}: {value:.4f}")KaiEvolve can also be run from the command line:
python kaievolve-run.py path/to/initial_program.py path/to/evaluator.py --config path/to/config.yaml --iterations 1000For a concrete first run, try the bundled circle packing example:
pip install -r examples/circle_packing/requirements.txt # scipy + matplotlib
python kaievolve-run.py examples/circle_packing/initial_program.py \
examples/circle_packing/evaluator.py \
--config examples/circle_packing/config_phase_1.yaml \
--iterations 50Note: if you omit
--config, KaiEvolve warns and falls back to built-in defaults: modelsgpt-4o-mini/gpt-4o, with the API base taken from theOPENAI_API_BASEenvironment variable (default:https://openrouter.ai/api/v1). Those model names only resolve if your endpoint actually serves them, so passing an explicit config is recommended.
KaiEvolve automatically saves checkpoints at intervals specified by the checkpoint_interval config parameter (default is 10 iterations). You can resume an evolution run from a saved checkpoint:
python kaievolve-run.py path/to/initial_program.py path/to/evaluator.py \
--config path/to/config.yaml \
--checkpoint path/to/checkpoint_directory \
--iterations 50When resuming from a checkpoint:
- The system loads all previously evolved programs and their metrics
- Checkpoint numbering continues from where it left off (e.g., if loaded from checkpoint_50, the next checkpoint will be checkpoint_60)
- All evolution state is preserved (best programs, feature maps, archives, etc.)
- Each checkpoint directory contains a copy of the best program at that point in time
Example workflow with checkpoints:
# Run for 50 iterations (creates checkpoints at iterations 10, 20, 30, 40, 50)
python kaievolve-run.py examples/circle_packing/initial_program.py \
examples/circle_packing/evaluator.py \
--iterations 50
# Resume from checkpoint 50 for another 50 iterations (creates checkpoints at 60, 70, 80, 90, 100)
python kaievolve-run.py examples/circle_packing/initial_program.py \
examples/circle_packing/evaluator.py \
--checkpoint examples/circle_packing/kaievolve_output/checkpoints/checkpoint_50 \
--iterations 50Resuming continues an existing run's full state. To instead start a fresh run
from where another one finished — a new config, a different model roster, or a
clean archive seeded by the best solution so far — use --init-from with a run
directory:
kai run --init-from bench_results/<run>/run_0 \
examples/circle_packing/evaluator.py \
--config examples/circle_packing/config_phase_2.yaml \
--iterations 100KaiEvolve locates that run's best program (<run_dir>/best/best_program.py, or
the latest checkpoints/checkpoint_*/best_program.py) and uses it as the initial
program — so you only need to pass the evaluator. This pairs naturally with the
persistent literature review: point
both runs at the same review file and knowledge compounds across them.
Each checkpoint directory contains the best program found up to that point, making it easy to compare solutions over time:
checkpoints/
checkpoint_10/
best_program.py # Best program at iteration 10
best_program_info.json # Metrics and details
programs/ # All programs evaluated so far
metadata.json # Database state
checkpoint_20/
best_program.py # Best program at iteration 20
...
You can compare the evolution of solutions by examining the best programs at different checkpoints:
# Compare best programs at different checkpoints
diff -u checkpoints/checkpoint_10/best_program.py checkpoints/checkpoint_20/best_program.py
# Compare metrics
cat checkpoints/checkpoint_*/best_program_info.json | grep -A 10 metricsYou can also install and execute via Docker:
docker build -t kaievolve .
docker run --rm -v $(pwd):/app --network="host" kaievolve examples/circle_packing/initial_program.py examples/circle_packing/evaluator.py --config examples/circle_packing/config_phase_1.yaml --iterations 1000Evolution produces a lot of programs; the kai command makes them legible. Run
kai with no arguments for a guided menu, or pick a subcommand:
| Command | What it does |
|---|---|
kai monitor |
Live dashboard, watch runs progress in real time |
kai runs |
One-shot table of every run and its score |
kai show |
Open one run and step through what the AI tried |
kai best |
Print (or save) the best solution a run found |
kai compare |
Compare setups side by side on a shared scale |
kai export |
Dump results to CSV/JSON for your own analysis |
kai steer |
Set a live steering brief a running job reads (human-in-the-loop) |
kai viewer |
Open the richer web view in your browser |
kai glossary |
Plain-language explanation of the terms |
kai run |
Run KaiEvolve on your own code (the optimizer itself) |
Point
kaiat your results with--root(or setKAI_ROOT);--configstells it where each setup's config lives.
kai viewer is a local, dependency-light web view of a results tree, with no
database and no build step. It gives an at-a-glance overview, a per-run dashboard,
and a click-to-inspect drawer that couples each step's reasoning, code diff, and
the solution it produced.
Overview. Every setup per task, ranked by score, with cost and a progress sparkline:
Run and step drawer. A run is a score trajectory beside a summary panel, then a full-width table of every step the AI tried. Click any row and a drawer opens with that program's notes, its diff against its parent, and an interactive visualization of the actual solution (here, the circle packing it produced; hover to inspect, scroll to zoom):
The interactive solution view is driven by an optional per-task visualize.py
that sits next to the task's evaluator.py and exposes a single function:
def render(program_path: str) -> str:
"""Return a self-contained HTML fragment visualizing the program's output."""The viewer runs it lazily in a sandboxed subprocess and caches the result, so it
works retroactively on existing runs. Tasks without a visualize.py simply omit
the solution view. See skills/visualization/SKILL.md
and the two reference visualizers under examples/alphaevolve/ (circle packing and
the autocorrelation step function).
Active runs append a per-iteration progress.jsonl feed to their output
directory; kai monitor and the web viewer read it, so both update in real
time while a run is going (and keep working on finished runs).
Two steering channels can redirect a run while it is running:
- Human steering brief — point
prompt.steering_brief_pathat a markdown file and edit it mid-run: its contents are re-read every iteration and injected into generation prompts, so directives like "focus on the inner loop" or "avoid approach X" take effect without restarting or touching config.kai steersets, appends to, or shows that file from the terminal. - Research director (opt-in) — an automated meta-agent that periodically
reads the population and writes a strategic directive into the same steering
channel. Enable it with
prompt.research_director_enabled: true; optionally setresearch_director_interval(defaults to the migration interval) andresearch_director_modelfor a dedicated reasoning model (defaults to the run's model roster). Seeskills/research-director/SKILL.md.
Both channels compose: the human brief and the director's directive are injected together.
KaiEvolve is highly configurable with advanced options:
# Example configuration showcasing advanced features
max_iterations: 1000
random_seed: 42 # Full reproducibility by default
llm:
# Advanced ensemble configuration
models:
- name: "gemini-2.0-flash-lite"
weight: 0.7
- name: "moa&readurls-gemini-2.0-flash" # optillm test-time compute
weight: 0.3
temperature: 0.7
database:
# MAP-Elites configuration
population_size: 500
num_islands: 5 # Island-based evolution
migration_interval: 20
feature_dimensions: ["complexity", "diversity"] # Default quality-diversity features
evaluator:
# Advanced evaluation features
enable_artifacts: true # Capture execution feedback
cascade_evaluation: true # Multi-stage testing
use_llm_feedback: true # AI-based code quality assessment
prompt:
# Sophisticated prompt engineering
num_top_programs: 3 # Performance examples
num_diverse_programs: 2 # Creative inspiration
include_artifacts: true # Execution feedback
# Template customization
template_dir: null # Directory for custom prompt templates
use_template_stochasticity: true # Enable random variations in prompts
template_variations: {} # Define variation placeholdersSample configuration files are available in the configs/ directory:
default_config.yaml: Comprehensive configuration with all available optionsisland_config_example.yaml: Advanced island-based evolution setup
KaiEvolve supports advanced prompt template customization to increase diversity in code evolution:
You can override the default prompt templates by providing custom ones:
prompt:
template_dir: "path/to/your/templates"Create .txt files in your template directory with these names:
diff_user.txt- Template for diff-based evolutionfull_rewrite_user.txt- Template for full code rewritesevolution_history.txt- Format for presenting evolution historytop_program.txt- Format for top-performing programsprevious_attempt.txt- Format for previous attempts
To add randomness to your prompts and prevent getting stuck in local optima:
- Enable stochasticity in your config:
prompt:
use_template_stochasticity: true
template_variations:
greeting:
- "Let's improve this code."
- "Time to enhance this program."
- "Here's how we can optimize:"
analysis_intro:
- "Current metrics show"
- "Performance analysis indicates"
- "The evaluation reveals"- Use variation placeholders in your custom templates:
# custom_template.txt
{greeting}
{analysis_intro} the following results:
{metrics}
The system will randomly select one variation for each placeholder during prompt generation, creating diverse prompts that can lead to more creative code evolutions.
Note: The default templates don't include variation placeholders, so you'll need to create custom templates to use this feature effectively.
Feature dimensions control how programs are organized in the MAP-Elites quality-diversity grid:
Default Features: If feature_dimensions is NOT specified in your config, KaiEvolve uses ["complexity", "diversity"] as defaults.
Built-in Features (always computed internally by KaiEvolve):
- complexity: Code length (recommended default)
- diversity: Code structure diversity compared to other programs (recommended default)
Only complexity and diversity are used as defaults because they work well across all program types.
Custom Features: You can mix built-in features with metrics from your evaluator:
database:
feature_dimensions: ["complexity", "performance", "correctness"] # Mix of built-in and custom
# Per-dimension bin configuration (optional)
feature_bins:
complexity: 10 # 10 bins for complexity
performance: 20 # 20 bins for performance (from YOUR evaluator)
correctness: 15 # 15 bins for correctness (from YOUR evaluator)Important: KaiEvolve will raise an error if a specified feature is not found in the evaluator's metrics. This ensures your configuration is correct. The error message will show available metrics to help you fix the configuration.
See the Configuration Guide for a full list of options.
When comparing and selecting programs, KaiEvolve uses the following priority:
- combined_score: If your evaluator returns a
combined_scoremetric, it will be used as the primary fitness measure - Average of all metrics: If no
combined_scoreis provided, KaiEvolve calculates the average of all numeric metrics returned by your evaluator
This ensures programs can always be compared even without explicit fitness definitions. For best results, consider having your evaluator return a combined_score that represents overall program fitness.
KaiEvolve includes an artifacts side-channel that allows evaluators to capture build errors, profiling results, etc. to provide better feedback to the LLM in subsequent generations. This feature enhances the evolution process by giving the LLM context about what went wrong and how to fix it.
The artifacts channel operates alongside the traditional fitness metrics.
from kaievolve.evaluation_result import EvaluationResult
return EvaluationResult(
metrics={"compile_ok": 0.0, "score": 0.0},
artifacts={
"stderr": "SyntaxError: invalid syntax (line 15)",
"traceback": "...",
"failure_stage": "compilation"
}
)The next generation prompt will include:
## Last Execution Output
### Stderr
SyntaxError: invalid syntax (line 15)
### Traceback
...An example for an LLM artifact side channel is part of the default evaluation template, which ends with
Return your evaluation as a JSON object with the following format:
{{
"readability": [score],
"maintainability": [score],
"efficiency": [score],
"reasoning": "[brief explanation of scores]"
}}The non-float values, in this case the "reasoning" key of the json response that the evaluator LLM generates, will be available within the next generation prompt.
Artifacts can be controlled via configuration and environment variables:
# config.yaml
evaluator:
enable_artifacts: true
prompt:
include_artifacts: true
max_artifact_bytes: 4096 # 4KB limit in prompts
artifact_security_filter: true# Environment variable to disable artifacts
export ENABLE_ARTIFACTS=false- Faster convergence - LLMs can see what went wrong and fix it directly
- Better error handling - Compilation and runtime failures become learning opportunities
- Rich debugging context - Full stack traces and error messages guide improvements
- Zero overhead - When disabled, no performance impact on evaluation
A plain evolutionary loop is amnesiac: each prompt sees a handful of neighboring programs and re-derives the problem from scratch. KaiEvolve gives the search a memory in two layers.
HMRD summaries. Every successful program is asked to emit a tagged
Hypothesis / Method / Result / Discussion prefix, which workers parse and attach
to the program. This turns the population from a pile of code into a record of
what was tried and why it did or didn't work. HMRD is on by default
(prompt.require_hmrd) and also feeds the optional strategy clustering
(prompt.strategy_clustering_enabled) and the approaches synthesis in the web
viewer.
Running literature review. Opt in with prompt.literature_review_enabled and
KaiEvolve maintains a bounded "lab notebook" distilled from those HMRD summaries.
At each checkpoint it distills only the programs added since the last checkpoint
and merges them into the review, then injects the review into every generation
prompt. The cost stays flat regardless of population size, and because the review
persists to a file it compounds across runs — a new run loads the accumulated
notebook and builds on it.
prompt:
require_hmrd: true # 4-section summaries (default on)
literature_review_enabled: true # maintain + inject the running review
literature_review_path: null # null => <output_dir>/literature_review.md;
# set a stable path to accumulate across runs
literature_review_max_chars: 4000 # bounded (~1000 tokens)
literature_review_interval: null # distill every N iters; null => checkpoint_intervalSet literature_review_path to a shared file and chain runs with
--init-from to carry knowledge forward.
Picking a single model per run is a bet; static ensemble weights are a fixed bet.
KaiEvolve can instead learn the split. When you list more than one model and
leave the weights at their default (1.0), the ensemble switches on a
Beta–Bernoulli Thompson sampler: it round-robins until each model has a few
samples, then routes each call to the model whose recent reward distribution looks
most promising — automatically shifting spend toward whatever is actually working
on this task. Set explicit weights to opt out and use a fixed weighted split.
llm:
models:
- name: "google/gemini-2.5-flash-lite" # no `weight:` => Thompson sampling
- name: "google/gemini-2.5-flash"
thompson_window_size: 50 # how many recent samples inform each model's estimate
thompson_min_samples: 3 # round-robin until each model has this many samples
thompson_exploration_bonus: 0.0 # >0 nudges toward less-used modelsThe reward a model earns can be the child's absolute score or its improvement over
the parent (llm.reward_mode: "absolute" | "improvement", optionally cost-aware);
selection is deterministic under a fixed seed.
Some problems don't live in one file — a kernel plus its launcher, a model plus
its training loop. KaiEvolve can evolve EVOLVE-BLOCKs that span a directory,
coordinated by a shared block ID. Mark the corresponding blocks in each file with
the same ID, point kai run at the directory, and name the block to evolve:
# file_a.py
# EVOLVE-BLOCK-START id="kernel"
...
# EVOLVE-BLOCK-END
# file_b.py
# EVOLVE-BLOCK-START id="kernel"
...
# EVOLVE-BLOCK-ENDkai run --directory path/to/project --block-id kernel \
path/to/evaluator.py --config path/to/config.yaml --iterations 200KaiEvolve presents the matching blocks to the LLM together, applies the evolved result back across the files, and evaluates the project as a whole. The best solution is written back to the original files at the end.
See the examples/ directory for complete examples of using KaiEvolve on various problems:
Our implementation of the circle packing problem. For the n=26 case, we achieve state-of-the-art results matching published benchmarks.
Below is the optimal packing found by KaiEvolve after 800 iterations:
Five open mathematical problems from the AlphaEvolve paper (autocorrelation_C1, packing_circles_max_sum_of_radii, no_isosceles_triangles, happy_ending, unit_distances), each packaged as an initial program plus evaluator. Run them with the suite config tuned for these tasks, configs/bench_alphaevolve.yaml.
Evolving CUDA/GPU kernels for PyTorch operators (e.g. 23_softmax) against a
reference implementation — the platform-optimization case, where fitness is raw
throughput on real hardware.
A minimal benchmark for noise-aware fitness: the
evaluator reports a noisy estimate of each candidate's quality, so a single draw
can mislead selection. Raising evaluator.re_evaluations above 1 averages repeats
to shrink the variance — a worked example of evolving under a stochastic objective.
To use KaiEvolve for your own problems:
- Mark code sections to evolve with
# EVOLVE-BLOCK-STARTand# EVOLVE-BLOCK-ENDcomments - Create an evaluation function that returns a dictionary of metrics
- Configure KaiEvolve with appropriate parameters
- Run the evolution process
Driving KaiEvolve from an LLM agent? skills/using-kaievolve/SKILL.md is an agent-facing reference for the full workflow (task setup, the evaluator contract, running, and reading results).
KaiEvolve is a fork of OpenEvolve by
Asankhaya Sharma (@codelion), used under the
Apache-2.0 license. OpenEvolve is itself an open-source implementation of
Google DeepMind's AlphaEvolve. KaiEvolve builds on that foundation; the original
copyright and license are preserved (see LICENSE and NOTICE).
If you use KaiEvolve in your research, please cite both KaiEvolve and the upstream OpenEvolve work it is built on:
@software{kaievolve,
title = {KaiEvolve: an evolutionary coding agent for scientific and algorithmic discovery},
author = {Dria},
year = {2025},
publisher = {GitHub},
url = {https://github.com/firstbatchxyz/kai-evolve}
}
@software{openevolve,
title = {OpenEvolve: an open-source evolutionary coding agent},
author = {Asankhaya Sharma},
year = {2025},
publisher = {GitHub},
url = {https://github.com/codelion/openevolve}
}




