DriftBench

DriftBench is a toolkit for generating and replaying data drift and workload drift with DriftSpec.

Who uses DriftBench:

Researcher — design reproducible drift experiments and ablations.
Database Vendor / Performance Team — run drift regression checks across targets before release.
New User — start from validated examples and get first outputs quickly.

Version history: CHANGELOG · Production site: driftbench.com

Install

pip install -U driftbench-db

Or from source:

git clone https://github.com/Liuguanli/DriftBench.git
cd DriftBench
pip install -e .

Verify:

driftbench --help

Benchmark Adapters (`driftbench.data`)

Nine adapters generate real data files and SQL query workloads with no external dependencies (TPC-H mode="generate" auto-downloads and builds dbgen on first use).

Adapter	Workload type	Data format	Tables	Queries
`tpch`	OLAP	`.tbl` (pipe-delimited)	8	22 SQL via qgen
`tpcds`	OLAP / Decision support	`.dat` (pipe-delimited)	5 synthetic	99 query IDs
`tpcc`	OLTP	`.csv`	9	5 transaction types
`tpcc_skew`	OLTP + hotspot	`.csv` + weight manifest	9	5 transaction types
`job`	OLAP / join-order	`.csv`	11 (IMDB-like)	20 SQL templates
`ycsb`	Key-value	`.csv`	1	6 workload mixes (A–F)
`dsb`	Decision support	`.csv`	3 star-schema	3 SQL templates
`pgbench`	TPC-B (OLTP)	`.csv`	4	3 workloads
`benchbase`	Multi-benchmark	XML + shell script	via live DB	10 benchmarks

Generate data and queries

from pathlib import Path
from driftbench.data.tpch import data as tpch_data, queries as tpch_queries
from driftbench.data.tpcds import data as tpcds_data, queries as tpcds_queries
from driftbench.data.tpcc import data as tpcc_data, queries as tpcc_queries
from driftbench.data.tpcc_skew import data as tpcc_skew_data, queries as tpcc_skew_queries
from driftbench.data.job import data as job_data, queries as job_queries
from driftbench.data.ycsb import data as ycsb_data, queries as ycsb_queries
from driftbench.data.dsb import data as dsb_data, queries as dsb_queries
from driftbench.data.pgbench import data as pgbench_data, queries as pgbench_queries
from driftbench.data.benchbase import data as bb_data, queries as bb_queries

out = Path("./artifacts")

# TPC-H — auto-builds dbgen on first use; converts .tbl to .csv with .as_csv()
tpch_data(scale_factor=1, mode="generate").generate(output_dir=out)
tpch_queries(query_ids=[1, 3, 5], queries_per_template=2).generate(output_dir=out)

# TPC-DS — synthetic .dat files; converts to .csv with .as_csv()
tpcds_data(scale_factor=10).generate(output_dir=out)
tpcds_queries().generate(output_dir=out)

# TPC-C — scale_factor = number of warehouses
tpcc_data(scale_factor=4).generate(output_dir=out)
tpcc_queries().generate(output_dir=out)

# TPC-C Skew — Zipf hot-warehouse access distribution
tpcc_skew_data(scale_factor=10, hot_warehouse_fraction=0.2, skew_factor=0.99).generate(output_dir=out)
tpcc_skew_queries(scale_factor=10, hot_warehouse_fraction=0.2).generate(output_dir=out)

# JOB, YCSB, DSB, pgbench
job_data(scale_factor=1).generate(output_dir=out)
ycsb_data(scale_factor=1).generate(output_dir=out)
ycsb_queries(workload="B").generate(output_dir=out)
dsb_data(scale_factor=10).generate(output_dir=out)
pgbench_data(scale_factor=1).generate(output_dir=out)
pgbench_queries(workload="tpcb").generate(output_dir=out)

# BenchBase — generates XML configs + shell scripts for a live database
bb_data(benchmark="tpcc", scale_factor=10).generate(output_dir=out)
bb_queries(benchmark="tpcc", terminals=8, duration=120).generate(output_dir=out)

Output layout

artifacts/
  tpch/data/sf_1/tables/   tpch/queries/
  tpcds/data/              tpcds/queries/
  tpcc/data/               tpcc/queries/
  tpcc_skew/data/          tpcc_skew/queries/
  job/data/                job/queries/
  ycsb/data/               ycsb/queries/
  dsb/data/                dsb/queries/
  pgbench/data/            pgbench/queries/
  benchbase/tpcc/data/     benchbase/tpcc/queries/

Each folder contains a *_manifest.json listing the generated files.

GenerationResult

generate() returns a GenerationResult:

result = tpch_data(scale_factor=1, mode="generate").generate(output_dir=out)
result.files      # list of generated file paths
result.metadata   # path to the manifest JSON

# Convert pipe-delimited .tbl / .dat to standard CSV (both kept on disk).
# Known TPC-H (8 tables) and TPC-DS (5 synthetic tables) get a proper
# header row, so the CSV is self-describing and usable directly by .drift().
csv_result = result.as_csv()

# Lightweight JSON-serializable summary for logs / dashboards / quick asserts.
result.summary()
# {'benchmark': 'tpch', 'artifact_type': 'data',
#  'output_dir': '/tmp/...', 'file_count': 8,
#  'tables': ['customer', 'lineitem', 'nation', ...]}

Second call reuses existing files automatically. Pass force=True to regenerate.

Applying drift to benchmark data

GenerationResult exposes .drift() and .drift_multi() to apply data drift directly — no manual schema extraction or generator setup needed.

Single-table drift:

from driftbench.data.tpch import TPCHData

result = TPCHData(scale_factor=1, source_dir="path/to/tbls").generate().as_csv()

# Inject outliers into lineitem.l_quantity
drifted = result.drift("lineitem", "outlier_injection", column="l_quantity", n=500)

# Skew the price/discount distribution
drifted = result.drift("lineitem", "value_skew",
                       columns=["l_extendedprice", "l_discount"], skewness=2)

drift() writes the drifted CSV to <output_dir>/<table>_<drift_type>.csv by default. Pass output_path= to override. Returns a new GenerationResult pointing at the drifted file.

Every .drift() call also emits a reproducible DriftSpec YAML (<output_stem>.driftspec.yaml) next to the CSV — kept out of result.files but recorded under the manifest's driftspec key. Running that YAML through driftbench.spec.core.run_all regenerates byte-identical output, so a Python-generated drift can be shared or automated as a spec without rework. The function-call path (fast, imperative) and the spec path (declarative, version-controllable, reproducible) are the same engine and produce identical results for the same seed and parameters.

Multi-table drift:

# FK relationships for tpch / job are wired automatically
drifted = result.drift_multi([
    {"op": "skew_column", "target": "lineitem", "column": "l_quantity",
     "fraction": 0.2, "skewness": 2},
    {"op": "delete_keys", "target": "orders", "key_column": "o_orderkey",
     "fraction": 0.05,
     "propagate": [{"relationship": "lineitem_orders", "policy": "drop"}]},
])

Pass relationships=[] or a custom list to override the built-in FK maps. Supported benchmarks with auto-wiring: tpch, job. tpcc and tpcc_skew require explicit relationship definitions because their joins use composite keys.

DriftSpec YAMLs — ready-to-run example specs for all five adapters are in driftspec/examples/:

tpch_lineitem_drift.yaml
tpcc_drift.yaml
job_drift.yaml
ycsb_drift.yaml
pgbench_drift.yaml

CLI Quickstart

# Validate a DriftSpec
python -m driftbench.cli validate-spec driftspec/examples/demo_data_single.yaml --json

# Dry-run (preview execution plan)
python -m driftbench.cli dry-run driftspec/examples/demo_data_single.yaml --json

# Execute
python -m driftbench.cli run-yaml driftspec/examples/demo_data_single.yaml

Python API

from driftbench import run_spec, trace_to_spec

run_spec("driftspec/examples/demo_data_single.yaml")
trace_to_spec("driftspec/trace_inputs/trace_data_mock.csv", "driftspec/generated/from_trace.yaml")

MCP Server

python3 -m driftbench_mcp.server

Core workflow via MCP: trace_to_spec → validate_spec → run_spec → list_outputs

Testing

python -m unittest discover -s test -p 'test_*.py' -v

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
.github/workflows		.github/workflows
daily		daily
data		data
docs		docs
driftbench		driftbench
driftbench_mcp		driftbench_mcp
driftbench_service		driftbench_service
driftspec		driftspec
output		output
requirements		requirements
res		res
scripts		scripts
tasks		tasks
test		test
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
main_result_8.pdf		main_result_8.pdf
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DriftBench

Install

Benchmark Adapters (`driftbench.data`)

Generate data and queries

Output layout

GenerationResult

Applying drift to benchmark data

CLI Quickstart

Python API

MCP Server

Testing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DriftBench

Install

Benchmark Adapters (driftbench.data)

Generate data and queries

Output layout

GenerationResult

Applying drift to benchmark data

CLI Quickstart

Python API

MCP Server

Testing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Benchmark Adapters (`driftbench.data`)

Packages