Velox

A high-throughput, zero-allocation AI inference gateway written in Rust. Velox sits between API consumers and upstream LLM providers (e.g. OpenAI), routing requests through a lock-free ingress pipeline with SIMD-accelerated JSON parsing, per-worker memory arenas, and multiplexed HTTP/2 egress with pre-warmed connection pools.

Architecture

                         ┌─────────────────────────────────────┐
                         │          Velox Gateway               │
                         │                                     │
  Producers ──►  ┌───────────────┐   ┌──────────────────┐     │
                 │  Crossbeam    │   │  Tokio Runtime    │     │
                 │  Bounded      │──►│  (pinned cores)   │     │
                 │  Channel      │   │                   │     │
                 │  (16,384)     │   │  ┌─────────────┐  │     │
                 └───────────────┘   │  │ Worker 0    │  │     │
                                     │  │ ┌─────────┐ │  │     │
                                     │  │ │ Bumpalo  │ │  │     │    ┌──────────────┐
                                     │  │ │ Arena    │ │  │     │    │  Upstream     │
                                     │  │ └─────────┘ │  │     │    │  (HTTP/2 +    │
                                     │  │ ┌─────────┐ │  │────►│───►│   TLS 1.3)   │
                                     │  │ │simd-json│ │  │     │    │              │
                                     │  │ └─────────┘ │  │     │    │  Pre-warmed  │
                                     │  └─────────────┘  │     │    │  Pool        │
                                     │  ┌─────────────┐  │     │    └──────────────┘
                                     │  │ Worker N    │  │     │
                                     │  │  (same)     │  │     │
                                     │  └─────────────┘  │     │
                                     │                   │     │
                                     │  ┌─────────────┐  │     │
                                     │  │ TokenBucket │  │     │
                                     │  │ (atomic CAS)│  │     │
                                     │  └─────────────┘  │     │
                                     └──────────────────┘     │
                         │                                     │
                         │  ┌─────────────┐                    │
                         │  │ pprof       │ :6060              │
                         │  │ (OS thread) │                    │
                         │  └─────────────┘                    │
                         └─────────────────────────────────────┘

Core Components

Ingress Pipeline (main.rs) -- Incoming request payloads are submitted into a lock-free bounded crossbeam channel (depth 16,384). One Tokio worker task per logical CPU core drains the channel in micro-batches of up to 64 tasks, parsing each through the arena-backed SIMD-JSON pipeline.

Request Arena (arena.rs) -- Each worker owns a bumpalo bump allocator pre-allocated at 256 KB. Network payloads are copied into the arena with 32 bytes of SIMD zero-padding, then parsed in-place by simd-json. Deserialized structs borrow &str directly from the arena buffer (zero-copy). The arena resets after each batch -- zero syscalls, zero OS allocator calls.

SIMD-JSON Parsing -- simd-json performs in-place vectorised parsing (AVX2/AVX-512 on x86_64, NEON on aarch64). Payloads over 1 MB are automatically routed to tokio::task::spawn_blocking to prevent SIMD vectorisation passes from stalling the Tokio I/O reactor.

Rate Limiter (rate_limit.rs) -- A lock-free token bucket using a single fetch_sub atomic RMW instruction on the fast path (no load+CAS loop). Integer-only microsecond-precision refill eliminates all floating-point arithmetic. Batch acquire mode acquires N tokens in one atomic operation, reducing contention by up to 64x per batch. No Mutex, no RwLock, no thread parking. When the bucket is empty, workers suspend via tokio::time::sleep, causing natural backpressure through the bounded ingress channel.

Egress (egress.rs) -- Multiplexed HTTP/2 connections over TLS 1.3 (rustls + webpki). Connection pools are sharded per upstream with round-robin dispatch. Connections are pre-warmed at startup via HEAD requests. Tunable HTTP/2 parameters: 8 MB stream windows, 64 MB connection windows, 512 max concurrent streams, adaptive flow control.

Profiling (profiling.rs) -- A dedicated OS thread (fully isolated from the Tokio reactor) serves a /debug/pprof/profile endpoint on port 6060. On Linux, it captures CPU flamegraphs via pprof with configurable duration (?seconds=N).

Runtime Topology -- Tokio worker threads are pinned 1:1 to logical CPU cores via core_affinity. The global allocator is mimalloc for reduced fragmentation under concurrent allocation patterns.

Design Principles

Zero allocations on the hot path: No .clone(), .to_owned(), String::from(), Vec::new(), or Box::new() in the ingress-parse-route critical path. All transient data lives in the bump arena.
Completely lock-free: No std::sync::Mutex or std::sync::RwLock anywhere in the codebase. Shared state uses atomic operations. Channels are lock-free crossbeam bounded queues.
Single-instruction rate limiting: The token bucket fast path uses one fetch_sub instead of a load+compare+CAS loop. Batch mode acquires an entire batch's tokens in a single atomic operation.
Reactor safety: Large payloads (>1 MB) are offloaded to spawn_blocking so SIMD parsing cannot starve the async I/O event loop. Small payloads stay inline for minimum latency.
Arena lifecycle safety: arena.reset() is called strictly after all borrowed &str references from the parse phase are dropped, preventing use-after-free while avoiding memory leaks.

Benchmark Results

Benchmarks run with Criterion on an optimised release build (opt-level = 3, thin LTO, single codegen unit, target-cpu=native, overflow checks disabled).

Arena Allocation + SIMD-JSON Parse (Full Hot Path)

Payload Size	Latency (median)	Throughput
48 B (small)	796 ns	56.3 MiB/s
4 KB (medium)	1.57 us	2.47 GiB/s
64 KB (large)	21.4 us	2.91 GiB/s

Arena Allocation Only (Bump Allocator Overhead)

Buffer Size	Latency (median)	Throughput
64 B	6.3 ns	9.4 GiB/s
512 B	9.2 ns	51.8 GiB/s
4 KB	35.6 ns	107.2 GiB/s
64 KB	2.17 us	28.8 GiB/s

SIMD-JSON Parse Only (4 KB Payload)

Benchmark	Latency (median)	Throughput
simd-json parse 4 KB	1.52 us	2.53 GiB/s

Rate Limiter (Uncontended Single-Instruction `fetch_sub`)

Benchmark	Latency
100 acquires (full bucket)	1.40 us
Per-acquire	~14.0 ns
Batch acquire 64 tokens	291 ns
Per-token (batched)	~4.5 ns

Crossbeam Channel (Lock-Free Queue)

Queue Depth	1,000 send+recv	Per-operation
1,024	34.4 us	~17 ns
16,384	34.2 us	~17 ns

End-to-End Ingress Task (Arena Alloc + Parse + Route Resolve)

Benchmark	Latency (median)
Full ingress pipeline	549 ns

Velox vs Standard Approaches

Head-to-head benchmarks measuring the same logical operation via Velox's optimised path versus standard Rust equivalents. Ratios are system-independent (both paths run under identical conditions).

Memory Allocation: Bumpalo arena vs Vec::from heap

Buffer Size	Bumpalo Arena	Vec Heap	Speedup
64 B	4.2 ns	37.8 ns	9.0x faster
512 B	8.4 ns	50.4 ns	6.0x faster
4 KB	29.2 ns	64.5 ns	2.2x faster
64 KB	1.38 us	1.68 us	1.2x faster

JSON Parse: simd-json zero-copy (arena) vs serde_json owned (heap) — realistic multi-message API payloads

Payload	simd-json + Arena	serde_json + Heap	Speedup
8 messages, 4 KB	1.68 us	2.62 us	1.56x faster
32 messages, 64 KB	17.5 us	17.9 us	1.02x faster

Rate Limiter: Lock-free atomic CAS vs Mutex<i64> (10,000 acquires)

Approach	Latency	Speedup
Atomic CAS (Velox)	74.8 us	1.59x faster
Mutex guard	118.8 us	baseline

Ingress Channel: Crossbeam bounded MPMC vs std::sync::mpsc (8 producers, 1,000 messages)

Approach	Latency	Speedup
Crossbeam bounded (Velox)	635 us	1.16x faster
std::sync::mpsc	734 us	baseline

End-to-End Ingress: Arena + simd-json + route resolve vs Heap + serde_json (4-message chat payload)

Approach	Latency	Speedup
Velox (arena + simd-json)	1.19 us	1.23x faster
Standard (heap + serde_json)	1.47 us	baseline

Key Takeaways

549 ns end-to-end for a complete ingress task (alloc + parse + route) -- translates to a theoretical ceiling of ~1.82 million requests/sec per core.
Arena overhead is negligible: 4.2 ns for a 64-byte allocation (9.0x faster than heap Vec::from). The bump pointer is effectively free compared to parse cost.
simd-json + arena is 1.56x faster than serde_json + heap for realistic 4 KB multi-message API payloads, with zero-copy &str borrowing eliminating all string allocation on the hot path.
Batch rate limiting acquires 64 tokens in 291 ns (~4.5 ns/token) — a single fetch_sub instruction replaces 64 individual CAS loops, reducing atomic operations per batch by 64x.
Rate limiter fast path is 1.59x faster than Mutex-guarded counter — one atomic RMW instruction, no lock, no retry. Integer-only refill math eliminates all floating-point from the rate limiter.
Crossbeam MPMC is 1.16x faster than std::sync::mpsc under contention (8 producers) while supporting multi-consumer patterns that mpsc cannot.
Arena allocation is 1.2--9x faster than heap depending on buffer size, with the advantage growing for smaller allocations where malloc overhead dominates.

Building

Native CPU Tuning

build.rs warns when target-cpu=native is missing because Cargo build scripts cannot directly force that codegen flag for the final crate. For release builds, pass the flag explicitly so rustc can use host SIMD features, including AVX-512 when the local CPU supports it.

PowerShell:

$env:RUSTFLAGS="-C target-cpu=native"
cargo build --release
Remove-Item Env:RUSTFLAGS

Bash/Zsh:

RUSTFLAGS="-C target-cpu=native" cargo build --release

Running Tests

cargo test

Running Benchmarks

cargo bench

HTML reports are generated in target/criterion/.

Configuration

All configuration is via environment variables:

Variable	Default	Description
`VELOX_UPSTREAM_ORIGIN`	`https://api.openai.com`	Upstream LLM provider origin
`VELOX_UPSTREAM_REQUEST_PATH`	`/v1/responses`	API request path
`VELOX_UPSTREAM_WARMUP_PATH`	`/v1/models`	Connection warmup path
`VELOX_EGRESS_POOL_CONNECTIONS`	(num logical cores)	HTTP/2 connection pool size
`VELOX_RATE_LIMIT_RPM`	`10000`	Requests per minute limit
`VELOX_RATE_LIMIT_BURST`	`500`	Token bucket burst capacity
`OPENAI_API_KEY`	(none)	Bearer token for upstream auth

Profiling

When running, a profiling endpoint is available at:

http://127.0.0.1:6060/debug/pprof/profile?seconds=10

On Linux, this returns an SVG CPU flamegraph captured over the specified duration. The profiling server runs on a dedicated OS thread, completely isolated from the Tokio async runtime.

Dependencies

Crate	Purpose
`tokio`	Async runtime with core-pinned workers
`crossbeam`	Lock-free bounded MPMC channel
`simd-json`	SIMD-accelerated in-place JSON parser
`bumpalo`	Bump allocator for per-request arenas
`hyper` + `hyper-util`	HTTP/2 client with connection pooling
`rustls` + `tokio-rustls`	TLS 1.3 with no OpenSSL dependency
`mimalloc`	Global allocator with low fragmentation
`core_affinity`	CPU core pinning for worker threads
`bytes`	Reference-counted byte buffers
`pprof`	CPU profiling with flamegraph output (Linux)

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
benches		benches
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
build.rs		build.rs
pgo-build.sh		pgo-build.sh
summary.md		summary.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Velox

Architecture

Core Components

Design Principles

Benchmark Results

Arena Allocation + SIMD-JSON Parse (Full Hot Path)

Arena Allocation Only (Bump Allocator Overhead)

SIMD-JSON Parse Only (4 KB Payload)

Rate Limiter (Uncontended Single-Instruction `fetch_sub`)

Crossbeam Channel (Lock-Free Queue)

End-to-End Ingress Task (Arena Alloc + Parse + Route Resolve)

Velox vs Standard Approaches

Key Takeaways

Building

Native CPU Tuning

Running Tests

Running Benchmarks

Configuration

Profiling

Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Velox

Architecture

Core Components

Design Principles

Benchmark Results

Arena Allocation + SIMD-JSON Parse (Full Hot Path)

Arena Allocation Only (Bump Allocator Overhead)

SIMD-JSON Parse Only (4 KB Payload)

Rate Limiter (Uncontended Single-Instruction fetch_sub)

Crossbeam Channel (Lock-Free Queue)

End-to-End Ingress Task (Arena Alloc + Parse + Route Resolve)

Velox vs Standard Approaches

Key Takeaways

Building

Native CPU Tuning

Running Tests

Running Benchmarks

Configuration

Profiling

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Rate Limiter (Uncontended Single-Instruction `fetch_sub`)

Packages