A high-throughput, zero-allocation AI inference gateway written in Rust. Velox sits between API consumers and upstream LLM providers (e.g. OpenAI), routing requests through a lock-free ingress pipeline with SIMD-accelerated JSON parsing, per-worker memory arenas, and multiplexed HTTP/2 egress with pre-warmed connection pools.
┌─────────────────────────────────────┐
│ Velox Gateway │
│ │
Producers ──► ┌───────────────┐ ┌──────────────────┐ │
│ Crossbeam │ │ Tokio Runtime │ │
│ Bounded │──►│ (pinned cores) │ │
│ Channel │ │ │ │
│ (16,384) │ │ ┌─────────────┐ │ │
└───────────────┘ │ │ Worker 0 │ │ │
│ │ ┌─────────┐ │ │ │
│ │ │ Bumpalo │ │ │ │ ┌──────────────┐
│ │ │ Arena │ │ │ │ │ Upstream │
│ │ └─────────┘ │ │ │ │ (HTTP/2 + │
│ │ ┌─────────┐ │ │────►│───►│ TLS 1.3) │
│ │ │simd-json│ │ │ │ │ │
│ │ └─────────┘ │ │ │ │ Pre-warmed │
│ └─────────────┘ │ │ │ Pool │
│ ┌─────────────┐ │ │ └──────────────┘
│ │ Worker N │ │ │
│ │ (same) │ │ │
│ └─────────────┘ │ │
│ │ │
│ ┌─────────────┐ │ │
│ │ TokenBucket │ │ │
│ │ (atomic CAS)│ │ │
│ └─────────────┘ │ │
└──────────────────┘ │
│ │
│ ┌─────────────┐ │
│ │ pprof │ :6060 │
│ │ (OS thread) │ │
│ └─────────────┘ │
└─────────────────────────────────────┘
Ingress Pipeline (main.rs) -- Incoming request payloads are submitted into a lock-free bounded crossbeam channel (depth 16,384). One Tokio worker task per logical CPU core drains the channel in micro-batches of up to 64 tasks, parsing each through the arena-backed SIMD-JSON pipeline.
Request Arena (arena.rs) -- Each worker owns a bumpalo bump allocator pre-allocated at 256 KB. Network payloads are copied into the arena with 32 bytes of SIMD zero-padding, then parsed in-place by simd-json. Deserialized structs borrow &str directly from the arena buffer (zero-copy). The arena resets after each batch -- zero syscalls, zero OS allocator calls.
SIMD-JSON Parsing -- simd-json performs in-place vectorised parsing (AVX2/AVX-512 on x86_64, NEON on aarch64). Payloads over 1 MB are automatically routed to tokio::task::spawn_blocking to prevent SIMD vectorisation passes from stalling the Tokio I/O reactor.
Rate Limiter (rate_limit.rs) -- A lock-free token bucket using a single fetch_sub atomic RMW instruction on the fast path (no load+CAS loop). Integer-only microsecond-precision refill eliminates all floating-point arithmetic. Batch acquire mode acquires N tokens in one atomic operation, reducing contention by up to 64x per batch. No Mutex, no RwLock, no thread parking. When the bucket is empty, workers suspend via tokio::time::sleep, causing natural backpressure through the bounded ingress channel.
Egress (egress.rs) -- Multiplexed HTTP/2 connections over TLS 1.3 (rustls + webpki). Connection pools are sharded per upstream with round-robin dispatch. Connections are pre-warmed at startup via HEAD requests. Tunable HTTP/2 parameters: 8 MB stream windows, 64 MB connection windows, 512 max concurrent streams, adaptive flow control.
Profiling (profiling.rs) -- A dedicated OS thread (fully isolated from the Tokio reactor) serves a /debug/pprof/profile endpoint on port 6060. On Linux, it captures CPU flamegraphs via pprof with configurable duration (?seconds=N).
Runtime Topology -- Tokio worker threads are pinned 1:1 to logical CPU cores via core_affinity. The global allocator is mimalloc for reduced fragmentation under concurrent allocation patterns.
- Zero allocations on the hot path: No
.clone(),.to_owned(),String::from(),Vec::new(), orBox::new()in the ingress-parse-route critical path. All transient data lives in the bump arena. - Completely lock-free: No
std::sync::Mutexorstd::sync::RwLockanywhere in the codebase. Shared state uses atomic operations. Channels are lock-free crossbeam bounded queues. - Single-instruction rate limiting: The token bucket fast path uses one
fetch_subinstead of a load+compare+CAS loop. Batch mode acquires an entire batch's tokens in a single atomic operation. - Reactor safety: Large payloads (>1 MB) are offloaded to
spawn_blockingso SIMD parsing cannot starve the async I/O event loop. Small payloads stay inline for minimum latency. - Arena lifecycle safety:
arena.reset()is called strictly after all borrowed&strreferences from the parse phase are dropped, preventing use-after-free while avoiding memory leaks.
Benchmarks run with Criterion on an optimised release build (opt-level = 3, thin LTO, single codegen unit, target-cpu=native, overflow checks disabled).
| Payload Size | Latency (median) | Throughput |
|---|---|---|
| 48 B (small) | 796 ns | 56.3 MiB/s |
| 4 KB (medium) | 1.57 us | 2.47 GiB/s |
| 64 KB (large) | 21.4 us | 2.91 GiB/s |
| Buffer Size | Latency (median) | Throughput |
|---|---|---|
| 64 B | 6.3 ns | 9.4 GiB/s |
| 512 B | 9.2 ns | 51.8 GiB/s |
| 4 KB | 35.6 ns | 107.2 GiB/s |
| 64 KB | 2.17 us | 28.8 GiB/s |
| Benchmark | Latency (median) | Throughput |
|---|---|---|
| simd-json parse 4 KB | 1.52 us | 2.53 GiB/s |
| Benchmark | Latency |
|---|---|
| 100 acquires (full bucket) | 1.40 us |
| Per-acquire | ~14.0 ns |
| Batch acquire 64 tokens | 291 ns |
| Per-token (batched) | ~4.5 ns |
| Queue Depth | 1,000 send+recv | Per-operation |
|---|---|---|
| 1,024 | 34.4 us | ~17 ns |
| 16,384 | 34.2 us | ~17 ns |
| Benchmark | Latency (median) |
|---|---|
| Full ingress pipeline | 549 ns |
Head-to-head benchmarks measuring the same logical operation via Velox's optimised path versus standard Rust equivalents. Ratios are system-independent (both paths run under identical conditions).
Memory Allocation: Bumpalo arena vs Vec::from heap
| Buffer Size | Bumpalo Arena | Vec Heap | Speedup |
|---|---|---|---|
| 64 B | 4.2 ns | 37.8 ns | 9.0x faster |
| 512 B | 8.4 ns | 50.4 ns | 6.0x faster |
| 4 KB | 29.2 ns | 64.5 ns | 2.2x faster |
| 64 KB | 1.38 us | 1.68 us | 1.2x faster |
JSON Parse: simd-json zero-copy (arena) vs serde_json owned (heap) — realistic multi-message API payloads
| Payload | simd-json + Arena | serde_json + Heap | Speedup |
|---|---|---|---|
| 8 messages, 4 KB | 1.68 us | 2.62 us | 1.56x faster |
| 32 messages, 64 KB | 17.5 us | 17.9 us | 1.02x faster |
Rate Limiter: Lock-free atomic CAS vs Mutex<i64> (10,000 acquires)
| Approach | Latency | Speedup |
|---|---|---|
| Atomic CAS (Velox) | 74.8 us | 1.59x faster |
| Mutex guard | 118.8 us | baseline |
Ingress Channel: Crossbeam bounded MPMC vs std::sync::mpsc (8 producers, 1,000 messages)
| Approach | Latency | Speedup |
|---|---|---|
| Crossbeam bounded (Velox) | 635 us | 1.16x faster |
| std::sync::mpsc | 734 us | baseline |
End-to-End Ingress: Arena + simd-json + route resolve vs Heap + serde_json (4-message chat payload)
| Approach | Latency | Speedup |
|---|---|---|
| Velox (arena + simd-json) | 1.19 us | 1.23x faster |
| Standard (heap + serde_json) | 1.47 us | baseline |
- 549 ns end-to-end for a complete ingress task (alloc + parse + route) -- translates to a theoretical ceiling of ~1.82 million requests/sec per core.
- Arena overhead is negligible: 4.2 ns for a 64-byte allocation (9.0x faster than heap
Vec::from). The bump pointer is effectively free compared to parse cost. - simd-json + arena is 1.56x faster than serde_json + heap for realistic 4 KB multi-message API payloads, with zero-copy
&strborrowing eliminating all string allocation on the hot path. - Batch rate limiting acquires 64 tokens in 291 ns (~4.5 ns/token) — a single
fetch_subinstruction replaces 64 individual CAS loops, reducing atomic operations per batch by 64x. - Rate limiter fast path is 1.59x faster than Mutex-guarded counter — one atomic RMW instruction, no lock, no retry. Integer-only refill math eliminates all floating-point from the rate limiter.
- Crossbeam MPMC is 1.16x faster than
std::sync::mpscunder contention (8 producers) while supporting multi-consumer patterns that mpsc cannot. - Arena allocation is 1.2--9x faster than heap depending on buffer size, with the advantage growing for smaller allocations where malloc overhead dominates.
build.rs warns when target-cpu=native is missing because Cargo build scripts cannot directly force that codegen flag for the final crate. For release builds, pass the flag explicitly so rustc can use host SIMD features, including AVX-512 when the local CPU supports it.
PowerShell:
$env:RUSTFLAGS="-C target-cpu=native"
cargo build --release
Remove-Item Env:RUSTFLAGSBash/Zsh:
RUSTFLAGS="-C target-cpu=native" cargo build --releasecargo testcargo benchHTML reports are generated in target/criterion/.
All configuration is via environment variables:
| Variable | Default | Description |
|---|---|---|
VELOX_UPSTREAM_ORIGIN |
https://api.openai.com |
Upstream LLM provider origin |
VELOX_UPSTREAM_REQUEST_PATH |
/v1/responses |
API request path |
VELOX_UPSTREAM_WARMUP_PATH |
/v1/models |
Connection warmup path |
VELOX_EGRESS_POOL_CONNECTIONS |
(num logical cores) | HTTP/2 connection pool size |
VELOX_RATE_LIMIT_RPM |
10000 |
Requests per minute limit |
VELOX_RATE_LIMIT_BURST |
500 |
Token bucket burst capacity |
OPENAI_API_KEY |
(none) | Bearer token for upstream auth |
When running, a profiling endpoint is available at:
http://127.0.0.1:6060/debug/pprof/profile?seconds=10
On Linux, this returns an SVG CPU flamegraph captured over the specified duration. The profiling server runs on a dedicated OS thread, completely isolated from the Tokio async runtime.
| Crate | Purpose |
|---|---|
tokio |
Async runtime with core-pinned workers |
crossbeam |
Lock-free bounded MPMC channel |
simd-json |
SIMD-accelerated in-place JSON parser |
bumpalo |
Bump allocator for per-request arenas |
hyper + hyper-util |
HTTP/2 client with connection pooling |
rustls + tokio-rustls |
TLS 1.3 with no OpenSSL dependency |
mimalloc |
Global allocator with low fragmentation |
core_affinity |
CPU core pinning for worker threads |
bytes |
Reference-counted byte buffers |
pprof |
CPU profiling with flamegraph output (Linux) |