Skip to content

AidanHT/Velox

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Velox

A high-throughput, zero-allocation AI inference gateway written in Rust. Velox sits between API consumers and upstream LLM providers (e.g. OpenAI), routing requests through a lock-free ingress pipeline with SIMD-accelerated JSON parsing, per-worker memory arenas, and multiplexed HTTP/2 egress with pre-warmed connection pools.

Architecture

                         ┌─────────────────────────────────────┐
                         │          Velox Gateway               │
                         │                                     │
  Producers ──►  ┌───────────────┐   ┌──────────────────┐     │
                 │  Crossbeam    │   │  Tokio Runtime    │     │
                 │  Bounded      │──►│  (pinned cores)   │     │
                 │  Channel      │   │                   │     │
                 │  (16,384)     │   │  ┌─────────────┐  │     │
                 └───────────────┘   │  │ Worker 0    │  │     │
                                     │  │ ┌─────────┐ │  │     │
                                     │  │ │ Bumpalo  │ │  │     │    ┌──────────────┐
                                     │  │ │ Arena    │ │  │     │    │  Upstream     │
                                     │  │ └─────────┘ │  │     │    │  (HTTP/2 +    │
                                     │  │ ┌─────────┐ │  │────►│───►│   TLS 1.3)   │
                                     │  │ │simd-json│ │  │     │    │              │
                                     │  │ └─────────┘ │  │     │    │  Pre-warmed  │
                                     │  └─────────────┘  │     │    │  Pool        │
                                     │  ┌─────────────┐  │     │    └──────────────┘
                                     │  │ Worker N    │  │     │
                                     │  │  (same)     │  │     │
                                     │  └─────────────┘  │     │
                                     │                   │     │
                                     │  ┌─────────────┐  │     │
                                     │  │ TokenBucket │  │     │
                                     │  │ (atomic CAS)│  │     │
                                     │  └─────────────┘  │     │
                                     └──────────────────┘     │
                         │                                     │
                         │  ┌─────────────┐                    │
                         │  │ pprof       │ :6060              │
                         │  │ (OS thread) │                    │
                         │  └─────────────┘                    │
                         └─────────────────────────────────────┘

Core Components

Ingress Pipeline (main.rs) -- Incoming request payloads are submitted into a lock-free bounded crossbeam channel (depth 16,384). One Tokio worker task per logical CPU core drains the channel in micro-batches of up to 64 tasks, parsing each through the arena-backed SIMD-JSON pipeline.

Request Arena (arena.rs) -- Each worker owns a bumpalo bump allocator pre-allocated at 256 KB. Network payloads are copied into the arena with 32 bytes of SIMD zero-padding, then parsed in-place by simd-json. Deserialized structs borrow &str directly from the arena buffer (zero-copy). The arena resets after each batch -- zero syscalls, zero OS allocator calls.

SIMD-JSON Parsing -- simd-json performs in-place vectorised parsing (AVX2/AVX-512 on x86_64, NEON on aarch64). Payloads over 1 MB are automatically routed to tokio::task::spawn_blocking to prevent SIMD vectorisation passes from stalling the Tokio I/O reactor.

Rate Limiter (rate_limit.rs) -- A lock-free token bucket using a single fetch_sub atomic RMW instruction on the fast path (no load+CAS loop). Integer-only microsecond-precision refill eliminates all floating-point arithmetic. Batch acquire mode acquires N tokens in one atomic operation, reducing contention by up to 64x per batch. No Mutex, no RwLock, no thread parking. When the bucket is empty, workers suspend via tokio::time::sleep, causing natural backpressure through the bounded ingress channel.

Egress (egress.rs) -- Multiplexed HTTP/2 connections over TLS 1.3 (rustls + webpki). Connection pools are sharded per upstream with round-robin dispatch. Connections are pre-warmed at startup via HEAD requests. Tunable HTTP/2 parameters: 8 MB stream windows, 64 MB connection windows, 512 max concurrent streams, adaptive flow control.

Profiling (profiling.rs) -- A dedicated OS thread (fully isolated from the Tokio reactor) serves a /debug/pprof/profile endpoint on port 6060. On Linux, it captures CPU flamegraphs via pprof with configurable duration (?seconds=N).

Runtime Topology -- Tokio worker threads are pinned 1:1 to logical CPU cores via core_affinity. The global allocator is mimalloc for reduced fragmentation under concurrent allocation patterns.

Design Principles

  • Zero allocations on the hot path: No .clone(), .to_owned(), String::from(), Vec::new(), or Box::new() in the ingress-parse-route critical path. All transient data lives in the bump arena.
  • Completely lock-free: No std::sync::Mutex or std::sync::RwLock anywhere in the codebase. Shared state uses atomic operations. Channels are lock-free crossbeam bounded queues.
  • Single-instruction rate limiting: The token bucket fast path uses one fetch_sub instead of a load+compare+CAS loop. Batch mode acquires an entire batch's tokens in a single atomic operation.
  • Reactor safety: Large payloads (>1 MB) are offloaded to spawn_blocking so SIMD parsing cannot starve the async I/O event loop. Small payloads stay inline for minimum latency.
  • Arena lifecycle safety: arena.reset() is called strictly after all borrowed &str references from the parse phase are dropped, preventing use-after-free while avoiding memory leaks.

Benchmark Results

Benchmarks run with Criterion on an optimised release build (opt-level = 3, thin LTO, single codegen unit, target-cpu=native, overflow checks disabled).

Arena Allocation + SIMD-JSON Parse (Full Hot Path)

Payload Size Latency (median) Throughput
48 B (small) 796 ns 56.3 MiB/s
4 KB (medium) 1.57 us 2.47 GiB/s
64 KB (large) 21.4 us 2.91 GiB/s

Arena Allocation Only (Bump Allocator Overhead)

Buffer Size Latency (median) Throughput
64 B 6.3 ns 9.4 GiB/s
512 B 9.2 ns 51.8 GiB/s
4 KB 35.6 ns 107.2 GiB/s
64 KB 2.17 us 28.8 GiB/s

SIMD-JSON Parse Only (4 KB Payload)

Benchmark Latency (median) Throughput
simd-json parse 4 KB 1.52 us 2.53 GiB/s

Rate Limiter (Uncontended Single-Instruction fetch_sub)

Benchmark Latency
100 acquires (full bucket) 1.40 us
Per-acquire ~14.0 ns
Batch acquire 64 tokens 291 ns
Per-token (batched) ~4.5 ns

Crossbeam Channel (Lock-Free Queue)

Queue Depth 1,000 send+recv Per-operation
1,024 34.4 us ~17 ns
16,384 34.2 us ~17 ns

End-to-End Ingress Task (Arena Alloc + Parse + Route Resolve)

Benchmark Latency (median)
Full ingress pipeline 549 ns

Velox vs Standard Approaches

Head-to-head benchmarks measuring the same logical operation via Velox's optimised path versus standard Rust equivalents. Ratios are system-independent (both paths run under identical conditions).

Memory Allocation: Bumpalo arena vs Vec::from heap

Buffer Size Bumpalo Arena Vec Heap Speedup
64 B 4.2 ns 37.8 ns 9.0x faster
512 B 8.4 ns 50.4 ns 6.0x faster
4 KB 29.2 ns 64.5 ns 2.2x faster
64 KB 1.38 us 1.68 us 1.2x faster

JSON Parse: simd-json zero-copy (arena) vs serde_json owned (heap) — realistic multi-message API payloads

Payload simd-json + Arena serde_json + Heap Speedup
8 messages, 4 KB 1.68 us 2.62 us 1.56x faster
32 messages, 64 KB 17.5 us 17.9 us 1.02x faster

Rate Limiter: Lock-free atomic CAS vs Mutex<i64> (10,000 acquires)

Approach Latency Speedup
Atomic CAS (Velox) 74.8 us 1.59x faster
Mutex guard 118.8 us baseline

Ingress Channel: Crossbeam bounded MPMC vs std::sync::mpsc (8 producers, 1,000 messages)

Approach Latency Speedup
Crossbeam bounded (Velox) 635 us 1.16x faster
std::sync::mpsc 734 us baseline

End-to-End Ingress: Arena + simd-json + route resolve vs Heap + serde_json (4-message chat payload)

Approach Latency Speedup
Velox (arena + simd-json) 1.19 us 1.23x faster
Standard (heap + serde_json) 1.47 us baseline

Key Takeaways

  • 549 ns end-to-end for a complete ingress task (alloc + parse + route) -- translates to a theoretical ceiling of ~1.82 million requests/sec per core.
  • Arena overhead is negligible: 4.2 ns for a 64-byte allocation (9.0x faster than heap Vec::from). The bump pointer is effectively free compared to parse cost.
  • simd-json + arena is 1.56x faster than serde_json + heap for realistic 4 KB multi-message API payloads, with zero-copy &str borrowing eliminating all string allocation on the hot path.
  • Batch rate limiting acquires 64 tokens in 291 ns (~4.5 ns/token) — a single fetch_sub instruction replaces 64 individual CAS loops, reducing atomic operations per batch by 64x.
  • Rate limiter fast path is 1.59x faster than Mutex-guarded counter — one atomic RMW instruction, no lock, no retry. Integer-only refill math eliminates all floating-point from the rate limiter.
  • Crossbeam MPMC is 1.16x faster than std::sync::mpsc under contention (8 producers) while supporting multi-consumer patterns that mpsc cannot.
  • Arena allocation is 1.2--9x faster than heap depending on buffer size, with the advantage growing for smaller allocations where malloc overhead dominates.

Building

Native CPU Tuning

build.rs warns when target-cpu=native is missing because Cargo build scripts cannot directly force that codegen flag for the final crate. For release builds, pass the flag explicitly so rustc can use host SIMD features, including AVX-512 when the local CPU supports it.

PowerShell:

$env:RUSTFLAGS="-C target-cpu=native"
cargo build --release
Remove-Item Env:RUSTFLAGS

Bash/Zsh:

RUSTFLAGS="-C target-cpu=native" cargo build --release

Running Tests

cargo test

Running Benchmarks

cargo bench

HTML reports are generated in target/criterion/.

Configuration

All configuration is via environment variables:

Variable Default Description
VELOX_UPSTREAM_ORIGIN https://api.openai.com Upstream LLM provider origin
VELOX_UPSTREAM_REQUEST_PATH /v1/responses API request path
VELOX_UPSTREAM_WARMUP_PATH /v1/models Connection warmup path
VELOX_EGRESS_POOL_CONNECTIONS (num logical cores) HTTP/2 connection pool size
VELOX_RATE_LIMIT_RPM 10000 Requests per minute limit
VELOX_RATE_LIMIT_BURST 500 Token bucket burst capacity
OPENAI_API_KEY (none) Bearer token for upstream auth

Profiling

When running, a profiling endpoint is available at:

http://127.0.0.1:6060/debug/pprof/profile?seconds=10

On Linux, this returns an SVG CPU flamegraph captured over the specified duration. The profiling server runs on a dedicated OS thread, completely isolated from the Tokio async runtime.

Dependencies

Crate Purpose
tokio Async runtime with core-pinned workers
crossbeam Lock-free bounded MPMC channel
simd-json SIMD-accelerated in-place JSON parser
bumpalo Bump allocator for per-request arenas
hyper + hyper-util HTTP/2 client with connection pooling
rustls + tokio-rustls TLS 1.3 with no OpenSSL dependency
mimalloc Global allocator with low fragmentation
core_affinity CPU core pinning for worker threads
bytes Reference-counted byte buffers
pprof CPU profiling with flamegraph output (Linux)

About

A zero-allocation, lock-free AI inference gateway in Rust with SIMD-JSON parsing and HTTP/2 multiplexing.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors