Skip to content

voipmonitor/llm-inference-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 

Repository files navigation

llm-inference-bench

LLM inference decode throughput benchmark with a Rich TUI dashboard.

Measures token generation speed across a matrix of concurrency levels and context lengths, giving you a full picture of how your serving engine scales under load.

Supports SGLang and vLLM engines (auto-detected).

Python 3.10+

screenshot

Features

  • Throughput matrix — benchmarks every combination of concurrency (1, 2, 4, 8, ...) and context length (0K, 16K, 32K, 64K, 128K)
  • Server-side metrics — scrapes Prometheus /metrics endpoint for accurate gen_throughput (tok/s) reported by the engine
  • Live TUI dashboard — real-time progress, per-cell results, and aggregate stats via Rich
  • Prefill measurement — separate TTFT measurement for large context prefill
  • JSON output — structured results saved to benchmark_results.json for further analysis
  • Smart test skipping — reads KV cache budget and max_running_requests from the server at startup, automatically skips combinations where concurrency × (context + max_tokens) would exceed available KV cache — these would just queue anyway, so there's no point measuring them
  • Engine auto-detection — automatically detects SGLang vs vLLM and adapts metric scraping

Installation

pip install httpx rich

Usage

# Default: localhost:30000, tests concurrency 1-128, contexts 0K-128K
python3 llm_decode_bench.py

# Custom port and parameters
python3 llm_decode_bench.py --port 5199 --concurrency 1,2,4 --contexts 0,16384

# Custom max tokens and test duration
python3 llm_decode_bench.py --port 5001 --max-tokens 4096 --duration 60

Arguments

Argument Default Description
--host localhost Server hostname
--port 30000 Server port
--concurrency 1,2,4,8,16,32,64,128 Comma-separated concurrency levels
--contexts 0,16384,32768,65536,131072 Comma-separated context lengths (tokens)
--max-tokens 8192 Max tokens to generate per request
--duration 30 Duration per test cell (seconds)
--output benchmark_results.json Output file path

Output

Results are saved as JSON with metadata and per-cell throughput data:

{
  "metadata": {
    "engine": "vllm",
    "model": "Qwen3_5-397B-A17B-NVFP4",
    "timestamp": "2026-03-13T00:30:53",
    "concurrency_levels": [1, 2, 4, 8, 16, 32, 64, 128],
    "context_lengths": [0, 16384, 32768, 65536, 131072]
  },
  "results": { ... }
}

License

MIT

About

LLM inference decode throughput benchmark with Rich TUI dashboard. Measures token generation speed across concurrency levels and context lengths. Supports SGLang and vLLM engines.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages