GitHub - EricLBuehler/mistral.rs: Fast, flexible LLM inference

Fast, flexible LLM inference.

Latest

DiffusionGemma: block-diffusion text generation. Fully integrated: paged attention, prefix caching, ISQ, multimodal, and tool calling. Guide
Anthropic Messages API: mistralrs serve now exposes Anthropic-compatible /v1/messages and /v1/messages/count_tokens endpoints alongside the OpenAI-compatible /v1 API. Guide
v0.8.2 CUDA performance: CUDA graphs, FlashInfer paged kernels, and MoE optimizations deliver strong results on GB10, B200, and H100 SXM. Benchmarks
Agentic runtime: web search, local Python code execution with model feedback, session management, and custom tool hooks. Guide
Gemma 4: full multimodal: text, image, video, and audio input. Supported models | Video setup
MXFP4 ISQ quantization: MXFP4 with optimized decode kernels for faster, smaller models. Quantization docs

Benchmarks

v0.8.2 CUDA benchmarks

Mean tokens per second across prompt lengths and decode depths from 128 to 16384 tokens. Decode uses 256 generated tokens. See the full v0.8.2 report for commands, model revisions, host metadata, and appendix tables.

Q8 prefill TPS: mistral.rs UQFF q8 vs llama.cpp GGUF Q8_0

Model	Hardware	mistral.rs	llama.cpp
Gemma 4 E4B	GB10	7395.7	3973.7
Gemma 4 E4B	B200	27705.6	11992.4
Gemma 4 E4B	H100 SXM	26220.6	11702.1
Gemma 4 26B-A4B	GB10	2947.0	2178.5
Gemma 4 26B-A4B	B200	12725.3	8503.4
Gemma 4 26B-A4B	H100 SXM	12362.3	8055.1

Q8 decode TPS: mistral.rs UQFF q8 vs llama.cpp GGUF Q8_0

Model	Hardware	mistral.rs	llama.cpp
Gemma 4 E4B	GB10	44.1	40.5
Gemma 4 E4B	B200	241.4	194.4
Gemma 4 E4B	H100 SXM	223.1	183.0
Gemma 4 26B-A4B	GB10	46.8	46.4
Gemma 4 26B-A4B	B200	210.9	192.2
Gemma 4 26B-A4B	H100 SXM	199.8	183.9

BF16 prefill TPS: mistral.rs BF16 vs vLLM BF16

Model	Hardware	mistral.rs	vLLM
Gemma 4 E4B	GB10	5838.9	5812.9
Gemma 4 E4B	B200	43547.8	39431.2
Gemma 4 E4B	H100 SXM	35852.2	39293.7
Gemma 4 26B-A4B	GB10	592.2	3878.6
Gemma 4 26B-A4B	B200	3467.3	28532.8
Gemma 4 26B-A4B	H100 SXM	2766.0	26295.9

BF16 decode TPS: mistral.rs BF16 vs vLLM BF16

Model	Hardware	mistral.rs	vLLM
Gemma 4 E4B	GB10	25.1	18.8
Gemma 4 E4B	B200	202.6	196.2
Gemma 4 E4B	H100 SXM	174.4	153.0
Gemma 4 26B-A4B	GB10	26.9	23.2
Gemma 4 26B-A4B	B200	159.6	220.2
Gemma 4 26B-A4B	H100 SXM	138.7	148.0

Why mistral.rs?

Any Hugging Face model, zero config: Just mistralrs run -m user/model. Architecture, quantization format, and chat template are auto-detected.
True multimodality: Text, vision, video, and audio, speech generation, image generation, and embeddings in one engine.
Smart quantization: --quant automatically selects the best quantization format at that level: using a prebuilt UQFF if one is published, otherwise applying ISQ. Docs
OpenAI + Anthropic compatible serving: The same mistralrs serve process exposes OpenAI-compatible /v1 endpoints and Anthropic-compatible Messages endpoints.
Prometheus metrics: mistralrs serve exposes a /metrics endpoint in Prometheus format, recording per-request counts and latency labeled by method, route, and status. Docs
Built-in web UI: Served at /ui by default. Shows reasoning, code execution, plots, and files inline. Edit any message and the new branch runs with its own Python state. Pass --no-ui to disable.
Hardware-aware: mistralrs tune recommends quantization and device mapping from the model config and your detected hardware.
Flexible SDKs: Python package and Rust crate to build your projects.
Native agentic support: built-in agentic loop with web search, local Python code execution with model feedback, session management, and custom tool hooks.

Quick Start

Install

Linux/macOS:

curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.sh | sh

Windows (PowerShell):

irm https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.ps1 | iex

Downloads a self-contained prebuilt binary for your platform (Metal on Apple Silicon; per-GPU CUDA or CPU on Linux; CPU on Windows), falling back to a source build if none matches. No Rust or CUDA toolkit needed for the prebuilt path.

Manual installation, accelerator details & other platforms

Run Your First Model

# Interactive chat
mistralrs run -m Qwen/Qwen3-4B

# One-shot prompt (no interactive session)
mistralrs run -m Qwen/Qwen3-4B -i "What is the capital of France?"

# One-shot with an image
mistralrs run -m google/gemma-4-E4B-it --image photo.jpg -i "Describe this image"

# Agentic REPL: search + code execution from the terminal
mistralrs run --agent -m Qwen/Qwen3-4B

# Start an API server with the built-in web UI
mistralrs serve -m google/gemma-4-E4B-it

For the server command, visit http://localhost:1234/ui for the web chat interface. OpenAI-compatible clients use http://localhost:1234/v1; Anthropic-compatible clients use http://localhost:1234.

The `mistralrs` CLI

The CLI is designed to be zero-config: just point it at a model and go.

Auto-detection: Automatically detects model architecture, quantization format, and chat template
All-in-one: Single binary for chat, server, benchmarks, and web UI (run, serve, bench)
Hardware-aware tuning: mistralrs tune recommends quantization and device mapping for your model and hardware
Format-agnostic: Works with Hugging Face models, GGUF files, and UQFF quantizations seamlessly

# Recommend settings for your hardware and emit a config file
mistralrs tune -m Qwen/Qwen3-4B --emit-config config.toml

# Run using the generated config
mistralrs from-config -f config.toml

# Diagnose system issues (CUDA, Metal, HuggingFace connectivity)
mistralrs doctor

Full CLI documentation

UI Demo

What Makes It Fast

Performance

Continuous batching support by default on all devices.
CUDA with FlashAttention V2/V3, Metal, and multi-GPU/distributed inference
PagedAttention for high throughput continuous batching on CUDA or Apple Silicon, prefix caching (including multimodal)

Quantization (full docs)

In-situ quantization (ISQ) of any Hugging Face model
GGUF (2-8 bit), GPTQ, AWQ, HQQ, FP8, BNB support
⭐ Per-layer topology: Fine-tune quantization per layer for optimal quality/speed
⭐ Auto-select fastest quant method for your hardware

Flexibility

LoRA & X-LoRA with weight merging
AnyMoE: Create mixture-of-experts on any base model
Multiple models: Load/unload at runtime

Agentic Features

Integrated tool calling with grammar enforcement and strict schema mode
⭐ Server-side agentic loop: auto-execute tools and feed results back
⭐ Python code execution: persistent Jupyter-like sessions with matplotlib capture and multimodal feedback
⭐ Web search integration with embedding-based ranking
⭐ Tool dispatch URL: POST tool calls to your own endpoint
⭐ MCP client: Connect to external tools via Process, HTTP, or WebSocket
Python/Rust tool callbacks for custom execution

Full feature documentation

Supported Models

40+ model families: text (Llama, Qwen 3, GLM, DeepSeek, GPT-OSS, Granite, and more), multimodal (Gemma 4, Qwen 3-VL, Llama 4, Phi 4 multimodal, and more), speech (Voxtral ASR, Dia), image generation (FLUX), and embeddings (Embedding Gemma, Qwen 3 Embedding).

Full compatibility tables | Request a new model

Python SDK

pip install mistralrs

In-process inference from Python: load a model with Runner and send OpenAI-shaped requests, no server required. Accelerator-specific wheels (CUDA, Metal, MKL, Accelerate) are listed in the getting-started guide.

Get started | API reference | Examples

Rust SDK

cargo add mistralrs

Embed the engine in a Rust application with the high-level mistralrs crate.

Get started | docs.rs | Crate | Examples

Docker

Prebuilt CPU and CUDA images are published to GHCR. Pull commands, tags, and Kubernetes notes are in the Docker guide.

Documentation

For complete documentation, see the Documentation.

Quick Links:

Quickstart - Install, first run, first serve
CLI Reference - All commands and options
Anthropic Messages API - Anthropic-compatible Messages, streaming, tool use, and token counting
HTTP API - OpenAI-compatible and Anthropic-compatible endpoints
Quantization - ISQ, GGUF, GPTQ, and more
Multi-GPU and Distributed - NCCL TP, P2P layer mapping, multi-node, and ring
MCP Integration - MCP integration documentation
Troubleshooting - Common issues and solutions
Environment variables - Environment variables for configuration

Contributing

Contributions welcome! Please open an issue to discuss new features or report bugs. If you want to add a new model, please contact us via an issue and we can coordinate.

Credits

This project would not be possible without the excellent work at Candle. Thank you to all contributors!

mistral.rs is not affiliated with Mistral AI.

Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 3,297 Commits
.cargo		.cargo
.github		.github
calibration_data		calibration_data
chat_templates		chat_templates
docs		docs
examples		examples
game_of_life_plots		game_of_life_plots
matformer_configs		matformer_configs
mistralrs-audio		mistralrs-audio
mistralrs-bench		mistralrs-bench
mistralrs-cli		mistralrs-cli
mistralrs-code-exec		mistralrs-code-exec
mistralrs-core		mistralrs-core
mistralrs-flash-attn		mistralrs-flash-attn
mistralrs-macros		mistralrs-macros
mistralrs-mcp		mistralrs-mcp
mistralrs-paged-attn		mistralrs-paged-attn
mistralrs-pyo3		mistralrs-pyo3
mistralrs-quant		mistralrs-quant
mistralrs-sandbox		mistralrs-sandbox
mistralrs-server-core		mistralrs-server-core
mistralrs-server		mistralrs-server
mistralrs-vision		mistralrs-vision
mistralrs		mistralrs
orderings		orderings
releases/v0.8.2		releases/v0.8.2
res		res
ring_configs		ring_configs
scripts		scripts
toml-selectors		toml-selectors
topologies		topologies
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.typos.toml		.typos.toml
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
Dockerfile.cuda-13.0-ubi9		Dockerfile.cuda-13.0-ubi9
Dockerfile.cuda-all		Dockerfile.cuda-all
Dockerfile.manylinux		Dockerfile.manylinux
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
install.ps1		install.ps1
install.sh		install.sh
sample_speech.wav		sample_speech.wav
speculative.toml		speculative.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fast, flexible LLM inference.

Latest

Benchmarks

Why mistral.rs?

Quick Start

Install

Run Your First Model

The `mistralrs` CLI

What Makes It Fast

Supported Models

Python SDK

Rust SDK

Docker

Documentation

Contributing

Credits

About

Uh oh!

Releases 49

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fast, flexible LLM inference.

Latest

Benchmarks

Why mistral.rs?

Quick Start

Install

Run Your First Model

The mistralrs CLI

What Makes It Fast

Supported Models

Python SDK

Rust SDK

Docker

Documentation

Contributing

Credits

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 49

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

The `mistralrs` CLI

Packages