RTX 6000 Pro Wiki — Running Large LLMs on PCIe GPUs

Community-sourced knowledge base for running large language models (Qwen3.5-397B, MiniMax M2.5, Kimi-K2.5, GLM-5) on NVIDIA RTX 6000 Pro (Blackwell, SM120) GPUs in 2×, 4×, and 8× PCIe configurations without NVLink.

Synthesized from ~5,000 Discord messages, 300+ screenshots, and months of community experimentation.

Quick Links

Models

Model	Params	Active	Min GPUs	Best Decode	Page
Qwen3.5-397B	397B MoE	17B	4×	350 tok/s (8×, SGLang)	→
Qwen3.5-27B/122B	27B–122B	—	1×	—	→
MiniMax M2.5	456B MoE	—	2×	85-89 tok/s (NVFP4)	→
Kimi-K2.5	530B MoE	—	8×	101 tok/s (PCIe switch)	→
GLM-5	744B MoE	40B	8×	105 tok/s (MTP)	→

Hardware & Topology

PCIe Topology — Switches, Turin vs Genoa, NUMA
PCIe Bandwidth — P2P measurements, BAR1, latency
GPU Configurations — 4×/8× builds, VRAM, power, rigs

Inference Engines

vLLM — Config, MTP, model-specific commands
SGLang — Config, DCP, MOE backends
FlashInfer — CUTLASS, SM120, bug fixes

Optimization

NCCL Tuning — Env vars, P2P levels, graph XML fix
NVFP4 Quantization — Setup, calibration, models
Speculative Decoding — MTP configs, EAGLE
Docker Images — Images, compose, custom builds

Results & Troubleshooting

Benchmark Results — Consolidated tables across all models
KLD Evaluation — Quantization quality (KL divergence vs FP8 reference)
Common Issues — Errors + fixes

Key Findings

MTP=2 is the sweet spot — +51-72% throughput across all models, MTP>3 unstable
NCCL graph XML fix is critical on AMD Turin — 1.5-1.9× speedup by correcting hardcoded 16 GB/s bandwidth
PCIe switches dramatically help single-batch latency — 101 vs 60 tok/s for Kimi K2.5
BF16 KV cache mandatory on SM120 for GLM-5 — FP8 produces garbled output
SGLang is the only option for GLM-5 — vLLM lacks SM120-compatible MLA+sparse attention backend
NVFP4 is native to SM120 — 2× decode speedup over FP8 for supported models
DCP is essential for Kimi K2.5 long context — Without it, 200K context drops to <10 tok/s

Hardware Overview

All results are on NVIDIA RTX PRO 6000 (Blackwell GB202, SM120):

96 GB GDDR7 per GPU (768 GB total for 8×)
PCIe 5.0 x16 (~64 GB/s per direction)
No NVLink — all inter-GPU communication via PCIe
Typical configs: AMD EPYC Turin/Genoa, 4× or 8× GPUs

Contributing

This wiki is synthesized from Discord discussions. If you have corrections, additional benchmarks, or new configurations, please open an issue or PR.

Generated March 2026. Data sourced from community Discord server.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
benchmarks		benchmarks
hardware		hardware
images		images
inference-engines		inference-engines
models		models
optimization		optimization
troubleshooting		troubleshooting
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RTX 6000 Pro Wiki — Running Large LLMs on PCIe GPUs

Quick Links

Models

Hardware & Topology

Inference Engines

Optimization

Results & Troubleshooting

Key Findings

Hardware Overview

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

Folders and files

Latest commit

History

Repository files navigation

RTX 6000 Pro Wiki — Running Large LLMs on PCIe GPUs

Quick Links

Models

Hardware & Topology

Inference Engines

Optimization

Results & Troubleshooting

Key Findings

Hardware Overview

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages