Community-sourced knowledge base for running large language models (Qwen3.5-397B, MiniMax M2.5, Kimi-K2.5, GLM-5) on NVIDIA RTX 6000 Pro (Blackwell, SM120) GPUs in 2×, 4×, and 8× PCIe configurations without NVLink.
Synthesized from ~5,000 Discord messages, 300+ screenshots, and months of community experimentation.
| Model | Params | Active | Min GPUs | Best Decode | Page |
|---|---|---|---|---|---|
| Qwen3.5-397B | 397B MoE | 17B | 4× | 350 tok/s (8×, SGLang) | → |
| Qwen3.5-27B/122B | 27B–122B | — | 1× | — | → |
| MiniMax M2.5 | 456B MoE | — | 2× | 85-89 tok/s (NVFP4) | → |
| Kimi-K2.5 | 530B MoE | — | 8× | 101 tok/s (PCIe switch) | → |
| GLM-5 | 744B MoE | 40B | 8× | 105 tok/s (MTP) | → |
- PCIe Topology — Switches, Turin vs Genoa, NUMA
- PCIe Bandwidth — P2P measurements, BAR1, latency
- GPU Configurations — 4×/8× builds, VRAM, power, rigs
- vLLM — Config, MTP, model-specific commands
- SGLang — Config, DCP, MOE backends
- FlashInfer — CUTLASS, SM120, bug fixes
- NCCL Tuning — Env vars, P2P levels, graph XML fix
- NVFP4 Quantization — Setup, calibration, models
- Speculative Decoding — MTP configs, EAGLE
- Docker Images — Images, compose, custom builds
- Benchmark Results — Consolidated tables across all models
- KLD Evaluation — Quantization quality (KL divergence vs FP8 reference)
- Common Issues — Errors + fixes
- MTP=2 is the sweet spot — +51-72% throughput across all models, MTP>3 unstable
- NCCL graph XML fix is critical on AMD Turin — 1.5-1.9× speedup by correcting hardcoded 16 GB/s bandwidth
- PCIe switches dramatically help single-batch latency — 101 vs 60 tok/s for Kimi K2.5
- BF16 KV cache mandatory on SM120 for GLM-5 — FP8 produces garbled output
- SGLang is the only option for GLM-5 — vLLM lacks SM120-compatible MLA+sparse attention backend
- NVFP4 is native to SM120 — 2× decode speedup over FP8 for supported models
- DCP is essential for Kimi K2.5 long context — Without it, 200K context drops to <10 tok/s
All results are on NVIDIA RTX PRO 6000 (Blackwell GB202, SM120):
- 96 GB GDDR7 per GPU (768 GB total for 8×)
- PCIe 5.0 x16 (~64 GB/s per direction)
- No NVLink — all inter-GPU communication via PCIe
- Typical configs: AMD EPYC Turin/Genoa, 4× or 8× GPUs
This wiki is synthesized from Discord discussions. If you have corrections, additional benchmarks, or new configurations, please open an issue or PR.
Generated March 2026. Data sourced from community Discord server.