Skip to content

voipmonitor/rtx6kpro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RTX 6000 Pro Wiki — Running Large LLMs on PCIe GPUs

Community-sourced knowledge base for running large language models (Qwen3.5-397B, MiniMax M2.5, Kimi-K2.5, GLM-5) on NVIDIA RTX 6000 Pro (Blackwell, SM120) GPUs in 2×, 4×, and 8× PCIe configurations without NVLink.

Synthesized from ~5,000 Discord messages, 300+ screenshots, and months of community experimentation.

Quick Links

Models

Model Params Active Min GPUs Best Decode Page
Qwen3.5-397B 397B MoE 17B 350 tok/s (8×, SGLang)
Qwen3.5-27B/122B 27B–122B
MiniMax M2.5 456B MoE 85-89 tok/s (NVFP4)
Kimi-K2.5 530B MoE 101 tok/s (PCIe switch)
GLM-5 744B MoE 40B 105 tok/s (MTP)

Hardware & Topology

Inference Engines

  • vLLM — Config, MTP, model-specific commands
  • SGLang — Config, DCP, MOE backends
  • FlashInfer — CUTLASS, SM120, bug fixes

Optimization

Results & Troubleshooting

Key Findings

  1. MTP=2 is the sweet spot — +51-72% throughput across all models, MTP>3 unstable
  2. NCCL graph XML fix is critical on AMD Turin — 1.5-1.9× speedup by correcting hardcoded 16 GB/s bandwidth
  3. PCIe switches dramatically help single-batch latency — 101 vs 60 tok/s for Kimi K2.5
  4. BF16 KV cache mandatory on SM120 for GLM-5 — FP8 produces garbled output
  5. SGLang is the only option for GLM-5 — vLLM lacks SM120-compatible MLA+sparse attention backend
  6. NVFP4 is native to SM120 — 2× decode speedup over FP8 for supported models
  7. DCP is essential for Kimi K2.5 long context — Without it, 200K context drops to <10 tok/s

Hardware Overview

All results are on NVIDIA RTX PRO 6000 (Blackwell GB202, SM120):

  • 96 GB GDDR7 per GPU (768 GB total for 8×)
  • PCIe 5.0 x16 (~64 GB/s per direction)
  • No NVLink — all inter-GPU communication via PCIe
  • Typical configs: AMD EPYC Turin/Genoa, 4× or 8× GPUs

Contributing

This wiki is synthesized from Discord discussions. If you have corrections, additional benchmarks, or new configurations, please open an issue or PR.


Generated March 2026. Data sourced from community Discord server.

About

RTX 6000 Pro Wiki — Running Large LLMs (Qwen3.5-397B, Kimi-K2.5, GLM-5) on PCIe GPUs without NVLink

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages