Skip to content

[MedPsy CUDA] Close native Qwen3 GPU parity gap against llama.cpp runner #985

@AtlantisPleb

Description

@AtlantisPleb

Goal

Close the MedPsy CUDA throughput gap by replacing the current Candle quantized_qwen3 execution path with a native Psionic CUDA Qwen3 decode path for the MedPsy 1.7B Q4_K_M GGUF row.

Current Evidence

The current retained matrix is:

fixtures/medpsy/benchmarks/medpsy_comparator_matrix_20260511_local.json

The artifact is:

repo: qvac/MedPsy-1.7B-GGUF
file: medpsy-1.7b-q4_k_m-imat.gguf
sha256: 41ee947d9cce72ec657577219fd1798fabeabf0d832217fe23c9d6d3d18d5880

Tailnet hardware:

host: archlinux
gpu: NVIDIA GeForce RTX 4080 16GB
driver: 595.58.03

Rows measured on the same artifact and host:

Psionic Rust Candle Qwen3 CUDA:
  execution_engine: rust_candle_qwen3_cuda
  prompt: token [151644]
  max_new_tokens: 512
  repeats: 3
  mean decode throughput: ~239 tok/s

Ollama llama.cpp runner:
  model: medpsy-17b-q4-local
  same prompt and 512-token cap
  repeats: 3
  mean decode throughput: ~369 tok/s

Negative trials already tested:

  • CANDLE_DEQUANTIZE_ALL_F16=1: ~127 tok/s, worse.
  • CANDLE_DEQUANTIZE_ALL=1: ~78 tok/s, worse.
  • PSIONIC_MEDPSY_FORCE_DMMV=1: ~250 tok/s, not enough.
  • Device-side argmax removed full-logit host readback but did not close the gap.
  • Reusing the loaded model across repeats removed reload overhead but did not close the warm decode gap.

Required Implementation Direction

Do not switch to QVAC SDK, llama.cpp, Ollama, vLLM, Python, or a sidecar runtime.

Implement a native Psionic CUDA Qwen3 path for the dense MedPsy architecture.

Likely code areas:

  • crates/psionic-models/src/lib.rs
    • Qwen3 tensor layout should be explicit and not flattened into Qwen2/Qwen35.
    • Current Qwen3 admission exists, but tensor layout should support native CUDA decode needs.
  • crates/psionic-serve/src/qwen35.rs
    • Reuse carefully where possible, but do not pretend Qwen3 is Qwen35.
    • Qwen35 full-attention assumes query/gate projection rows equal 2 * query_width. Qwen3 has query rows equal query_width and no attention gate. That mismatch is one reason the existing native path cannot be reused unchanged.
  • crates/psionic-backend-cuda/src/lib.rs and CUDA kernels
    • Reuse existing quantized matvec, Q8_1 input quantization, top-k/argmax, RMSNorm, RoPE, and output-head paths where applicable.
    • Add Qwen3-specific fused or grouped kernels only where the generic pieces do not reach parity.
  • crates/psionic-models/examples/medpsy_bench.rs
    • Keep this benchmark as the direct runtime gate.
  • scripts/release/check-psionic-medpsy-pilot.sh
    • Extend after the native CUDA path is green.

Native Qwen3 Runtime Requirements

The MedPsy 1.7B Qwen3 row requires:

  • Qwen3 dense full-attention block, not Qwen35 hybrid SSM.
  • Q projection without Qwen35 gate half.
  • K/V projection with GQA: 16 attention heads, 8 KV heads, head_dim 128.
  • Per-head q/k RMSNorm.
  • RoPE with theta 1_000_000 for 1.7B.
  • SwiGLU MLP: gate/up/down.
  • Tied embeddings by default.
  • GGUF Q4_K_M quantized projection support.
  • CUDA KV cache and decode loop.
  • Device-side output selection without full-logit host readback.

Acceptance Criteria

  • A native Psionic CUDA Qwen3 path runs medpsy-1.7b-q4_k_m-imat.gguf without Candle quantized_qwen3 in the hot decode path.
  • The MedPsy benchmark can select the native path explicitly and records execution_engine = psionic_qwen3_cuda or another non-Candle native label.
  • Retained benchmark matrix shows Psionic at parity or better against the Ollama llama.cpp runner on the same archlinux RTX 4080 host.
  • Parity threshold: Psionic mean warm decode tok/s is at least the Ollama mean warm decode tok/s for the 512-token row, or the issue records exactly why parity cannot be claimed.
  • The implementation preserves MedPsy medical policy and quantization refusal posture.
  • No sidecar/runtime dependency is introduced.

Validation

Run on the Tailnet archlinux RTX 4080 host from a clean worktree:

nvidia-smi --query-compute-apps=pid,process_name,used_gpu_memory --format=csv,noheader,nounits

cargo run --release -p psionic-models --features medpsy-cuda --example medpsy_bench -- \
  --model-path /tmp/medpsy-1.7b-q4_k_m-imat.gguf \
  --artifact-kind gguf \
  --model-size 1.7b \
  --backend cuda \
  --prompt-token-ids 151644 \
  --max-new-tokens 512 \
  --repeats 3 \
  --json-out fixtures/medpsy/benchmarks/manual/medpsy_17b_q4_cuda_native.json

Then rerun the Ollama comparator with the same artifact and prompt/token cap, and update:

fixtures/medpsy/benchmarks/medpsy_comparator_matrix_20260511_local.json
docs/MEDPSY_BENCHMARK.md
docs/NON_GPT_OSS_MEDPSY_QWEN_PILOT.md

Claim Boundary

Until this issue closes, Psionic has a Rust CUDA MedPsy path but cannot claim GPU parity with the Tether-recommended llama.cpp-class runtime.

Metadata

Metadata

Assignees

No one assigned

    Labels

    backendBackend workenhancementNew feature or requestevalEvaluation workqaQuality and validation work

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions