Goal
Close the MedPsy CUDA throughput gap by replacing the current Candle quantized_qwen3 execution path with a native Psionic CUDA Qwen3 decode path for the MedPsy 1.7B Q4_K_M GGUF row.
Current Evidence
The current retained matrix is:
fixtures/medpsy/benchmarks/medpsy_comparator_matrix_20260511_local.json
The artifact is:
repo: qvac/MedPsy-1.7B-GGUF
file: medpsy-1.7b-q4_k_m-imat.gguf
sha256: 41ee947d9cce72ec657577219fd1798fabeabf0d832217fe23c9d6d3d18d5880
Tailnet hardware:
host: archlinux
gpu: NVIDIA GeForce RTX 4080 16GB
driver: 595.58.03
Rows measured on the same artifact and host:
Psionic Rust Candle Qwen3 CUDA:
execution_engine: rust_candle_qwen3_cuda
prompt: token [151644]
max_new_tokens: 512
repeats: 3
mean decode throughput: ~239 tok/s
Ollama llama.cpp runner:
model: medpsy-17b-q4-local
same prompt and 512-token cap
repeats: 3
mean decode throughput: ~369 tok/s
Negative trials already tested:
CANDLE_DEQUANTIZE_ALL_F16=1: ~127 tok/s, worse.
CANDLE_DEQUANTIZE_ALL=1: ~78 tok/s, worse.
PSIONIC_MEDPSY_FORCE_DMMV=1: ~250 tok/s, not enough.
- Device-side argmax removed full-logit host readback but did not close the gap.
- Reusing the loaded model across repeats removed reload overhead but did not close the warm decode gap.
Required Implementation Direction
Do not switch to QVAC SDK, llama.cpp, Ollama, vLLM, Python, or a sidecar runtime.
Implement a native Psionic CUDA Qwen3 path for the dense MedPsy architecture.
Likely code areas:
crates/psionic-models/src/lib.rs
- Qwen3 tensor layout should be explicit and not flattened into Qwen2/Qwen35.
- Current Qwen3 admission exists, but tensor layout should support native CUDA decode needs.
crates/psionic-serve/src/qwen35.rs
- Reuse carefully where possible, but do not pretend Qwen3 is Qwen35.
- Qwen35 full-attention assumes query/gate projection rows equal
2 * query_width. Qwen3 has query rows equal query_width and no attention gate. That mismatch is one reason the existing native path cannot be reused unchanged.
crates/psionic-backend-cuda/src/lib.rs and CUDA kernels
- Reuse existing quantized matvec, Q8_1 input quantization, top-k/argmax, RMSNorm, RoPE, and output-head paths where applicable.
- Add Qwen3-specific fused or grouped kernels only where the generic pieces do not reach parity.
crates/psionic-models/examples/medpsy_bench.rs
- Keep this benchmark as the direct runtime gate.
scripts/release/check-psionic-medpsy-pilot.sh
- Extend after the native CUDA path is green.
Native Qwen3 Runtime Requirements
The MedPsy 1.7B Qwen3 row requires:
- Qwen3 dense full-attention block, not Qwen35 hybrid SSM.
- Q projection without Qwen35 gate half.
- K/V projection with GQA: 16 attention heads, 8 KV heads, head_dim 128.
- Per-head q/k RMSNorm.
- RoPE with theta
1_000_000 for 1.7B.
- SwiGLU MLP: gate/up/down.
- Tied embeddings by default.
- GGUF Q4_K_M quantized projection support.
- CUDA KV cache and decode loop.
- Device-side output selection without full-logit host readback.
Acceptance Criteria
- A native Psionic CUDA Qwen3 path runs
medpsy-1.7b-q4_k_m-imat.gguf without Candle quantized_qwen3 in the hot decode path.
- The MedPsy benchmark can select the native path explicitly and records
execution_engine = psionic_qwen3_cuda or another non-Candle native label.
- Retained benchmark matrix shows Psionic at parity or better against the Ollama llama.cpp runner on the same
archlinux RTX 4080 host.
- Parity threshold: Psionic mean warm decode tok/s is at least the Ollama mean warm decode tok/s for the 512-token row, or the issue records exactly why parity cannot be claimed.
- The implementation preserves MedPsy medical policy and quantization refusal posture.
- No sidecar/runtime dependency is introduced.
Validation
Run on the Tailnet archlinux RTX 4080 host from a clean worktree:
nvidia-smi --query-compute-apps=pid,process_name,used_gpu_memory --format=csv,noheader,nounits
cargo run --release -p psionic-models --features medpsy-cuda --example medpsy_bench -- \
--model-path /tmp/medpsy-1.7b-q4_k_m-imat.gguf \
--artifact-kind gguf \
--model-size 1.7b \
--backend cuda \
--prompt-token-ids 151644 \
--max-new-tokens 512 \
--repeats 3 \
--json-out fixtures/medpsy/benchmarks/manual/medpsy_17b_q4_cuda_native.json
Then rerun the Ollama comparator with the same artifact and prompt/token cap, and update:
fixtures/medpsy/benchmarks/medpsy_comparator_matrix_20260511_local.json
docs/MEDPSY_BENCHMARK.md
docs/NON_GPT_OSS_MEDPSY_QWEN_PILOT.md
Claim Boundary
Until this issue closes, Psionic has a Rust CUDA MedPsy path but cannot claim GPU parity with the Tether-recommended llama.cpp-class runtime.
Goal
Close the MedPsy CUDA throughput gap by replacing the current Candle
quantized_qwen3execution path with a native Psionic CUDA Qwen3 decode path for the MedPsy 1.7B Q4_K_M GGUF row.Current Evidence
The current retained matrix is:
The artifact is:
Tailnet hardware:
Rows measured on the same artifact and host:
Negative trials already tested:
CANDLE_DEQUANTIZE_ALL_F16=1: ~127 tok/s, worse.CANDLE_DEQUANTIZE_ALL=1: ~78 tok/s, worse.PSIONIC_MEDPSY_FORCE_DMMV=1: ~250 tok/s, not enough.Required Implementation Direction
Do not switch to QVAC SDK, llama.cpp, Ollama, vLLM, Python, or a sidecar runtime.
Implement a native Psionic CUDA Qwen3 path for the dense MedPsy architecture.
Likely code areas:
crates/psionic-models/src/lib.rscrates/psionic-serve/src/qwen35.rs2 * query_width. Qwen3 has query rows equalquery_widthand no attention gate. That mismatch is one reason the existing native path cannot be reused unchanged.crates/psionic-backend-cuda/src/lib.rsand CUDA kernelscrates/psionic-models/examples/medpsy_bench.rsscripts/release/check-psionic-medpsy-pilot.shNative Qwen3 Runtime Requirements
The MedPsy 1.7B Qwen3 row requires:
1_000_000for 1.7B.Acceptance Criteria
medpsy-1.7b-q4_k_m-imat.ggufwithout Candlequantized_qwen3in the hot decode path.execution_engine = psionic_qwen3_cudaor another non-Candle native label.archlinuxRTX 4080 host.Validation
Run on the Tailnet
archlinuxRTX 4080 host from a clean worktree:Then rerun the Ollama comparator with the same artifact and prompt/token cap, and update:
Claim Boundary
Until this issue closes, Psionic has a Rust CUDA MedPsy path but cannot claim GPU parity with the Tether-recommended llama.cpp-class runtime.