[MedPsy CUDA] Close native Qwen3 GPU parity gap against llama.cpp runner

## Goal

Close the MedPsy CUDA throughput gap by replacing the current Candle `quantized_qwen3` execution path with a native Psionic CUDA Qwen3 decode path for the MedPsy 1.7B Q4_K_M GGUF row.

## Current Evidence

The current retained matrix is:

```text
fixtures/medpsy/benchmarks/medpsy_comparator_matrix_20260511_local.json
```

The artifact is:

```text
repo: qvac/MedPsy-1.7B-GGUF
file: medpsy-1.7b-q4_k_m-imat.gguf
sha256: 41ee947d9cce72ec657577219fd1798fabeabf0d832217fe23c9d6d3d18d5880
```

Tailnet hardware:

```text
host: archlinux
gpu: NVIDIA GeForce RTX 4080 16GB
driver: 595.58.03
```

Rows measured on the same artifact and host:

```text
Psionic Rust Candle Qwen3 CUDA:
  execution_engine: rust_candle_qwen3_cuda
  prompt: token [151644]
  max_new_tokens: 512
  repeats: 3
  mean decode throughput: ~239 tok/s

Ollama llama.cpp runner:
  model: medpsy-17b-q4-local
  same prompt and 512-token cap
  repeats: 3
  mean decode throughput: ~369 tok/s
```

Negative trials already tested:

- `CANDLE_DEQUANTIZE_ALL_F16=1`: ~127 tok/s, worse.
- `CANDLE_DEQUANTIZE_ALL=1`: ~78 tok/s, worse.
- `PSIONIC_MEDPSY_FORCE_DMMV=1`: ~250 tok/s, not enough.
- Device-side argmax removed full-logit host readback but did not close the gap.
- Reusing the loaded model across repeats removed reload overhead but did not close the warm decode gap.

## Required Implementation Direction

Do not switch to QVAC SDK, llama.cpp, Ollama, vLLM, Python, or a sidecar runtime.

Implement a native Psionic CUDA Qwen3 path for the dense MedPsy architecture.

Likely code areas:

- `crates/psionic-models/src/lib.rs`
  - Qwen3 tensor layout should be explicit and not flattened into Qwen2/Qwen35.
  - Current Qwen3 admission exists, but tensor layout should support native CUDA decode needs.
- `crates/psionic-serve/src/qwen35.rs`
  - Reuse carefully where possible, but do not pretend Qwen3 is Qwen35.
  - Qwen35 full-attention assumes query/gate projection rows equal `2 * query_width`. Qwen3 has query rows equal `query_width` and no attention gate. That mismatch is one reason the existing native path cannot be reused unchanged.
- `crates/psionic-backend-cuda/src/lib.rs` and CUDA kernels
  - Reuse existing quantized matvec, Q8_1 input quantization, top-k/argmax, RMSNorm, RoPE, and output-head paths where applicable.
  - Add Qwen3-specific fused or grouped kernels only where the generic pieces do not reach parity.
- `crates/psionic-models/examples/medpsy_bench.rs`
  - Keep this benchmark as the direct runtime gate.
- `scripts/release/check-psionic-medpsy-pilot.sh`
  - Extend after the native CUDA path is green.

## Native Qwen3 Runtime Requirements

The MedPsy 1.7B Qwen3 row requires:

- Qwen3 dense full-attention block, not Qwen35 hybrid SSM.
- Q projection without Qwen35 gate half.
- K/V projection with GQA: 16 attention heads, 8 KV heads, head_dim 128.
- Per-head q/k RMSNorm.
- RoPE with theta `1_000_000` for 1.7B.
- SwiGLU MLP: gate/up/down.
- Tied embeddings by default.
- GGUF Q4_K_M quantized projection support.
- CUDA KV cache and decode loop.
- Device-side output selection without full-logit host readback.

## Acceptance Criteria

- A native Psionic CUDA Qwen3 path runs `medpsy-1.7b-q4_k_m-imat.gguf` without Candle `quantized_qwen3` in the hot decode path.
- The MedPsy benchmark can select the native path explicitly and records `execution_engine = psionic_qwen3_cuda` or another non-Candle native label.
- Retained benchmark matrix shows Psionic at parity or better against the Ollama llama.cpp runner on the same `archlinux` RTX 4080 host.
- Parity threshold: Psionic mean warm decode tok/s is at least the Ollama mean warm decode tok/s for the 512-token row, or the issue records exactly why parity cannot be claimed.
- The implementation preserves MedPsy medical policy and quantization refusal posture.
- No sidecar/runtime dependency is introduced.

## Validation

Run on the Tailnet `archlinux` RTX 4080 host from a clean worktree:

```bash
nvidia-smi --query-compute-apps=pid,process_name,used_gpu_memory --format=csv,noheader,nounits

cargo run --release -p psionic-models --features medpsy-cuda --example medpsy_bench -- \
  --model-path /tmp/medpsy-1.7b-q4_k_m-imat.gguf \
  --artifact-kind gguf \
  --model-size 1.7b \
  --backend cuda \
  --prompt-token-ids 151644 \
  --max-new-tokens 512 \
  --repeats 3 \
  --json-out fixtures/medpsy/benchmarks/manual/medpsy_17b_q4_cuda_native.json
```

Then rerun the Ollama comparator with the same artifact and prompt/token cap, and update:

```text
fixtures/medpsy/benchmarks/medpsy_comparator_matrix_20260511_local.json
docs/MEDPSY_BENCHMARK.md
docs/NON_GPT_OSS_MEDPSY_QWEN_PILOT.md
```

## Claim Boundary

Until this issue closes, Psionic has a Rust CUDA MedPsy path but cannot claim GPU parity with the Tether-recommended llama.cpp-class runtime.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MedPsy CUDA] Close native Qwen3 GPU parity gap against llama.cpp runner #985

Goal

Current Evidence

Required Implementation Direction

Native Qwen3 Runtime Requirements

Acceptance Criteria

Validation

Claim Boundary

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[MedPsy CUDA] Close native Qwen3 GPU parity gap against llama.cpp runner #985

Description

Goal

Current Evidence

Required Implementation Direction

Native Qwen3 Runtime Requirements

Acceptance Criteria

Validation

Claim Boundary

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions