Skip to content

Remove cuDNN/MIOpen activation broadcast overloads (fixes #504, #509)#686

Open
CarloLucibello wants to merge 2 commits into
masterfrom
cl/piracy
Open

Remove cuDNN/MIOpen activation broadcast overloads (fixes #504, #509)#686
CarloLucibello wants to merge 2 commits into
masterfrom
cl/piracy

Conversation

@CarloLucibello

@CarloLucibello CarloLucibello commented Jun 7, 2026

Copy link
Copy Markdown
Member

The activation broadcasts (relu, σ, elu, tanh, ...) on CuArray/ROCArray were routed through cuDNN's cudnnActivationForward! / MIOpen by pirating Base.materialize and Base.materialize!. This had two problems:

As suggested by maintainers in both issues, for these memory-bandwidth-bound elementwise ops the native GPU broadcast is correct and (expected to be) just as fast, so the custom overloads are removed entirely. cuDNN/MIOpen are still used where they actually help — conv, pooling, batchnorm — which are untouched.

Changes

  • ext/NNlibCUDACUDNNExt/activations.jl — remove the materialize/materialize! loop and its now-unused cuDNN activation imports.
  • ext/NNlibAMDGPUExt/activations.jl — remove the materialize loop.
  • Keep the identity broadcast shortcut in both (avoids an allocation in the common no-activation case; not implicated in either issue).
  • Add NaN-propagation regression tests for both backends.

TODO before merge

  • Benchmark native GPU broadcast vs. the removed cuDNN/MIOpen activation path, to confirm there's no meaningful performance regression (the benchmarking requested in relu propagates NaN on CPU but not on GPU #509 that was never done). If a gap shows up, we can revisit a NaN-safe, non-pirating reimplementation.
  • Run the GPU test suites (CUDA + AMDGPU) on real hardware — the new NaN tests were authored but not executed locally.
  • Version bump in Project.toml.

Fixes #504
Fixes #509
Fixed #512

🤖 Generated with Claude Code

The activation broadcasts (relu, σ, elu, tanh, ...) on CuArray/ROCArray were
routed through cuDNN's `cudnnActivationForward!` / MIOpen by pirating
`Base.materialize` and `Base.materialize!`. This had two problems:

- Type piracy of the central `materialize` methods caused method invalidations
  and load-time latency (#504).
- cuDNN/MIOpen do not propagate NaNs by default, so e.g. `relu.(cu([NaN]))`
  returned `0` instead of `NaN`, silently diverging from the CPU (#509).

For these memory-bandwidth-bound elementwise ops the native GPU broadcast is
correct and just as fast, so the custom overloads are removed entirely. The
`identity` broadcast shortcut is kept (it avoids an allocation in the common
no-activation case and is not implicated in either issue). Adds NaN-propagation
regression tests for both backends.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@CarloLucibello

CarloLucibello commented Jun 8, 2026

Copy link
Copy Markdown
Member Author

CUDA activation broadcast benchmarks (pre vs post)

I added GPU benchmarks for this change under benchmark/cuda/ and ran them on an RTX 5090 / CUDA 13.2 / cuDNN 9.2 / Julia 1.12.

The script measures the paths in the same process, so it's apples-to-apples on one machine:

column path corresponds to
native f.(x) / broadcast!(f, dst, x) this PR (post)
cudnn cudnnActivationForward!(dst, x; mode) pre-PR
fast tanh_fast.(x) / sigmoid_fast.(x) native fast approx (never cuDNN-routed)

ratio = native / cudnn>1 means the post-PR native path is slower. Times are the min of 1000 GPU-synced samples, in µs (out-of-place).

native vs cuDNN (nat/cu)

act eltype 1024² 224×224×3×32 56×56×64×64
relu Float16 1.44× 3.44× 5.62×
relu Float32 1.34× 2.82× 1.87×
relu Float64 1.18× 1.06× 1.00×
sigmoid Float16 1.81× 3.55× 5.46×
sigmoid Float32 1.47× 3.10× 2.36×
sigmoid Float64 1.11× 1.12× 1.16×
elu Float16 1.69× 3.62× 5.64×
elu Float32 1.55× 2.91× 2.22×
elu Float64 0.78× 0.71× 0.76×
tanh Float16 1.45× 3.38× 5.26×
tanh Float32 1.51× 2.94× 2.29×
tanh Float64 1.00× 1.03× 1.05×
  • Float64, large memory-bound tensors — native is on par (tanh/relu within a few %) or faster (elu ~0.7×, cuDNN's ELU does extra work). ✅ "just as fast" holds.
  • Small tensors — cuDNN has lower CPU-side launch overhead, so native is a few µs slower in absolute terms.
  • Float16 — native is markedly slower (up to ~5–6×). The Float16 native time is essentially identical to the Float32 native time, i.e. CUDA.jl's broadcast isn't vectorizing Float16 (half2); cuDNN's Float16 kernel does scale. E.g. relu 56×56×64×64: 108 µs native vs 19 µs cuDNN.

fast variants — do tanh_fast / sigmoid_fast help on GPU?

tanh_fast and sigmoid_fast are NNlib's faster approximations (relu/elu have none). They were never cuDNN-routed, so this PR doesn't change them — the question is just whether they're a useful native alternative on GPU. fst/cu = fast / cudnn:

act eltype native/cu fast/cu
sigmoid Float16 (56³×) 5.46× 5.47×
sigmoid Float32 (56³×) 2.36× 2.37×
sigmoid Float64 (1024²) 1.11× 1.04×
tanh Float16 (56³×) 5.26× 5.26×
tanh Float32 (56³×) 2.29× 2.38×
tanh Float64 (1024²) 1.00× 0.86×
tanh Float64 (224×224×3×32) 1.03× 0.88×
  • For Float16/Float32 the fast variants are no faster than plain native — they're essentially identical. The fast approximations were designed to avoid expensive CPU exp/tanh; GPUs have hardware-fast transcendentals, so there's nothing to save. (For Float16, sigmoid_fast literally calls sigmoid and tanh_fast falls back to tanh.) So they do not close the Float16 gap to cuDNN.
  • The one real win is Float64 tanh_fast, ~10–14% faster than native tanh — enough to beat cuDNN (0.86–0.94×). sigmoid_fast Float64 also trims native slightly (1.04× vs 1.11×).

Takeaway

The correctness (NaN propagation, #509) and latency (invalidations, #504) arguments stand on their own. On raw throughput native broadcast is competitive for Float32/Float64, and the fast variants don't change that picture — Float16 elementwise throughput is still left on the table. The right fix is Float16 (half2) vectorization in CUDA.jl's broadcast, not re-introducing the cuDNN piracy.

How to run / full results
julia --project=benchmark/cuda -e 'using Pkg; Pkg.develop(path="."); Pkg.instantiate()'
julia --project=benchmark/cuda benchmark/cuda/activations.jl

Full out-of-place + in-place table (incl. fast columns) is in benchmark/cuda/results.txt.

Benchmarks comparing the native CUDA.jl broadcast (post-PR) against the
cuDNN-routed `cudnnActivationForward!` (pre-PR) for relu/sigmoid/elu/tanh,
plus the `tanh_fast`/`sigmoid_fast` native variants, across Float16/Float32/
Float64 and several tensor shapes. Includes a captured run (results.txt) and
a README summarizing the methodology and findings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@CarloLucibello

CarloLucibello commented Jun 9, 2026

Copy link
Copy Markdown
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Do we need src/cudnn/activations.jl? relu propagates NaN on CPU but not on GPU type-piracy of materialize

1 participant