Remove cuDNN/MIOpen activation broadcast overloads (fixes #504, #509) by CarloLucibello · Pull Request #686 · FluxML/NNlib.jl

CarloLucibello · 2026-06-07T19:28:15Z

The activation broadcasts (relu, σ, elu, tanh, ...) on CuArray/ROCArray were routed through cuDNN's cudnnActivationForward! / MIOpen by pirating Base.materialize and Base.materialize!. This had two problems:

Type piracy of the central materialize methods caused method invalidations and load-time latency (type-piracy of materialize #504).
NaN swallowing: cuDNN/MIOpen do not propagate NaNs by default, so e.g. relu.(cu([NaN])) returned 0 instead of NaN, silently diverging from the CPU (relu propagates NaN on CPU but not on GPU #509).

As suggested by maintainers in both issues, for these memory-bandwidth-bound elementwise ops the native GPU broadcast is correct and (expected to be) just as fast, so the custom overloads are removed entirely. cuDNN/MIOpen are still used where they actually help — conv, pooling, batchnorm — which are untouched.

Changes

ext/NNlibCUDACUDNNExt/activations.jl — remove the materialize/materialize! loop and its now-unused cuDNN activation imports.
ext/NNlibAMDGPUExt/activations.jl — remove the materialize loop.
Keep the identity broadcast shortcut in both (avoids an allocation in the common no-activation case; not implicated in either issue).
Add NaN-propagation regression tests for both backends.

TODO before merge

Benchmark native GPU broadcast vs. the removed cuDNN/MIOpen activation path, to confirm there's no meaningful performance regression (the benchmarking requested in relu propagates NaN on CPU but not on GPU #509 that was never done). If a gap shows up, we can revisit a NaN-safe, non-pirating reimplementation.
Run the GPU test suites (CUDA + AMDGPU) on real hardware — the new NaN tests were authored but not executed locally.
Version bump in Project.toml.

Fixes #504
Fixes #509
Fixed #512

🤖 Generated with Claude Code

The activation broadcasts (relu, σ, elu, tanh, ...) on CuArray/ROCArray were routed through cuDNN's `cudnnActivationForward!` / MIOpen by pirating `Base.materialize` and `Base.materialize!`. This had two problems: - Type piracy of the central `materialize` methods caused method invalidations and load-time latency (#504). - cuDNN/MIOpen do not propagate NaNs by default, so e.g. `relu.(cu([NaN]))` returned `0` instead of `NaN`, silently diverging from the CPU (#509). For these memory-bandwidth-bound elementwise ops the native GPU broadcast is correct and just as fast, so the custom overloads are removed entirely. The `identity` broadcast shortcut is kept (it avoids an allocation in the common no-activation case and is not implicated in either issue). Adds NaN-propagation regression tests for both backends. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

CarloLucibello · 2026-06-08T13:18:21Z

CUDA activation broadcast benchmarks (pre vs post)

I added GPU benchmarks for this change under benchmark/cuda/ and ran them on an RTX 5090 / CUDA 13.2 / cuDNN 9.2 / Julia 1.12.

The script measures the paths in the same process, so it's apples-to-apples on one machine:

column	path	corresponds to
`native`	`f.(x)` / `broadcast!(f, dst, x)`	this PR (post)
`cudnn`	`cudnnActivationForward!(dst, x; mode)`	pre-PR
`fast`	`tanh_fast.(x)` / `sigmoid_fast.(x)`	native fast approx (never cuDNN-routed)

ratio = native / cudnn — >1 means the post-PR native path is slower. Times are the min of 1000 GPU-synced samples, in µs (out-of-place).

native vs cuDNN (`nat/cu`)

act	eltype	1024²	224×224×3×32	56×56×64×64
relu	Float16	1.44×	3.44×	5.62×
relu	Float32	1.34×	2.82×	1.87×
relu	Float64	1.18×	1.06×	1.00×
sigmoid	Float16	1.81×	3.55×	5.46×
sigmoid	Float32	1.47×	3.10×	2.36×
sigmoid	Float64	1.11×	1.12×	1.16×
elu	Float16	1.69×	3.62×	5.64×
elu	Float32	1.55×	2.91×	2.22×
elu	Float64	0.78×	0.71×	0.76×
tanh	Float16	1.45×	3.38×	5.26×
tanh	Float32	1.51×	2.94×	2.29×
tanh	Float64	1.00×	1.03×	1.05×

Float64, large memory-bound tensors — native is on par (tanh/relu within a few %) or faster (elu ~0.7×, cuDNN's ELU does extra work). ✅ "just as fast" holds.
Small tensors — cuDNN has lower CPU-side launch overhead, so native is a few µs slower in absolute terms.
Float16 — native is markedly slower (up to ~5–6×). The Float16 native time is essentially identical to the Float32 native time, i.e. CUDA.jl's broadcast isn't vectorizing Float16 (half2); cuDNN's Float16 kernel does scale. E.g. relu 56×56×64×64: 108 µs native vs 19 µs cuDNN.

fast variants — do `tanh_fast` / `sigmoid_fast` help on GPU?

tanh_fast and sigmoid_fast are NNlib's faster approximations (relu/elu have none). They were never cuDNN-routed, so this PR doesn't change them — the question is just whether they're a useful native alternative on GPU. fst/cu = fast / cudnn:

act	eltype	native/cu	fast/cu
sigmoid	Float16 (56³×)	5.46×	5.47×
sigmoid	Float32 (56³×)	2.36×	2.37×
sigmoid	Float64 (1024²)	1.11×	1.04×
tanh	Float16 (56³×)	5.26×	5.26×
tanh	Float32 (56³×)	2.29×	2.38×
tanh	Float64 (1024²)	1.00×	0.86×
tanh	Float64 (224×224×3×32)	1.03×	0.88×

For Float16/Float32 the fast variants are no faster than plain native — they're essentially identical. The fast approximations were designed to avoid expensive CPU exp/tanh; GPUs have hardware-fast transcendentals, so there's nothing to save. (For Float16, sigmoid_fast literally calls sigmoid and tanh_fast falls back to tanh.) So they do not close the Float16 gap to cuDNN.
The one real win is Float64 tanh_fast, ~10–14% faster than native tanh — enough to beat cuDNN (0.86–0.94×). sigmoid_fast Float64 also trims native slightly (1.04× vs 1.11×).

Takeaway

The correctness (NaN propagation, #509) and latency (invalidations, #504) arguments stand on their own. On raw throughput native broadcast is competitive for Float32/Float64, and the fast variants don't change that picture — Float16 elementwise throughput is still left on the table. The right fix is Float16 (half2) vectorization in CUDA.jl's broadcast, not re-introducing the cuDNN piracy.

How to run / full results

julia --project=benchmark/cuda -e 'using Pkg; Pkg.develop(path="."); Pkg.instantiate()'
julia --project=benchmark/cuda benchmark/cuda/activations.jl

Full out-of-place + in-place table (incl. fast columns) is in benchmark/cuda/results.txt.

Benchmarks comparing the native CUDA.jl broadcast (post-PR) against the cuDNN-routed `cudnnActivationForward!` (pre-PR) for relu/sigmoid/elu/tanh, plus the `tanh_fast`/`sigmoid_fast` native variants, across Float16/Float32/ Float64 and several tensor shapes. Includes a captured run (results.txt) and a README summarizing the methodology and findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

CarloLucibello · 2026-06-09T14:56:26Z

CUDA buildkite CI is failing due to missing imports
https://buildkite.com/julialang/nnlib-dot-jl/builds/1829/canvas?jid=019ea7da-06b5-4277-ae4d-4eaa26dfe58d&tab=output#L753

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Remove cuDNN/MIOpen activation broadcast overloads (fixes #504, #509)#686

Remove cuDNN/MIOpen activation broadcast overloads (fixes #504, #509)#686
CarloLucibello wants to merge 2 commits into
masterfrom
cl/piracy

CarloLucibello commented Jun 7, 2026 •

edited

Loading

Uh oh!

CarloLucibello commented Jun 8, 2026 •

edited

Loading

Uh oh!

CarloLucibello commented Jun 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

CarloLucibello commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

TODO before merge

Uh oh!

CarloLucibello commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CUDA activation broadcast benchmarks (pre vs post)

native vs cuDNN (nat/cu)

fast variants — do tanh_fast / sigmoid_fast help on GPU?

Takeaway

Uh oh!

CarloLucibello commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

CarloLucibello commented Jun 7, 2026 •

edited

Loading

CarloLucibello commented Jun 8, 2026 •

edited

Loading

native vs cuDNN (`nat/cu`)

fast variants — do `tanh_fast` / `sigmoid_fast` help on GPU?

CarloLucibello commented Jun 9, 2026 •

edited

Loading