Remove cuDNN/MIOpen activation broadcast overloads (fixes #504, #509)#686
Remove cuDNN/MIOpen activation broadcast overloads (fixes #504, #509)#686CarloLucibello wants to merge 2 commits into
Conversation
The activation broadcasts (relu, σ, elu, tanh, ...) on CuArray/ROCArray were routed through cuDNN's `cudnnActivationForward!` / MIOpen by pirating `Base.materialize` and `Base.materialize!`. This had two problems: - Type piracy of the central `materialize` methods caused method invalidations and load-time latency (#504). - cuDNN/MIOpen do not propagate NaNs by default, so e.g. `relu.(cu([NaN]))` returned `0` instead of `NaN`, silently diverging from the CPU (#509). For these memory-bandwidth-bound elementwise ops the native GPU broadcast is correct and just as fast, so the custom overloads are removed entirely. The `identity` broadcast shortcut is kept (it avoids an allocation in the common no-activation case and is not implicated in either issue). Adds NaN-propagation regression tests for both backends. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CUDA activation broadcast benchmarks (pre vs post)I added GPU benchmarks for this change under The script measures the paths in the same process, so it's apples-to-apples on one machine:
native vs cuDNN (
|
| act | eltype | 1024² | 224×224×3×32 | 56×56×64×64 |
|---|---|---|---|---|
| relu | Float16 | 1.44× | 3.44× | 5.62× |
| relu | Float32 | 1.34× | 2.82× | 1.87× |
| relu | Float64 | 1.18× | 1.06× | 1.00× |
| sigmoid | Float16 | 1.81× | 3.55× | 5.46× |
| sigmoid | Float32 | 1.47× | 3.10× | 2.36× |
| sigmoid | Float64 | 1.11× | 1.12× | 1.16× |
| elu | Float16 | 1.69× | 3.62× | 5.64× |
| elu | Float32 | 1.55× | 2.91× | 2.22× |
| elu | Float64 | 0.78× | 0.71× | 0.76× |
| tanh | Float16 | 1.45× | 3.38× | 5.26× |
| tanh | Float32 | 1.51× | 2.94× | 2.29× |
| tanh | Float64 | 1.00× | 1.03× | 1.05× |
- Float64, large memory-bound tensors — native is on par (
tanh/reluwithin a few %) or faster (elu~0.7×, cuDNN's ELU does extra work). ✅ "just as fast" holds. - Small tensors — cuDNN has lower CPU-side launch overhead, so native is a few µs slower in absolute terms.
- Float16 — native is markedly slower (up to ~5–6×). The Float16 native time is essentially identical to the Float32 native time, i.e. CUDA.jl's broadcast isn't vectorizing Float16 (
half2); cuDNN's Float16 kernel does scale. E.g.relu56×56×64×64: 108 µs native vs 19 µs cuDNN.
fast variants — do tanh_fast / sigmoid_fast help on GPU?
tanh_fast and sigmoid_fast are NNlib's faster approximations (relu/elu have none). They were never cuDNN-routed, so this PR doesn't change them — the question is just whether they're a useful native alternative on GPU. fst/cu = fast / cudnn:
| act | eltype | native/cu | fast/cu |
|---|---|---|---|
| sigmoid | Float16 (56³×) | 5.46× | 5.47× |
| sigmoid | Float32 (56³×) | 2.36× | 2.37× |
| sigmoid | Float64 (1024²) | 1.11× | 1.04× |
| tanh | Float16 (56³×) | 5.26× | 5.26× |
| tanh | Float32 (56³×) | 2.29× | 2.38× |
| tanh | Float64 (1024²) | 1.00× | 0.86× |
| tanh | Float64 (224×224×3×32) | 1.03× | 0.88× |
- For Float16/Float32 the fast variants are no faster than plain native — they're essentially identical. The fast approximations were designed to avoid expensive CPU
exp/tanh; GPUs have hardware-fast transcendentals, so there's nothing to save. (For Float16,sigmoid_fastliterally callssigmoidandtanh_fastfalls back totanh.) So they do not close the Float16 gap to cuDNN. - The one real win is Float64
tanh_fast, ~10–14% faster than nativetanh— enough to beat cuDNN (0.86–0.94×).sigmoid_fastFloat64 also trims native slightly (1.04× vs 1.11×).
Takeaway
The correctness (NaN propagation, #509) and latency (invalidations, #504) arguments stand on their own. On raw throughput native broadcast is competitive for Float32/Float64, and the fast variants don't change that picture — Float16 elementwise throughput is still left on the table. The right fix is Float16 (half2) vectorization in CUDA.jl's broadcast, not re-introducing the cuDNN piracy.
How to run / full results
julia --project=benchmark/cuda -e 'using Pkg; Pkg.develop(path="."); Pkg.instantiate()'
julia --project=benchmark/cuda benchmark/cuda/activations.jl
Full out-of-place + in-place table (incl. fast columns) is in benchmark/cuda/results.txt.
Benchmarks comparing the native CUDA.jl broadcast (post-PR) against the cuDNN-routed `cudnnActivationForward!` (pre-PR) for relu/sigmoid/elu/tanh, plus the `tanh_fast`/`sigmoid_fast` native variants, across Float16/Float32/ Float64 and several tensor shapes. Includes a captured run (results.txt) and a README summarizing the methodology and findings. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
CUDA buildkite CI is failing due to missing imports |
The activation broadcasts (
relu,σ,elu,tanh, ...) onCuArray/ROCArraywere routed through cuDNN'scudnnActivationForward!/ MIOpen by piratingBase.materializeandBase.materialize!. This had two problems:materializemethods caused method invalidations and load-time latency (type-piracy of materialize #504).relu.(cu([NaN]))returned0instead ofNaN, silently diverging from the CPU (relupropagates NaN on CPU but not on GPU #509).As suggested by maintainers in both issues, for these memory-bandwidth-bound elementwise ops the native GPU broadcast is correct and (expected to be) just as fast, so the custom overloads are removed entirely. cuDNN/MIOpen are still used where they actually help — conv, pooling, batchnorm — which are untouched.
Changes
ext/NNlibCUDACUDNNExt/activations.jl— remove thematerialize/materialize!loop and its now-unused cuDNN activation imports.ext/NNlibAMDGPUExt/activations.jl— remove thematerializeloop.identitybroadcast shortcut in both (avoids an allocation in the common no-activation case; not implicated in either issue).TODO before merge
relupropagates NaN on CPU but not on GPU #509 that was never done). If a gap shows up, we can revisit a NaN-safe, non-pirating reimplementation.Project.toml.Fixes #504
Fixes #509
Fixed #512
🤖 Generated with Claude Code