Add narrow-precision floats to cudaDataType by AntonOresten · Pull Request #3180 · JuliaGPU/CUDA.jl

AntonOresten · 2026-06-21T12:49:30Z

Adds narrow-precision floats to the cudaDataType enum based on library_types.h in CUDA 13:

    CUDA_R_8F_E4M3 = 28, /* real as a nv_fp8_e4m3 */
    CUDA_R_8F_UE4M3 = CUDA_R_8F_E4M3, /* real as an unsigned nv_fp8_e4m3 */
    CUDA_R_8F_E5M2 = 29, /* real as a nv_fp8_e5m2 */
    CUDA_R_8F_UE8M0 = 30,  /* real as an exponent-only unsigned nv_fp8_e8m0 */
    CUDA_R_6F_E2M3  = 31,  /* real as a nv_fp6_e2m3 */
    CUDA_R_6F_E3M2  = 32,  /* real as a nv_fp6_e3m2 */
    CUDA_R_4F_E2M1  = 33,  /* real as a nv_fp4_e2m1 */

R_8F_UE4M3 is an alias of R_8F_E4M3, differing semantically in that the sign is meaningless in the NVFP4 block-scaling format where every element has a sign anyway.

Implementations like DLFP8Types.jl and Microfloats.jl could define e.g. Base.convert(::Type{CUDACore.cudaDataType}, ::Type{Float8_E5M2}) = CUDACore.R_8F_E5M2 in an extension, or there could be some jltype_to_cudaDataType function.

It doesn't work the other way though, since each cudaDataType can only map to one julia type, but not sure if this is a problem.

…arks]

github-actions

CUDA.jl Benchmarks

Details

Benchmark suite	Current: `f41f373`	Previous: `bc81d40`	Ratio
`array/accumulate/Float32/1d`	`98915` ns	`98667` ns	`1.00`
`array/accumulate/Float32/dims=1`	`74908` ns	`74079` ns	`1.01`
`array/accumulate/Float32/dims=1L`	`1599681` ns	`1599572` ns	`1.00`
`array/accumulate/Float32/dims=2`	`140661` ns	`139645` ns	`1.01`
`array/accumulate/Float32/dims=2L`	`660907` ns	`660578` ns	`1.00`
`array/accumulate/Int64/1d`	`118396` ns	`118201` ns	`1.00`
`array/accumulate/Int64/dims=1`	`79173` ns	`79258` ns	`1.00`
`array/accumulate/Int64/dims=1L`	`1716336` ns	`1715309` ns	`1.00`
`array/accumulate/Int64/dims=2`	`153488` ns	`153091` ns	`1.00`
`array/accumulate/Int64/dims=2L`	`987716` ns	`987286` ns	`1.00`
`array/broadcast`	`18273` ns	`18223` ns	`1.00`
`array/construct`	`1078.1` ns	`1098.7` ns	`0.98`
`array/copy`	`16657` ns	`16605` ns	`1.00`
`array/copyto!/cpu_to_gpu`	`208688` ns	`206529` ns	`1.01`
`array/copyto!/gpu_to_cpu`	`240551` ns	`239820` ns	`1.00`
`array/copyto!/gpu_to_gpu`	`8901` ns	`8819` ns	`1.01`
`array/iteration/findall/bool`	`132729` ns	`133125` ns	`1.00`
`array/iteration/findall/int`	`146300` ns	`146351` ns	`1.00`
`array/iteration/findfirst/bool`	`68470` ns	`69131` ns	`0.99`
`array/iteration/findfirst/int`	`69955` ns	`70202` ns	`1.00`
`array/iteration/findmin/1d`	`65436` ns	`62891` ns	`1.04`
`array/iteration/findmin/2d`	`99866` ns	`99773` ns	`1.00`
`array/iteration/logical`	`188593` ns	`188295` ns	`1.00`
`array/iteration/scalar`	`63264` ns	`63035` ns	`1.00`
`array/permutedims/2d`	`49404` ns	`48726` ns	`1.01`
`array/permutedims/3d`	`51027` ns	`50552` ns	`1.01`
`array/permutedims/4d`	`50367` ns	`50019` ns	`1.01`
`array/random/rand/Float32`	`11843` ns	`11718` ns	`1.01`
`array/random/rand/Int64`	`23589` ns	`21972` ns	`1.07`
`array/random/rand!/Float32`	`7898.5` ns	`7684.75` ns	`1.03`
`array/random/rand!/Int64`	`19613` ns	`17879` ns	`1.10`
`array/random/randn/Float32`	`33933` ns	`33563` ns	`1.01`
`array/random/randn!/Float32`	`23887` ns	`24207` ns	`0.99`
`array/reductions/mapreduce/Float32/1d`	`31916` ns	`31970` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=1`	`37777` ns	`37560` ns	`1.01`
`array/reductions/mapreduce/Float32/dims=1L`	`50858` ns	`50748` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=2`	`54915` ns	`54859` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=2L`	`66733` ns	`66258` ns	`1.01`
`array/reductions/mapreduce/Int64/1d`	`38705` ns	`38667` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=1`	`40340` ns	`40569` ns	`0.99`
`array/reductions/mapreduce/Int64/dims=1L`	`88805` ns	`88767` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2`	`57540` ns	`57046` ns	`1.01`
`array/reductions/mapreduce/Int64/dims=2L`	`83176` ns	`83249` ns	`1.00`
`array/reductions/reduce/Float32/1d`	`31908` ns	`31908` ns	`1`
`array/reductions/reduce/Float32/dims=1`	`37850` ns	`37834` ns	`1.00`
`array/reductions/reduce/Float32/dims=1L`	`50960` ns	`50899` ns	`1.00`
`array/reductions/reduce/Float32/dims=2`	`55366` ns	`54907` ns	`1.01`
`array/reductions/reduce/Float32/dims=2L`	`68197` ns	`68257` ns	`1.00`
`array/reductions/reduce/Int64/1d`	`39430` ns	`38884` ns	`1.01`
`array/reductions/reduce/Int64/dims=1`	`40342` ns	`40427` ns	`1.00`
`array/reductions/reduce/Int64/dims=1L`	`88629` ns	`88694` ns	`1.00`
`array/reductions/reduce/Int64/dims=2`	`57334` ns	`57232` ns	`1.00`
`array/reductions/reduce/Int64/dims=2L`	`82733` ns	`82964` ns	`1.00`
`array/reverse/1d`	`16912` ns	`17058` ns	`0.99`
`array/reverse/1dL`	`69640` ns	`69721` ns	`1.00`
`array/reverse/1dL_inplace`	`67005` ns	`66952` ns	`1.00`
`array/reverse/1d_inplace`	`9601.666666666666` ns	`9893.666666666666` ns	`0.97`
`array/reverse/2d`	`19775` ns	`19541` ns	`1.01`
`array/reverse/2dL`	`73114` ns	`72830` ns	`1.00`
`array/reverse/2dL_inplace`	`66691` ns	`66851` ns	`1.00`
`array/reverse/2d_inplace`	`9972` ns	`10080` ns	`0.99`
`array/sorting/1d`	`2655411` ns	`2667782` ns	`1.00`
`array/sorting/2d`	`1039053` ns	`1038338` ns	`1.00`
`array/sorting/by`	`3193248` ns	`3193204` ns	`1.00`
`cuda/synchronization/context/auto`	`1042.9` ns	`1031.5` ns	`1.01`
`cuda/synchronization/context/blocking`	`813.9222222222222` ns	`791.2282608695652` ns	`1.03`
`cuda/synchronization/context/nonblocking`	`5797.833333333333` ns	`5839.5` ns	`0.99`
`cuda/synchronization/stream/auto`	`886` ns	`880.5576923076923` ns	`1.01`
`cuda/synchronization/stream/blocking`	`685.4183006535948` ns	`677.7908496732026` ns	`1.01`
`cuda/synchronization/stream/nonblocking`	`5644.166666666667` ns	`5561.333333333333` ns	`1.01`
`integration/byval/reference`	`147575` ns	`147498` ns	`1.00`
`integration/byval/slices=1`	`149611` ns	`149705` ns	`1.00`
`integration/byval/slices=2`	`292499` ns	`292393` ns	`1.00`
`integration/byval/slices=3`	`435006` ns	`435188` ns	`1.00`
`integration/cudadevrt`	`104601` ns	`104517` ns	`1.00`
`integration/volumerhs`	`9305642` ns	`9306993` ns	`1.00`
`kernel/indexing`	`12363` ns	`12494` ns	`0.99`
`kernel/indexing_checked`	`13190` ns	`13141` ns	`1.00`
`kernel/launch`	`2010` ns	`2103.777777777778` ns	`0.96`
`kernel/occupancy`	`642.6686746987951` ns	`683.375` ns	`0.94`
`kernel/rand`	`14890` ns	`13705` ns	`1.09`
`latency/import`	`3889951490` ns	`3893015807` ns	`1.00`
`latency/precompile`	`4658193847` ns	`4656329210` ns	`1.00`
`latency/ttfp`	`5403708128` ns	`5439496872` ns	`0.99`

This comment was automatically generated by workflow using github-action-benchmark.

AntonOresten added 2 commits June 21, 2026 14:21

Add narrow-precision floats to cudaDataType [only julia] [skip benchm…

f4bf399

…arks]

Add R_8F_UE4M3 alias of R_8F_E4M3 [only julia] [skip benchmarks]

f41f373

AntonOresten force-pushed the append-narrow branch from bc16c4c to f41f373 Compare June 21, 2026 13:20

github-actions Bot reviewed Jun 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add narrow-precision floats to cudaDataType#3180

Add narrow-precision floats to cudaDataType#3180
AntonOresten wants to merge 2 commits into
JuliaGPU:mainfrom
AntonOresten:append-narrow

AntonOresten commented Jun 21, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AntonOresten commented Jun 21, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

CUDA.jl Benchmarks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant