Skip to content

Add narrow-precision floats to cudaDataType#3180

Open
AntonOresten wants to merge 2 commits into
JuliaGPU:mainfrom
AntonOresten:append-narrow
Open

Add narrow-precision floats to cudaDataType#3180
AntonOresten wants to merge 2 commits into
JuliaGPU:mainfrom
AntonOresten:append-narrow

Conversation

@AntonOresten

Copy link
Copy Markdown
Contributor

Adds narrow-precision floats to the cudaDataType enum based on library_types.h in CUDA 13:

    CUDA_R_8F_E4M3 = 28, /* real as a nv_fp8_e4m3 */
    CUDA_R_8F_UE4M3 = CUDA_R_8F_E4M3, /* real as an unsigned nv_fp8_e4m3 */
    CUDA_R_8F_E5M2 = 29, /* real as a nv_fp8_e5m2 */
    CUDA_R_8F_UE8M0 = 30,  /* real as an exponent-only unsigned nv_fp8_e8m0 */
    CUDA_R_6F_E2M3  = 31,  /* real as a nv_fp6_e2m3 */
    CUDA_R_6F_E3M2  = 32,  /* real as a nv_fp6_e3m2 */
    CUDA_R_4F_E2M1  = 33,  /* real as a nv_fp4_e2m1 */

R_8F_UE4M3 is an alias of R_8F_E4M3, differing semantically in that the sign is meaningless in the NVFP4 block-scaling format where every element has a sign anyway.

Implementations like DLFP8Types.jl and Microfloats.jl could define e.g. Base.convert(::Type{CUDACore.cudaDataType}, ::Type{Float8_E5M2}) = CUDACore.R_8F_E5M2 in an extension, or there could be some jltype_to_cudaDataType function.

It doesn't work the other way though, since each cudaDataType can only map to one julia type, but not sure if this is a problem.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Details
Benchmark suite Current: f41f373 Previous: bc81d40 Ratio
array/accumulate/Float32/1d 98915 ns 98667 ns 1.00
array/accumulate/Float32/dims=1 74908 ns 74079 ns 1.01
array/accumulate/Float32/dims=1L 1599681 ns 1599572 ns 1.00
array/accumulate/Float32/dims=2 140661 ns 139645 ns 1.01
array/accumulate/Float32/dims=2L 660907 ns 660578 ns 1.00
array/accumulate/Int64/1d 118396 ns 118201 ns 1.00
array/accumulate/Int64/dims=1 79173 ns 79258 ns 1.00
array/accumulate/Int64/dims=1L 1716336 ns 1715309 ns 1.00
array/accumulate/Int64/dims=2 153488 ns 153091 ns 1.00
array/accumulate/Int64/dims=2L 987716 ns 987286 ns 1.00
array/broadcast 18273 ns 18223 ns 1.00
array/construct 1078.1 ns 1098.7 ns 0.98
array/copy 16657 ns 16605 ns 1.00
array/copyto!/cpu_to_gpu 208688 ns 206529 ns 1.01
array/copyto!/gpu_to_cpu 240551 ns 239820 ns 1.00
array/copyto!/gpu_to_gpu 8901 ns 8819 ns 1.01
array/iteration/findall/bool 132729 ns 133125 ns 1.00
array/iteration/findall/int 146300 ns 146351 ns 1.00
array/iteration/findfirst/bool 68470 ns 69131 ns 0.99
array/iteration/findfirst/int 69955 ns 70202 ns 1.00
array/iteration/findmin/1d 65436 ns 62891 ns 1.04
array/iteration/findmin/2d 99866 ns 99773 ns 1.00
array/iteration/logical 188593 ns 188295 ns 1.00
array/iteration/scalar 63264 ns 63035 ns 1.00
array/permutedims/2d 49404 ns 48726 ns 1.01
array/permutedims/3d 51027 ns 50552 ns 1.01
array/permutedims/4d 50367 ns 50019 ns 1.01
array/random/rand/Float32 11843 ns 11718 ns 1.01
array/random/rand/Int64 23589 ns 21972 ns 1.07
array/random/rand!/Float32 7898.5 ns 7684.75 ns 1.03
array/random/rand!/Int64 19613 ns 17879 ns 1.10
array/random/randn/Float32 33933 ns 33563 ns 1.01
array/random/randn!/Float32 23887 ns 24207 ns 0.99
array/reductions/mapreduce/Float32/1d 31916 ns 31970 ns 1.00
array/reductions/mapreduce/Float32/dims=1 37777 ns 37560 ns 1.01
array/reductions/mapreduce/Float32/dims=1L 50858 ns 50748 ns 1.00
array/reductions/mapreduce/Float32/dims=2 54915 ns 54859 ns 1.00
array/reductions/mapreduce/Float32/dims=2L 66733 ns 66258 ns 1.01
array/reductions/mapreduce/Int64/1d 38705 ns 38667 ns 1.00
array/reductions/mapreduce/Int64/dims=1 40340 ns 40569 ns 0.99
array/reductions/mapreduce/Int64/dims=1L 88805 ns 88767 ns 1.00
array/reductions/mapreduce/Int64/dims=2 57540 ns 57046 ns 1.01
array/reductions/mapreduce/Int64/dims=2L 83176 ns 83249 ns 1.00
array/reductions/reduce/Float32/1d 31908 ns 31908 ns 1
array/reductions/reduce/Float32/dims=1 37850 ns 37834 ns 1.00
array/reductions/reduce/Float32/dims=1L 50960 ns 50899 ns 1.00
array/reductions/reduce/Float32/dims=2 55366 ns 54907 ns 1.01
array/reductions/reduce/Float32/dims=2L 68197 ns 68257 ns 1.00
array/reductions/reduce/Int64/1d 39430 ns 38884 ns 1.01
array/reductions/reduce/Int64/dims=1 40342 ns 40427 ns 1.00
array/reductions/reduce/Int64/dims=1L 88629 ns 88694 ns 1.00
array/reductions/reduce/Int64/dims=2 57334 ns 57232 ns 1.00
array/reductions/reduce/Int64/dims=2L 82733 ns 82964 ns 1.00
array/reverse/1d 16912 ns 17058 ns 0.99
array/reverse/1dL 69640 ns 69721 ns 1.00
array/reverse/1dL_inplace 67005 ns 66952 ns 1.00
array/reverse/1d_inplace 9601.666666666666 ns 9893.666666666666 ns 0.97
array/reverse/2d 19775 ns 19541 ns 1.01
array/reverse/2dL 73114 ns 72830 ns 1.00
array/reverse/2dL_inplace 66691 ns 66851 ns 1.00
array/reverse/2d_inplace 9972 ns 10080 ns 0.99
array/sorting/1d 2655411 ns 2667782 ns 1.00
array/sorting/2d 1039053 ns 1038338 ns 1.00
array/sorting/by 3193248 ns 3193204 ns 1.00
cuda/synchronization/context/auto 1042.9 ns 1031.5 ns 1.01
cuda/synchronization/context/blocking 813.9222222222222 ns 791.2282608695652 ns 1.03
cuda/synchronization/context/nonblocking 5797.833333333333 ns 5839.5 ns 0.99
cuda/synchronization/stream/auto 886 ns 880.5576923076923 ns 1.01
cuda/synchronization/stream/blocking 685.4183006535948 ns 677.7908496732026 ns 1.01
cuda/synchronization/stream/nonblocking 5644.166666666667 ns 5561.333333333333 ns 1.01
integration/byval/reference 147575 ns 147498 ns 1.00
integration/byval/slices=1 149611 ns 149705 ns 1.00
integration/byval/slices=2 292499 ns 292393 ns 1.00
integration/byval/slices=3 435006 ns 435188 ns 1.00
integration/cudadevrt 104601 ns 104517 ns 1.00
integration/volumerhs 9305642 ns 9306993 ns 1.00
kernel/indexing 12363 ns 12494 ns 0.99
kernel/indexing_checked 13190 ns 13141 ns 1.00
kernel/launch 2010 ns 2103.777777777778 ns 0.96
kernel/occupancy 642.6686746987951 ns 683.375 ns 0.94
kernel/rand 14890 ns 13705 ns 1.09
latency/import 3889951490 ns 3893015807 ns 1.00
latency/precompile 4658193847 ns 4656329210 ns 1.00
latency/ttfp 5403708128 ns 5439496872 ns 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant