Add narrow-precision floats to cudaDataType#3180
Open
AntonOresten wants to merge 2 commits into
Open
Conversation
bc16c4c to
f41f373
Compare
Contributor
There was a problem hiding this comment.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: f41f373 | Previous: bc81d40 | Ratio |
|---|---|---|---|
array/accumulate/Float32/1d |
98915 ns |
98667 ns |
1.00 |
array/accumulate/Float32/dims=1 |
74908 ns |
74079 ns |
1.01 |
array/accumulate/Float32/dims=1L |
1599681 ns |
1599572 ns |
1.00 |
array/accumulate/Float32/dims=2 |
140661 ns |
139645 ns |
1.01 |
array/accumulate/Float32/dims=2L |
660907 ns |
660578 ns |
1.00 |
array/accumulate/Int64/1d |
118396 ns |
118201 ns |
1.00 |
array/accumulate/Int64/dims=1 |
79173 ns |
79258 ns |
1.00 |
array/accumulate/Int64/dims=1L |
1716336 ns |
1715309 ns |
1.00 |
array/accumulate/Int64/dims=2 |
153488 ns |
153091 ns |
1.00 |
array/accumulate/Int64/dims=2L |
987716 ns |
987286 ns |
1.00 |
array/broadcast |
18273 ns |
18223 ns |
1.00 |
array/construct |
1078.1 ns |
1098.7 ns |
0.98 |
array/copy |
16657 ns |
16605 ns |
1.00 |
array/copyto!/cpu_to_gpu |
208688 ns |
206529 ns |
1.01 |
array/copyto!/gpu_to_cpu |
240551 ns |
239820 ns |
1.00 |
array/copyto!/gpu_to_gpu |
8901 ns |
8819 ns |
1.01 |
array/iteration/findall/bool |
132729 ns |
133125 ns |
1.00 |
array/iteration/findall/int |
146300 ns |
146351 ns |
1.00 |
array/iteration/findfirst/bool |
68470 ns |
69131 ns |
0.99 |
array/iteration/findfirst/int |
69955 ns |
70202 ns |
1.00 |
array/iteration/findmin/1d |
65436 ns |
62891 ns |
1.04 |
array/iteration/findmin/2d |
99866 ns |
99773 ns |
1.00 |
array/iteration/logical |
188593 ns |
188295 ns |
1.00 |
array/iteration/scalar |
63264 ns |
63035 ns |
1.00 |
array/permutedims/2d |
49404 ns |
48726 ns |
1.01 |
array/permutedims/3d |
51027 ns |
50552 ns |
1.01 |
array/permutedims/4d |
50367 ns |
50019 ns |
1.01 |
array/random/rand/Float32 |
11843 ns |
11718 ns |
1.01 |
array/random/rand/Int64 |
23589 ns |
21972 ns |
1.07 |
array/random/rand!/Float32 |
7898.5 ns |
7684.75 ns |
1.03 |
array/random/rand!/Int64 |
19613 ns |
17879 ns |
1.10 |
array/random/randn/Float32 |
33933 ns |
33563 ns |
1.01 |
array/random/randn!/Float32 |
23887 ns |
24207 ns |
0.99 |
array/reductions/mapreduce/Float32/1d |
31916 ns |
31970 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=1 |
37777 ns |
37560 ns |
1.01 |
array/reductions/mapreduce/Float32/dims=1L |
50858 ns |
50748 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2 |
54915 ns |
54859 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2L |
66733 ns |
66258 ns |
1.01 |
array/reductions/mapreduce/Int64/1d |
38705 ns |
38667 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=1 |
40340 ns |
40569 ns |
0.99 |
array/reductions/mapreduce/Int64/dims=1L |
88805 ns |
88767 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2 |
57540 ns |
57046 ns |
1.01 |
array/reductions/mapreduce/Int64/dims=2L |
83176 ns |
83249 ns |
1.00 |
array/reductions/reduce/Float32/1d |
31908 ns |
31908 ns |
1 |
array/reductions/reduce/Float32/dims=1 |
37850 ns |
37834 ns |
1.00 |
array/reductions/reduce/Float32/dims=1L |
50960 ns |
50899 ns |
1.00 |
array/reductions/reduce/Float32/dims=2 |
55366 ns |
54907 ns |
1.01 |
array/reductions/reduce/Float32/dims=2L |
68197 ns |
68257 ns |
1.00 |
array/reductions/reduce/Int64/1d |
39430 ns |
38884 ns |
1.01 |
array/reductions/reduce/Int64/dims=1 |
40342 ns |
40427 ns |
1.00 |
array/reductions/reduce/Int64/dims=1L |
88629 ns |
88694 ns |
1.00 |
array/reductions/reduce/Int64/dims=2 |
57334 ns |
57232 ns |
1.00 |
array/reductions/reduce/Int64/dims=2L |
82733 ns |
82964 ns |
1.00 |
array/reverse/1d |
16912 ns |
17058 ns |
0.99 |
array/reverse/1dL |
69640 ns |
69721 ns |
1.00 |
array/reverse/1dL_inplace |
67005 ns |
66952 ns |
1.00 |
array/reverse/1d_inplace |
9601.666666666666 ns |
9893.666666666666 ns |
0.97 |
array/reverse/2d |
19775 ns |
19541 ns |
1.01 |
array/reverse/2dL |
73114 ns |
72830 ns |
1.00 |
array/reverse/2dL_inplace |
66691 ns |
66851 ns |
1.00 |
array/reverse/2d_inplace |
9972 ns |
10080 ns |
0.99 |
array/sorting/1d |
2655411 ns |
2667782 ns |
1.00 |
array/sorting/2d |
1039053 ns |
1038338 ns |
1.00 |
array/sorting/by |
3193248 ns |
3193204 ns |
1.00 |
cuda/synchronization/context/auto |
1042.9 ns |
1031.5 ns |
1.01 |
cuda/synchronization/context/blocking |
813.9222222222222 ns |
791.2282608695652 ns |
1.03 |
cuda/synchronization/context/nonblocking |
5797.833333333333 ns |
5839.5 ns |
0.99 |
cuda/synchronization/stream/auto |
886 ns |
880.5576923076923 ns |
1.01 |
cuda/synchronization/stream/blocking |
685.4183006535948 ns |
677.7908496732026 ns |
1.01 |
cuda/synchronization/stream/nonblocking |
5644.166666666667 ns |
5561.333333333333 ns |
1.01 |
integration/byval/reference |
147575 ns |
147498 ns |
1.00 |
integration/byval/slices=1 |
149611 ns |
149705 ns |
1.00 |
integration/byval/slices=2 |
292499 ns |
292393 ns |
1.00 |
integration/byval/slices=3 |
435006 ns |
435188 ns |
1.00 |
integration/cudadevrt |
104601 ns |
104517 ns |
1.00 |
integration/volumerhs |
9305642 ns |
9306993 ns |
1.00 |
kernel/indexing |
12363 ns |
12494 ns |
0.99 |
kernel/indexing_checked |
13190 ns |
13141 ns |
1.00 |
kernel/launch |
2010 ns |
2103.777777777778 ns |
0.96 |
kernel/occupancy |
642.6686746987951 ns |
683.375 ns |
0.94 |
kernel/rand |
14890 ns |
13705 ns |
1.09 |
latency/import |
3889951490 ns |
3893015807 ns |
1.00 |
latency/precompile |
4658193847 ns |
4656329210 ns |
1.00 |
latency/ttfp |
5403708128 ns |
5439496872 ns |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds narrow-precision floats to the cudaDataType enum based on
library_types.hin CUDA 13:R_8F_UE4M3is an alias ofR_8F_E4M3, differing semantically in that the sign is meaningless in the NVFP4 block-scaling format where every element has a sign anyway.Implementations like DLFP8Types.jl and Microfloats.jl could define e.g.
Base.convert(::Type{CUDACore.cudaDataType}, ::Type{Float8_E5M2}) = CUDACore.R_8F_E5M2in an extension, or there could be somejltype_to_cudaDataTypefunction.It doesn't work the other way though, since each cudaDataType can only map to one julia type, but not sure if this is a problem.