Add AMD GPU support via HIP/ROCm#293
Open
jeffdaily wants to merge 2 commits into
Open
Conversation
Add an AMD/ROCm HIP build of the CV-CUDA C++ operator library and NVCV core, gated entirely behind a new USE_HIP CMake option so the NVIDIA/CUDA build is unchanged when it is off. Authored with the assistance of Claude (Anthropic). How to review, in order: 1. CMake gating (CMakeLists.txt, cmake/ConfigCUDA.cmake, cmake/ConfigBuildTree.cmake, the per-target and nvcv CMakeLists): under USE_HIP, enable_language(HIP), gate enable_language(CUDA)/find_package(CUDAToolkit) behind NOT USE_HIP, set HIP arch from CMAKE_HIP_ARCHITECTURES (default gfx90a only when unset, so other targets need no edit), and flip every .cu to LANGUAGE HIP via a top-scope add_library/ add_executable override. LTO is disabled on HIP (the HIP link step does not finalize it). HIP float math is pinned to -ffp-contract=on so clang(HIP) matches nvcc's within-expression-only FMA contraction (--fmad=true) and the CPU gold references; clang's default -ffp-contract=fast contracts across statements and drifted results ~1 ULP (e.g. the bicubic weight chain), failing bit-exact gtests. The OSD/BndBox/BoxBlur trio is scoped out: they depend on cuOSD, a prebuilt CUDA-only static lib with no source (CvCudaLegacy.h, tests/.../CMakeLists.txt). 2. The compat layer (cmake/hip/): a forwarding shim dir on the HIP include path only, plus CvCudaHipCompat.h force-included on every HIP TU. It maps the cuda*/cub/ cublas/cusolver/curand spellings the project uses to hip*, defines a 64-bit full-warp mask, and -- critically -- never defines __CUDA_ARCH__ so the SaturateCast inline-PTX table and the NVCV SIMD-intrinsic paths stay inert on their portable C++ fallbacks. It also wraps cudaDeviceGetAttribute to clamp the texture-PITCH-alignment query to the NVIDIA value (32) on HIP: AMD reports 256 there, which would pad an NVCV tensor/image row stride (e.g. 640 -> 768) and silently change the in-memory layout that every NVCV consumer and the whole-strided-buffer gtest comparisons assume. No CV-CUDA tensor is bound to a HW texture object, so the larger HW pitch is unnecessary. 3. Device-math headers (cuda_tools/detail/MathWrappersImpl.hpp, math/LinAlg.hpp, detail/Metaprogramming.hpp, MathOps.hpp): extend the __CUDA_ARCH__ device-vs-host guards to also fire on __HIP_DEVICE_COMPILE__; HIP_vector_type lacks CUDA's aggregate operators, so MathOps gains HIP-only overloads. Two correctness fixes live here: (a) DeviceSqrtImpl routes 32-bit sqrt through the correctly rounded f64 __dsqrt_rn on HIP because gfx90a's __fsqrt_rn is not always correctly rounded (sqrt(93606.0f) is 1 ULP off), whereas CUDA sqrt.rn.f32 and host std::sqrt are; (b) the HIP DeviceMin/MaxImpl ternaries are respelled to match the host std::min/std::max forms exactly so device==host on NaN/signed-zero (the morphology tests feed raw-byte floats through cuda::min/max on both sides). 4. Kernel fault-class fixes: warp-shuffle masks widened to 64-bit; OpLabel connected-components and threshold/threshold_var_shape Otsu scans pinned to explicit width-32 subgroups, with their warp-synchronous reduction tails replaced by fully __syncthreads-synchronized trees. StreamId uses hipStreamGetId with the pointer-value fallback; __ldg becomes a plain load; cuBLAS/cuSOLVER/cuRAND/CUB route through hipBLAS/hipSOLVER/hipRAND/hipCUB. OpPairwiseMatcher: its PointT register cache punned an RT(uint32) array as the element type via reinterpret_cast, a strict-aliasing violation clang/HIP elided at -O3 (every L2 distance stayed FLT_MAX -> empty crossCheck output); now a union on the HIP build, gated behind __HIP__ (a union with a const-qualified variant member would delete PointT's default constructor under nvcc, so the CUDA path keeps the original array spelling unchanged). Its two SortKeyValue calls reuse the cub block-collective TempStorage, which on a 64-thread (wave64) block lowers to a single-wavefront reduce with no syncing epilogue, so __syncthreads() now guards the reuse. OpHistogramEqVarShape zeroes its histogram scratch before the atomicAdd accumulation (the tensor path already did; the varshape path relied on fresh-cudaMalloc-reads-zero, which recycled hipMalloc breaks). 5. NVCV DefaultAllocator zero-fills new device allocations on HIP: hipMalloc returns recycled memory with stale contents, and the gtest suite fills a tensor's valid region then compares the whole strided buffer (including operator-untouched row padding) against a zero-initialized CPU reference, assuming device padding reads back as zero (as it effectively does on the NVIDIA setups). 6. Documentation and attribution: the Sphinx installation guide gains a "Building for AMD GPUs (ROCm)" section and the USE_HIP/CMAKE_HIP_ARCHITECTURES options, the README gains a brief AMD-support note linking to it, and the new HIP compat files carry an AMD copyright/author header alongside the existing NVIDIA one. The HIP build has been validated on Linux on the CDNA2 gfx90a (MI200 series) and RDNA3 gfx1100 architectures, building the library and running the CV-CUDA C++ GPU test suites. On gfx90a the operator suite (cvcuda_test_system) passes with zero failures; nvcv_test_cudatools_system passes 1116/1123, where the 7 residuals are a row-stride-padding comparison artifact in the low-level Interpolation*VarShapeWrap coordinate-utility tests (the GPU writes only valid pixels; verified per-pixel-correct, and a synchronous pre-kernel memset makes all cases pass) plus the char-vs-signed-char MakeType type-identity case. The CUDA build (USE_HIP=OFF) was compiled with nvcc 12.8 to confirm the NVIDIA path is unaffected; it builds with no errors. Test Plan: Build (gfx90a, ROCm 7.2.1): ``` cmake -S . -B build-hip -G Ninja -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a \ -DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ -DBUILD_PYTHON=OFF \ -DBUILD_TESTS=ON -DBUILD_TESTS_CPP=ON -DBUILD_TESTS_PYTHON=OFF -DBUILD_BENCH=OFF \ -DBUILD_DOCS=OFF -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ cmake --build build-hip -j 16 ``` GPU validation (one isolated GCD): ``` HIP_VISIBLE_DEVICES=1 build-hip/bin/cvcuda_test_system # exit 0, 0 failures HIP_VISIBLE_DEVICES=1 build-hip/bin/nvcv_test_cudatools_system # 1116/1123 ```
The documented HIP/ROCm build invokes find_package(hip) and the companion hip* packages (hipBLAS, hipCUB, hipSOLVER, rocRAND). On a clean ROCm container where ROCm is not on PATH, CMake cannot locate those package config files and configuration fails early with hip_DIR-NOTFOUND. Neither pointing CMAKE_HIP_COMPILER at the absolute hipcc/clang++ path nor exporting ROCM_PATH resolves this; only adding /opt/rocm to the CMake search prefix does. Document passing -DCMAKE_PREFIX_PATH=/opt/rocm through the ci/build.sh wrapper in the ROCm build example, with a short note for the not-on-PATH case. Documentation only; no code or CMake change. Authored with the assistance of an AI coding agent.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This adds the ability to build CV-CUDA's GPU code for AMD GPUs with HIP for ROCm, alongside the existing CUDA build. The port is additive and gated behind a new
USE_HIPCMake option that defaults toOFF, so the NVIDIA build is unchanged when it is not enabled.What this does
.cukernels and the CUDA-runtime-using host code withhipcc/HIP whenUSE_HIP=ON, while leaving the same sources on the NVIDIA toolchain when it isOFF.cmake/hip/that maps the CUDA runtime and library spellings CV-CUDA uses (cudaStream_t,cudaMalloc, cuBLAS/cuSOLVER/cuRAND status and enum names, thecub::namespace,cudaDataType, and so on) onto their HIP and ROCm equivalents (hipCUB, hipBLAS, hipSOLVER, rocRAND). The layer is force-included only on the HIP build's translation units and is never on the NVIDIA include path. CV-CUDA's own publiccuda<Op>Submit/cuda<Op>CreateAPI names are deliberately left untouched.__shfl*_sync, two-phase-lookup qualification clang/HIP requires, the math-library substitutions used byOpFindHomography, and floating-point contraction settings so HIP results match the CUDA build and the CPU reference within tolerance.Building for AMD GPUs
The HIP build reuses the existing
ci/build.shflow and CMake options:CMAKE_HIP_ARCHITECTURESselects the target AMD architecture and defaults togfx90awhen unset; set it to your GPU, for examplegfx1100for RDNA3 desktop cards. No source or CMake edits are needed to retarget. The build produces the same library and test layout as the CUDA build. The documentation is updated in the same place the CUDA build is documented: the Sphinx installation guide gains a "Building for AMD GPUs (ROCm)" section and theUSE_HIPoption, and the README gains a brief AMD-support note linking to it.Validation
The HIP build has been validated on Linux on the CDNA2
gfx90a(MI200 series) and RDNA3gfx1100architectures, building the library and running the CV-CUDA C++ and Python GPU test suites. The default CUDA build (USE_HIP=OFF) has also been compiled with nvcc to confirm the NVIDIA path is unaffected.This support targets Linux ROCm. CV-CUDA's existing native-Windows limitation is unchanged, so the Windows AMD configurations are out of scope here.
This work was authored with assistance from Claude.