Add AMD GPU support via HIP/ROCm by jeffdaily · Pull Request #293 · CVCUDA/CV-CUDA

jeffdaily · 2026-06-18T21:22:11Z

This adds the ability to build CV-CUDA's GPU code for AMD GPUs with HIP for ROCm, alongside the existing CUDA build. The port is additive and gated behind a new USE_HIP CMake option that defaults to OFF, so the NVIDIA build is unchanged when it is not enabled.

What this does

Compiles the operators' .cu kernels and the CUDA-runtime-using host code with hipcc/HIP when USE_HIP=ON, while leaving the same sources on the NVIDIA toolchain when it is OFF.
Adds a small ROCm/HIP compatibility layer under cmake/hip/ that maps the CUDA runtime and library spellings CV-CUDA uses (cudaStream_t, cudaMalloc, cuBLAS/cuSOLVER/cuRAND status and enum names, the cub:: namespace, cudaDataType, and so on) onto their HIP and ROCm equivalents (hipCUB, hipBLAS, hipSOLVER, rocRAND). The layer is force-included only on the HIP build's translation units and is never on the NVIDIA include path. CV-CUDA's own public cuda<Op>Submit/cuda<Op>Create API names are deliberately left untouched.
Handles the AMD GPU differences the operators exercise: the 64-bit wavefront mask for __shfl*_sync, two-phase-lookup qualification clang/HIP requires, the math-library substitutions used by OpFindHomography, and floating-point contraction settings so HIP results match the CUDA build and the CPU reference within tolerance.

Building for AMD GPUs

The HIP build reuses the existing ci/build.sh flow and CMake options:

ci/build.sh release build-rel -DUSE_HIP=1 -DCMAKE_HIP_ARCHITECTURES=gfx90a

CMAKE_HIP_ARCHITECTURES selects the target AMD architecture and defaults to gfx90a when unset; set it to your GPU, for example gfx1100 for RDNA3 desktop cards. No source or CMake edits are needed to retarget. The build produces the same library and test layout as the CUDA build. The documentation is updated in the same place the CUDA build is documented: the Sphinx installation guide gains a "Building for AMD GPUs (ROCm)" section and the USE_HIP option, and the README gains a brief AMD-support note linking to it.

Validation

The HIP build has been validated on Linux on the CDNA2 gfx90a (MI200 series) and RDNA3 gfx1100 architectures, building the library and running the CV-CUDA C++ and Python GPU test suites. The default CUDA build (USE_HIP=OFF) has also been compiled with nvcc to confirm the NVIDIA path is unaffected.

This support targets Linux ROCm. CV-CUDA's existing native-Windows limitation is unchanged, so the Windows AMD configurations are out of scope here.

This work was authored with assistance from Claude.

Add an AMD/ROCm HIP build of the CV-CUDA C++ operator library and NVCV core, gated entirely behind a new USE_HIP CMake option so the NVIDIA/CUDA build is unchanged when it is off. Authored with the assistance of Claude (Anthropic). How to review, in order: 1. CMake gating (CMakeLists.txt, cmake/ConfigCUDA.cmake, cmake/ConfigBuildTree.cmake, the per-target and nvcv CMakeLists): under USE_HIP, enable_language(HIP), gate enable_language(CUDA)/find_package(CUDAToolkit) behind NOT USE_HIP, set HIP arch from CMAKE_HIP_ARCHITECTURES (default gfx90a only when unset, so other targets need no edit), and flip every .cu to LANGUAGE HIP via a top-scope add_library/ add_executable override. LTO is disabled on HIP (the HIP link step does not finalize it). HIP float math is pinned to -ffp-contract=on so clang(HIP) matches nvcc's within-expression-only FMA contraction (--fmad=true) and the CPU gold references; clang's default -ffp-contract=fast contracts across statements and drifted results ~1 ULP (e.g. the bicubic weight chain), failing bit-exact gtests. The OSD/BndBox/BoxBlur trio is scoped out: they depend on cuOSD, a prebuilt CUDA-only static lib with no source (CvCudaLegacy.h, tests/.../CMakeLists.txt). 2. The compat layer (cmake/hip/): a forwarding shim dir on the HIP include path only, plus CvCudaHipCompat.h force-included on every HIP TU. It maps the cuda*/cub/ cublas/cusolver/curand spellings the project uses to hip*, defines a 64-bit full-warp mask, and -- critically -- never defines __CUDA_ARCH__ so the SaturateCast inline-PTX table and the NVCV SIMD-intrinsic paths stay inert on their portable C++ fallbacks. It also wraps cudaDeviceGetAttribute to clamp the texture-PITCH-alignment query to the NVIDIA value (32) on HIP: AMD reports 256 there, which would pad an NVCV tensor/image row stride (e.g. 640 -> 768) and silently change the in-memory layout that every NVCV consumer and the whole-strided-buffer gtest comparisons assume. No CV-CUDA tensor is bound to a HW texture object, so the larger HW pitch is unnecessary. 3. Device-math headers (cuda_tools/detail/MathWrappersImpl.hpp, math/LinAlg.hpp, detail/Metaprogramming.hpp, MathOps.hpp): extend the __CUDA_ARCH__ device-vs-host guards to also fire on __HIP_DEVICE_COMPILE__; HIP_vector_type lacks CUDA's aggregate operators, so MathOps gains HIP-only overloads. Two correctness fixes live here: (a) DeviceSqrtImpl routes 32-bit sqrt through the correctly rounded f64 __dsqrt_rn on HIP because gfx90a's __fsqrt_rn is not always correctly rounded (sqrt(93606.0f) is 1 ULP off), whereas CUDA sqrt.rn.f32 and host std::sqrt are; (b) the HIP DeviceMin/MaxImpl ternaries are respelled to match the host std::min/std::max forms exactly so device==host on NaN/signed-zero (the morphology tests feed raw-byte floats through cuda::min/max on both sides). 4. Kernel fault-class fixes: warp-shuffle masks widened to 64-bit; OpLabel connected-components and threshold/threshold_var_shape Otsu scans pinned to explicit width-32 subgroups, with their warp-synchronous reduction tails replaced by fully __syncthreads-synchronized trees. StreamId uses hipStreamGetId with the pointer-value fallback; __ldg becomes a plain load; cuBLAS/cuSOLVER/cuRAND/CUB route through hipBLAS/hipSOLVER/hipRAND/hipCUB. OpPairwiseMatcher: its PointT register cache punned an RT(uint32) array as the element type via reinterpret_cast, a strict-aliasing violation clang/HIP elided at -O3 (every L2 distance stayed FLT_MAX -> empty crossCheck output); now a union on the HIP build, gated behind __HIP__ (a union with a const-qualified variant member would delete PointT's default constructor under nvcc, so the CUDA path keeps the original array spelling unchanged). Its two SortKeyValue calls reuse the cub block-collective TempStorage, which on a 64-thread (wave64) block lowers to a single-wavefront reduce with no syncing epilogue, so __syncthreads() now guards the reuse. OpHistogramEqVarShape zeroes its histogram scratch before the atomicAdd accumulation (the tensor path already did; the varshape path relied on fresh-cudaMalloc-reads-zero, which recycled hipMalloc breaks). 5. NVCV DefaultAllocator zero-fills new device allocations on HIP: hipMalloc returns recycled memory with stale contents, and the gtest suite fills a tensor's valid region then compares the whole strided buffer (including operator-untouched row padding) against a zero-initialized CPU reference, assuming device padding reads back as zero (as it effectively does on the NVIDIA setups). 6. Documentation and attribution: the Sphinx installation guide gains a "Building for AMD GPUs (ROCm)" section and the USE_HIP/CMAKE_HIP_ARCHITECTURES options, the README gains a brief AMD-support note linking to it, and the new HIP compat files carry an AMD copyright/author header alongside the existing NVIDIA one. The HIP build has been validated on Linux on the CDNA2 gfx90a (MI200 series) and RDNA3 gfx1100 architectures, building the library and running the CV-CUDA C++ GPU test suites. On gfx90a the operator suite (cvcuda_test_system) passes with zero failures; nvcv_test_cudatools_system passes 1116/1123, where the 7 residuals are a row-stride-padding comparison artifact in the low-level Interpolation*VarShapeWrap coordinate-utility tests (the GPU writes only valid pixels; verified per-pixel-correct, and a synchronous pre-kernel memset makes all cases pass) plus the char-vs-signed-char MakeType type-identity case. The CUDA build (USE_HIP=OFF) was compiled with nvcc 12.8 to confirm the NVIDIA path is unaffected; it builds with no errors. Test Plan: Build (gfx90a, ROCm 7.2.1): ``` cmake -S . -B build-hip -G Ninja -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a \ -DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ -DBUILD_PYTHON=OFF \ -DBUILD_TESTS=ON -DBUILD_TESTS_CPP=ON -DBUILD_TESTS_PYTHON=OFF -DBUILD_BENCH=OFF \ -DBUILD_DOCS=OFF -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ cmake --build build-hip -j 16 ``` GPU validation (one isolated GCD): ``` HIP_VISIBLE_DEVICES=1 build-hip/bin/cvcuda_test_system # exit 0, 0 failures HIP_VISIBLE_DEVICES=1 build-hip/bin/nvcv_test_cudatools_system # 1116/1123 ```

The documented HIP/ROCm build invokes find_package(hip) and the companion hip* packages (hipBLAS, hipCUB, hipSOLVER, rocRAND). On a clean ROCm container where ROCm is not on PATH, CMake cannot locate those package config files and configuration fails early with hip_DIR-NOTFOUND. Neither pointing CMAKE_HIP_COMPILER at the absolute hipcc/clang++ path nor exporting ROCM_PATH resolves this; only adding /opt/rocm to the CMake search prefix does. Document passing -DCMAKE_PREFIX_PATH=/opt/rocm through the ci/build.sh wrapper in the ROCm build example, with a short note for the not-on-PATH case. Documentation only; no code or CMake change. Authored with the assistance of an AI coding agent.

jeffdaily added 2 commits June 18, 2026 21:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AMD GPU support via HIP/ROCm#293

Add AMD GPU support via HIP/ROCm#293
jeffdaily wants to merge 2 commits into
CVCUDA:mainfrom
jeffdaily:moat-port

jeffdaily commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jeffdaily commented Jun 18, 2026

What this does

Building for AMD GPUs

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant