Skip to content

Add AMD GPU support via HIP/ROCm#293

Open
jeffdaily wants to merge 2 commits into
CVCUDA:mainfrom
jeffdaily:moat-port
Open

Add AMD GPU support via HIP/ROCm#293
jeffdaily wants to merge 2 commits into
CVCUDA:mainfrom
jeffdaily:moat-port

Conversation

@jeffdaily

Copy link
Copy Markdown

This adds the ability to build CV-CUDA's GPU code for AMD GPUs with HIP for ROCm, alongside the existing CUDA build. The port is additive and gated behind a new USE_HIP CMake option that defaults to OFF, so the NVIDIA build is unchanged when it is not enabled.

What this does

  • Compiles the operators' .cu kernels and the CUDA-runtime-using host code with hipcc/HIP when USE_HIP=ON, while leaving the same sources on the NVIDIA toolchain when it is OFF.
  • Adds a small ROCm/HIP compatibility layer under cmake/hip/ that maps the CUDA runtime and library spellings CV-CUDA uses (cudaStream_t, cudaMalloc, cuBLAS/cuSOLVER/cuRAND status and enum names, the cub:: namespace, cudaDataType, and so on) onto their HIP and ROCm equivalents (hipCUB, hipBLAS, hipSOLVER, rocRAND). The layer is force-included only on the HIP build's translation units and is never on the NVIDIA include path. CV-CUDA's own public cuda<Op>Submit/cuda<Op>Create API names are deliberately left untouched.
  • Handles the AMD GPU differences the operators exercise: the 64-bit wavefront mask for __shfl*_sync, two-phase-lookup qualification clang/HIP requires, the math-library substitutions used by OpFindHomography, and floating-point contraction settings so HIP results match the CUDA build and the CPU reference within tolerance.

Building for AMD GPUs

The HIP build reuses the existing ci/build.sh flow and CMake options:

ci/build.sh release build-rel -DUSE_HIP=1 -DCMAKE_HIP_ARCHITECTURES=gfx90a

CMAKE_HIP_ARCHITECTURES selects the target AMD architecture and defaults to gfx90a when unset; set it to your GPU, for example gfx1100 for RDNA3 desktop cards. No source or CMake edits are needed to retarget. The build produces the same library and test layout as the CUDA build. The documentation is updated in the same place the CUDA build is documented: the Sphinx installation guide gains a "Building for AMD GPUs (ROCm)" section and the USE_HIP option, and the README gains a brief AMD-support note linking to it.

Validation

The HIP build has been validated on Linux on the CDNA2 gfx90a (MI200 series) and RDNA3 gfx1100 architectures, building the library and running the CV-CUDA C++ and Python GPU test suites. The default CUDA build (USE_HIP=OFF) has also been compiled with nvcc to confirm the NVIDIA path is unaffected.

This support targets Linux ROCm. CV-CUDA's existing native-Windows limitation is unchanged, so the Windows AMD configurations are out of scope here.

This work was authored with assistance from Claude.

Add an AMD/ROCm HIP build of the CV-CUDA C++ operator library and NVCV core,
gated entirely behind a new USE_HIP CMake option so the NVIDIA/CUDA build is
unchanged when it is off. Authored with the assistance of Claude (Anthropic).

How to review, in order:
1. CMake gating (CMakeLists.txt, cmake/ConfigCUDA.cmake, cmake/ConfigBuildTree.cmake,
   the per-target and nvcv CMakeLists): under USE_HIP, enable_language(HIP), gate
   enable_language(CUDA)/find_package(CUDAToolkit) behind NOT USE_HIP, set HIP arch
   from CMAKE_HIP_ARCHITECTURES (default gfx90a only when unset, so other targets
   need no edit), and flip every .cu to LANGUAGE HIP via a top-scope add_library/
   add_executable override. LTO is disabled on HIP (the HIP link step does not
   finalize it). HIP float math is pinned to -ffp-contract=on so clang(HIP) matches
   nvcc's within-expression-only FMA contraction (--fmad=true) and the CPU gold
   references; clang's default -ffp-contract=fast contracts across statements and
   drifted results ~1 ULP (e.g. the bicubic weight chain), failing bit-exact gtests.
   The OSD/BndBox/BoxBlur trio is scoped out: they depend on cuOSD, a prebuilt
   CUDA-only static lib with no source (CvCudaLegacy.h, tests/.../CMakeLists.txt).
2. The compat layer (cmake/hip/): a forwarding shim dir on the HIP include path only,
   plus CvCudaHipCompat.h force-included on every HIP TU. It maps the cuda*/cub/
   cublas/cusolver/curand spellings the project uses to hip*, defines a 64-bit
   full-warp mask, and -- critically -- never defines __CUDA_ARCH__ so the
   SaturateCast inline-PTX table and the NVCV SIMD-intrinsic paths stay inert on
   their portable C++ fallbacks. It also wraps cudaDeviceGetAttribute to clamp the
   texture-PITCH-alignment query to the NVIDIA value (32) on HIP: AMD reports 256
   there, which would pad an NVCV tensor/image row stride (e.g. 640 -> 768) and
   silently change the in-memory layout that every NVCV consumer and the
   whole-strided-buffer gtest comparisons assume. No CV-CUDA tensor is bound to a HW
   texture object, so the larger HW pitch is unnecessary.
3. Device-math headers (cuda_tools/detail/MathWrappersImpl.hpp, math/LinAlg.hpp,
   detail/Metaprogramming.hpp, MathOps.hpp): extend the __CUDA_ARCH__ device-vs-host
   guards to also fire on __HIP_DEVICE_COMPILE__; HIP_vector_type lacks CUDA's
   aggregate operators, so MathOps gains HIP-only overloads. Two correctness fixes
   live here: (a) DeviceSqrtImpl routes 32-bit sqrt through the correctly rounded f64
   __dsqrt_rn on HIP because gfx90a's __fsqrt_rn is not always correctly rounded
   (sqrt(93606.0f) is 1 ULP off), whereas CUDA sqrt.rn.f32 and host std::sqrt are;
   (b) the HIP DeviceMin/MaxImpl ternaries are respelled to match the host
   std::min/std::max forms exactly so device==host on NaN/signed-zero (the morphology
   tests feed raw-byte floats through cuda::min/max on both sides).
4. Kernel fault-class fixes: warp-shuffle masks widened to 64-bit; OpLabel
   connected-components and threshold/threshold_var_shape Otsu scans pinned to
   explicit width-32 subgroups, with their warp-synchronous reduction tails replaced
   by fully __syncthreads-synchronized trees. StreamId uses hipStreamGetId with the
   pointer-value fallback; __ldg becomes a plain load; cuBLAS/cuSOLVER/cuRAND/CUB
   route through hipBLAS/hipSOLVER/hipRAND/hipCUB. OpPairwiseMatcher: its PointT
   register cache punned an RT(uint32) array as the element type via reinterpret_cast,
   a strict-aliasing violation clang/HIP elided at -O3 (every L2 distance stayed
   FLT_MAX -> empty crossCheck output); now a union on the HIP build, gated behind
   __HIP__ (a union with a const-qualified variant member would delete PointT's
   default constructor under nvcc, so the CUDA path keeps the original array
   spelling unchanged). Its two SortKeyValue calls reuse
   the cub block-collective TempStorage, which on a 64-thread (wave64) block lowers to
   a single-wavefront reduce with no syncing epilogue, so __syncthreads() now guards
   the reuse. OpHistogramEqVarShape zeroes its histogram scratch before the atomicAdd
   accumulation (the tensor path already did; the varshape path relied on
   fresh-cudaMalloc-reads-zero, which recycled hipMalloc breaks).
5. NVCV DefaultAllocator zero-fills new device allocations on HIP: hipMalloc returns
   recycled memory with stale contents, and the gtest suite fills a tensor's valid
   region then compares the whole strided buffer (including operator-untouched row
   padding) against a zero-initialized CPU reference, assuming device padding reads
   back as zero (as it effectively does on the NVIDIA setups).
6. Documentation and attribution: the Sphinx installation guide gains a "Building
   for AMD GPUs (ROCm)" section and the USE_HIP/CMAKE_HIP_ARCHITECTURES options, the
   README gains a brief AMD-support note linking to it, and the new HIP compat files
   carry an AMD copyright/author header alongside the existing NVIDIA one.

The HIP build has been validated on Linux on the CDNA2 gfx90a (MI200 series) and
RDNA3 gfx1100 architectures, building the library and running the CV-CUDA C++ GPU
test suites. On gfx90a the operator suite (cvcuda_test_system) passes with zero
failures; nvcv_test_cudatools_system passes 1116/1123, where the 7 residuals are a
row-stride-padding comparison artifact in the low-level Interpolation*VarShapeWrap
coordinate-utility tests (the GPU writes only valid pixels; verified
per-pixel-correct, and a synchronous pre-kernel memset makes all cases pass) plus
the char-vs-signed-char MakeType type-identity case.

The CUDA build (USE_HIP=OFF) was compiled with nvcc 12.8 to confirm the NVIDIA
path is unaffected; it builds with no errors.

Test Plan:
Build (gfx90a, ROCm 7.2.1):
```
cmake -S . -B build-hip -G Ninja -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a \
  -DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ -DBUILD_PYTHON=OFF \
  -DBUILD_TESTS=ON -DBUILD_TESTS_CPP=ON -DBUILD_TESTS_PYTHON=OFF -DBUILD_BENCH=OFF \
  -DBUILD_DOCS=OFF -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++
cmake --build build-hip -j 16
```
GPU validation (one isolated GCD):
```
HIP_VISIBLE_DEVICES=1 build-hip/bin/cvcuda_test_system          # exit 0, 0 failures
HIP_VISIBLE_DEVICES=1 build-hip/bin/nvcv_test_cudatools_system  # 1116/1123
```
The documented HIP/ROCm build invokes find_package(hip) and the
companion hip* packages (hipBLAS, hipCUB, hipSOLVER, rocRAND). On a
clean ROCm container where ROCm is not on PATH, CMake cannot locate
those package config files and configuration fails early with
hip_DIR-NOTFOUND.

Neither pointing CMAKE_HIP_COMPILER at the absolute hipcc/clang++ path
nor exporting ROCM_PATH resolves this; only adding /opt/rocm to the
CMake search prefix does. Document passing
-DCMAKE_PREFIX_PATH=/opt/rocm through the ci/build.sh wrapper in the
ROCm build example, with a short note for the not-on-PATH case.

Documentation only; no code or CMake change.

Authored with the assistance of an AI coding agent.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant