Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
1f2ae46
Add ROCm backend
Looong01 Jul 28, 2025
b455530
Fix bugs
Looong01 Jul 28, 2025
8b30cb9
Update
Looong01 Jul 31, 2025
570ced0
Fix bugs
Looong01 Aug 1, 2025
abb6124
Fix bugs
Looong01 Aug 1, 2025
bfb292e
All bug fixed
Looong01 Aug 1, 2025
4606424
Update
Looong01 Aug 1, 2025
1e8ea78
test new method
Looong01 Aug 1, 2025
c1a09cf
Update
Looong01 Aug 2, 2025
0957b88
Test finished
Looong01 Aug 2, 2025
c70d841
Update docks
Looong01 Aug 2, 2025
1d05ca8
Update gitignore
Looong01 Aug 2, 2025
9d4662b
Update new method
Looong01 Aug 2, 2025
d40bd50
Optimize performance
Looong01 Aug 2, 2025
158d24d
Update new Convlayer method
Looong01 Aug 13, 2025
ec32eb1
Merge branch 'master' of https://github.com/Looong01/KataGo-ROCm
Looong01 Aug 13, 2025
0bfe0a1
Add new compile target
Looong01 Oct 4, 2025
f5fbb33
Merge branch 'lightvector:master' into master
Looong01 Nov 8, 2025
26d8c5b
Add ROCm for Windows support
Looong01 Nov 8, 2025
555d2f1
Merge branch 'lightvector:master' into master
Looong01 Dec 1, 2025
dbc7cfa
Merge branch 'lightvector:master' into master
Looong01 Feb 22, 2026
c511c33
Add MIGraphX support
Looong01 Feb 27, 2026
d91e9f5
Merge pull request #3 from lightvector/master
Looong01 Mar 16, 2026
337b575
Merge branch 'lightvector:master' into MIGraphX
Looong01 Apr 19, 2026
00cb688
Fix bugs
Looong01 Apr 19, 2026
b1da0e0
Optimize performance
Looong01 Apr 19, 2026
8a133a0
Add MIGraphX support
Looong01 Apr 19, 2026
8c4eaf2
Merge pull request #4 from Looong01/MIGraphX
Looong01 Apr 19, 2026
e480af5
Add Windows ROCm support
Looong01 Apr 19, 2026
a36189e
Update cpp/README.md
Looong01 Apr 20, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 63 additions & 1 deletion Compiling.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@ As also mentioned in the instructions below but repeated here for visibility, if
* If using the OpenCL backend, a modern GPU that supports OpenCL 1.2 or greater, or else something like [this](https://software.intel.com/en-us/opencl-sdk) for CPU. But if using CPU, Eigen should be better.
* If using the CUDA backend, CUDA 11 or later and a compatible version of CUDNN based on your CUDA version (https://developer.nvidia.com/cuda-toolkit) (https://developer.nvidia.com/cudnn) and a GPU capable of supporting them.
* If using the TensorRT backend, in addition to a compatible CUDA Toolkit (https://developer.nvidia.com/cuda-toolkit), you also need TensorRT (https://developer.nvidia.com/tensorrt) that is at least version 8.5.
* If using the ROCm backend, ROCm 6.4 or later and a GPU capable of supporting them. More information about installation(https://rocm.docs.amd.com/projects/install-on-linux/en/latest/) and please install all possible ROCm developer packages, instead of just ROCm runtime packages.
* If using the MIGraphX backend, ROCm 7.0 or later with MIGraphX library installed (e.g. `sudo apt install migraphx` via the ROCm package repo).
* If using the Eigen backend, Eigen3. With Debian packages, (i.e. apt or apt-get), this should be `libeigen3-dev`.
* zlib, libzip. With Debian packages (i.e. apt or apt-get), these should be `zlib1g-dev`, `libzip-dev`.
* If you want to do self-play training and research, probably Google perftools `libgoogle-perftools-dev` for TCMalloc or some other better malloc implementation. For unknown reasons, the allocation pattern in self-play with large numbers of threads and parallel games causes a lot of memory fragmentation under glibc malloc that will eventually run your machine out of memory, but better mallocs handle it fine.
Expand All @@ -41,7 +43,7 @@ As also mentioned in the instructions below but repeated here for visibility, if
* `git clone https://github.com/lightvector/KataGo.git`
* Compile using CMake and make in the cpp directory:
* `cd KataGo/cpp`
* `cmake . -DUSE_BACKEND=OPENCL` or `cmake . -DUSE_BACKEND=CUDA` or `cmake . -DUSE_BACKEND=TENSORRT` or `cmake . -DUSE_BACKEND=EIGEN` depending on which backend you want.
* `cmake . -DUSE_BACKEND=OPENCL` or `cmake . -DUSE_BACKEND=CUDA` or `cmake . -DUSE_BACKEND=TENSORRT` or `cmake . -DUSE_BACKEND=EIGEN` or `cmake . -DUSE_BACKEND=ROCM` or `cmake . -DUSE_BACKEND=MIGRAPHX` depending on which backend you want.
* Specify also `-DUSE_TCMALLOC=1` if using TCMalloc.
* Compiling will also call git commands to embed the git hash into the compiled executable, specify also `-DNO_GIT_REVISION=1` to disable it if this is causing issues for you.
* Specify `-DUSE_AVX2=1` to also compile Eigen with AVX2 and FMA support, which will make it incompatible with old CPUs but much faster. (If you want to go further, you can also add `-DCMAKE_CXX_FLAGS='-march=native'` which will specialize to precisely your machine's CPU, but the exe might not run on other machines at all).
Expand All @@ -54,6 +56,30 @@ As also mentioned in the instructions below but repeated here for visibility, if
* You will probably want to edit `configs/gtp_example.cfg` (see "Tuning for Performance" above).
* If using OpenCL, you will want to verify that KataGo is picking up the correct device when you run it (e.g. some systems may have both an Intel CPU OpenCL and GPU OpenCL, if KataGo appears to pick the wrong one, you can correct this by specifying `openclGpuToUse` in `configs/gtp_example.cfg`).

* **ROCm backend (Linux) — additional notes:**
* Install ROCm following the [official guide](https://rocm.docs.amd.com/en/7.12.0-preview/install/rocm.html). Install the full developer stack (not just runtime): `sudo apt install rocm-dev miopen-hip rocblas hipblas`.
* Build:
```
cd KataGo/cpp
mkdir build && cd build
cmake .. -DUSE_BACKEND=ROCM -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
```
* GPU architecture is auto-detected via `amdgpu-arch`. If auto-detection fails, specify manually: `-DCMAKE_HIP_ARCHITECTURES=gfx1100` (replace with your GPU's gfx target).
* On first run, MIOpen will search for optimal convolution algorithms for your specific GPU and network size. This may take up to a minute and results are cached in `~/.config/miopen/` for subsequent runs.

* **MIGraphX backend (Linux) — additional notes:**
* Requires ROCm 7.0+ with MIGraphX installed. Install via: `sudo apt install migraphx`.
* Build:
```
cd KataGo/cpp
mkdir build && cd build
cmake .. -DUSE_BACKEND=MIGRAPHX -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
```
* On first launch, MIGraphX compiles and caches GPU programs for each batch size (4, 8, 16, 24, 32, 40, 64 up to `maxBatchSize`) in `~/.katago/migraphxcache/`. This initial compilation may take several minutes but subsequent launches load from cache instantly.
* MIGraphX may offer better GPU utilization and throughput than the ROCm/MIOpen backend on some workloads due to whole-graph operator fusion.

## Windows
* TLDR:
* Building from source on Windows is actually a bit tricky, depending on what version you're building, there's not necessarily a super-fast way.
Expand Down Expand Up @@ -117,6 +143,42 @@ As also mentioned in the instructions below but repeated here for visibility, if
* You will probably want to edit `configs/gtp_example.cfg` (see "Tuning for Performance" above).
* If using OpenCL, you will want to verify that KataGo is picking up the correct device (e.g. some systems may have both an Intel CPU OpenCL and GPU OpenCL, if KataGo appears to pick the wrong one, you can correct this by specifying `openclGpuToUse` in `configs/gtp_example.cfg`).

* **ROCm backend (Windows) — building via AMD TheRock:**
* The ROCm (MIOpen) backend supports Windows via [AMD TheRock](https://github.com/ROCm/TheRock) (tested with TheRock 7.12.0 / ROCm 7.2.0, RX 7900 XTX / gfx1100).
* **Prerequisites:**
* Install ROCm following the [official guide](https://rocm.docs.amd.com/en/7.12.0-preview/install/rocm.html). For Windows, download [AMD TheRock](https://github.com/ROCm/TheRock) and extract to e.g. `C:\TheRock\build`.
* Install **Visual Studio 2026 Build Tools** or **Visual Studio 2026 Community** with the "Desktop development with C++" workload. This provides the MSVC toolchain and Windows SDK required by the HIP compiler.
* Install [Ninja](https://ninja-build.org) build tool: `winget install Ninja-build.Ninja`.
* Set the following **system environment variables** (via System Properties → Advanced → Environment Variables):
```
HIP_PATH=C:/TheRock/build
HIP_PLATFORM=amd
HIP_DEVICE_LIB_PATH=C:/TheRock/build/lib/llvm/amdgcn/bitcode
LLVM_PATH=C:/TheRock/build/lib/llvm
```
* Add to system `PATH`:
```
C:\TheRock\build\bin
C:\TheRock\build\lib\llvm\bin
```
* Reboot after setting environment variables so they take effect system-wide.
* **Build** (from a terminal with the above env vars active):
```
cd KataGo/cpp
mkdir build
cd build
cmake .. -G Ninja -DUSE_BACKEND=ROCM -DCMAKE_BUILD_TYPE=Release
ninja -j $env:NUMBER_OF_PROCESSORS
```
No additional `-D` flags are needed — `CMakeLists.txt` automatically detects the HIP/clang compiler, GPU architecture (via `amdgpu-arch.exe`), Windows SDK include paths, and zlib from `HIP_PATH`.
* **Runtime DLL setup** — copy the following next to `katago.exe`:
* `amdhip64_7.dll` — **required**: must be copied from `D:\TheRock\build\bin\` to override the incompatible version that AMD GPU drivers install into `C:\Windows\System32\`.
* All other ROCm DLLs (`MIOpen.dll`, `hipblas.dll`, `rocblas.dll`, `hiprtc0702.dll`, `amd_comgr0702.dll`, `libhipblaslt.dll`, `amdocl64.dll`) are found automatically from `D:\TheRock\build\bin\` via `PATH` — no need to copy them.
* If `rocblas.dll` is copied, also copy the `rocblas\library\` directory alongside it (rocBLAS looks for its kernel files relative to its own DLL location).
* MSVC runtime DLLs (`msvcp140.dll`, `vcruntime140.dll`, etc.) are in `C:\Windows\System32\` on any machine with the Visual C++ Redistributable installed.
* **First-run note:** MIOpen will search for optimal convolution algorithms on the first run. This may take 45+ seconds per network configuration and results are cached in `%USERPROFILE%\.miopen\` for subsequent runs. Do not terminate the process during this initial tuning.
* **Performance note:** GPU utilization on Windows may be somewhat lower than on Linux due to the Windows Driver Model (WDDM) adding overhead to GPU kernel submissions. This is a known limitation of ROCm on Windows.

## MacOS
* TLDR:
```
Expand Down
57 changes: 33 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,30 @@
# KataGo

* [Overview](#overview)
* [Training History and Research](#training-history-and-research)
* [Where To Download Stuff](#where-to-download-stuff)
* [Setting Up and Running KataGo](#setting-up-and-running-katago)
* [GUIs](#guis)
* [Windows and Linux](#windows-and-linux)
* [MacOS](#macos)
* [OpenCL vs CUDA vs TensorRT vs Eigen](#opencl-vs-cuda-vs-tensorrt-vs-eigen)
* [How To Use](#how-to-use)
* [Tuning for Performance](#tuning-for-performance)
* [Common Questions and Issues](#common-questions-and-issues)
* [Issues with specific GPUs or GPU drivers](#issues-with-specific-gpus-or-gpu-drivers)
* [Common Problems](#common-problems)
* [Other Questions](#other-questions)
* [Features for Developers](#features-for-developers)
* [GTP Extensions](#gtp-extensions)
* [Analysis Engine](#analysis-engine)
* [Compiling KataGo](#compiling-katago)
* [Source Code Overview](#source-code-overview)
* [Selfplay Training](#selfplay-training)
* [Contributors](#contributors)
* [License](#license)
- [KataGo](#katago)
- [Overview](#overview)
- [Training History and Research and Docs](#training-history-and-research-and-docs)
- [Where To Download Stuff](#where-to-download-stuff)
- [Setting Up and Running KataGo](#setting-up-and-running-katago)
- [GUIs](#guis)
- [Windows and Linux](#windows-and-linux)
- [MacOS](#macos)
- [OpenCL vs CUDA vs TensorRT vs ROCm vs MIGraphX vs Eigen](#opencl-vs-cuda-vs-tensorrt-vs-rocm-vs-migraphx-vs-eigen)
- [How To Use](#how-to-use)
- [Human-style Play and Analysis](#human-style-play-and-analysis)
- [Other Commands:](#other-commands)
- [Tuning for Performance](#tuning-for-performance)
- [Common Questions and Issues](#common-questions-and-issues)
- [Issues with specific GPUs or GPU drivers](#issues-with-specific-gpus-or-gpu-drivers)
- [Common Problems](#common-problems)
- [Other Questions](#other-questions)
- [Features for Developers](#features-for-developers)
- [GTP Extensions:](#gtp-extensions)
- [Analysis Engine:](#analysis-engine)
- [Compiling KataGo](#compiling-katago)
- [Source Code Overview:](#source-code-overview)
- [Selfplay Training:](#selfplay-training)
- [Contributors](#contributors)
- [License](#license)

## Overview

Expand Down Expand Up @@ -84,20 +87,24 @@ The community also provides KataGo packages for [Homebrew](https://brew.sh) on M

Use `brew install katago`. The latest config files and networks are installed in KataGo's `share` directory. Find them via `brew list --verbose katago`. A basic way to run katago will be `katago gtp -config $(brew list --verbose katago | grep 'gtp.*\.cfg') -model $(brew list --verbose katago | grep .gz | head -1)`. You should choose the Network according to the release notes here and customize the provided example config as with every other way of installing KataGo.

### OpenCL vs CUDA vs TensorRT vs Eigen
KataGo has four backends, OpenCL (GPU), CUDA (GPU), TensorRT (GPU), and Eigen (CPU).
### OpenCL vs CUDA vs TensorRT vs ROCm vs MIGraphX vs Eigen
KataGo has six backends, OpenCL (GPU), CUDA (GPU), TensorRT (GPU), ROCm (GPU), MIGraphX (GPU) and Eigen (CPU).

The quick summary is:
* **To easily get something working, try OpenCL if you have any good or decent GPU.**
* **For often much better performance on NVIDIA GPUs, try TensorRT**, but you may need to install TensorRT from Nvidia.
* Use Eigen with AVX2 if you don't have a GPU or if your GPU is too old/weak to work with OpenCL, and you just want a plain CPU KataGo.
* Use Eigen without AVX2 if your CPU is old or on a low-end device that doesn't support AVX2.
* The CUDA backend can work for NVIDIA GPUs with CUDA+CUDNN installed but is likely worse than TensorRT.
* The ROCm backend can work for AMD GPUs with ROCm+MIOpen installed.
* The MIGraphX backend is an alternative AMD GPU backend using MIGraphX instead of MIOpen.

More in detail:
* OpenCL is a general GPU backend should be able to run with any GPUs or accelerators that support [OpenCL](https://en.wikipedia.org/wiki/OpenCL), including NVIDIA GPUs, AMD GPUs, as well CPU-based OpenCL implementations or things like Intel Integrated Graphics. This is the most general GPU version of KataGo and doesn't require a complicated install like CUDA does, so is most likely to work out of the box as long as you have a fairly modern GPU. **However, it also need to take some time when run for the very first time to tune itself.** For many systems, this will take 5-30 seconds, but on a few older/slower systems, may take many minutes or longer. Also, the quality of OpenCL implementations is sometimes inconsistent, particularly for Intel Integrated Graphics and for AMD GPUs that are older than several years, so it might not work for very old machines, as well as specific buggy newer AMD GPUs, see also [Issues with specific GPUs or GPU drivers](#issues-with-specific-gpus-or-gpu-drivers).
* CUDA is a GPU backend specific to NVIDIA GPUs (it will not work with AMD or Intel or any other GPUs) and requires installing [CUDA](https://developer.nvidia.com/cuda-zone) and [CUDNN](https://developer.nvidia.com/cudnn) and a modern NVIDIA GPU. On most GPUs, the OpenCL implementation will actually beat NVIDIA's own CUDA/CUDNN at performance. The exception is for top-end NVIDIA GPUs that support FP16 and tensor cores, in which case sometimes one is better and sometimes the other is better.
* TensorRT is similar to CUDA, but only uses NVIDIA's TensorRT framework to run the neural network with more optimized kernels. For modern NVIDIA GPUs, it should work whenever CUDA does and will usually be faster than CUDA or any other backend.
* ROCm is a GPU backend specific to AMD GPUs (it will not work with NVIDIA or Intel or any other GPUs) and requires installing [ROCm](https://rocm.docs.amd.com) and [MIOpen](https://rocm.docs.amd.com/projects/MIOpen) and a modern AMD GPU. Supports both **Linux** (via official ROCm packages, ROCm 6.4+) and **Windows** (via [AMD TheRock](https://github.com/ROCm/TheRock) builds). On most GPUs, the OpenCL implementation will actually beat AMD's own ROCm/MIOpen at performance. The exception is for top-end AMD GPUs that support FP16 and stream processors, in which case sometimes one is better and sometimes the other is better.
* MIGraphX is an alternative GPU backend for AMD GPUs using AMD's MIGraphX graph-compiler framework instead of MIOpen. It compiles the entire neural network into a single fused GPU program, which can offer better throughput than ROCm/MIOpen on some workloads. Requires ROCm 7.0+ with MIGraphX installed. Currently supports Linux only.
* Eigen is a *CPU* backend that should work widely *without* needing a GPU or fancy drivers. Use this if you don't have a good GPU or really any GPU at all. It will be quite significantly slower than OpenCL or CUDA, but on a good CPU can still often get 10 to 20 playouts per second if using the smaller (15 or 20) block neural nets. Eigen can also be compiled with AVX2 and FMA support, which can provide a big performance boost for Intel and AMD CPUs from the last few years. However, it will not run at all on older CPUs (and possibly even some recent but low-power modern CPUs) that don't support these fancy vector instructions.

For **any** implementation, it's recommended that you also tune the number of threads used if you care about optimal performance, as it can make a factor of 2-3 difference in the speed. See "Tuning for Performance" below. However, if you mostly just want to get it working, then the default untuned settings should also be still reasonable.
Expand Down Expand Up @@ -175,6 +182,8 @@ This section summarizes a number of common questions and issues when running Kat
#### Issues with specific GPUs or GPU drivers
If you are observing any crashes in KataGo while attempting to run the benchmark or the program itself, and you have one of the below GPUs, then this is likely the reason.

* **AMD GPUs** - If you choose to use the ROCm backend, you need a GPU on the official [System requirements list](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html) (at least AMD Radeon RX 7700 XT). ROCm backend supports both Linux (via official ROCm packages) and Windows (via [AMD TheRock](https://github.com/ROCm/TheRock) builds). On Linux, install the full ROCm developer stack. On Windows, see the ROCm Windows build instructions in [Compiling.md](Compiling.md). The MIGraphX backend also requires ROCm 7.0+ with MIGraphX installed and currently supports Linux only.

* **AMD Radeon RX 5700** - AMD's drivers for OpenCL for this GPU have been buggy ever since this GPU was released, and as of May 2020 AMD has still never released a fix. If you are using this GPU, you will just not be able to run KataGo (Leela Zero and other Go engines will probably fail too) and will probably also obtain incorrect calculations or crash if doing anything else scientific or mathematical that uses OpenCL. See for example these reddit threads: [[1]](https://www.reddit.com/r/Amd/comments/ebso1x/its_not_just_setihome_any_mathematic_or/) or [[2]](https://www.reddit.com/r/BOINC/comments/ebiz18/psa_please_remove_your_amd_rx5700xt_from_setihome/) or this [L19 thread](https://lifein19x19.com/viewtopic.php?f=18&t=17093).
* **OpenCL Mesa** - These drivers for OpenCL are buggy. Particularly if on startup before crashing you see KataGo printing something like
`Found OpenCL Platform 0: ... (Mesa) (OpenCL 1.1 Mesa ...) ...`
Expand Down
Loading