Hi, thanks for maintaining this repo.
int bs_idx = blockIdx.z;
int c_idx = blockIdx.y;
int pt_idx = blockIdx.x * blockDim.x + threadIdx.x;
So it looks like the expected grid layout should be:
`dim3(num_point_blocks, c, b)`
where:
- `blockIdx.x` indexes chunks of sampled points
- `blockIdx.y` indexes channels
- `blockIdx.z` indexes batches
However, in `gather_points_kernel_wrapper` , the kernel is launched with:
`gather_points_kernel<scalar_t>
<<<dim3(b, c, 1), opt_n_threads(npoints), 0, stream>>>(...);
`
This means `blockIdx.z` is always `0`, so bs_idx seems to always be `0`. If I understand correctly, this would only process batch `0`, while batches` 1..b-1` would remain unwritten.
The same pattern also appears in `gather_points_grad_kernel_wrapper`.
Should the launch configuration instead be something like:
`unsigned int n_threads = opt_n_threads(npoints);
unsigned int n_blocks = (npoints + n_threads - 1) / n_threads;
gather_points_kernel<scalar_t>
<<<dim3(n_blocks, c, b), n_threads, 0, stream>>>(...);
`
and similarly for `gather_points_grad_kernel`?
I may be missing something about the intended indexing scheme, so I wanted to ask before opening a PR.
Thanks!
Hi, thanks for maintaining this repo.
I have a question about the CUDA launch configuration in
sampling_gpu.cu.In
gather_points_kernel, the kernel uses: