batch rays together, decimate allocations and speedup ~3x#4
Conversation
|
AK dep missing in Project.toml! |
| verts::Vector{SVector{3,Float64}} | ||
| verts::Vector{NTuple{3,Float64}} | ||
| tris::Vector{NTuple{3,Int32}} | ||
| kinds::Vector{SurfaceKind} |
There was a problem hiding this comment.
We can make this a BitVector seeing as there are 2 types? Maybe make into an is_sink var
| tcur = wall_t[ray_idx] | ||
| idxcur = wall_idx[ray_idx] | ||
| if (t < tcur) || ((t == tcur) && (idxcur == 0 || leaf_idx < idxcur)) | ||
| n = triangle_unit_normal(v0, v1, v2) |
There was a problem hiding this comment.
I think we might get a speedup by precomputing the normals and storing in SurfaceBVH?
| hit_found = true | ||
| tcur = sphere_t[ray_idx] | ||
| idxcur = sphere_idx[ray_idx] | ||
| if (t < tcur) || ((t == tcur) && (idxcur == 0 || Int(leaf_idx) < idxcur)) |
There was a problem hiding this comment.
Not sure how much performance this gives but... if we change the ray_trianglle_intersect negative case to Inf, we 1. get a a Float64 (instead of a Union) and 2. get use a boolean op here. In my head the gains from the boolean op add up over time?
| @inline add3(a::NTuple{3,Float64}, b::NTuple{3,Float64}) = (a[1] + b[1], a[2] + b[2], a[3] + b[3]) | ||
| @inline sub3(a::NTuple{3,Float64}, b::NTuple{3,Float64}) = (a[1] - b[1], a[2] - b[2], a[3] - b[3]) | ||
| @inline mul3(a::NTuple{3,Float64}, s::Float64) = (a[1] * s, a[2] * s, a[3] * s) | ||
| @inline madd3(a::NTuple{3,Float64}, s::Float64, b::NTuple{3,Float64}) = (a[1] + s * b[1], a[2] + s * b[2], a[3] + s * b[3]) |
There was a problem hiding this comment.
Yk there's a builtin func called muladd (found out by accident when i was showing someone what mullah means in arabic)
There was a problem hiding this comment.
I reckon we can use it here and in dot3 and cross3
| tmin = -Inf | ||
| tmax = Inf | ||
| @inbounds for k in 1:3 | ||
| dk = d[k] |
There was a problem hiding this comment.
Lowkey i think we're overthinking on this function? We can make it like
invd = 1.0 / d[k]
t1 = (mins[k] - p[k]) * invd
t2 = (maxs[k] - p[k]) * invd
tmin = max(tmin, min(t1, t2))
tmax = min(tmax, max(t1, t2))
if tmax < max(tmin, eps):
return nothing # or Inf if you like my previous idea
return tmin > eps ? tmin : tmaxcus if the ray is parallel to the ray is exactly parallel to the axis, it will be Inf ygm?
There was a problem hiding this comment.
Hopefully this means it will parallelise better?
rather large effort to batch rays together for passing to ImplicitBVH. overload LVT traversal algorithm to only trace 'active' rays, i.e. that have not been terminated (hit a sink, bbox, max bounces, max length etc)
this aims to effectively eliminate allocations in the ray tracing loop, as traversal caches are now being adequately utilised for ray tracing, and direction + position matrices do not need to be created per traverse_rays call. simply mutate the RayBatchBuffer
strong scaling is decent but not amazing: 1000 rays -> 11.3 s on 1 thread, 3.6 s on 4 threads for a ~3.1x speedup