Improve performance of panama int4{DotProduct,SquareDistance}SinglePacked#15736
Improve performance of panama int4{DotProduct,SquareDistance}SinglePacked#15736mccullocht wants to merge 1 commit intoapache:mainfrom
Conversation
From javadocs of
..I would've expected that the shape of vectors does not change? It does fix JMH benchmark performance on my AWS Graviton3 machine: Baseline Candidate |
|
Good to see that this improves throughput on graviton3!
I'll try to run luceneutil this afternoon to see if this changes things in the macro benchmark. It's probably also worth running the microbenchmark in branch_10x with jdk21 to make sure this should target 10.5 instead of 11.0. |
| Vector<Integer> intAcc0 = accShort.convert(ZERO_EXTEND_S2I, 0); | ||
| Vector<Integer> intAcc1 = accShort.convert(ZERO_EXTEND_S2I, 1); | ||
| sum += intAcc0.add(intAcc1).reinterpretAsInts().reduceLanes(ADD); | ||
| } |
There was a problem hiding this comment.
I'm seeing slightly better performance if we avoid the convert call entirely like:
IntVector intAcc0 = acc0.reinterpretAsInts();
IntVector intAcc1 = acc1.reinterpretAsInts();
sum +=
intAcc0
.and(0x0000FFFF)
.add(intAcc0.lanewise(LSHR, 16))
.add(intAcc1.and(0x0000FFFF))
.add(intAcc1.lanewise(LSHR, 16))
.reduceLanes(ADD);JMH says:
Benchmark (size) Mode Cnt Score Error Units
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector 1024 thrpt 15 16.288 ± 0.023 ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector 1024 thrpt 15 16.739 ± 0.036 ops/us
There was a problem hiding this comment.
@mccullocht In fact, I went one step ahead and removed all convert operations, specially shape changing ones -- so we load and operate on the same number of bytes (see this commit), and I get much better performance on my machine (~25% bump from this PR):
Benchmark (size) Mode Cnt Score Error Units
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedScalar 1024 thrpt 15 2.450 ± 0.006 ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector 1024 thrpt 15 19.166 ± 0.028 ops/us
Edit: I only made changes for int4DotProductSinglePackedBody + when the preferred bit width is 256 -- you may need equivalent changes for other functions + bit widths to reproduce
There was a problem hiding this comment.
It's the hacker's delight popcount trick! You can generalize this down to 1 bit. I may look into this and see if it does better on x86 too. We convert a lot to upcast for accumulation so there may be other places we can apply both of these techniques.
When I profiled the call stack under convert was invoking a ShortShuffle128 class. aarch64 doesn't have any native short shuffle instructions. This could be represented as a byte shuffle if you know what the target endianness (clang will definitely generate this code in some cases). slice() followed by convert(..., 0) might also do well but I struggle to reason about performance with the vector incubator package in a way I don't with raw intrinsics.
There was a problem hiding this comment.
If it improves the byte -> short conversion that one is a 64-bit -> 128-bit register conversion and really should be fast in all cases. There might be more throughput lurking here because if this widening operation is really that slow you may want to also load 128 bits at a time with the reinterpret cast summing workaround.
There was a problem hiding this comment.
I opened #15742 to demonstrate!
Edit: Would be great if you could replicate benchmarks for an x86 machine (and possibly a different aarch64 machine)
The conversion operations to int were observed to be very slow in the async profiler. These operations widen vectors
that are already at the maximum native length, so it's preferable to do fewer of them by summing the two short
accumulators before widening. Summing before widening could potentially overflow, but it is sufficient to switch
the integer widening to zero extend since these dot products are implicitly unsigned.
This may fix #15697
M4 Before
M4 After
AMD Ryzen AI 395 (AVX 512) Before
AMD Ryzen AI 395 After