Improve performance of panama int4{DotProduct,SquareDistance}SinglePacked by mccullocht · Pull Request #15736 · apache/lucene

mccullocht · 2026-02-20T05:04:35Z

The conversion operations to int were observed to be very slow in the async profiler. These operations widen vectors
that are already at the maximum native length, so it's preferable to do fewer of them by summing the two short
accumulators before widening. Summing before widening could potentially overflow, but it is sufficient to switch
the integer widening to zero extend since these dot products are implicitly unsigned.

This may fix #15697

M4 Before

Benchmark                                                       (size)   Mode  Cnt  Score   Error   Units
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedScalar    1024  thrpt   15  4.111 ± 0.028  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  thrpt   15  3.413 ± 0.022  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedScalar        1024  thrpt   15  4.597 ± 0.055  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  thrpt   15  3.418 ± 0.017  ops/us

M4 After

Benchmark                                                       (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedScalar    1024  thrpt   15   4.016 ± 0.032  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  thrpt   15  26.741 ± 0.180  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedScalar        1024  thrpt   15   4.592 ± 0.041  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  thrpt   15  26.597 ± 0.071  ops/us

AMD Ryzen AI 395 (AVX 512) Before

Benchmark                                                       (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedScalar    1024  thrpt   15   2.471 ± 0.059  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  thrpt   15  11.226 ± 0.333  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedScalar        1024  thrpt   15   2.281 ± 0.028  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  thrpt   15  12.019 ± 0.379  ops/us

AMD Ryzen AI 395 After

Benchmark                                                       (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedScalar    1024  thrpt   15   2.477 ± 0.053  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  thrpt   15  27.912 ± 0.162  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedScalar        1024  thrpt   15   2.266 ± 0.028  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  thrpt   15  40.499 ± 0.578  ops/us

kaivalnp · 2026-02-20T19:03:10Z

These operations widen vectors that are already at the maximum native length

From javadocs of Vector#convert:

Convert this vector to a vector of the same shape and a new element type

..I would've expected that the shape of vectors does not change?

It does fix JMH benchmark performance on my AWS Graviton3 machine:

Baseline

Benchmark                                                       (size)   Mode  Cnt  Score   Error   Units
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedScalar    1024  thrpt   15  2.453 ± 0.011  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  thrpt   15  2.623 ± 0.012  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedScalar        1024  thrpt   15  2.023 ± 0.002  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  thrpt   15  2.595 ± 0.015  ops/us

Candidate

Benchmark                                                       (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedScalar    1024  thrpt   15   2.449 ± 0.008  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  thrpt   15  15.205 ± 0.034  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedScalar        1024  thrpt   15   2.022 ± 0.002  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  thrpt   15  15.819 ± 0.102  ops/us

mccullocht · 2026-02-20T19:15:26Z

Good to see that this improves throughput on graviton3!

convert() does produce a vector of the same bit width, I'm noting that if I have a uint16x8 vector and I try to widen to uint32 lanes I have to do it twice to see all the data in the input vector (two uint32x4 registers). It can sometimes be tricker to extract values at the top of the register (the last 4 16 bit entries in this case) and maybe the API is choosing a poor plan for this.

I'll try to run luceneutil this afternoon to see if this changes things in the macro benchmark. It's probably also worth running the microbenchmark in branch_10x with jdk21 to make sure this should target 10.5 instead of 11.0.

kaivalnp · 2026-02-20T20:30:02Z

lucene/core/src/java25/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java

+      Vector<Integer> intAcc0 = accShort.convert(ZERO_EXTEND_S2I, 0);
+      Vector<Integer> intAcc1 = accShort.convert(ZERO_EXTEND_S2I, 1);
+      sum += intAcc0.add(intAcc1).reinterpretAsInts().reduceLanes(ADD);
    }


I'm seeing slightly better performance if we avoid the convert call entirely like:

IntVector intAcc0 = acc0.reinterpretAsInts(); IntVector intAcc1 = acc1.reinterpretAsInts(); sum += intAcc0 .and(0x0000FFFF) .add(intAcc0.lanewise(LSHR, 16)) .add(intAcc1.and(0x0000FFFF)) .add(intAcc1.lanewise(LSHR, 16)) .reduceLanes(ADD);

JMH says:

Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector 1024 thrpt 15 16.288 ± 0.023 ops/us VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector 1024 thrpt 15 16.739 ± 0.036 ops/us

@mccullocht In fact, I went one step ahead and removed all convert operations, specially shape changing ones -- so we load and operate on the same number of bytes (see this commit), and I get much better performance on my machine (~25% bump from this PR):

Benchmark (size) Mode Cnt Score Error Units VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedScalar 1024 thrpt 15 2.450 ± 0.006 ops/us VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector 1024 thrpt 15 19.166 ± 0.028 ops/us

Edit: I only made changes for int4DotProductSinglePackedBody + when the preferred bit width is 256 -- you may need equivalent changes for other functions + bit widths to reproduce

It's the hacker's delight popcount trick! You can generalize this down to 1 bit. I may look into this and see if it does better on x86 too. We convert a lot to upcast for accumulation so there may be other places we can apply both of these techniques.

When I profiled the call stack under convert was invoking a ShortShuffle128 class. aarch64 doesn't have any native short shuffle instructions. This could be represented as a byte shuffle if you know what the target endianness (clang will definitely generate this code in some cases). slice() followed by convert(..., 0) might also do well but I struggle to reason about performance with the vector incubator package in a way I don't with raw intrinsics.

If it improves the byte -> short conversion that one is a 64-bit -> 128-bit register conversion and really should be fast in all cases. There might be more throughput lurking here because if this widening operation is really that slow you may want to also load 128 bits at a time with the reinterpret cast summing workaround.

I opened #15742 to demonstrate!

Edit: Would be great if you could replicate benchmarks for an x86 machine (and possibly a different aarch64 machine)

do fewer conversions to int. this helps substantially on aarch64

7d2f9c2

github-actions bot added the module:core/other label Feb 20, 2026

mccullocht mentioned this pull request Feb 20, 2026

For 4-bit quantized vectors, should we change the scheme of unpacking nibbles? #15697

Open

mccullocht added this to the 10.5.0 milestone Feb 20, 2026

kaivalnp reviewed Feb 20, 2026

View reviewed changes

kaivalnp mentioned this pull request Feb 20, 2026

Optimize int4 vector computations by avoiding conversions #15742

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Improve performance of panama int4{DotProduct,SquareDistance}SinglePacked#15736

Improve performance of panama int4{DotProduct,SquareDistance}SinglePacked#15736
mccullocht wants to merge 1 commit intoapache:mainfrom
mccullocht:int4-single-packed

mccullocht commented Feb 20, 2026 •

edited

Loading

Uh oh!

kaivalnp commented Feb 20, 2026

Uh oh!

mccullocht commented Feb 20, 2026

Uh oh!

kaivalnp Feb 20, 2026

Uh oh!

kaivalnp Feb 20, 2026 •

edited

Loading

Uh oh!

mccullocht Feb 20, 2026

Uh oh!

mccullocht Feb 20, 2026 •

edited

Loading

Uh oh!

kaivalnp Feb 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

mccullocht commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaivalnp commented Feb 20, 2026

Uh oh!

mccullocht commented Feb 20, 2026

Uh oh!

kaivalnp Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

kaivalnp Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mccullocht Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

mccullocht Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaivalnp Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mccullocht commented Feb 20, 2026 •

edited

Loading

kaivalnp Feb 20, 2026 •

edited

Loading

mccullocht Feb 20, 2026 •

edited

Loading

kaivalnp Feb 20, 2026 •

edited

Loading