wip: Optimize BitBuffer methods across the board by joseph-isaacs · Pull Request #7375 · vortex-data/vortex

joseph-isaacs · 2026-04-09T18:28:11Z

Key optimizations:

Not for &BitBuffer: allocate directly instead of clone+try_into_mut
which always failed (clone shares Arc, so 2 refs = failure)
→ 27-58% faster bitwise NOT on references
iter_bits(): process u64 chunks instead of byte-by-byte
→ 13% faster iteration
PartialEq: fast path using memcmp for byte-aligned buffers
append_buffer(): memcpy fast path when both sides are byte-aligned
→ 20-52% faster buffer appends
append_false(): remove unnecessary branch (new bytes are zero-init)
→ 65% faster single-bit appends
from_indices(): use set_bit_unchecked directly on the byte slice
FromIterator tail: batch remaining items in u64 words
→ 13% faster from_iter
sliced(): use bitwise_unary_op_copy to avoid clone+fail path,
fix byte range bug in aligned path
filter_bitbuffer_by_indices: detect contiguous runs for bulk copy
filter_bitbuffer_by_slices: use slice()+append_buffer() instead
of per-bit get+append
Add #[inline] to hot methods: set, set_to, append, true_count, etc.

Signed-off-by: Claude noreply@anthropic.com

https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB

Key optimizations: - `Not for &BitBuffer`: allocate directly instead of clone+try_into_mut which always failed (clone shares Arc, so 2 refs = failure) → 27-58% faster bitwise NOT on references - `iter_bits()`: process u64 chunks instead of byte-by-byte → 13% faster iteration - `PartialEq`: fast path using memcmp for byte-aligned buffers - `append_buffer()`: memcpy fast path when both sides are byte-aligned → 20-52% faster buffer appends - `append_false()`: remove unnecessary branch (new bytes are zero-init) → 65% faster single-bit appends - `from_indices()`: use set_bit_unchecked directly on the byte slice - `FromIterator` tail: batch remaining items in u64 words → 13% faster from_iter - `sliced()`: use bitwise_unary_op_copy to avoid clone+fail path, fix byte range bug in aligned path - `filter_bitbuffer_by_indices`: detect contiguous runs for bulk copy - `filter_bitbuffer_by_slices`: use slice()+append_buffer() instead of per-bit get+append - Add #[inline] to hot methods: set, set_to, append, true_count, etc. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB

codspeed-hq · 2026-04-09T18:32:42Z

Merging this PR will improve performance by ×2.4

⚡ 26 improved benchmarks
✅ 1096 untouched benchmarks
⏩ 1455 skipped benchmarks¹

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	Simulation	`chunked_bool_canonical_into[(100, 100)]`	114.9 µs	103 µs	+11.56%
⚡	Simulation	`chunked_bool_canonical_into[(1000, 10)]`	60.2 µs	48 µs	+25.49%
⚡	Simulation	`chunked_opt_bool_canonical_into[(1000, 10)]`	61.3 µs	37.9 µs	+61.59%
⚡	Simulation	`chunked_opt_bool_canonical_into[(100, 100)]`	240.2 µs	213.2 µs	+12.64%
⚡	Simulation	`chunked_opt_bool_into_canonical[(1000, 10)]`	69 µs	45.2 µs	+52.74%
⚡	Simulation	`chunked_opt_bool_into_canonical[(100, 100)]`	265 µs	238 µs	+11.36%
⚡	Simulation	`density_sweep_single_slice[0.9]`	138.7 µs	58.4 µs	×2.4
⚡	Simulation	`filter_one_false[100000]`	174.2 µs	125 µs	+39.4%
⚡	Simulation	`filter_one_false[250000]`	407.2 µs	296.6 µs	+37.27%
⚡	Simulation	`filter_one_false[10000]`	34.1 µs	29 µs	+17.79%
⚡	Simulation	`varbinview_zip_block_mask`	3.7 ms	3.4 ms	+10.27%
⚡	Simulation	`append_vortex_buffer[65536]`	923.3 µs	468.2 µs	+97.21%
⚡	Simulation	`append_vortex_buffer[2048]`	29.1 µs	14.9 µs	+95.91%
⚡	Simulation	`bitwise_not_vortex_buffer[128]`	4.5 µs	3.9 µs	+17.47%
⚡	Simulation	`bitwise_not_vortex_buffer[1024]`	3.7 µs	3 µs	+24.42%
⚡	Simulation	`bitwise_not_vortex_buffer[16384]`	6.6 µs	5.9 µs	+11.93%
⚡	Simulation	`append_buffer_vortex_buffer[1024]`	18.3 µs	14.1 µs	+29.5%
⚡	Simulation	`append_buffer_vortex_buffer[128]`	14.3 µs	12.3 µs	+15.96%
⚡	Simulation	`append_buffer_vortex_buffer[65536]`	111.6 µs	100.9 µs	+10.59%
⚡	Simulation	`append_buffer_vortex_buffer[2048]`	20.2 µs	11.8 µs	+71.68%
...	...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

_{Comparing claude/optimize-bit-buffer-NeeMo (a87bb17) with develop (256a029)}

1455 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

BitBuffers almost always have shared backing storage (from slicing, array construction, etc.), so try_into_mut nearly always fails. The owned `Not for BitBuffer` now uses the same direct-copy path as the reference version, and the dead `bitwise_unary_op` function is removed. For true in-place mutation, use `BitBufferMut` directly. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB

Keep the in-place mutation fast path for owned BitBuffer when the backing storage has exclusive access (Arc refcount == 1). When try_into_mut fails (shared storage), delegate to bitwise_unary_op_copy instead of duplicating the copy logic. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB

The scan loop pattern `mask = mask.bitand(&conjunct_mask)` previously always allocated a new buffer because all binary ops took references. Now owned-left operands try try_into_mut for zero-allocation in-place mutation when the backing storage has exclusive access. Changes: - Add bitwise_binary_op_lhs_owned in ops.rs: tries in-place on owned left, falls back to allocating bitwise_binary_op - Wire BitAnd/BitOr/BitXor owned-left impls to use in-place path - Add BitBuffer::into_bitand_not for owned variant - Add BitAnd<&Mask>/BitOr<&Mask> for Mask: extracts owned BitBuffer for in-place binary ops in the scan loop - Update Mask::bitand_not to use owned BitBuffer path - Fix flat/zoned readers to capture density before consuming mask Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB

robert3005 · 2026-04-09T22:38:45Z

vortex-buffer/src/bit/ops.rs

+/// Tries `try_into_mut` on the left operand. If the backing storage has exclusive access,
+/// the operation is performed in-place (zero allocation). Otherwise, falls back to
+/// [`bitwise_binary_op`] which allocates a new buffer.
+pub(super) fn bitwise_binary_op_lhs_owned<F: FnMut(u64, u64) -> u64>(


maybe this should just be a method on the bitbuffer? Not sure why I made these separate functions

robert3005 · 2026-04-09T22:40:27Z

vortex-buffer/src/bit/buf_mut.rs

+        let dst_bit_offset = start_bit_pos % 8;
+        let src_bit_offset = buffer.offset();
+
+        if dst_bit_offset == 0 && src_bit_offset == 0 {


I don't think src offset matters here, you care that src end is byte aligned

joseph-isaacs changed the title ~~Optimize BitBuffer methods across the board~~ wip: Optimize BitBuffer methods across the board Apr 9, 2026

claude added 3 commits April 9, 2026 18:44

robert3005 reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wip: Optimize BitBuffer methods across the board#7375

wip: Optimize BitBuffer methods across the board#7375
joseph-isaacs wants to merge 4 commits intodevelopfrom
claude/optimize-bit-buffer-NeeMo

joseph-isaacs commented Apr 9, 2026

Uh oh!

codspeed-hq bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

robert3005 Apr 9, 2026

Uh oh!

robert3005 Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

joseph-isaacs commented Apr 9, 2026

Uh oh!

codspeed-hq bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will improve performance by ×2.4

Performance Changes

Footnotes

Uh oh!

robert3005 Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

robert3005 Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codspeed-hq bot commented Apr 9, 2026 •

edited

Loading