Skip to content

wip: Optimize BitBuffer methods across the board#7375

Open
joseph-isaacs wants to merge 4 commits intodevelopfrom
claude/optimize-bit-buffer-NeeMo
Open

wip: Optimize BitBuffer methods across the board#7375
joseph-isaacs wants to merge 4 commits intodevelopfrom
claude/optimize-bit-buffer-NeeMo

Conversation

@joseph-isaacs
Copy link
Copy Markdown
Contributor

Key optimizations:

  • Not for &BitBuffer: allocate directly instead of clone+try_into_mut
    which always failed (clone shares Arc, so 2 refs = failure)
    → 27-58% faster bitwise NOT on references
  • iter_bits(): process u64 chunks instead of byte-by-byte
    → 13% faster iteration
  • PartialEq: fast path using memcmp for byte-aligned buffers
  • append_buffer(): memcpy fast path when both sides are byte-aligned
    → 20-52% faster buffer appends
  • append_false(): remove unnecessary branch (new bytes are zero-init)
    → 65% faster single-bit appends
  • from_indices(): use set_bit_unchecked directly on the byte slice
  • FromIterator tail: batch remaining items in u64 words
    → 13% faster from_iter
  • sliced(): use bitwise_unary_op_copy to avoid clone+fail path,
    fix byte range bug in aligned path
  • filter_bitbuffer_by_indices: detect contiguous runs for bulk copy
  • filter_bitbuffer_by_slices: use slice()+append_buffer() instead
    of per-bit get+append
  • Add #[inline] to hot methods: set, set_to, append, true_count, etc.

Signed-off-by: Claude noreply@anthropic.com

https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB

Key optimizations:
- `Not for &BitBuffer`: allocate directly instead of clone+try_into_mut
  which always failed (clone shares Arc, so 2 refs = failure)
  → 27-58% faster bitwise NOT on references
- `iter_bits()`: process u64 chunks instead of byte-by-byte
  → 13% faster iteration
- `PartialEq`: fast path using memcmp for byte-aligned buffers
- `append_buffer()`: memcpy fast path when both sides are byte-aligned
  → 20-52% faster buffer appends
- `append_false()`: remove unnecessary branch (new bytes are zero-init)
  → 65% faster single-bit appends
- `from_indices()`: use set_bit_unchecked directly on the byte slice
- `FromIterator` tail: batch remaining items in u64 words
  → 13% faster from_iter
- `sliced()`: use bitwise_unary_op_copy to avoid clone+fail path,
  fix byte range bug in aligned path
- `filter_bitbuffer_by_indices`: detect contiguous runs for bulk copy
- `filter_bitbuffer_by_slices`: use slice()+append_buffer() instead
  of per-bit get+append
- Add #[inline] to hot methods: set, set_to, append, true_count, etc.

Signed-off-by: Claude <noreply@anthropic.com>

https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB
@joseph-isaacs joseph-isaacs changed the title Optimize BitBuffer methods across the board wip: Optimize BitBuffer methods across the board Apr 9, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Apr 9, 2026

Merging this PR will improve performance by ×2.4

⚡ 26 improved benchmarks
✅ 1096 untouched benchmarks
⏩ 1455 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_bool_canonical_into[(100, 100)] 114.9 µs 103 µs +11.56%
Simulation chunked_bool_canonical_into[(1000, 10)] 60.2 µs 48 µs +25.49%
Simulation chunked_opt_bool_canonical_into[(1000, 10)] 61.3 µs 37.9 µs +61.59%
Simulation chunked_opt_bool_canonical_into[(100, 100)] 240.2 µs 213.2 µs +12.64%
Simulation chunked_opt_bool_into_canonical[(1000, 10)] 69 µs 45.2 µs +52.74%
Simulation chunked_opt_bool_into_canonical[(100, 100)] 265 µs 238 µs +11.36%
Simulation density_sweep_single_slice[0.9] 138.7 µs 58.4 µs ×2.4
Simulation filter_one_false[100000] 174.2 µs 125 µs +39.4%
Simulation filter_one_false[250000] 407.2 µs 296.6 µs +37.27%
Simulation filter_one_false[10000] 34.1 µs 29 µs +17.79%
Simulation varbinview_zip_block_mask 3.7 ms 3.4 ms +10.27%
Simulation append_vortex_buffer[65536] 923.3 µs 468.2 µs +97.21%
Simulation append_vortex_buffer[2048] 29.1 µs 14.9 µs +95.91%
Simulation bitwise_not_vortex_buffer[128] 4.5 µs 3.9 µs +17.47%
Simulation bitwise_not_vortex_buffer[1024] 3.7 µs 3 µs +24.42%
Simulation bitwise_not_vortex_buffer[16384] 6.6 µs 5.9 µs +11.93%
Simulation append_buffer_vortex_buffer[1024] 18.3 µs 14.1 µs +29.5%
Simulation append_buffer_vortex_buffer[128] 14.3 µs 12.3 µs +15.96%
Simulation append_buffer_vortex_buffer[65536] 111.6 µs 100.9 µs +10.59%
Simulation append_buffer_vortex_buffer[2048] 20.2 µs 11.8 µs +71.68%
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.


Comparing claude/optimize-bit-buffer-NeeMo (a87bb17) with develop (256a029)

Open in CodSpeed

Footnotes

  1. 1455 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

claude added 3 commits April 9, 2026 18:44
BitBuffers almost always have shared backing storage (from slicing,
array construction, etc.), so try_into_mut nearly always fails. The
owned `Not for BitBuffer` now uses the same direct-copy path as the
reference version, and the dead `bitwise_unary_op` function is removed.

For true in-place mutation, use `BitBufferMut` directly.

Signed-off-by: Claude <noreply@anthropic.com>

https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB
Keep the in-place mutation fast path for owned BitBuffer when the
backing storage has exclusive access (Arc refcount == 1). When
try_into_mut fails (shared storage), delegate to bitwise_unary_op_copy
instead of duplicating the copy logic.

Signed-off-by: Claude <noreply@anthropic.com>

https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB
The scan loop pattern `mask = mask.bitand(&conjunct_mask)` previously
always allocated a new buffer because all binary ops took references.
Now owned-left operands try try_into_mut for zero-allocation in-place
mutation when the backing storage has exclusive access.

Changes:
- Add bitwise_binary_op_lhs_owned in ops.rs: tries in-place on owned
  left, falls back to allocating bitwise_binary_op
- Wire BitAnd/BitOr/BitXor owned-left impls to use in-place path
- Add BitBuffer::into_bitand_not for owned variant
- Add BitAnd<&Mask>/BitOr<&Mask> for Mask: extracts owned BitBuffer
  for in-place binary ops in the scan loop
- Update Mask::bitand_not to use owned BitBuffer path
- Fix flat/zoned readers to capture density before consuming mask

Signed-off-by: Claude <noreply@anthropic.com>

https://claude.ai/code/session_0163XH8LLYAkU2qNQGbmYjhB
/// Tries `try_into_mut` on the left operand. If the backing storage has exclusive access,
/// the operation is performed in-place (zero allocation). Otherwise, falls back to
/// [`bitwise_binary_op`] which allocates a new buffer.
pub(super) fn bitwise_binary_op_lhs_owned<F: FnMut(u64, u64) -> u64>(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should just be a method on the bitbuffer? Not sure why I made these separate functions

let dst_bit_offset = start_bit_pos % 8;
let src_bit_offset = buffer.offset();

if dst_bit_offset == 0 && src_bit_offset == 0 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think src offset matters here, you care that src end is byte aligned

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants