Add bloom filter folding to automatically size SBBF filters by adriangb · Pull Request #9628 · apache/arrow-rs

adriangb · 2026-03-30T13:30:27Z

Summary

Bloom filters now support folding mode: allocate a conservatively large filter (sized for worst-case NDV), insert all values during writing, then fold down at flush time to meet a target FPP. This eliminates the need to guess NDV upfront and produces optimally-sized filters automatically.

Changes

BloomFilterProperties.ndv changed from u64 to Option<u64> — when None (new default), the filter is sized based on max_row_group_row_count; when Some(n), the explicit NDV is used
DEFAULT_BLOOM_FILTER_NDV redefined to DEFAULT_MAX_ROW_GROUP_ROW_COUNT as u64 (was hardcoded 1_000_000)
Added Sbbf::fold_to_target_fpp() and supporting methods (num_folds_for_target_fpp, fold_n, num_blocks) with comprehensive documentation
flush_bloom_filter() in both ColumnValueEncoderImpl and ByteArrayEncoder now folds the filter before returning it
New create_bloom_filter() helper in encoder.rs centralizes bloom filter construction logic

How folding works

The SBBF fold operation merges adjacent block pairs (block[2i] | block[2i+1]) via bitwise OR, halving the filter size. This differs from standard Bloom filter folding (which merges halves at distance m/2) because SBBF uses multiplicative hashing for block selection:

block_index = ((hash >> 32) * num_blocks) >> 32

When num_blocks is halved, the new index becomes floor(original_index / 2), so adjacent blocks map to the same position.

The number of safe folds is determined analytically from the average per-block fill rate: after k folds, expected fill is 1 - (1-f)^(2^k), giving FPP = fill^8. This requires only a single popcount scan over the blocks (no scratch allocation), then O(log N) floating-point ops to find the optimal fold count. The actual fold is then performed in a single pass.

Benchmarks

Filter sized for 1M NDV, varying actual distinct values inserted. Measured on Apple M3 Pro.

Fold overhead (fold_to_target_fpp only):

Actual NDV	Time	Throughput
1,000	39.1 µs	838 Melem/s
10,000	34.2 µs	960 Melem/s
100,000	32.5 µs	1.01 Gelem/s

End-to-end (insert + fold) vs insert-only:

Actual NDV	Insert only	Insert + fold	Fold overhead
1,000	14.7 µs	49.1 µs	34.4 µs (70%)
10,000	30.7 µs	58.1 µs	27.4 µs (47%)
100,000	162.5 µs	189.8 µs	27.3 µs (14%)

The fold cost is dominated by the popcount scan over the initial (large) filter. For the common case (100K values into a 1M-NDV filter), folding adds only ~14% overhead to the total insert+fold time.

References

Sailhan & Stehr, "Folding and Unfolding Bloom Filters", IEEE iThings 2012.

Liang, "Blocked Bloom Filters: Speeding Up Point Lookups in Tiger Postgres' Native Columnstore"

Breaking changes

There are no breaking API changes

However, when bloom filters are enabled without specifying the number of distinct values, the bloom filters are automatically sized. Previously they would be sized using the default value of DEFAULT_BLOOM_FILTER_NDV

Test plan

All existing bloom filter unit tests pass
All existing integration tests (sync + async reader roundtrips) pass
New unit tests: fold correctness, no false negatives after folding, FPP target respected, minimum size guard
New unit tests: folded filter is bit-identical to a fresh filter of the same size (proves correctness via two lemmas about SBBF hashing)
New unit tests: multi-step folding, folded FPP matches fresh FPP empirically, fold size matches optimal fixed-size filter
New integration test: i32_column_bloom_filter_fixed_ndv — roundtrip with both overestimated and underestimated NDV
Full cargo test -p parquet passes

🤖 Generated with Claude Code

etseidl · 2026-03-30T21:48:18Z

Hey @adriangb, cool idea. What motivated this if you don't mind me asking? Are any other Parquet implementations doing this?

adriangb · 2026-03-30T22:07:01Z

@viirya or @jimexist since you've worked on our bloom filters before, any interest in reviewing?

adriangb · 2026-03-30T22:10:33Z

Hey @adriangb, cool idea. What motivated this if you don't mind me asking? Are any other Parquet implementations doing this?

My motivation was that looking at our data this is a consistent problem: we have high cardinality data (trace ids) that when packed into 1M row row groups saturate the bloom filters (making them useless) but also waste a ton of space in small files. In looking for a solution I came across this neat trick.

I don't know if other Parquet implementations use this, but TimescaleDB does (linked above).

parquet/src/bloom_filter/mod.rs

viirya · 2026-03-30T23:44:06Z

parquet/src/bloom_filter/mod.rs

+        assert!(
+            len >= 2,
+            "Cannot fold a bloom filter with fewer than 2 blocks"
+        );


assert!(len % 2 == 0)?

I think fold_once can only work if len is not odd.

I think is should work fine with odd values as long, we are sure that the last value doesn't do an out of bound index? (i.e. the last block is not modified for the odd case). But I think we probably truncate too much for odd values.

emkornfield · 2026-03-30T23:53:35Z

parquet/src/bloom_filter/mod.rs

+            let block_fill = set_bits as f64 / 256.0;
+            total_fpp += block_fill.powi(8);
+        }
+        total_fpp / half as f64


why is cast needed here, can this be avoided by setting the type explicitly on total_fpp?

emkornfield · 2026-03-31T00:00:41Z

parquet/src/bloom_filter/mod.rs

+    ///
+    /// ## Why adjacent pairs (not halves)?
+    ///
+    /// Standard Bloom filter folding merges the two halves (`B[i] | B[i + m/2]`) because


nit: as an explanation it might pay to reverse, this I'm not sure whether readers would commonly be aware of bloom filter folding. So it might be better to explain why half first and then indicate why this is different then the linked paper.

viirya · 2026-03-31T00:06:44Z

parquet/src/arrow/arrow_writer/byte_array.rs

-            .bloom_filter_properties(descr.path())
-            .map(|props| Sbbf::new_with_ndv_fpp(props.ndv, props.fpp))
-            .transpose()?;
+        let (bloom_filter, bloom_filter_target_fpp) =


Is the bloom filter creation logic the same as encoder.rs? Maybe we can extract fn create_bloom_filter?

parquet/src/bloom_filter/mod.rs

viirya

Do we have e2e tests that cover this folding mode behavior already?

adriangb · 2026-03-31T00:29:48Z

Do we have e2e tests that cover this folding mode behavior already?

I can add them. Where would you recommend? I'm not all that familiar with the test structure here.

viirya · 2026-03-31T00:45:29Z

Do we have e2e tests that cover this folding mode behavior already?

I can add them. Where would you recommend? I'm not all that familiar with the test structure here.

Okay. Actually existing integration roundtrip tests after this PR will cover folding path automatically because they don't set NDV. So it turns out that old behavior fixed-size mode will not be covered by these roundtrip tests. Seems we should add roundtrip tests for fixed-size mode.

Arrow writer roundtrip tests are in parquet/src/arrow/arrow_writer/mod.rs, like i32_column_bloom_filter, i32_column_bloom_filter_at_end, etc.

Arrow reader roundtrip tests like
test_get_row_group_column_bloom_filter_with_length in parquet/tests/arrow_reader/bloom_filter/sync.rs, only calls set_bloom_filter_enabled(true).

adriangb · 2026-03-31T01:13:58Z

I added a test for the legacy path. Should we deprecate it? I think the intent is better captured by the new path. One may want to create exact size bloom filters, but I don't think setting the NDV and FPP is the right way to do that (a setting for specifying the size directly would be better).

viirya · 2026-03-31T01:36:10Z

I added a test for the legacy path. Should we deprecate it? I think the intent is better captured by the new path. One may want to create exact size bloom filters, but I don't think setting the NDV and FPP is the right way to do that (a setting for specifying the size directly would be better).

Yea, I think we can deprecate the old behavior and maybe remove it after few releases.

adriangb · 2026-03-31T02:10:41Z

Yea, I think we can deprecate the old behavior and maybe remove it after few releases.

Do you want to do that in this PR or in a followup (maybe once this is out in the wild and known to be working well)?

viirya · 2026-03-31T02:44:31Z

Yea, I think we can deprecate the old behavior and maybe remove it after few releases.

Do you want to do that in this PR or in a followup (maybe once this is out in the wild and known to be working well)?

I think we can do it in this PR.

wgtmac · 2026-03-31T03:09:30Z

Coming from the dev list. The parquet-java implementation tried to optimize the disk size by creating multiple bloom filter writers with different NDVs and choosing the best in the end. The approach in this PR looks more elegant and worth porting to other implementations.

adriangb · 2026-03-31T12:59:14Z

Yea, I think we can deprecate the old behavior and maybe remove it after few releases.

Do you want to do that in this PR or in a followup (maybe once this is out in the wild and known to be working well)?

I think we can do it in this PR.

If we want to deprecate the existing NDV I think we're better off re-interpreting it to mean "maximum ndv" or "initial ndv". That way existing users who are setting the ndv also benefit from folding. This means there will be no way to disable folding but I also don't see any reason anyone would want to do that beyond requiring a fixed-size bloom filter (in which case relying on a combination of fpp + ndv giving you a fixed size was probably a bad choice to begin with given I don't think we made any such API promise, and they should open an issue requesting an explicit API for this).

Thus the only changes vs. main now are:

Adding the folding on write.
Default max ndv is derived from the max rows per row group instead of being hardcoded.

viirya · 2026-03-31T21:28:49Z

parquet/src/arrow/arrow_writer/mod.rs

+            (SMALL_SIZE as i32 + 1..SMALL_SIZE as i32 + 10).collect(),
+        );
+
+        // NDV smaller than actual distinct values — tests the underestimate path


array has only 7 distinct value. So "NDV smaller than actual distinct values" seems incorrect?

alamb · 2026-04-01T12:53:38Z

Checking this one out

alamb

This is very cool @adriangb

I reviewed this one carefully, not algorithmically but empirically with testing and I have convinced myself that this folding algorithm is correct -- specifically that it produces bit for bit identical filters.

I made a fuzz test here

pydantic#8

My two concerns:

Performance (my testing shows this folding operation is very expensive) -- if we want to enable this behavior by default we should probably optimize the calculation and folding
API change -- sadly this looks like a public API change so will have to wait to the next major release

parquet/src/bloom_filter/mod.rs

alamb · 2026-04-01T13:15:54Z

parquet/src/file/properties.rs

+    /// If you do set this value explicitly it is probably best to set it for each column
+    /// individually via [`WriterPropertiesBuilder::set_column_bloom_filter_ndv`] rather than globally,
+    /// since different columns may have different numbers of distinct values.
+    pub ndv: Option<u64>,


since this is a public struct, I think this is a breaking API change: https://docs.rs/parquet/58.0.0/parquet/file/properties/struct.BloomFilterProperties.html

alamb · 2026-04-01T13:16:01Z

parquet/src/file/properties.rs

-    /// Setting to a very small number diminishes the value of the filter itself, as the bitset size is
-    /// even larger than just storing the whole value. You are also expected to set `ndv` if it can
-    /// be known in advance to greatly reduce space usage.
+    /// This value also serves as the target FPP for bloom filter folding: after all values


This is a really cool feature -- to make bloom filter creation dynamically adjust the bloom filter size based on the data. I really like that there is a way to go back to the old explicit behavior too -- and solves a problem for using bloom filters in Parquet.

As mentioned below this would be a breaking API change so would have to wait for the next release

Given this is a pub struct with pub fields, I am not sure there is any way to avoid a breaking API change 🤔

One thing I think that would make the API could potentially do would be to make a builder

struct BloomFilterPropertiesBuilder { ... }

And change the fields of BloomFilterProperties to not be public 🤔

(we could do this as a follow on PR)

I kind of went down this path and then backed out to minimize the diff / change. I would propose we merge this and then refactor into BloomFilterPropertiesBuilder to ease future changes.

Makes sense to me

alamb · 2026-04-01T14:08:06Z

parquet/src/bloom_filter/mod.rs

+        let half = len / 2;
+        for i in 0..half {
+            for j in 0..8 {
+                self.0[i].0[j] = self.0[2 * i].0[j] | self.0[2 * i + 1].0[j];


FYI this code seems very expensive in profiling

It also consumes a huge amount of time even in release builds

I wonder if there is some way to make this faster (unchecked access / rust iterators, etc)

I will see what I can do

I've gotten Claude to do some benchmark driven optimization. It should be ~ 35% faster now.

emkornfield · 2026-04-01T20:42:12Z

parquet/src/bloom_filter/mod.rs

+            "Cannot fold a bloom filter with fewer than 2 blocks"
+        );
+        assert!(
+            len % 2 == 0,


It isn't clear to me that this will be the general case (would need to reread surrounding code and formulation) but are we sure this could never be hit in practice? It doesn't seem expensive to necessarily handle the case when it is odd.

Folding is only applied on the write path and we never generate odd block counts, so I think it's worth the defensive assertion instead of handling via a dead code path.

Make sense. If block counts are never odd, I would expect an assert someplace above this function (e.g. on initialization) that the block count is never odd as well (and we can comment there that this is required for folding).

The number of bytes used is always a power of 2 and a multiple of block size, so it follows that the number of blocks will also be a power of 2. But it looks like this code is no longer used.

My main point it's not obvious from this patch that this is the case. Im not sure we made use of the fact of power of 2 sizing before and if start doing so, we should ideally document that with an assertion or unit test on where we set initial number of blocks, so it doesn't regress in the future

…move half Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Instead of alternating between FPP estimation and folding at each level (reading data twice per level), determine the number of safe folds upfront via a tree reduction on a scratch buffer, then fold all levels in one pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ld_n Remove fold_once and estimated_fpp_after_fold in favor of two composable private methods that are independently testable. Tests now use fold_n(1) directly instead of a test-only helper. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tch buffer Replace the tree-reduction approach (O(N/2) scratch allocation + simulated merges) with an analytical estimate based on average per-block fill rate. After k folds, expected fill is 1-(1-f)^(2^k), giving FPP = fill^8. This is ~2x faster for the fold-only benchmark: the only data pass is a vectorizable popcount scan, followed by O(log N) floating-point ops. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Instead of changing ndv from u64 to Option<u64> (which breaks downstream users), resolve the dynamic default at WriterPropertiesBuilder::build() time. Columns with bloom filters enabled but no explicit NDV get their ndv set to max_row_group_row_count, so the filter is never undersized. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

adriangb · 2026-04-04T14:19:49Z

parquet/src/file/properties.rs

+/// but without an explicit NDV will have their NDV resolved at build time to
+/// [`WriterProperties::max_row_group_row_count`], which may differ from this constant
+/// if the user configured a custom row group size.
+pub const DEFAULT_BLOOM_FILTER_NDV: u64 = DEFAULT_MAX_ROW_GROUP_ROW_COUNT as u64;


Note that this is also a slightly different number (DEFAULT_MAX_ROW_GROUP_ROW_COUNT is 1_048_576, negligible and I think the old constant would have been rounded up to this anyway since the sizes need to be powers of 2).

adriangb · 2026-04-04T14:22:09Z

@alamb I was able to refactor this into a non-breaking change, there are no changes to the public API, just the behavioral changes which are positive and backwards compatible. So I think we can merge this immediately and queue up a PR to make any API improvements we want (e.g. #9628 (comment)) as a followup to be released w/ next major version.

viirya · 2026-04-04T15:37:34Z

parquet/src/file/properties.rs

+    /// since different columns may have different numbers of distinct values.
    pub ndv: u64,


If we plan to remove it, we may need add deprecated? Or we actyally want to keep explicit ndv setting alongside auto folding?

So the situation now is that the explicit ndv setting is useful if you know the ndv << row group size to reduce write memory usage, but we still always do the folding. So the ndv only controls the initially allocated size of the bloom filters. Thus I think it's useful to keep the ability to explicitly set the ndv.

I see. Making sense.

emkornfield · 2026-04-05T19:37:04Z

parquet/src/bloom_filter/mod.rs

+/// A split block Bloom filter (SBBF).
+///
+/// An SBBF partitions its bit space into fixed-size 256-bit (32-byte) blocks, each fitting in a
+/// single CPU cache line. Each block contains eight 32-bit words, aligned with SIMD lanes for


nit: do we need the detail about cacheline? For most modern CPUs I think we actually have two per cache line so it is a little bit confusing?

parquet/src/bloom_filter/mod.rs

emkornfield · 2026-04-05T19:50:32Z

parquet/src/bloom_filter/mod.rs

+
+                // step 3 (lemma 2): mask is the same
+                for w in 0..8 {
+                    assert_ne!(original.0[orig_idx].0[w] & mask.0[w], 0,);


is it more straight-forward to assert original.0[orig_idx].0[w] = mask.0[w]. Isn't the current assertion just saying there is one overlapping bit between mask and original?

parquet/src/bloom_filter/mod.rs

emkornfield

Seems reasonable to me, added a few comments but other then some comment cleanup I don't think there is anything blocking here.

adriangb · 2026-04-05T22:51:41Z

Seems reasonable to me, added a few comments but other then some comment cleanup I don't think there is anything blocking here.

Great to hear! I won't be able to merge (not a committer) so I will leave it up to someone else.

alamb · 2026-04-06T18:01:09Z

Giving it a review

alamb

Really nice work @adriangb

I went through this PR again carefully and TLDR is I think this is good to go even though it is technically a behavior change it isn't an API change

I also re-ran by fuzz testing on it and it all passed

I did change the description from

- `BloomFilterProperties.ndv`: `u64` → `Option<u64>` (direct struct construction must be updated)
+ However, when bloom filters are enabled without specifying the number of distinct values, the bloom filters are automatically sized. Previously they would be sized using the default value of `DEFAULT_BLOOM_FILTER_NDV`

I also think we should follow up with a follow on PR to make the BloomFilterPropertiesBuilder (as discussed here) and make the bloom filter configuration more explicit. I filed a ticket to do so

#9667

adriangb · 2026-04-06T18:37:22Z

Awesome! Thanks for filing the follow up ticket. This is good to merge from my perspective, but I can’t hit the button just FYI.

alamb · 2026-04-06T18:38:07Z

Ok, I'll plan to merge it tomorrow unless there are any objections

github-actions bot added the parquet Changes to the parquet crate label Mar 30, 2026

emkornfield reviewed Mar 30, 2026

View reviewed changes

parquet/src/bloom_filter/mod.rs Outdated Show resolved Hide resolved

viirya reviewed Mar 30, 2026

View reviewed changes

emkornfield reviewed Mar 30, 2026

View reviewed changes

emkornfield reviewed Mar 31, 2026

View reviewed changes

viirya reviewed Mar 31, 2026

View reviewed changes

parquet/src/bloom_filter/mod.rs Outdated Show resolved Hide resolved

viirya reviewed Mar 31, 2026

View reviewed changes

parquet/src/bloom_filter/mod.rs Outdated Show resolved Hide resolved

viirya reviewed Mar 31, 2026

View reviewed changes

adriangb requested review from emkornfield and viirya March 31, 2026 20:43

viirya reviewed Mar 31, 2026

View reviewed changes

viirya approved these changes Mar 31, 2026

View reviewed changes

alamb mentioned this pull request Apr 1, 2026

Add fuzz testing for bloom filter folding pydantic/arrow-rs#8

Open

alamb approved these changes Apr 1, 2026

View reviewed changes

alamb added api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version labels Apr 1, 2026

emkornfield reviewed Apr 1, 2026

View reviewed changes

adriangb and others added 8 commits April 4, 2026 08:58

lint

fbea1e1

address review: assert even block count in estimated_fpp_after_fold, …

8611ba8

…move half Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

add insert-only benchmark to isolate fold overhead

0f3500a

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

lint

50eff0d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

adriangb force-pushed the bloom-filter-folding branch from f18709d to bb6c3a9 Compare April 4, 2026 14:10

fmt

eb88b69

adriangb commented Apr 4, 2026

View reviewed changes

viirya approved these changes Apr 4, 2026

View reviewed changes

viirya reviewed Apr 4, 2026

View reviewed changes

emkornfield reviewed Apr 5, 2026

View reviewed changes

parquet/src/bloom_filter/mod.rs Outdated Show resolved Hide resolved

emkornfield reviewed Apr 5, 2026

View reviewed changes

parquet/src/bloom_filter/mod.rs Show resolved Hide resolved

emkornfield approved these changes Apr 5, 2026

View reviewed changes

adriangb added 4 commits April 5, 2026 16:29

add power of 2 assertion

e8e8b8b

update docstring

0c6b449

add more docstrings

390fbed

more docs

07bf922

alamb removed the next-major-release the PR has API changes and it waiting on the next major version label Apr 6, 2026

alamb mentioned this pull request Apr 6, 2026

[parquet] Add BloomFilterProperties builder API to make bloom filter configuration explicit #9667

Open

alamb approved these changes Apr 6, 2026

View reviewed changes

		/// since different columns may have different numbers of distinct values.
		pub ndv: u64,

Conversation

adriangb commented Mar 30, 2026 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

How folding works

Benchmarks

References

Breaking changes

Test plan

Uh oh!

etseidl commented Mar 30, 2026

Uh oh!

adriangb commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented Mar 30, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emkornfield Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

adriangb commented Mar 31, 2026

Uh oh!

viirya commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented Mar 31, 2026

Uh oh!

viirya commented Mar 31, 2026

Uh oh!

adriangb commented Mar 31, 2026

Uh oh!

viirya commented Mar 31, 2026

Uh oh!

wgtmac commented Mar 31, 2026

Uh oh!

adriangb commented Mar 31, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Apr 1, 2026

Uh oh!

alamb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb commented Mar 30, 2026 •

edited by alamb

Loading

adriangb commented Mar 30, 2026 •

edited

Loading

emkornfield Mar 30, 2026 •

edited

Loading

viirya commented Mar 31, 2026 •

edited

Loading

alamb left a comment •

edited

Loading

viirya Apr 4, 2026 •

edited

Loading

adriangb Apr 4, 2026 •

edited

Loading