Add bloom_filter_agg and might_contain SparkSql function #3342
Closed
jinchengchenghh
started this conversation in
Ideas
Replies: 1 comment 1 reply
-
|
@jinchengchenghh FYI, there is an effort to document Spark functions: #3890 It would be nice to add these new functions to the documentation. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
As apache/spark#35789 describes, it will have performance enhancement.
So I will try to implement these functions in velox.
Spark use xxhash64(col) whose return type hashcode is int64_t while velox BloomFilter only support uint64_t hashcode as hashInput https://github.com/facebookincubator/velox/blob/main/velox/common/base/BloomFilter.h#L65
hashInput false, hashcode = folly::hasher(xxhash64(col))
hashInput true, hashcode = xxhash64(col)
Need to implement a stronger BloomFilter as Spark https://github.com/apache/spark/blob/master/common/sketch/src/main/java/org/apache/spark/util/sketch/BloomFilterImpl.java
The new BloomFilter should adapt hash function number according to estimatedNumItems(rowCount) and accept more column types, uint64_t and int64_t at least.
Beta Was this translation helpful? Give feedback.
All reactions