Skip to content

[native]: Implement Sketch Theta aggregate and scalar functions. #26745

@nmahadevuni

Description

@nmahadevuni

Need to implement Sketch Theta aggregate and scalar functions required for the new Iceberg statistics for number of distinct values computation introduced in Presto Java.

Expected Behavior or Use Case

Sketches are data structures that can approximately answer particular questions about a dataset when full accuracy is not required. The benefit of approximate answers is that they are often faster and more efficient to compute than functions which result in full accuracy.

Theta sketches enable distinct value counting on datasets and also provide the ability to perform set operations. For more information on Theta sketches, please see the Apache Datasketches Theta sketch documentation

The Presto PR which introduced these changes is #20993. A brief intro to these functions

New Sketch Functions
Iceberg's Puffin spec defines the format that NDVs must be written in. Currently, the only available format is a binary
blob representing an Apache Datasketches Theta Sketch, so we implemented three basic functions which expose the sketch so that Iceberg can eventually consume it when writing statistics.

sketch_theta(<column>) -> varbinary: An aggregation function which accepts a column and generates a binary representation of the org.apache.datasketches.theta.CompactSketch. Applications can easily consume this raw binary
format to gain access to a CompactSketch instance and associated methods.

sketch_theta_estimate(<varbinary sketch>) -> double: A scalar function which consumes a raw binary sketch and produces the estimate. This is effectively the same as calling CompactSketch::getEstimate. I've exposed this as a convenience for checking the sketch's output

sketch_theta_summary(<varbinary sketch>) -> row(estimate double, theta double, upper_bound_std1 double, lower_bound_std1 double, retained_entries int): This is another scalar function, but returns a row type containing more human-readable information about the sketch such as the theta parameter as well as upper and lower bounds for 1 standard deviation from the estimate

Presto Component, Service, or Connector

Functions

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions