Parquet: level encoding cost should be proportional to RLE output size

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
Very sparse columns (high null ratio) are just as expensive to write as writing dense, high-cardinality ones, even though the underlying encoding (RLE) compresses long runs of identical values into a single entry.

The cost of writing should reflect the cost of encoding: writing the same value a million times should be roughly as cheap as writing it once.

**Describe the solution you'd like**
The writer should perform per-run work instead of per-value work wherever possible. When long runs of identical definition/repetition levels are detected (as is typical for sparse columns), counting, histogram updates, and RLE encoding should all be amortized over the entire run in O(1) rather than O(n). Entirely-null columns should be an especially cheap special case

**Describe alternatives you've considered**
N/A

**Additional context**
N/A


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet: level encoding cost should be proportional to RLE output size #9652

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Parquet: level encoding cost should be proportional to RLE output size #9652

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions