Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Very sparse columns (high null ratio) are just as expensive to write as writing dense, high-cardinality ones, even though the underlying encoding (RLE) compresses long runs of identical values into a single entry.
The cost of writing should reflect the cost of encoding: writing the same value a million times should be roughly as cheap as writing it once.
Describe the solution you'd like
The writer should perform per-run work instead of per-value work wherever possible. When long runs of identical definition/repetition levels are detected (as is typical for sparse columns), counting, histogram updates, and RLE encoding should all be amortized over the entire run in O(1) rather than O(n). Entirely-null columns should be an especially cheap special case
Describe alternatives you've considered
N/A
Additional context
N/A
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Very sparse columns (high null ratio) are just as expensive to write as writing dense, high-cardinality ones, even though the underlying encoding (RLE) compresses long runs of identical values into a single entry.
The cost of writing should reflect the cost of encoding: writing the same value a million times should be roughly as cheap as writing it once.
Describe the solution you'd like
The writer should perform per-run work instead of per-value work wherever possible. When long runs of identical definition/repetition levels are detected (as is typical for sparse columns), counting, histogram updates, and RLE encoding should all be amortized over the entire run in O(1) rather than O(n). Entirely-null columns should be an especially cheap special case
Describe alternatives you've considered
N/A
Additional context
N/A