Skip to content

feat(parquet): add wide-schema writer overhead benchmark#9723

Merged
alamb merged 1 commit intoapache:mainfrom
HippoBaro:wide_schema_writer_bench
Apr 15, 2026
Merged

feat(parquet): add wide-schema writer overhead benchmark#9723
alamb merged 1 commit intoapache:mainfrom
HippoBaro:wide_schema_writer_bench

Conversation

@HippoBaro
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

Existing writer benchmarks use narrow schemas (5–10 columns) and primarily measure data encoding throughput. They don't capture per-column structural overhead that dominates at high column cardinality (thousands to hundreds of thousands of columns), such as allocation, and metadata assembly.

What changes are included in this PR?

This commit adds benchmarks to fill that gap by writing a single-row batch through ArrowWriter with 1k/5k/10k flat Float32 columns and per-column WriterProperties entries, isolating the cost of the writer infrastructure itself.

Baseline results (Apple M1 Max):

  writer_overhead/1000_cols/per_column_props      3.72 ms
  writer_overhead/5000_cols/per_column_props     54.96 ms
  writer_overhead/10000_cols/per_column_props   220.73 ms

Are these changes tested?

N/A

Are there any user-facing changes?

N/A

Existing writer benchmarks use narrow schemas (5–10 columns) and
primarily measure data encoding throughput. They don't capture
per-column structural overhead that dominates at high column cardinality
(thousands to hundreds of thousands of columns), such as allocation, and
metadata assembly.

This commit adds benchmarks to fill that gap by writing a single-row
batch through `ArrowWriter` with 1k/5k/10k flat `Float32` columns and
per-column `WriterProperties` entries, isolating the cost of the writer
infrastructure itself.

Baseline results (Apple M1 Max):

  writer_overhead/1000_cols/per_column_props      3.72 ms
  writer_overhead/5000_cols/per_column_props     54.96 ms
  writer_overhead/10000_cols/per_column_props   220.73 ms

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
Copy link
Copy Markdown
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. The metadata bench will do wide tables (10k columns), but only measures decoding the footer. Nice to have something similar on the write side.

@alamb alamb merged commit 06c3bd0 into apache:main Apr 15, 2026
17 checks passed
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 15, 2026

Thank you @HippoBaro and @etseidl for the review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants