Skip to content

[GLUTEN-12010][CH] Pass the correct values to ParquetInputFormat#12011

Merged
zzcclp merged 3 commits into
apache:mainfrom
zzcclp:gluten-12010
Apr 30, 2026
Merged

[GLUTEN-12010][CH] Pass the correct values to ParquetInputFormat#12011
zzcclp merged 3 commits into
apache:mainfrom
zzcclp:gluten-12010

Conversation

@zzcclp
Copy link
Copy Markdown
Contributor

@zzcclp zzcclp commented Apr 29, 2026

Updates the ClickHouse backend’s Parquet read path to use the configured Parquet input settings (rather than hardcoded defaults) when constructing/feeding the Parquet input formats.

Changes:

  • Thread format_settings.parquet.max_block_size into the local ParquetInputFormat wrapper and use it for row-index-only batch generation.
  • Select min_bytes_for_seek based on whether the underlying read is remote vs local, and pass it into the native Parquet input formats.
  • Remove the default ClickHouse setting local_engine.settings.log_processors_profiles = true from backend initialization.

What changes are proposed in this pull request?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Related issue: #12010

Pass the correct values to ParquetInputFormat
@zzcclp zzcclp requested a review from Copilot April 29, 2026 10:23
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

import org.apache.gluten.backendsapi.clickhouse.CHConfig._
conf.setCHConfig(
"timezone" -> conf.get("spark.sql.session.timeZone", TimeZone.getDefault.getID),
"local_engine.settings.log_processors_profiles" -> "true")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value of the log_processors_profiles is true, don't need to set again.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better leave this here.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the ClickHouse backend’s Parquet read path to use the configured Parquet input settings (rather than hardcoded defaults) when constructing/feeding the Parquet input formats.

Changes:

  • Thread format_settings.parquet.max_block_size into the local ParquetInputFormat wrapper and use it for row-index-only batch generation.
  • Select min_bytes_for_seek based on whether the underlying read is remote vs local, and pass it into the native Parquet input formats.
  • Remove the default ClickHouse setting local_engine.settings.log_processors_profiles = true from backend initialization.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
cpp-ch/local-engine/Storages/SubstraitSource/ParquetFormatFile.cpp Uses configured Parquet max_block_size for row-index-only reads and uses remote vs local min_bytes_for_seek when creating Parquet input formats.
backends-clickhouse/src/main/scala/org/apache/gluten/backendsapi/clickhouse/CHListenerApi.scala Changes default ClickHouse backend initialization config by removing log_processors_profiles.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 111 to 115
assert(outputHeader.columns());
assert(row_index_reader);
// TODO: rebase-25.12, format_settings_.parquet.max_block_size
Columns cols{row_index_reader->readBatch(8192)};
Columns cols{row_index_reader->readBatch(max_block_size)};
size_t rows = cols[0]->size();
return Chunk(std::move(cols), rows);
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change alters the batch size used when only meta columns are requested (readHeader.columns()==0) to depend on format_settings.parquet.max_block_size. There doesn’t appear to be a unit/integration test covering that row-index-only path with a non-default max_block_size; adding one would help prevent regressions (e.g., ensuring chunk sizes follow the setting rather than a hardcoded constant).

Copilot uses AI. Check for mistakes.
Comment on lines 90 to 94
// Add configs
import org.apache.gluten.backendsapi.clickhouse.CHConfig._
conf.setCHConfig(
"timezone" -> conf.get("spark.sql.session.timeZone", TimeZone.getDefault.getID),
"local_engine.settings.log_processors_profiles" -> "true")
"timezone" -> conf.get("spark.sql.session.timeZone", TimeZone.getDefault.getID))
conf.setCHSettings("spark_version", SPARK_VERSION)
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description/title is about passing correct values to ParquetInputFormat, but this hunk also removes the default CH setting local_engine.settings.log_processors_profiles = true. If this behavior change is intended, it should be called out in the PR description (or moved to a separate PR) since it changes runtime logging/profiling defaults for the ClickHouse backend.

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@zzcclp zzcclp requested a review from lgbo-ustc April 30, 2026 02:10
Copy link
Copy Markdown
Contributor

@zhanglistar zhanglistar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zzcclp zzcclp merged commit 93de3c8 into apache:main Apr 30, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants