Skip to content

[CH]support complex type in parquetv3#12079

Open
zhanglistar wants to merge 5 commits into
apache:mainfrom
zhanglistar:feat/support-complex-type-in-parquetv3
Open

[CH]support complex type in parquetv3#12079
zhanglistar wants to merge 5 commits into
apache:mainfrom
zhanglistar:feat/support-complex-type-in-parquetv3

Conversation

@zhanglistar
Copy link
Copy Markdown
Contributor

@zhanglistar zhanglistar commented May 12, 2026

What changes are proposed in this pull request?

Depends on Kyligence/ClickHouse#523.

This PR fixes several ClickHouse backend failures around nullable complex types, especially arrays, structs, and nested null values.

The changes include:

  • Preserve nullable semantics when parsing nested Substrait LIST types, so nested arrays can correctly represent values such as Array(Nullable(Array(...))).
  • Avoid inserting Field::Null into non-nullable ClickHouse complex columns during Spark row conversion and constant column creation.
  • Preserve source nullability when casting nullable columns and when aligning the final output schema.
  • Fix Parquet reading for nullable complex columns such as Nullable(Tuple(...)), including reconstructing struct-level nullability from nullable child fields.

These fixes prevent errors such as Bad get: has Null, requested Array and Cannot convert NULL value to non-Nullable type, and also restore correct semantics for array lambda operations over nullable structs.

Related to #2340.

How was this patch tested?

Tested with affected ClickHouse backend function validation cases, including:

Also verified C++ compilation for the touched objects, including parser and Parquet reader components.

Was this patch authored or co-authored using generative AI tooling?

Cowork with GPT-5.5

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

3 similar comments
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@zhanglistar zhanglistar force-pushed the feat/support-complex-type-in-parquetv3 branch from 057bcda to d28e587 Compare May 20, 2026 01:15
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@zhanglistar zhanglistar force-pushed the feat/support-complex-type-in-parquetv3 branch from 28b4930 to 6758a11 Compare May 22, 2026 01:19
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves ClickHouse backend correctness around nullable complex types (arrays/maps/tuples/struct-like) across Substrait type parsing, Spark-row-to-CH conversion, CAST/const literal handling, output schema alignment, join column alignment, and enabling Parquet native reader v3 for complex types.

Changes:

  • Preserve nested LIST nullability semantics in TypeParser to correctly represent types like Array(Nullable(Array(...))), while intentionally dropping outermost LIST nullability.
  • Prevent inserting Field::Null into non-nullable complex columns by normalizing Spark row Fields and by making null literals/casts preserve nullability when required.
  • Enable Parquet reader v3 for non-flat schemas (when row-index virtual columns aren’t requested) and add regression tests for the affected behaviors.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
cpp-ch/local-engine/tests/gtest_spark_row.cpp Adds regression test for nested null handling into non-nullable complex CH types.
cpp-ch/local-engine/tests/gtest_local_engine.cpp Adds tests covering convertColumnAsNecessary behavior when dropping nullability safely.
cpp-ch/local-engine/Storages/SubstraitSource/ParquetFormatFile.cpp Enables Parquet reader v3 beyond flat schemas; keeps row-index path on legacy reader.
cpp-ch/local-engine/Parser/TypeParser.h Extends parseType API with a control flag for LIST nullability.
cpp-ch/local-engine/Parser/TypeParser.cpp Implements nested LIST nullability preservation (and drops outer LIST nullability by default).
cpp-ch/local-engine/Parser/SparkRowToCHColumn.cpp Normalizes nested Fields to avoid NULL insertion into non-nullable complex columns.
cpp-ch/local-engine/Parser/SerializedPlanParser.cpp Preserves origin nullability when aligning final output schema types.
cpp-ch/local-engine/Parser/RelParsers/ReadRelParser.cpp Updates v3 reader labeling decision to match broadened v3 enablement.
cpp-ch/local-engine/Parser/ExpressionParser.cpp Fixes null literal const-column creation and preserves nullability across casts when needed.
cpp-ch/local-engine/Common/CHUtil.cpp Allows dropping Nullable -> non-nullable when there are no NULL values; otherwise throws.
cpp-ch/clickhouse.version Bumps ClickHouse branch/commit to a complextype-enabled fork.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +204 to 205
if (format_settings.parquet.use_native_reader_v3 && !readRowIndex)
source_step->setStepDescription("ParquetReaderV3");
Comment on lines 222 to +224
// TODO: rebase-25.12, support complex types when there is a nullable type
// for example: parquet type is Array, requested type is Nullable(Array(Nullable(String)))
if (format_settings.parquet.use_native_reader_v3 && !readRowIndex && onlyFlatType)
if (format_settings.parquet.use_native_reader_v3 && !readRowIndex)
Comment on lines +131 to +134
{
DB::Field field = spark_row_reader.getField(i);
columns[i]->insert(normalizeFieldForType(std::move(field), spark_row_reader.getFieldTypes()[i]));
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants