[CH]support complex type in parquetv3#12079
Conversation
|
Run Gluten Clickhouse CI on x86 |
3 similar comments
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
057bcda to
d28e587
Compare
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
28b4930 to
6758a11
Compare
|
Run Gluten Clickhouse CI on x86 |
There was a problem hiding this comment.
Pull request overview
This PR improves ClickHouse backend correctness around nullable complex types (arrays/maps/tuples/struct-like) across Substrait type parsing, Spark-row-to-CH conversion, CAST/const literal handling, output schema alignment, join column alignment, and enabling Parquet native reader v3 for complex types.
Changes:
- Preserve nested LIST nullability semantics in
TypeParserto correctly represent types likeArray(Nullable(Array(...))), while intentionally dropping outermost LIST nullability. - Prevent inserting
Field::Nullinto non-nullable complex columns by normalizing Spark row Fields and by making null literals/casts preserve nullability when required. - Enable Parquet reader v3 for non-flat schemas (when row-index virtual columns aren’t requested) and add regression tests for the affected behaviors.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| cpp-ch/local-engine/tests/gtest_spark_row.cpp | Adds regression test for nested null handling into non-nullable complex CH types. |
| cpp-ch/local-engine/tests/gtest_local_engine.cpp | Adds tests covering convertColumnAsNecessary behavior when dropping nullability safely. |
| cpp-ch/local-engine/Storages/SubstraitSource/ParquetFormatFile.cpp | Enables Parquet reader v3 beyond flat schemas; keeps row-index path on legacy reader. |
| cpp-ch/local-engine/Parser/TypeParser.h | Extends parseType API with a control flag for LIST nullability. |
| cpp-ch/local-engine/Parser/TypeParser.cpp | Implements nested LIST nullability preservation (and drops outer LIST nullability by default). |
| cpp-ch/local-engine/Parser/SparkRowToCHColumn.cpp | Normalizes nested Fields to avoid NULL insertion into non-nullable complex columns. |
| cpp-ch/local-engine/Parser/SerializedPlanParser.cpp | Preserves origin nullability when aligning final output schema types. |
| cpp-ch/local-engine/Parser/RelParsers/ReadRelParser.cpp | Updates v3 reader labeling decision to match broadened v3 enablement. |
| cpp-ch/local-engine/Parser/ExpressionParser.cpp | Fixes null literal const-column creation and preserves nullability across casts when needed. |
| cpp-ch/local-engine/Common/CHUtil.cpp | Allows dropping Nullable -> non-nullable when there are no NULL values; otherwise throws. |
| cpp-ch/clickhouse.version | Bumps ClickHouse branch/commit to a complextype-enabled fork. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (format_settings.parquet.use_native_reader_v3 && !readRowIndex) | ||
| source_step->setStepDescription("ParquetReaderV3"); |
| // TODO: rebase-25.12, support complex types when there is a nullable type | ||
| // for example: parquet type is Array, requested type is Nullable(Array(Nullable(String))) | ||
| if (format_settings.parquet.use_native_reader_v3 && !readRowIndex && onlyFlatType) | ||
| if (format_settings.parquet.use_native_reader_v3 && !readRowIndex) |
| { | ||
| DB::Field field = spark_row_reader.getField(i); | ||
| columns[i]->insert(normalizeFieldForType(std::move(field), spark_row_reader.getFieldTypes()[i])); | ||
| } |
What changes are proposed in this pull request?
Depends on Kyligence/ClickHouse#523.
This PR fixes several ClickHouse backend failures around nullable complex types, especially arrays, structs, and nested null values.
The changes include:
Array(Nullable(Array(...))).Field::Nullinto non-nullable ClickHouse complex columns during Spark row conversion and constant column creation.Nullable(Tuple(...)), including reconstructing struct-level nullability from nullable child fields.These fixes prevent errors such as
Bad get: has Null, requested ArrayandCannot convert NULL value to non-Nullable type, and also restore correct semantics for array lambda operations over nullable structs.Related to #2340.
How was this patch tested?
Tested with affected ClickHouse backend function validation cases, including:
last_dayapprox_count_distinctAlso verified C++ compilation for the touched objects, including parser and Parquet reader components.
Was this patch authored or co-authored using generative AI tooling?
Cowork with GPT-5.5