Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Dec 13, 2025

Which issue does this PR close?

Closes #.

Rationale for this change

Example schema (partial):

root
 |-- c_1: long (nullable = true)
 |-- c_2: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- c_4: struct (nullable = true)
 |    |-- c_5: decimal(10,2) (nullable = true)
 |    |-- c_6: string (nullable = true)
 |    |-- c_7: long (nullable = true)
 |    |-- c_8: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- c_10: array (nullable = true)
 |    |    |-- element: decimal(10,2) (containsNull = true)
 |    |-- c_12: string (nullable = true)
 |-- c_13: struct (nullable = true)
 |    |-- c_14: array (nullable = true)
 |    |    |-- element: long (containsNull = true)
 |    |-- c_16: struct (nullable = true)
 |    |    |-- c_17: long (nullable = true)
 |    |    |-- c_18: long (nullable = true)
 |    |    |-- c_19: string (nullable = true)
 |    |    |-- c_20: string (nullable = true)
 |    |    |-- c_21: decimal(10,2) (nullable = true)
 |    |    |-- c_22: long (nullable = true)
 |    |    |-- c_23: decimal(10,2) (nullable = true)
 |    |    |-- c_24: decimal(10,2) (nullable = true)
 |    |    |-- c_25: long (nullable = true)
...
OpenJDK 64-Bit Server VM 17.0.17+10-Ubuntu-122.04 on Linux 6.8.0-87-generic
AMD Ryzen 9 7950X3D 16-Core Processor
SQL Deeply Nested (depth=2) Shuffle (5 Partition):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------------
SQL Parquet - Spark                                          426            462          24          0.0       42607.4       1.0X
SQL Parquet - Comet (Spark Shuffle)                          389            431          28          0.0       38927.6       1.1X
SQL Parquet - Comet (JVM Shuffle)                            548            597          48          0.0       54806.4       0.8X
SQL Parquet - Comet (Native Shuffle)                         377            385           8          0.0       37723.2       1.1X

OpenJDK 64-Bit Server VM 17.0.17+10-Ubuntu-122.04 on Linux 6.8.0-87-generic
AMD Ryzen 9 7950X3D 16-Core Processor
SQL Deeply Nested (depth=2) Shuffle (201 Partition):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-----------------------------------------------------------------------------------------------------------------------------------
SQL Parquet - Spark                                            451            508          48          0.0       45103.4       1.0X
SQL Parquet - Comet (Spark Shuffle)                            398            406          10          0.0       39764.0       1.1X
SQL Parquet - Comet (JVM Shuffle)                             2158           2181          32          0.0      215831.0       0.2X
SQL Parquet - Comet (Native Shuffle)                           402            435          46          0.0       40230.3       1.1X

What changes are included in this PR?

How are these changes tested?

@andygrove andygrove marked this pull request as ready for review December 13, 2025 18:57
@codecov-commenter
Copy link

codecov-commenter commented Dec 13, 2025

Codecov Report

❌ Patch coverage is 74.19355% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.44%. Comparing base (f09f8af) to head (6bd7cc6).
⚠️ Report is 775 commits behind head on main.

Files with missing lines Patch % Lines
...a/org/apache/comet/testing/FuzzDataGenerator.scala 74.19% 0 Missing and 8 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2902      +/-   ##
============================================
+ Coverage     56.12%   59.44%   +3.31%     
- Complexity      976     1379     +403     
============================================
  Files           119      167      +48     
  Lines         11743    15384    +3641     
  Branches       2251     2557     +306     
============================================
+ Hits           6591     9145    +2554     
- Misses         4012     4945     +933     
- Partials       1140     1294     +154     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andygrove
Copy link
Member Author

@comphead the addition to the fuzz generator for generating deeply nested schema could be useful to try and reproduce the reported issue about shuffle metrics being inaccurate

--conf spark.plugins=org.apache.spark.CometPlugin \
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
--conf spark.comet.exec.replaceSortMergeJoin=true \
--conf spark.comet.exec.shuffle.writeBufferSize=32000000 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also benchmark with shuffle batch size and record batch size?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an accidental check in. I have reverted it since it is unrelated to nested schemas.

@comphead
Copy link
Contributor

maps are not in the scope because Comet dont support grouping by maps yet?

@comphead
Copy link
Contributor

Wondering if nested depth can be configured, I can see now 2 level max, in real world examples typically can be up to 5-6.

More than 5-6 is quite rare, but prob would be still nice to benchmark it

@andygrove
Copy link
Member Author

Wondering if nested depth can be configured, I can see now 2 level max, in real world examples typically can be up to 5-6.

More than 5-6 is quite rare, but prob would be still nice to benchmark it

The benchmarks currently run with max depth 2 and 4:

for (maxDepth <- Seq(2, 4)) {

However, due to the random schema generation approach, there is no guarantee that the schema will reach these depths. I will see how I can improve this.

@andygrove andygrove marked this pull request as draft December 16, 2025 15:33
@andygrove andygrove marked this pull request as ready for review December 16, 2025 16:04
@andygrove
Copy link
Member Author

maps are not in the scope because Comet dont support grouping by maps yet?

I was being lazy. I have added map support now.

@andygrove
Copy link
Member Author

Wondering if nested depth can be configured, I can see now 2 level max, in real world examples typically can be up to 5-6.
More than 5-6 is quite rare, but prob would be still nice to benchmark it

The benchmarks currently run with max depth 2 and 4:

for (maxDepth <- Seq(2, 4)) {

However, due to the random schema generation approach, there is no guarantee that the schema will reach these depths. I will see how I can improve this.

@comphead I have now added minDepth as well.

generators += (() => generateStruct(depth + 1, name))
}
if (options.generateMap && depth < maxDepth) {
generators += (() => generateMap(depth, name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wondering why depth is not + 1 here like for arrays and structs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's incorrect ... I will fix

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


// nested type shuffle
val numRows = 1000
for (generateArray <- Seq(true, false)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should map be here as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this now

@andygrove andygrove requested a review from mbutrovich December 18, 2025 14:19
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove it it looks good to me the way it is

Copy link
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove!

@andygrove andygrove merged commit 60c0f1e into apache:main Dec 19, 2025
134 of 137 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants