-
Notifications
You must be signed in to change notification settings - Fork 267
chore: Add shuffle benchmark for deeply nested schemas #2902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2902 +/- ##
============================================
+ Coverage 56.12% 59.44% +3.31%
- Complexity 976 1379 +403
============================================
Files 119 167 +48
Lines 11743 15384 +3641
Branches 2251 2557 +306
============================================
+ Hits 6591 9145 +2554
- Misses 4012 4945 +933
- Partials 1140 1294 +154 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
spark/src/test/scala/org/apache/spark/sql/benchmark/CometShuffleBenchmark.scala
Outdated
Show resolved
Hide resolved
|
@comphead the addition to the fuzz generator for generating deeply nested schema could be useful to try and reproduce the reported issue about shuffle metrics being inaccurate |
dev/benchmarks/comet-tpch.sh
Outdated
| --conf spark.plugins=org.apache.spark.CometPlugin \ | ||
| --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \ | ||
| --conf spark.comet.exec.replaceSortMergeJoin=true \ | ||
| --conf spark.comet.exec.shuffle.writeBufferSize=32000000 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also benchmark with shuffle batch size and record batch size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was an accidental check in. I have reverted it since it is unrelated to nested schemas.
|
maps are not in the scope because Comet dont support grouping by maps yet? |
|
Wondering if nested depth can be configured, I can see now 2 level max, in real world examples typically can be up to 5-6. More than 5-6 is quite rare, but prob would be still nice to benchmark it |
The benchmarks currently run with max depth 2 and 4: However, due to the random schema generation approach, there is no guarantee that the schema will reach these depths. I will see how I can improve this. |
I was being lazy. I have added map support now. |
@comphead I have now added |
| generators += (() => generateStruct(depth + 1, name)) | ||
| } | ||
| if (options.generateMap && depth < maxDepth) { | ||
| generators += (() => generateMap(depth, name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just wondering why depth is not + 1 here like for arrays and structs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's incorrect ... I will fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
|
||
| // nested type shuffle | ||
| val numRows = 1000 | ||
| for (generateArray <- Seq(true, false)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should map be here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this now
spark/src/main/scala/org/apache/comet/testing/FuzzDataGenerator.scala
Outdated
Show resolved
Hide resolved
comphead
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andygrove it it looks good to me the way it is
mbutrovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andygrove!
Which issue does this PR close?
Closes #.
Rationale for this change
Example schema (partial):
What changes are included in this PR?
How are these changes tested?