-
Notifications
You must be signed in to change notification settings - Fork 196
#755 - Support column stats for paimon #767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| internalSchema.getAllFields().stream() | ||
| .collect(Collectors.toMap(InternalField::getPath, f -> f)); | ||
|
|
||
| // stats for all columns are present in valueStats, we can safely ignore file.keyStats() - TODO: validate this assumption |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q to Paimon experts: Is this assumption valid?
| List<String> colNames = file.valueStatsCols(); | ||
| // log.info("valueStatsCols: {}", colNames); | ||
| if (colNames == null || colNames.isEmpty()) { | ||
| // if column names are not present, we assume all columns in the schema are present in the same order as the schema - TODO: validate this assumption |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q to Paimon experts: Is this assumption valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mao-liu are you in contact with anyone on the Paimon side to get these questions answered?
xtable-core/src/main/java/org/apache/xtable/paimon/PaimonDataFileExtractor.java
Outdated
Show resolved
Hide resolved
| long tsMillis = timestamp.toEpochMilli(); | ||
|
|
||
| // according to docs for org.apache.xtable.model.stat.Range, timestamp is stored as millis | ||
| // or micros - even if precision is higher than micros, return micros |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: to xtable maintainers - is this assumption valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this assumption is correct. Right now there is no support for other precisions so I would expect that an error is thrown if some other precision is found.
c0cabf3 to
2757735
Compare
2757735 to
6e927e2
Compare
| // TODO: Implement logic to extract column stats from the file meta | ||
| // https://github.com/apache/incubator-xtable/issues/755 | ||
| return Collections.emptyList(); | ||
| private List<ColumnStat> toColumnStats(DataFileMeta file, InternalSchema internalSchema) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move the column stats conversion to its own class? In the future if we add Paimon as a target then we will also need to convert to the Paimon representation and it would be nice to have all this stats logic in its own class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do, thanks for the early review @the-other-tim-brown !
I do wonder though, if it is even possible to have Paimon as a target... Paimon has a pretty unique file layout, and might not be as easily "tricked" as other formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know enough about Paimon to say. Hudi also has a unique native layout structure to allow for update heavy workloads though and we were able to make this work.
Mainly we do this separation to keep the logic isolated though. As not necessarily relevant to Paimon, but if a table format changes how they represent stats in a new version, we can plug in the appropriate converter based on the version.
| long tsMillis = timestamp.toEpochMilli(); | ||
|
|
||
| // according to docs for org.apache.xtable.model.stat.Range, timestamp is stored as millis | ||
| // or micros - even if precision is higher than micros, return micros |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this assumption is correct. Right now there is no support for other precisions so I would expect that an error is thrown if some other precision is found.
WORK IN PROGRESS
Important Read
Closes #755
What is the purpose of the pull request
Adds support for Paimon metadata stats
Brief change log
Verify this pull request
This change added tests and can be verified as follows: