Skip to content

Conversation

@mao-liu
Copy link

@mao-liu mao-liu commented Dec 10, 2025

WORK IN PROGRESS

  • build and test this out against a real table
  • open draft PR against upstream
  • complete test cases
  • confirm assumptions with Paimon user group
  • remove logging

Important Read

  • Please ensure the GitHub issue is mentioned at the beginning of the PR

Closes #755

What is the purpose of the pull request

Adds support for Paimon metadata stats

Brief change log

  • Adds support for Paimon file-level stats metadata
  • Adds unit tests for Paimon file stats

Verify this pull request

This change added tests and can be verified as follows:

  • Extends TestPaimonDataFileExtractorto verify the change.
  • Manually verified the change by running a job locally.

@mao-liu mao-liu changed the title Support column stats for paimon #755 - Support column stats for paimon Dec 10, 2025
internalSchema.getAllFields().stream()
.collect(Collectors.toMap(InternalField::getPath, f -> f));

// stats for all columns are present in valueStats, we can safely ignore file.keyStats() - TODO: validate this assumption
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q to Paimon experts: Is this assumption valid?

List<String> colNames = file.valueStatsCols();
// log.info("valueStatsCols: {}", colNames);
if (colNames == null || colNames.isEmpty()) {
// if column names are not present, we assume all columns in the schema are present in the same order as the schema - TODO: validate this assumption
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q to Paimon experts: Is this assumption valid?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mao-liu are you in contact with anyone on the Paimon side to get these questions answered?

long tsMillis = timestamp.toEpochMilli();

// according to docs for org.apache.xtable.model.stat.Range, timestamp is stored as millis
// or micros - even if precision is higher than micros, return micros
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: to xtable maintainers - is this assumption valid?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this assumption is correct. Right now there is no support for other precisions so I would expect that an error is thrown if some other precision is found.

@mao-liu mao-liu force-pushed the feat/paimon-column-stats branch 3 times, most recently from c0cabf3 to 2757735 Compare December 10, 2025 12:38
@mao-liu mao-liu force-pushed the feat/paimon-column-stats branch from 2757735 to 6e927e2 Compare December 10, 2025 13:03
// TODO: Implement logic to extract column stats from the file meta
// https://github.com/apache/incubator-xtable/issues/755
return Collections.emptyList();
private List<ColumnStat> toColumnStats(DataFileMeta file, InternalSchema internalSchema) {
Copy link
Contributor

@the-other-tim-brown the-other-tim-brown Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the column stats conversion to its own class? In the future if we add Paimon as a target then we will also need to convert to the Paimon representation and it would be nice to have all this stats logic in its own class.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, thanks for the early review @the-other-tim-brown !

I do wonder though, if it is even possible to have Paimon as a target... Paimon has a pretty unique file layout, and might not be as easily "tricked" as other formats.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know enough about Paimon to say. Hudi also has a unique native layout structure to allow for update heavy workloads though and we were able to make this work.

Mainly we do this separation to keep the logic isolated though. As not necessarily relevant to Paimon, but if a table format changes how they represent stats in a new version, we can plug in the appropriate converter based on the version.

long tsMillis = timestamp.toEpochMilli();

// according to docs for org.apache.xtable.model.stat.Range, timestamp is stored as millis
// or micros - even if precision is higher than micros, return micros
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this assumption is correct. Right now there is no support for other precisions so I would expect that an error is thrown if some other precision is found.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Apache Paimon] Support for extracting column-level statistics on the source

2 participants