Skip to content

Conversation

@aswinshakil
Copy link
Member

What changes were proposed in this pull request?

SnapshotDeletingService incorrectly checks the Ratis Buffer Limit when we invoke submitSnapshotMoveDeletedKeys , it adds all the keys to the builder and then checks the size of the message. Later, it again adds more deletedKeys, deletedDirs and renamedKeys on it, The message size is already over the ratis buffer limit, instead of creating a new builder we re-use the same builder. Ideally we should stop iterating through the snapshot-related tables when we hit the buffer limit even before going to submitSnapshotMoveDeletedKeys

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14432

How was this patch tested?

Existing Tests

@aswinshakil aswinshakil requested a review from smengcl January 21, 2026 17:46
@aswinshakil aswinshakil self-assigned this Jan 21, 2026
@aswinshakil aswinshakil added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label Jan 21, 2026
@smengcl smengcl requested a review from Copilot January 21, 2026 21:37
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes incorrect Ratis buffer limit checking in SnapshotDeletingService. The old implementation would add all snapshot-related entries to a request builder before checking if the message size exceeded the buffer limit, then attempted to recover by truncating the lists. This approach was flawed as it could create oversized messages.

Changes:

  • Replaced the flawed size-checking logic with progressive batch building that checks size limits before adding each entry
  • Introduced proper batching across deletedKeys, renamedKeys, and deletedDirs with early termination on failures
  • Improved error handling by returning the count of successfully submitted entries for accurate retry tracking

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +314 to +339
long batchBytes = 0;
int totalSubmitted = 0;
int batchCount = 0;

for (SnapshotMoveKeyInfos key : deletedKeys) {
int keySize = key.getSerializedSize();

// If adding this key would exceed the limit, flush the current batch first
if (batchBytes + keySize > ratisByteLimit && !currentDeletedKeys.isEmpty()) {
batchCount++;
LOG.debug("Submitting batch {} for snapshot {} with {} deletedKeys, {} renamedKeys, {} deletedDirs, " +
"size: {} bytes", batchCount, snapInfo.getTableKey(), currentDeletedKeys.size(),
currentRenamedKeys.size(), currentDeletedDirs.size(), batchBytes);

if (!submitSingleSnapshotMoveBatch(snapInfo, currentDeletedKeys, currentRenamedKeys, currentDeletedDirs)) {
return totalSubmitted;
}

totalSubmitted += currentDeletedKeys.size();
currentDeletedKeys.clear();
batchBytes = 0;
}

currentDeletedKeys.add(key);
batchBytes += keySize;
}
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The batch size calculation sums the serialized sizes of individual entries but doesn't account for the protocol buffer message overhead from SnapshotMoveTableKeysRequest and OMRequest wrappers. While the 10% safety margin in ratisByteLimit (line 113) may provide some buffer, consider explicitly accounting for base message overhead or documenting this limitation to ensure batches reliably stay under the limit even with many small entries.

Copilot uses AI. Check for mistakes.
}

totalSubmitted += currentDeletedKeys.size();
currentDeletedKeys.clear();
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After flushing a batch during deletedKeys processing, only currentDeletedKeys is cleared (line 333), while currentRenamedKeys and currentDeletedDirs are not explicitly cleared. Although they should be empty at this point, for consistency and defensive programming, consider clearing all three lists after every flush to maintain a consistent state. This pattern is followed correctly in the later flush points (lines 357-359 and 382-385).

Suggested change
currentDeletedKeys.clear();
currentDeletedKeys.clear();
currentRenamedKeys.clear();
currentDeletedDirs.clear();

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

snapshot https://issues.apache.org/jira/browse/HDDS-6517

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant