Make citus_create_restore_point MX-safe by blocking 2PC commit decisions #8352
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
DESCRIPTION: Make citus_create_restore_point MX-safe by blocking 2PC commit decisions
Problem:
In coordinator-only mode, citus_create_restore_point() creates consistent restore points by blocking distributed writes at the coordinator level, which is safe because all distributed transactions are coordinated through the coordinator.
However, in MX mode (multi-writer), any worker with metadata can initiate distributed transactions. The existing implementation only blocks writes at the coordinator, allowing metadata workers to continue making 2PC commit decisions. This can result in an inconsistent cluster state where restore points on different nodes represent different transaction visibility.
Solution:
Block distributed transaction commit decisions cluster-wide by acquiring ExclusiveLock on pg_dist_transaction on all metadata nodes (coordinator and MX workers). Additionally, on the coordinator only, lock pg_dist_node and pg_dist_partition to prevent topology and schema changes.
This selective locking strategy is based on the MX mode architecture:
The implementation:
Key Insight - Why No Transaction Drainage Is Needed:
The commit decision in Citus 2PC occurs when LogTransactionRecord() writes to pg_dist_transaction (using RowExclusiveLock for the insert), which happens BEFORE the writer's local commit (in the PRE_COMMIT callback).
By holding ExclusiveLock on pg_dist_transaction:
This creates a clean cut point for consistency without requiring us to drain in-flight transactions. The restore point captures the exact state of committed transactions across the cluster.
Recovery Correctness:
The maintenance daemon's recovery logic relies on the presence of pg_dist_transaction records to determine whether to COMMIT PREPARED or ROLLBACK PREPARED. Our blocking ensures that:
Since we create restore points while holding these locks, all nodes capture the same set of commit decisions, ensuring cluster-wide consistency.
Backward Compatibility:
Regression Test cases:
TODO