Skip to content

Conversation

@codeforall
Copy link
Contributor

DESCRIPTION: Make citus_create_restore_point MX-safe by blocking 2PC commit decisions

Problem:

In coordinator-only mode, citus_create_restore_point() creates consistent restore points by blocking distributed writes at the coordinator level, which is safe because all distributed transactions are coordinated through the coordinator.

However, in MX mode (multi-writer), any worker with metadata can initiate distributed transactions. The existing implementation only blocks writes at the coordinator, allowing metadata workers to continue making 2PC commit decisions. This can result in an inconsistent cluster state where restore points on different nodes represent different transaction visibility.

Solution:

Block distributed transaction commit decisions cluster-wide by acquiring ExclusiveLock on pg_dist_transaction on all metadata nodes (coordinator and MX workers). Additionally, on the coordinator only, lock pg_dist_node and pg_dist_partition to prevent topology and schema changes.

This selective locking strategy is based on the MX mode architecture:

  • DDL operations (topology changes, table creation) can ONLY be executed through the coordinator node, even in MX mode
  • MX workers can only initiate distributed DML transactions (INSERT/UPDATE/ DELETE) that use 2PC
  • Therefore, locking pg_dist_transaction on remote metadata nodes is sufficient to block all distributed writes they can perform, while the coordinator's additional locks on pg_dist_node and pg_dist_partition provide cluster-wide protection against DDL changes

The implementation:

  1. Opens connections to all nodes (metadata and non-metadata workers)
  2. Begins coordinated transactions on all remote connections
  3. Acquires ExclusiveLock on pg_dist_node, pg_dist_partition, and pg_dist_transaction locally on the coordinator via LockRelationOid()
  4. Acquires ExclusiveLock on pg_dist_transaction on all remote metadata nodes via SQL LOCK TABLE command (executed in parallel)
  5. Creates restore points on all nodes in parallel (both metadata and non-metadata nodes need WAL restore points)
  6. Closes remote connections, which releases locks via implicit ROLLBACK

Key Insight - Why No Transaction Drainage Is Needed:

The commit decision in Citus 2PC occurs when LogTransactionRecord() writes to pg_dist_transaction (using RowExclusiveLock for the insert), which happens BEFORE the writer's local commit (in the PRE_COMMIT callback).

By holding ExclusiveLock on pg_dist_transaction:

  • Transactions that have already recorded their commit decision (already inserted their row) will complete normally
  • Transactions that haven't recorded their commit decision yet will block on the ExclusiveLock (which conflicts with the RowExclusiveLock needed for inserts), preventing them from proceeding

This creates a clean cut point for consistency without requiring us to drain in-flight transactions. The restore point captures the exact state of committed transactions across the cluster.

Recovery Correctness:

The maintenance daemon's recovery logic relies on the presence of pg_dist_transaction records to determine whether to COMMIT PREPARED or ROLLBACK PREPARED. Our blocking ensures that:

  • Prepared transactions WITH commit records will be committed on recovery
  • Prepared transactions WITHOUT commit records will be rolled back on recovery

Since we create restore points while holding these locks, all nodes capture the same set of commit decisions, ensuring cluster-wide consistency.

Backward Compatibility:

  • Return type unchanged: still returns coordinator LSN (pg_lsn)
  • Coordinator-only mode: unchanged behavior
  • MX mode: automatic detection and enhanced safety (transparent)
  • No SQL function signature changes required

Regression Test cases:

TODO

Problem:
--------
In coordinator-only mode, citus_create_restore_point() creates consistent
restore points by blocking distributed writes at the coordinator level,
which is safe because all distributed transactions are coordinated through
the coordinator.

However, in MX mode (multi-writer), any worker with metadata can initiate
distributed transactions. The existing implementation only blocks writes
at the coordinator, allowing metadata workers to continue making 2PC commit
decisions. This can result in an inconsistent cluster state where restore
points on different nodes represent different transaction visibility.

Solution:
---------
Block distributed transaction commit decisions cluster-wide by acquiring
ExclusiveLock on pg_dist_transaction on all metadata nodes (coordinator
and MX workers). Additionally, on the coordinator only, lock pg_dist_node
and pg_dist_partition to prevent topology and schema changes.

This selective locking strategy is based on the MX mode architecture:
- DDL operations (topology changes, table creation) can ONLY be executed
  through the coordinator node, even in MX mode
- MX workers can only initiate distributed DML transactions (INSERT/UPDATE/
  DELETE) that use 2PC
- Therefore, locking pg_dist_transaction on remote metadata nodes is
  sufficient to block all distributed writes they can perform, while the
  coordinator's additional locks on pg_dist_node and pg_dist_partition
  provide cluster-wide protection against DDL changes

The implementation:
-------------------
1. Opens connections to all nodes (metadata and non-metadata workers)
2. Begins coordinated transactions on all remote connections
3. Acquires ExclusiveLock on pg_dist_node, pg_dist_partition, and
   pg_dist_transaction locally on the coordinator via LockRelationOid()
4. Acquires ExclusiveLock on pg_dist_transaction on all remote metadata
   nodes via SQL LOCK TABLE command (executed in parallel)
5. Creates restore points on all nodes in parallel (both metadata and
   non-metadata nodes need WAL restore points)
6. Closes remote connections, which releases locks via implicit ROLLBACK

Key Insight - Why No Transaction Drainage Is Needed:
-----------------------------------------------------
The commit decision in Citus 2PC occurs when LogTransactionRecord() writes
to pg_dist_transaction (using RowExclusiveLock for the insert), which
happens BEFORE the writer's local commit (in the PRE_COMMIT callback).

By holding ExclusiveLock on pg_dist_transaction:
- Transactions that have already recorded their commit decision (already
  inserted their row) will complete normally
- Transactions that haven't recorded their commit decision yet will block
  on the ExclusiveLock (which conflicts with the RowExclusiveLock needed
  for inserts), preventing them from proceeding

This creates a clean cut point for consistency without requiring us to
drain in-flight transactions. The restore point captures the exact state
of committed transactions across the cluster.

Recovery Correctness:
---------------------
The maintenance daemon's recovery logic relies on the presence of
pg_dist_transaction records to determine whether to COMMIT PREPARED
or ROLLBACK PREPARED. Our blocking ensures that:
- Prepared transactions WITH commit records will be committed on recovery
- Prepared transactions WITHOUT commit records will be rolled back on recovery

Since we create restore points while holding these locks, all nodes capture
the same set of commit decisions, ensuring cluster-wide consistency.

Backward Compatibility:
-----------------------
- Return type unchanged: still returns coordinator LSN (pg_lsn)
- Coordinator-only mode: unchanged behavior
- MX mode: automatic detection and enhanced safety (transparent)
- No SQL function signature changes required
@codecov
Copy link

codecov bot commented Nov 20, 2025

Codecov Report

❌ Patch coverage is 69.23077% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.94%. Comparing base (662b724) to head (0d32050).

❌ Your patch check has failed because the patch coverage (69.23%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8352      +/-   ##
==========================================
- Coverage   88.95%   88.94%   -0.02%     
==========================================
  Files         287      287              
  Lines       63151    63186      +35     
  Branches     7942     7944       +2     
==========================================
+ Hits        56176    56199      +23     
- Misses       4667     4672       +5     
- Partials     2308     2315       +7     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants