Fix read-only open during checkpoint from reading inconsistent state#615
Fix read-only open during checkpoint from reading inconsistent state#615ericyuanhui wants to merge 1 commit into
Conversation
Signed-off-by: ericyuanhui <285521263@qq.com>
| tableEpochWatermarksByManager; | ||
| std::unordered_map<catalog::Catalog*, uint64_t> catalogVersionAtCheckpointByCatalog; | ||
| std::unordered_map<StorageManager*, uint64_t> pageManagerVersionAtCheckpointByManager; | ||
| std::unique_ptr<common::FileInfo> checkpointIntentLockFile; |
There was a problem hiding this comment.
Could you describe the thought process here? Why not lock the main database file instead of creating new intent/apply files?
There was a problem hiding this comment.
the core reason why we cannot rely solely on the primary database file lock is straightforward: this fix targets the consistency window during checkpoints, rather than simply controlling which processes can open the primary data file.
Let’s break down the problem point by point.
- Primary database locks cannot represent the phase semantics of a checkpoint
A primary file lock can only indicate whether a file is occupied. However, this bug requires us to distinguish three distinct states:
Normal readable state
Dangerous intermediate state while a checkpoint is in progress
Stable state after the checkpoint completes
This change introduces a two-phase locking mechanism: intent lock and apply lock. It explicitly marks the "checkpoint-in-progress" state instead of conflating it with generic file occupancy status. The relevant code is defined in storage_utils.h, and the locking logic is implemented in checkpointer.cpp. - Sole reliance on the primary lock leads to excessive blocking with overly coarse-grained semantics
If read-only open operations contend for the primary file lock to avoid race conditions, we end up with two undesirable outcomes:
All read-only openings fail whenever a writer process exists, resulting in overly conservative behavior.
Read-only access is permitted to proceed, yet we lack a way to precisely block only during the risky checkpoint window.
The goal of this patch is to block only unsafe time windows while avoiding unnecessary interruptions in other scenarios. This is why we introduced dedicated checkpoint coordination locks and status checks, rather than lumping all logic into the primary file lock. - The issue involves multi-file consistency, which a single-file lock cannot enforce
As highlighted in your summary, read-only recovery may encounter intermediate artifacts such as shadow files and checkpoint WAL entries.
The inconsistency window spans the main data file, shadow files and checkpoint WAL files. Locking only the primary file cannot verify whether the entire set of files is in an atomically recoverable state.
The corresponding safeguards are implemented in WALReplayer:
First perform checkpoint lock detection for read-only mode (implemented in wal_replayer.cpp).
Immediately trigger fail-fast errors for intermediate states involving shadow files or frozen WALs (also in wal_replayer.cpp). - Primary file locks become ineffective after process crashes
OS file locks are bound to the process lifecycle. Locks are automatically released when a process crashes, yet partial intermediate files may remain on disk.
To address this, the patch adds checks based on the existence of state files plus validation for metadata page ranges. Any leftover partial state after a crash will result in a controlled failure instead of blind reads that trigger segmentation faults.
Defensive validation for page ranges is added in the Checkpointer read path, converting invalid memory access crashes into managed exceptions. The code resides in checkpointer.cpp. - The commit essentially converts race conditions into protocol-defined failure modes
The final paragraph of your summary already captures the essence:
The fix does not aim to eliminate failures entirely. Instead, it ensures the system fails explicitly whenever it hits the unsafe window, eliminating undefined behavior.
This objective requires four components:
Explicit checkpoint coordination with intent and apply locks
Pre-emptive rejection logic for recovery (fail-fast on read-only opens)
Defensive validation of checkpoint metadata
Resource cleanup on lock acquisition failures to prevent file descriptor and handle leaks (implemented in local_file_system.cpp)
Conclusion:
The primary database file lock is a necessary but insufficient condition. It only enforces mutual exclusion, yet cannot implement the phase protocol, multi-file consistency boundaries, or detection of leftover intermediate states after crashes that this change requires.
|
Note that DuckDB and SQLite take different approaches here: https://claude.ai/share/9ce5d5a8-bf34-46cb-87e7-9f1636491c4f |
ladybug is same like SQlite right? |
|
Do you use a filesystem that supports snapshots? From a maintainability and complexity perspective, I prefer a solution like this: https://dev.to/shiviyer/step-by-step-implementation-filesystem-snapshots-in-postgresql-39ci The SQLite method works even if the reader doesn't have write permissions to the dir where the database resides. Also our internals are different from SQLite and the development methodology is not as tightly controlled. That requires us to operate with simpler abstractions. |
Summary
This change fixes a bug where opening the database in read-only mode could race with an in-progress checkpoint and end up reading inconsistent on-disk state. In that situation, recovery could observe intermediate checkpoint artifacts and continue loading from partially updated metadata, which could lead to crashes such as segmentation faults.
To address that, the fix makes checkpoint progress explicit and blocks read-only recovery from entering that unsafe window. It adds coordination around checkpoint execution so read-only opens can detect that a checkpoint is underway and fail fast instead of trying to recover from an intermediate state. It also adds defensive validation when reading checkpoint metadata so invalid or out-of-range page information is rejected with a controlled error rather than being used blindly. In addition, it cleans up a related resource-handling issue in file locking so failed lock attempts do not leak OS handles.
In short, the change turns a potentially unsafe concurrent recovery path into a safe, well-defined failure mode, and adds extra validation to prevent corrupted or partial checkpoint state from causing low-level crashes.
What Changed
Added two checkpoint lock files:*.checkpoint.intent.lock
*.checkpoint.apply.lock
Added checkpoint lock acquisition/release in Checkpointeracquire at checkpoint start
release on cleanup and rollback
Added eager creation of checkpoint lock files for non-read-only persistent databases
Added read-only recovery protection in WALReplayerif a checkpoint is in progress, opening in read-only mode now throws a RuntimeException
read-only recovery no longer proceeds through shadow file / checkpoint WAL intermediate states
Added defensive validation when reading checkpoint headersvalidate catalog page range
validate metadata page range
validate dataFileNumPages against actual file size
Fixed file descriptor / handle leaks when file locking fails in LocalFileSystem
Updated WAL testsReadOnlyRecoveryWithShadowFile now expects failure in read-only mode
added ReadOnlyRecoveryWithCheckpointWAL