Skip to content

Commit c1ee687

Browse files
committed
Fix initialization of the WAL buffer at startup
There were two bugs: 1. We initialized the page headers on the allocated temporary WAL page even if the end-of-log was precisely at page boundary. That means we wrote beyond the end of the allocation. That readily gives an assertion failure on debug-enabled builds. Add test case and fix. 2. We initialize the WAL buffer by copying the contents of the WAL read buffer. The idea is that when we stop reading WAL, the last buffer in the reader becomes the new WAL buffer we'll write to. That's how PostgreSQL does it too, see code around comment "Tricky point here" in StartupXLOG(). However, that doesn't work with Neon. The startup procedure is a little different: we don't do normal WAL recovery and we don't read any WAL at startup, except when promoting a read replica. Vanilla PostgreSQL always reads WAL: it reads the last checkpoint record from the WAL if nothing else, but in Neon we don't necessarily read even that. In that case, the xlogreader's read buffer is still uninitialized by the time that we copy it. That's relatively harmless, the only consequence is that the initial WAL segment on local disk can contain garbage before the first WAL record that we write. That's why we haven't noticed until now. Furthermore, it seems that the uninitialized memory just happens to be all-zeros. However, it now caused the test_pg_waldump.py test to fail with the new communicator implementation. That was very coincidental - the new communicator process isn't even running yet when the WAL buffer is initialized. It seems to have changed the memory allocation just so that the uninitialized memory is no longer all-zeros. That's normally harmless too, but it makes the pg_waldump test to fail: pg_waldump, with the --ignore option, starts reading the WAL from the first non-zero bytes, so when the uninitialized portion was filled with garbage rather than zeros, it fails. This little patch to poison the allocated buffer with garbage was helpful while debugging, to make the test fail in a repeatable fashion with or without the new communicator: ``` diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 988be3f..2f4844c2b86 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -97,6 +97,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir, */ state->readBuf = (char *) palloc_extended(XLOG_BLCKSZ, MCXT_ALLOC_NO_OOM); + memset(state->readBuf, 0x7e, XLOG_BLCKSZ); if (!state->readBuf) { pfree(state); ```
1 parent f1da947 commit c1ee687

File tree

1 file changed

+55
-48
lines changed
  • src/backend/access/transam

1 file changed

+55
-48
lines changed

src/backend/access/transam/xlog.c

Lines changed: 55 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -8001,6 +8001,16 @@ StartupXLOG(void)
80018001
Assert(!WalRcvStreaming());
80028002
StandbyMode = false;
80038003

8004+
/*
8005+
* We cannot start generating new WAL if we don't have a valid prev-LSN
8006+
* to use for the first new WAL record. (Shouldn't happen.)
8007+
*/
8008+
if (NeonRecoveryRequested &&!neonWriteOk)
8009+
ereport(ERROR,
8010+
(errmsg("cannot start in read-write mode from this base backup")));
8011+
8012+
// FIXME: should we unlink neon.signal?
8013+
80048014
/*
80058015
* Determine where to start writing WAL next.
80068016
*
@@ -8009,62 +8019,58 @@ StartupXLOG(void)
80098019
* valid or last applied record, so we can identify the exact endpoint of
80108020
* what we consider the valid portion of WAL.
80118021
*
8012-
* When starting from a neon base backup, we don't have WAL. Initialize
8013-
* the WAL page where we will start writing new records from scratch,
8014-
* instead.
8022+
* With neon, it's possible that we start without having read any WAL
8023+
* whatsoever. In that case, initialize the WAL page where we will
8024+
* start writing new records from scratch, instead.
80158025
*/
8016-
if (NeonRecoveryRequested)
8026+
if (NeonRecoveryRequested && EndRecPtr == RedoStartLSN)
80178027
{
8018-
if (!neonWriteOk)
8019-
{
8020-
/*
8021-
* We cannot start generating new WAL if we don't have a valid prev-LSN
8022-
* to use for the first new WAL record. (Shouldn't happen.)
8023-
*/
8024-
ereport(ERROR,
8025-
(errmsg("cannot start in read-write mode from this base backup")));
8026-
}
8027-
else
8028+
XLogRecPtr endOfLog = EndRecPtr;
8029+
char *page;
8030+
int len;
8031+
XLogRecPtr pageBeginPtr;
8032+
8033+
pageBeginPtr = endOfLog - (endOfLog % XLOG_BLCKSZ);
8034+
8035+
len = endOfLog % XLOG_BLCKSZ;
8036+
page = xlogreader->readBuf;
8037+
8038+
if (len > 0)
80288039
{
8029-
int offs = EndRecPtr % XLOG_BLCKSZ;
8030-
XLogRecPtr lastPage = EndRecPtr - offs;
8031-
bool isLongHeader = (lastPage % wal_segment_size) == 0;
8040+
bool isLongHeader = (pageBeginPtr % wal_segment_size) == 0;
80328041
int lastPageSize = isLongHeader ? SizeOfXLogLongPHD : SizeOfXLogShortPHD;
8033-
int idx = XLogRecPtrToBufIdx(lastPage);
8034-
char *page = XLogCtl->pages + idx * XLOG_BLCKSZ;
80358042
XLogPageHeader xlogPageHdr = (XLogPageHeader) page;
80368043

8037-
memcpy(page, xlogreader->readBuf, offs);
8038-
if (xlogPageHdr->xlp_magic != XLOG_PAGE_MAGIC)
8039-
{
8040-
xlogPageHdr->xlp_pageaddr = lastPage;
8041-
xlogPageHdr->xlp_magic = XLOG_PAGE_MAGIC;
8042-
xlogPageHdr->xlp_tli = ThisTimeLineID;
8043-
xlogPageHdr->xlp_info = 0;
8044-
/*
8045-
* If we start writing with offset from page beginning, pretend in
8046-
* page header there is a record ending where actual data will
8047-
* start.
8048-
*/
8049-
xlogPageHdr->xlp_rem_len = offs - lastPageSize;
8050-
if (xlogPageHdr->xlp_rem_len > 0)
8051-
xlogPageHdr->xlp_info |= XLP_FIRST_IS_CONTRECORD;
8052-
readOff = XLogSegmentOffset(lastPage, wal_segment_size);
8044+
Assert(len >= lastPageSize);
80538045

8054-
if (isLongHeader)
8055-
{
8056-
XLogLongPageHeader longHdr = (XLogLongPageHeader) page;
8046+
xlogPageHdr->xlp_pageaddr = pageBeginPtr;
8047+
xlogPageHdr->xlp_magic = XLOG_PAGE_MAGIC;
8048+
xlogPageHdr->xlp_tli = recoveryTargetTLI;
8049+
xlogPageHdr->xlp_info = 0;
8050+
/*
8051+
* If we start writing with offset from page beginning, pretend in
8052+
* page header there is a record ending where actual data will
8053+
* start.
8054+
*/
8055+
xlogPageHdr->xlp_rem_len = len - lastPageSize;
8056+
if (xlogPageHdr->xlp_rem_len > 0)
8057+
xlogPageHdr->xlp_info |= XLP_FIRST_IS_CONTRECORD;
8058+
readOff = XLogSegmentOffset(pageBeginPtr, wal_segment_size);
80578059

8058-
longHdr->xlp_sysid = GetSystemIdentifier();
8059-
longHdr->xlp_seg_size = wal_segment_size;
8060-
longHdr->xlp_xlog_blcksz = XLOG_BLCKSZ;
8060+
if (isLongHeader)
8061+
{
8062+
XLogLongPageHeader longHdr = (XLogLongPageHeader) page;
80618063

8062-
xlogPageHdr->xlp_info |= XLP_LONG_HEADER;
8063-
}
8064-
}
8065-
elog(LOG, "Continue writing WAL at %X/%X", LSN_FORMAT_ARGS(EndRecPtr));
8064+
longHdr->xlp_sysid = GetSystemIdentifier();
8065+
longHdr->xlp_seg_size = wal_segment_size;
8066+
longHdr->xlp_xlog_blcksz = XLOG_BLCKSZ;
80668067

8067-
// FIXME: should we unlink neon.signal?
8068+
xlogPageHdr->xlp_info |= XLP_LONG_HEADER;
8069+
}
8070+
}
8071+
else
8072+
{
8073+
Assert(readOff == XLogSegmentOffset(pageBeginPtr, wal_segment_size));
80688074
}
80698075
}
80708076
else
@@ -8085,6 +8091,8 @@ StartupXLOG(void)
80858091
*/
80868092
EndOfLogTLI = xlogreader->seg.ws_tli;
80878093

8094+
elog(LOG, "Continue writing WAL at %X/%X", LSN_FORMAT_ARGS(EndOfLog));
8095+
80888096
/*
80898097
* Complain if we did not roll forward far enough to render the backup
80908098
* dump consistent. Note: it is indeed okay to look at the local variable
@@ -8276,8 +8284,7 @@ StartupXLOG(void)
82768284
/* Copy the valid part of the last block, and zero the rest */
82778285
page = &XLogCtl->pages[firstIdx * XLOG_BLCKSZ];
82788286
len = EndOfLog % XLOG_BLCKSZ;
8279-
if (!NeonRecoveryRequested)
8280-
memcpy(page, xlogreader->readBuf, len);
8287+
memcpy(page, xlogreader->readBuf, len);
82818288
memset(page + len, 0, XLOG_BLCKSZ - len);
82828289

82838290
XLogCtl->xlblocks[firstIdx] = pageBeginPtr + XLOG_BLCKSZ;

0 commit comments

Comments
 (0)