Thursday, October 19, 2023

Postgres WAL Files and Sequence Numbers

 

The Postgres Write Ahead Log (WAL) is a functional component to the database. WAL makes a lot of key functionality possible, like Point-in-Time-Recovery backups, recovering from an event, streaming replication, and more. From time to time, those deep inside the database will need to work directly with WAL files to diagnose or recover.

Recently in working with one of Crunchy Data's customers, I came across a situation where understanding the names and sequence numbers was important. In working with several of my colleauges that commit to the Postgres project, I collected notes on some of the details inside WAL. The goal today to look at the LSN and naming convention for WAL to help users understand WAL files a little better.

Log Sequence Number

Transactions in PostgreSQL create WAL records which are ultimately appended to the WAL log (file). The position where the insert occurs is known as the Log Sequence Number (LSN). The values of LSN (of type pg_lsn) can be compared to determine the amount of WAL generated between two different offsets (in bytes). When using in this manner, it is important to know the calculation assumes the full WAL segment was used (16MB) if multiple WAL logs are used. A similar calculation to the one used here is often used to determine latency of a replica.

The LSN is a 64-bit integer, representing a position in the write-ahead log stream. This 64-bit integer is split into two segments (high 32 bits and low 32 bits). It is printed as two hexadecimal numbers separated by a slash (XXXXXXXX/YYZZZZZZ). The 'X' represents the high 32-bits of the LSN and ‘Y’ is the high 8 bits of the lower 32-bits section. The 'Z' represents the offset position in the file. Each element is a hexadecimal number. The 'X' and 'Y' values are used in the second part of the WAL file on a default PostgreSQL deployment.

WAL File

The WAL file name is in the format TTTTTTTTXXXXXXXXYYYYYYYY. Here 'T' is the timeline, 'X' is the high 32-bits from the LSN, and 'Y' is the low 32-bits of the LSN.

Start by looking at the current WAL LSN and insert LSN. The pg_current_wal_lsn is the location of the last write. The pg_current_wal_insert_lsn is the logical location and reflects data in the buffer that has not been written to disk. There is also a flush value that shows what has been written to durable storage.

[postgres] # select pg_current_wal_lsn(), pg_current_wal_insert_lsn();
 pg_current_wal_lsn | pg_current_wal_insert_lsn
--------------------+---------------------------
 76/7D000000        | 76/7D000028
(1 row)

Although you can guess the name of the WAL file based on the above output, it is best to use the pg_walfile_name function.

[postgres] # select pg_walfile_name('76/7D000028');
     pg_walfile_name
--------------------------
 00000001000000760000007D
(1 row)

Looking at the file system we see that indeed segment 00000001000000760000007D is the latest modified file. Note that if the database is idle then 00000001000000760000007D could have been the oldest file (based on O/S last modified date). The reason that is possible is due to PostgreSQL reusing older WAL segments. During the pg_wal_switch, the older file was renamed. The O/S date/time modified will only change when the file is written to. Using -i in the ls command shows the inode (first column) for the file. When files are reused, this number does not change as the file is just renamed.

$ ls -larti 00*
169034323 -rw-------   1 bpace  staff       376 Jan 30 09:13 0000000100000075000000A0.00000060.backup
169564733 -rw-------   1 bpace  staff  16777216 Feb 13 15:49 00000001000000760000007E
167120667 -rw-------   1 bpace  staff  16777216 Feb 13 15:50 00000001000000760000007F
167120673 -rw-------   1 bpace  staff  16777216 Feb 13 16:00 00000001000000760000007B
167120686 -rw-------   1 bpace  staff  16777216 Feb 13 16:18 00000001000000760000007C
169564722 -rw-------   1 bpace  staff  16777216 Feb 13 16:18 00000001000000760000007D

Let's create a small table and perform a WAL switch.

[postgres] # create table test (a char(1));
CREATE TABLE
Time: 23.770 ms

[postgres] # select pg_switch_wal();

 pg_switch_wal
--------------
76/7D018FD8
(1 row)

Looking again at the files we now see that 00000001000000760000007D file has been updated (date/time changed from O/S perspective). With some other items occurring in the background the next segment 00000001000000760000007E has also received some writes after the pg_switch_wal.

$ ls -larti 00*
169034323 -rw-------  1 bpace  staff       376 Jan 30 09:13 0000000100000075000000A0.00000060.backup
167120667 -rw-------  1 bpace  staff  16777216 Feb 13 15:50 00000001000000760000007F
167120673 -rw-------  1 bpace  staff  16777216 Feb 13 16:00 000000010000007600000080
167120686 -rw-------  1 bpace  staff  16777216 Feb 13 16:18 000000010000007600000081
169564722 -rw-------  1 bpace  staff  16777216 Feb 13 16:24 00000001000000760000007D
169564733 -rw-------  1 bpace  staff  16777216 Feb 13 16:24 00000001000000760000007E

In the newly created table, insert one record. Ensure the database is quiet and grab the current WAL LSN before and after the change. Note that we are inserting 1 byte ('a') into a single table with a single column.

[postgres] # select pg_current_wal_lsn(), pg_current_wal_insert_lsn();
 pg_current_wal_lsn | pg_current_wal_insert_lsn
--------------------+---------------------------
 76/7E000060        | 76/7E000060
(1 row)

[postgres] # insert into test (a) values ('a');
INSERT 0 1

[postgres] # select pg_current_wal_lsn(), pg_current_wal_insert_lsn();
 pg_current_wal_lsn | pg_current_wal_insert_lsn
--------------------+---------------------------
 76/7E000108        | 76/7E000108
(1 row)

Using the two LSN positions we can calculate the amount of WAL that was generated by the INSERT (168 bytes in this case).

[postgres] # select '76/7E000108'::pg_lsn - '76/7E000060'::pg_lsn size_bytes;
 size_bytes
------------
        168
(1 row)

Grab the current WAL LSN and WAL insert LSN and perform one more switch. Then list out the files.

[postgres] # select pg_switch_wal();
 pg_switch_wal
---------------
 76/7E0001D0
(1 row)


$ ls -larti 00*
169034323 -rw-------  1 bpace  staff       376 Jan 30 09:13 0000000100000075000000A0.00000060.backup
167120673 -rw-------  1 bpace  staff  16777216 Feb 13 16:00 000000010000007600000080
167120686 -rw-------  1 bpace  staff  16777216 Feb 13 16:18 000000010000007600000081
169564722 -rw-------  1 bpace  staff  16777216 Feb 13 16:24 000000010000007600000082
169564733 -rw-------  1 bpace  staff  16777216 Feb 13 16:26 00000001000000760000007E
167120667 -rw-------  1 bpace  staff  16777216 Feb 13 16:26 00000001000000760000007F

With the information we captured in the previous steps, use pg_waldump to get a human readable summary of the WAL segment contents. In the following command the starting position (-s) and ending position (-e) are specified along with the WAL file name (00000001000000760000007E). The start position was the current_wal_lsn before our INSERT and the ending position was the current_wal_lsn after our insert. Previous, just using LSNs, we determined there were 168 bytes written to WAL from our transaction. Looking at the waldump reveals that the INSERT took 103 bytes (57 for the INSERT, 46 for the COMMIT).

sh-4.4$ pg_waldump -s 76/7E000060 -e 76/7E000108 00000001000000760000007E
rmgr: Heap        len (rec/tot):     57/    57, tx:   59555584, lsn: 76/7E000060, prev 76/7E000028, desc: INSERT+INIT off 1 flags 0x08, blkref #0: rel 1663/5/53434 blk 0
rmgr: Transaction len (rec/tot):     46/    46, tx:   59555584, lsn: 76/7E0000A0, prev 76/7E000060, desc: COMMIT 2023-02-13 16:25:19.441483 EST
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 76/7E0000D0, prev 76/7E0000A0, desc: RUNNING_XACTS nextXid 59555585 latestCompletedXid 59555584 oldestRunningXid 59555585

Looking at the entire WAL file from the point before our INSERT, the INSERT itself can be seen, then the COMMIT. Finally, the Checkpoint is logged and the switch from the pg_wal_switch() that was performed. The RUNNING_XACTS entries are added if wal_level is set to replica or higher. The RUNNING_XACTS entries capture details of the current snapshot (active transactions). The last entry, SWITCH, is the pg_switch_wal that was performed.

sh-4.4$ pg_waldump 00000001000000760000007E
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 76/7E000028, prev 76/7D018FC0, desc: RUNNING_XACTS nextXid 59555584 latestCompletedXid 59555583 oldestRunningXid 59555584
rmgr: Heap        len (rec/tot):     57/    57, tx:   59555584, lsn: 76/7E000060, prev 76/7E000028, desc: INSERT+INIT off 1 flags 0x08, blkref #0: rel 1663/5/53434 blk 0
rmgr: Transaction len (rec/tot):     46/    46, tx:   59555584, lsn: 76/7E0000A0, prev 76/7E000060, desc: COMMIT 2023-02-13 16:25:19.441483 EST
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 76/7E0000D0, prev 76/7E0000A0, desc: RUNNING_XACTS nextXid 59555585 latestCompletedXid 59555584 oldestRunningXid 59555585
rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 76/7E000108, prev 76/7E0000D0, desc: RUNNING_XACTS nextXid 59555585 latestCompletedXid 59555584 oldestRunningXid 59555585
rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 76/7E000140, prev 76/7E000108, desc: CHECKPOINT_ONLINE redo 76/7E000108; tli 1; prev tli 1; fpw true; xid 0:59555585; oid 61620; multi 799; offset 1657; oldest xid 716 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 59555585; online
rmgr: XLOG        len (rec/tot):     24/    24, tx:          0, lsn: 76/7E0001B8, prev 76/7E000140, desc: SWITCH

Wrapping Up

I hope you don’t need to dig into WAL too often and may your pg_wal_switch events be few and far between. As a reminder, various initdb options and compile time options can change the outcome of the calculations and assumptions made. WAL is a complicated topic and any element related to WAL should be treated with great care.

No comments:

Post a Comment