Postgresql 中文操作指南
30.6. WAL Internals #
WAL 会自动启用;管理员无需采取任何操作,只须确保满足 WAL 文件的磁盘空间要求,并完成任何必要的调整(请参见 Section 30.5)。
WAL is automatically enabled; no action is required from the administrator except ensuring that the disk-space requirements for the WAL files are met, and that any necessary tuning is done (see Section 30.5).
WAL 记录在写入每条新记录时附加到 WAL 文件。插入位置由 WAL 中的字节偏移量表示,其对数序列号 (LSN) 单调增加。LSN 值作为数据类型 pg_lsn 返回。可以比较值以计算分隔它们的 WAL 数据体积,因此其用于度量复制和恢复进度。
WAL records are appended to the WAL files as each new record is written. The insert position is described by a Log Sequence Number (LSN) that is a byte offset into the WAL, increasing monotonically with each new record. LSN values are returned as the datatype pg_lsn. Values can be compared to calculate the volume of WAL data that separates them, so they are used to measure the progress of replication and recovery.
WAL 文件存储在主数据目录下的 pg_wal 目录中,作为一个段文件集,每个通常为 16 MB 大小(但可以通过更改 —wal-segsize initdb 选项来更改大小)。每个段都分成页,每个通常为 8 kB(可以通过 —with-wal-blocksize configure 选项更改此大小)。WAL 记录头在 access/xlogrecord.h 中进行了说明;记录内容取决于正在记录的事件类型。段文件被赋予不断增加的数字作为名称,从 000000010000000000000001 开始。数字不换行,但是用尽可用的数字库存需要很长很长时间。
WAL files are stored in the directory pg_wal under the data directory, as a set of segment files, normally each 16 MB in size (but the size can be changed by altering the —wal-segsize initdb option). Each segment is divided into pages, normally 8 kB each (this size can be changed via the —with-wal-blocksize configure option). The WAL record headers are described in access/xlogrecord.h; the record content is dependent on the type of event that is being logged. Segment files are given ever-increasing numbers as names, starting at 000000010000000000000001. The numbers do not wrap, but it will take a very, very long time to exhaust the available stock of numbers.
如果 WAL 位于与主数据库文件不同的磁盘上,则是有利的。可以通过将 pg_wal 目录移动到另一个位置来实现这一点(当然,在服务器关闭时),并从主数据目录中的原始位置创建一个符号链接到新位置。
It is advantageous if the WAL is located on a different disk from the main database files. This can be achieved by moving the pg_wal directory to another location (while the server is shut down, of course) and creating a symbolic link from the original location in the main data directory to the new location.
WAL 的目标是确保在数据库记录被更改之前写入日志,但这可能被虚假地向内核报告成功写入的磁盘驱动器破坏,实际上它们只会缓存数据而尚未存储到磁盘。在这种情况下,如果发生电源故障,可能会导致无法恢复的数据损坏。管理员应尽量确保容纳 PostgreSQL WAL 文件的磁盘不会产生成此类虚假报告。(请参见 Section 30.1。)
The aim of WAL is to ensure that the log is written before database records are altered, but this can be subverted by disk drives that falsely report a successful write to the kernel, when in fact they have only cached the data and not yet stored it on the disk. A power failure in such a situation might lead to irrecoverable data corruption. Administrators should try to ensure that disks holding PostgreSQL’s WAL files do not make such false reports. (See Section 30.1.)
生成检查点并刷新 WAL 后,检查点的位置会被保存在文件 pg_control_中。因此,在恢复开始时,服务器会首先读取 _pg_control,然后读取检查点记录;然后通过从检查点记录中指示的 WAL 位置开始前向扫描,对 REDO 操作执行扫描。因为在检查点后修改的第一个页之后,整个数据页内容会被保存到 WAL 中(假设 full_page_writes未禁用),所以自检查点以来更改的所有页都会被恢复到一致的状态。
After a checkpoint has been made and the WAL flushed, the checkpoint’s position is saved in the file pg_control. Therefore, at the start of recovery, the server first reads pg_control and then the checkpoint record; then it performs the REDO operation by scanning forward from the WAL location indicated in the checkpoint record. Because the entire content of data pages is saved in the WAL on the first page modification after a checkpoint (assuming full_page_writes is not disabled), all pages changed since the checkpoint will be restored to a consistent state.
为了处理 pg_control 损坏的情况,我们应该支持按逆序(从最新到最旧)扫描现有 WAL 段的可能性,以便找到最新的检查点。此功能尚未实现。pg_control 足够小(小于一个磁盘页面),因此不会受到部分写入问题的影响,并且截至撰写本文时,尚未收到因无法读取 pg_control 而导致的数据库故障报告。因此,尽管从理论上讲它是一个薄弱环节,但 pg_control 在实践中似乎并不成问题。
To deal with the case where pg_control is corrupt, we should support the possibility of scanning existing WAL segments in reverse order — newest to oldest — in order to find the latest checkpoint. This has not been implemented yet. pg_control is small enough (less than one disk page) that it is not subject to partial-write problems, and as of this writing there have been no reports of database failures due solely to the inability to read pg_control itself. So while it is theoretically a weak spot, pg_control does not seem to be a problem in practice.