Postgresql 中文操作指南

30.5. WAL Configuration #

有几个与 WAL 相关的配置参数会影响数据库性能。本节解释了它们的使用方式。有关设置服务器配置参数的一般信息,请参见 Chapter 20

There are several WAL-related configuration parameters that affect database performance. This section explains their use. Consult Chapter 20 for general information about setting server configuration parameters.

Checkpoints 是交易序列中的点,在该序列中,可以保证已使用该检查点之前写入的所有信息来更新堆和索引数据文件。在检查点时间,所有脏数据页面都被刷新到磁盘,并将特殊检查点记录写入 WAL 文件。(更改记录之前已刷新到 WAL 文件。)在发生崩溃时,崩溃恢复程序查看最新的检查点记录以确定 WAL(称为重做记录)中应从此处开始 REDO 操作的点。对检查点之前的任何数据文件所做的更改均已保存在磁盘上。因此,在检查点之后,不再需要包含重做记录的 WAL 段,可以进行回收或删除。(当正在执行 WAL 归档时,必须先归档 WAL 段,然后再进行回收或删除。)

Checkpoints are points in the sequence of transactions at which it is guaranteed that the heap and index data files have been updated with all information written before that checkpoint. At checkpoint time, all dirty data pages are flushed to disk and a special checkpoint record is written to the WAL file. (The change records were previously flushed to the WAL files.) In the event of a crash, the crash recovery procedure looks at the latest checkpoint record to determine the point in the WAL (known as the redo record) from which it should start the REDO operation. Any changes made to data files before that point are guaranteed to be already on disk. Hence, after a checkpoint, WAL segments preceding the one containing the redo record are no longer needed and can be recycled or removed. (When WAL archiving is being done, the WAL segments must be archived before being recycled or removed.)

刷新所有脏数据页面到磁盘的检查点要求可能会导致大量的 I/O 负载。由于这个原因,检查点活动会受到限制,以便 I/O 在检查点开始时开始,并在应开始下一个检查点之前完成;这在检查点期间将性能下降降到最低。

The checkpoint requirement of flushing all dirty data pages to disk can cause a significant I/O load. For this reason, checkpoint activity is throttled so that I/O begins at checkpoint start and completes before the next checkpoint is due to start; this minimizes performance degradation during checkpoints.

服务器的检查点进程会自动定期执行检查点。每 checkpoint_timeout 秒开始一次检查点,或者如果 max_wal_size 即将超过,以先到者为准。默认设置分别为 5 分钟和 1 GB。如果自前一个检查点以来没有写入任何 WAL,则即使 checkpoint_timeout 已通过,也会跳过新的检查点。(如果正在使用 WAL 存档,并且您希望对存档文件有多长时间的限制以绑定潜在数据丢失,则应调整 archive_timeout 参数,而不是检查点参数。)还可以使用 SQL 命令 _CHECKPOINT_强制检查点。

The server’s checkpointer process automatically performs a checkpoint every so often. A checkpoint is begun every checkpoint_timeout seconds, or if max_wal_size is about to be exceeded, whichever comes first. The default settings are 5 minutes and 1 GB, respectively. If no WAL has been written since the previous checkpoint, new checkpoints will be skipped even if checkpoint_timeout has passed. (If WAL archiving is being used and you want to put a lower limit on how often files are archived in order to bound potential data loss, you should adjust the archive_timeout parameter rather than the checkpoint parameters.) It is also possible to force a checkpoint by using the SQL command CHECKPOINT.

减少_checkpoint_timeout_和/或_max_wal_size_会导致更频繁地出现检查点。这使事故后的恢复速度更快,因为需要重做的工作较少。但是,必须在这一点与更频繁地刷新脏数据页面的较高成本之间取得平衡。如果设置了 full_page_writes(这是默认设置),则需要考虑另一个因素。为了确保数据页面的一致性,每次检查点后对数据页面的第一次修改都会导致记录整个页面内容。在这种情况下,较小的检查点间隔会增加 WRITABLE AHEAD LOG (WAL) 中的输出量,部分否定了使用较小间隔的目的,无论如何都会导致更多的磁盘 I/O。

Reducing checkpoint_timeout and/or max_wal_size causes checkpoints to occur more often. This allows faster after-crash recovery, since less work will need to be redone. However, one must balance this against the increased cost of flushing dirty data pages more often. If full_page_writes is set (as is the default), there is another factor to consider. To ensure data page consistency, the first modification of a data page after each checkpoint results in logging the entire page content. In that case, a smaller checkpoint interval increases the volume of output to the WAL, partially negating the goal of using a smaller interval, and in any case causing more disk I/O.

检查点相当昂贵,首先是因为它们需要写入所有当前脏缓冲区,其次是因为它们会导致额外的后续 WAL 流量,如上所述。因此,明智的做法是将检查点参数设置得足够高,以避免检查点过于频繁。作为对检查点参数的简单健全性检查,可以设置 checkpoint_warning参数。如果检查点间隔小于_checkpoint_warning_秒,将向服务器日志输出一条消息,建议增加_max_wal_size_。偶尔出现这样的消息并不令人担忧,但如果经常出现,则应增加检查点控制参数。在还没有将_max_wal_size_设置得足够高的情况下进行大_COPY_传输等批量操作可能会导致出现这样的警告。

Checkpoints are fairly expensive, first because they require writing out all currently dirty buffers, and second because they result in extra subsequent WAL traffic as discussed above. It is therefore wise to set the checkpointing parameters high enough so that checkpoints don’t happen too often. As a simple sanity check on your checkpointing parameters, you can set the checkpoint_warning parameter. If checkpoints happen closer together than checkpoint_warning seconds, a message will be output to the server log recommending increasing max_wal_size. Occasional appearance of such a message is not cause for alarm, but if it appears often then the checkpoint control parameters should be increased. Bulk operations such as large COPY transfers might cause a number of such warnings to appear if you have not set max_wal_size high enough.

为了避免使用大量页面写入使 I/O 系统超负荷,在检查点期间写入脏缓冲区的时间会分布在一段时间内。该期间由 checkpoint_completion_target控制,它表示的检查点时间间隔的分数(使用_checkpoint_timeout_配置)。I/O 速率会得到调整,以便在给定的 checkpoint_timeout 秒分数过去时或在_max_wal_size_被超额(以两者中较早者为准)检查点完成。在默认值 0.9 下,PostgreSQL 预计每个检查点都会在下一个预定的检查点之前(在上次检查点持续时间的 90% 左右)完成。这尽可能地分散了 I/O,以便在整个检查点间隔内检查点 I/O 负载保持一致。这样做的缺点是延长检查点会影响恢复时间,因为需要保留更多的 WAL 段以供在恢复中使用。担心恢复所需时间过长的用户可能希望减少_checkpoint_timeout_,以便检查点更频繁地发生,但仍然将 I/O 分散在检查点间隔内。或者,可以减少_checkpoint_completion_target_,但这会导致 I/O 更频繁的时间(在检查点期间)与 I/O 较少的时间(检查点完成后但在下一个预定检查点之前)交替发生,因此不推荐这样做。尽管可以将_checkpoint_completion_target_设置为 1.0,但通常建议将其设置为不高于 0.9(默认值),因为检查点还包括除写入脏缓冲区之外的其他一些活动。1.0 的设置很可能导致检查点无法按时完成,这会导致所需的 WAL 段数量意外变化而导致性能下降。

To avoid flooding the I/O system with a burst of page writes, writing dirty buffers during a checkpoint is spread over a period of time. That period is controlled by checkpoint_completion_target, which is given as a fraction of the checkpoint interval (configured by using checkpoint_timeout). The I/O rate is adjusted so that the checkpoint finishes when the given fraction of checkpoint_timeout seconds have elapsed, or before max_wal_size is exceeded, whichever is sooner. With the default value of 0.9, PostgreSQL can be expected to complete each checkpoint a bit before the next scheduled checkpoint (at around 90% of the last checkpoint’s duration). This spreads out the I/O as much as possible so that the checkpoint I/O load is consistent throughout the checkpoint interval. The disadvantage of this is that prolonging checkpoints affects recovery time, because more WAL segments will need to be kept around for possible use in recovery. A user concerned about the amount of time required to recover might wish to reduce checkpoint_timeout so that checkpoints occur more frequently but still spread the I/O across the checkpoint interval. Alternatively, checkpoint_completion_target could be reduced, but this would result in times of more intense I/O (during the checkpoint) and times of less I/O (after the checkpoint completed but before the next scheduled checkpoint) and therefore is not recommended. Although checkpoint_completion_target could be set as high as 1.0, it is typically recommended to set it to no higher than 0.9 (the default) since checkpoints include some other activities besides writing dirty buffers. A setting of 1.0 is quite likely to result in checkpoints not being completed on time, which would result in performance loss due to unexpected variation in the number of WAL segments needed.

在 Linux 和 POSIX 平台上, checkpoint_flush_after允许强制操作系统在可配置字节数后将检查点写入的页面刷新到磁盘。否则,这些页面可能会保存在操作系统的页面缓存中,从而在检查点结束时发出_fsync_时导致停顿。此设置通常有助于减少事务延迟,但它也会对性能产生不利影响;特别是对于大于 shared_buffers但小于操作系统页面缓存的工作负载。

On Linux and POSIX platforms checkpoint_flush_after allows to force the OS that pages written by the checkpoint should be flushed to disk after a configurable number of bytes. Otherwise, these pages may be kept in the OS’s page cache, inducing a stall when fsync is issued at the end of a checkpoint. This setting will often help to reduce transaction latency, but it also can have an adverse effect on performance; particularly for workloads that are bigger than shared_buffers, but smaller than the OS’s page cache.

pg_wal 目录中 WAL 段文件数量取决于 min_wal_sizemax_wal_size,以及在先前的检查点周期中生成的 WAL 的数量。当不再需要旧的 WAL 段文件时,将删除或回收它们(即,重命名为序号序列中的未来段)。如果由于 WAL 输出速率的短期峰值,max_wal_size 被超出了,那么不需要的段文件将被删除,直到系统重新回到此限制以下。低于该限制时,系统会回收足够多的 WAL 文件来满足到下次检查点之间的估计需求,并删除其余部分。该估计基于先前检查点周期中使用的 WAL 文件数量的移动平均值。如果实际使用量超过了估计值,则会立即增加移动平均值,因此它可以适应峰值使用率,而不是平均使用率。min_wal_size 对未来使用时回收的 WAL 文件数量设置了一个最小值;即使系统处于空闲状态且 WAL 使用率估计表明需要很少的 WAL,仍将总是回收大量 WAL 以供未来使用。

The number of WAL segment files in pg_wal directory depends on min_wal_size, max_wal_size and the amount of WAL generated in previous checkpoint cycles. When old WAL segment files are no longer needed, they are removed or recycled (that is, renamed to become future segments in the numbered sequence). If, due to a short-term peak of WAL output rate, max_wal_size is exceeded, the unneeded segment files will be removed until the system gets back under this limit. Below that limit, the system recycles enough WAL files to cover the estimated need until the next checkpoint, and removes the rest. The estimate is based on a moving average of the number of WAL files used in previous checkpoint cycles. The moving average is increased immediately if the actual usage exceeds the estimate, so it accommodates peak usage rather than average usage to some extent. min_wal_size puts a minimum on the amount of WAL files recycled for future usage; that much WAL is always recycled for future use, even if the system is idle and the WAL usage estimate suggests that little WAL is needed.

独立于_max_wal_size_,始终保留最新的 wal_keep_size兆字节 WAL 文件加上一个附加 WAL 文件。此外,如果使用 WAL 存档,则无法删除或回收旧段,直到它们被存档。如果 WAL 存档无法跟上 WAL 生成的速度,或者如果_archive_command_或_archive_library_反复失败,旧 WAL 文件将累积在_pg_wal_中,直到情况得到解决。使用复制槽的缓慢或失败的备用服务器将产生相同的效果(见 Section 27.2.6)。

Independently of max_wal_size, the most recent wal_keep_size megabytes of WAL files plus one additional WAL file are kept at all times. Also, if WAL archiving is used, old segments cannot be removed or recycled until they are archived. If WAL archiving cannot keep up with the pace that WAL is generated, or if archive_command or archive_library fails repeatedly, old WAL files will accumulate in pg_wal until the situation is resolved. A slow or failed standby server that uses a replication slot will have the same effect (see Section 27.2.6).

在归档恢复或备用模式下,服务器会定期执行 restartpoints,这些操作类似于正常操作中的检查点:服务器会强制将所有状态刷新到磁盘,更新 pg_control 文件以指示不需要再次扫描已处理的 WAL 数据,然后回收 pg_wal 目录中的任何旧 WAL 段文件。重启点不能比主服务器上的检查点执行得更频繁,因为重启点只能在检查点记录上执行。当检查点记录已到达并且至少与上一个重启点之间过了 checkpoint_timeout 秒,或者当 WAL 大小即将超过 max_wal_size 时,就会触发重启点。但是,由于执行重启点的时间受到限制,因此在恢复过程中通常会超过 max_wal_size,最多可以超出一个检查点周期价值的 WAL。(反正 max_wal_size 永远不是一个硬性限制,因此你应该始终留出足够的裕量以避免用尽磁盘空间。)

In archive recovery or standby mode, the server periodically performs restartpoints, which are similar to checkpoints in normal operation: the server forces all its state to disk, updates the pg_control file to indicate that the already-processed WAL data need not be scanned again, and then recycles any old WAL segment files in the pg_wal directory. Restartpoints can’t be performed more frequently than checkpoints on the primary because restartpoints can only be performed at checkpoint records. A restartpoint is triggered when a checkpoint record is reached if at least checkpoint_timeout seconds have passed since the last restartpoint, or if WAL size is about to exceed max_wal_size. However, because of limitations on when a restartpoint can be performed, max_wal_size is often exceeded during recovery, by up to one checkpoint cycle’s worth of WAL. (max_wal_size is never a hard limit anyway, so you should always leave plenty of headroom to avoid running out of disk space.)

有两个常用的内部 WAL 函数:XLogInsertRecord_和_XLogFlush。_XLogInsertRecord_用于将新记录放入共享内存中的 WAL 缓冲区。如果没有新记录的空间,_XLogInsertRecord_将不得不写入(移至内核缓存)一些已满的 WAL 缓冲区。这是不希望的,因为_XLogInsertRecord_用于每次数据库低级修改(例如,行插入)时,而此时正在对受影响的数据页面保持排他锁,因此该操作需要尽可能快。更糟的是,写入 WAL 缓冲区也可能强制创建一个新的 WAL 段,这需要更多时间。通常,WAL 缓冲区应由_XLogFlush_请求写入和刷新,该请求在大部分情况下在事务提交时发出,以确保事务记录刷新到永久存储中。在 WAL 输出较高的系统上,_XLogFlush_请求可能无法足够频繁地发生以防止_XLogInsertRecord_不得不进行写入。在这样的系统上,应该通过修改 wal_buffers参数来增加 WAL 缓冲区数量。当 full_page_writes设置并且系统非常繁忙时,将_wal_buffers_设置为较高值将有助于在每个检查点之后的立即期间平滑响应时间。

There are two commonly used internal WAL functions: XLogInsertRecord and XLogFlush. XLogInsertRecord is used to place a new record into the WAL buffers in shared memory. If there is no space for the new record, XLogInsertRecord will have to write (move to kernel cache) a few filled WAL buffers. This is undesirable because XLogInsertRecord is used on every database low level modification (for example, row insertion) at a time when an exclusive lock is held on affected data pages, so the operation needs to be as fast as possible. What is worse, writing WAL buffers might also force the creation of a new WAL segment, which takes even more time. Normally, WAL buffers should be written and flushed by an XLogFlush request, which is made, for the most part, at transaction commit time to ensure that transaction records are flushed to permanent storage. On systems with high WAL output, XLogFlush requests might not occur often enough to prevent XLogInsertRecord from having to do writes. On such systems one should increase the number of WAL buffers by modifying the wal_buffers parameter. When full_page_writes is set and the system is very busy, setting wal_buffers higher will help smooth response times during the period immediately following each checkpoint.

commit_delay参数定义在_XLogFlush_中获取锁后,组提交领导者进程将在微秒内休眠多长时间,而组提交跟随

The commit_delay parameter defines for how many microseconds a group commit leader process will sleep after acquiring a lock within XLogFlush, while group commit followers queue up behind the leader. This delay allows other server processes to add their commit records to the WAL buffers so that all of them will be flushed by the leader’s eventual sync operation. No sleep will occur if fsync is not enabled, or if fewer than commit_siblings other sessions are currently in active transactions; this avoids sleeping when it’s unlikely that any other session will commit soon. Note that on some platforms, the resolution of a sleep request is ten milliseconds, so that any nonzero commit_delay setting between 1 and 10000 microseconds would have the same effect. Note also that on some platforms, sleep operations may take slightly longer than requested by the parameter.

由于 commit_delay 的目的是让每个刷新操作的成本在并发提交事务上摊销(可能以事务延迟为代价),因此在选择设置之前,有必要量化该成本。成本越高, commit_delay 在提高事务处理吞吐量方面的效果预期就越大,但要达到一定程度。 pg_test_fsync 程序可用于测量单个 WAL 刷新操作需要花费的平均时间(微秒)。程序报告的刷新时间平均值的一半(在进行 8kB 单次写入操作后)通常是 commit_delay 最有效的设置,因此建议将其用作针对特定工作负载优化时的起始点。虽然在 WAL 存储在高延迟旋转磁盘上时对 commit_delay 进行微调特别有用,但即使在同步时间非常快的存储介质(例如固态硬盘或带有电池供电写入高速缓存的 RAID 阵列)上,其收益也可能很大;但绝对应该针对代表性工作负载对此进行测试。在这些情况下,应使用更高值 commit_siblings ,而 commit_siblings 值较小通常有助于提高延迟介质的性能。请注意,将 commit_delay 设置的过高很可能极大地增加事务延迟,从而导致事务总吞吐量受到影响。

Since the purpose of commit_delay is to allow the cost of each flush operation to be amortized across concurrently committing transactions (potentially at the expense of transaction latency), it is necessary to quantify that cost before the setting can be chosen intelligently. The higher that cost is, the more effective commit_delay is expected to be in increasing transaction throughput, up to a point. The pg_test_fsync program can be used to measure the average time in microseconds that a single WAL flush operation takes. A value of half of the average time the program reports it takes to flush after a single 8kB write operation is often the most effective setting for commit_delay, so this value is recommended as the starting point to use when optimizing for a particular workload. While tuning commit_delay is particularly useful when the WAL is stored on high-latency rotating disks, benefits can be significant even on storage media with very fast sync times, such as solid-state drives or RAID arrays with a battery-backed write cache; but this should definitely be tested against a representative workload. Higher values of commit_siblings should be used in such cases, whereas smaller commit_siblings values are often helpful on higher latency media. Note that it is quite possible that a setting of commit_delay that is too high can increase transaction latency by so much that total transaction throughput suffers.

commit_delay 设置为零(默认值)时,仍可能会出现某种形式的群组提交,但是每个群组将只包含在先前的刷新操作(如果有)发生的窗口期间到达需要刷新其提交记录的点的会话。在客户端数量较高时,往往会出现“门道效应”,即使 commit_delay 为零,群组提交的效果也会变得显着,因此显式设置 commit_delay 往往帮助较小。只有在(1)有一些并发提交的事务,并且(2)吞吐量在一定程度上受提交速率限制时,设置 commit_delay 才有帮助;但在旋转延迟较高的情况下,此设置可以有效地增加只有两个客户端(即一个提交客户端和一个同级事务)的事务吞吐量。

When commit_delay is set to zero (the default), it is still possible for a form of group commit to occur, but each group will consist only of sessions that reach the point where they need to flush their commit records during the window in which the previous flush operation (if any) is occurring. At higher client counts a “gangway effect” tends to occur, so that the effects of group commit become significant even when commit_delay is zero, and thus explicitly setting commit_delay tends to help less. Setting commit_delay can only help when (1) there are some concurrently committing transactions, and (2) throughput is limited to some degree by commit rate; but with high rotational latency this setting can be effective in increasing transaction throughput with as few as two clients (that is, a single committing client with one sibling transaction).

wal_sync_method 参数确定 PostgreSQL 将如何要求内核强制将 WAL 更新写出到磁盘。除了 fsync_writethrough 外,在其他选项都没有强制刷新磁盘高速缓存的情况下 fsync_writethrough 可能会强制刷新磁盘高速缓存,因此所有这些选项在可靠性方面应该相同。但是,速度最快的选项取决于具体平台。你可以使用 pg_test_fsync 程序测试不同选项的速度。请注意,如果 fsync 已关闭,则此参数不相关。

The wal_sync_method parameter determines how PostgreSQL will ask the kernel to force WAL updates out to disk. All the options should be the same in terms of reliability, with the exception of fsync_writethrough, which can sometimes force a flush of the disk cache even when other options do not do so. However, it’s quite platform-specific which one will be the fastest. You can test the speeds of different options using the pg_test_fsync program. Note that this parameter is irrelevant if fsync has been turned off.

启用 wal_debug 配置参数(前提是 PostgreSQL 已编译出对它的支持)将导致向服务器日志中记录每个 XLogInsertRecordXLogFlush WAL 调用。此选项将来可能会被更通用的机制所取代。

Enabling the wal_debug configuration parameter (provided that PostgreSQL has been compiled with support for it) will result in each XLogInsertRecord and XLogFlush WAL call being logged to the server log. This option might be replaced by a more general mechanism in the future.

有两个内部函数用于将 WAL 数据写入磁盘:XLogWriteissue_xlog_fsync。在启用 track_wal_io_timing 时,XLogWrite 写入 WAL 数据的时间总数和 issue_xlog_fsync 将 WAL 数据同步到磁盘的时间总数分别在 pg_stat_wal 中被计为 wal_write_timewal_sync_time。当 WAL 缓冲区中没有空间容纳新记录时,XLogWrite 一般由 XLogInsertRecord(),XLogFlush 和 WAL 写入器调用,以便将 WAL 缓冲区写入磁盘并调用 issue_xlog_fsync。在 XLogWrite 中,issue_xlog_fsync 通常被用来将 WAL 文件同步到磁盘。如果 wal_sync_method 要么是 open_datasync,要么是 open_sync,则 XLogWrite 中的写操作保证将写入的 WAL 数据同步到磁盘,而 issue_xlog_fsync 不执行任何操作。如果 wal_sync_method 要么是 fdatasyncfsync,要么是 fsync_writethrough,该写操作将 WAL 缓冲区移至内核缓存,而 issue_xlog_fsync 将它们同步到磁盘。不管 track_wal_io_timing 的设置如何,XLogWrite 将 WAL 数据写入和 issue_xlog_fsync 将 WAL 数据同步到磁盘的次数分别在 pg_stat_wal 中被计为 wal_writewal_sync

There are two internal functions to write WAL data to disk: XLogWrite and issue_xlog_fsync. When track_wal_io_timing is enabled, the total amounts of time XLogWrite writes and issue_xlog_fsync syncs WAL data to disk are counted as wal_write_time and wal_sync_time in pg_stat_wal, respectively. XLogWrite is normally called by XLogInsertRecord (when there is no space for the new record in WAL buffers), XLogFlush and the WAL writer, to write WAL buffers to disk and call issue_xlog_fsync. issue_xlog_fsync is normally called by XLogWrite to sync WAL files to disk. If wal_sync_method is either open_datasync or open_sync, a write operation in XLogWrite guarantees to sync written WAL data to disk and issue_xlog_fsync does nothing. If wal_sync_method is either fdatasync, fsync, or fsync_writethrough, the write operation moves WAL buffers to kernel cache and issue_xlog_fsync syncs them to disk. Regardless of the setting of track_wal_io_timing, the number of times XLogWrite writes and issue_xlog_fsync syncs WAL data to disk are also counted as wal_write and wal_sync in pg_stat_wal, respectively.

recovery_prefetch 参数可用于通过指示内核启动读取磁盘块(它们很快就会被需要,但当前不在 PostgreSQL 的缓冲池中)的方式缩短恢复过程中的 I/O 等待时间。 maintenance_io_concurrencywal_decode_buffer_size 设置分别限制预取并发性和预取距离。默认情况下,它被设置为 try,并在 posix_fadvise 可用的系统上启用此功能。

The recovery_prefetch parameter can be used to reduce I/O wait times during recovery by instructing the kernel to initiate reads of disk blocks that will soon be needed but are not currently in PostgreSQL’s buffer pool. The maintenance_io_concurrency and wal_decode_buffer_size settings limit prefetching concurrency and distance, respectively. By default, it is set to try, which enables the feature on systems where posix_fadvise is available.