Postgresql 中文操作指南
30.1. Reliability #
可靠性是任何重要数据库系统的特性,PostgreSQL 做了一切可能的努力来保证可靠运行。可靠运行的一个方面是提交事务记录的所有数据都应存储在不会受到断电、操作系统故障和硬件故障(当然不包括非易失性区域本身故障)影响的非易失性区域中。通常,成功写入计算机永久存储(磁盘驱动器或同等设备)可以满足此要求。事实上,即使计算机遭到致命损坏,如果磁盘驱动器幸存下来,它们可以被移动到具有相似硬件的另一台计算机,并且所有提交的事务都将保持完整。
Reliability is an important property of any serious database system, and PostgreSQL does everything possible to guarantee reliable operation. One aspect of reliable operation is that all data recorded by a committed transaction should be stored in a nonvolatile area that is safe from power loss, operating system failure, and hardware failure (except failure of the nonvolatile area itself, of course). Successfully writing the data to the computer’s permanent storage (disk drive or equivalent) ordinarily meets this requirement. In fact, even if a computer is fatally damaged, if the disk drives survive they can be moved to another computer with similar hardware and all committed transactions will remain intact.
虽然定期将数据强制写入磁盘盘片看似一个简单的操作,但事实并非如此。由于磁盘驱动器的速度明显低于主内存和 CPU,因此计算机主内存和磁盘盘片之间存在多层缓存。首先,操作系统具有缓冲区高速缓存,它可以高速缓存请求频繁的磁盘块并合并磁盘写入内容。幸运的是,所有操作系统都会为应用程序提供一种方法来强制写入缓冲区高速缓存至磁盘,PostgreSQL 使用这些功能。(请参阅 wal_sync_method 参数以调整执行此操作的方式。)
While forcing data to the disk platters periodically might seem like a simple operation, it is not. Because disk drives are dramatically slower than main memory and CPUs, several layers of caching exist between the computer’s main memory and the disk platters. First, there is the operating system’s buffer cache, which caches frequently requested disk blocks and combines disk writes. Fortunately, all operating systems give applications a way to force writes from the buffer cache to disk, and PostgreSQL uses those features. (See the wal_sync_method parameter to adjust how this is done.)
其次,磁盘驱动器控制器中可能还有缓存,在 RAID 控制器卡中尤为常见。其中一些缓存是 write-through,这意味着写入会立即发送到驱动器。其他一些是 write-back,意味着数据之后会发送到驱动器。这类缓存可能存在可靠性隐患,因为磁盘控制器缓存中的内存易失, 断电会导致内存内容丢失。更好的控制器卡具有 battery-backup units(BBU),意味着该卡具有电池,可在系统断电时为缓存提供电力。来电后,数据将写入磁盘驱动器。
Next, there might be a cache in the disk drive controller; this is particularly common on RAID controller cards. Some of these caches are write-through, meaning writes are sent to the drive as soon as they arrive. Others are write-back, meaning data is sent to the drive at some later time. Such caches can be a reliability hazard because the memory in the disk controller cache is volatile, and will lose its contents in a power failure. Better controller cards have battery-backup units (BBUs), meaning the card has a battery that maintains power to the cache in case of system power loss. After power is restored the data will be written to the disk drives.
最后,大多数磁盘驱动器具有缓存。一些是直写缓存,一些则是回写缓存,回写驱动器缓存也存在与磁盘控制器缓存相同的数据丢失问题。消费级 IDE 和 SATA 驱动器尤其可能具有断电后不可用的回写缓存。许多固态驱动器 (SSD) 也具有易失的回写缓存。
And finally, most disk drives have caches. Some are write-through while some are write-back, and the same concerns about data loss exist for write-back drive caches as for disk controller caches. Consumer-grade IDE and SATA drives are particularly likely to have write-back caches that will not survive a power failure. Many solid-state drives (SSD) also have volatile write-back caches.
这些缓存通常可以禁用;但是,执行此操作的方法因操作系统和驱动器类型而异:
These caches can typically be disabled; however, the method for doing this varies by operating system and drive type:
最近的 SATA 驱动器(遵循 ATAPI-6 或更高版本)提供了驱动器缓存刷新命令 ( FLUSH CACHE EXT ),而 SCSI 驱动器长期以来一直支持类似的命令 SYNCHRONIZE CACHE 。这些命令不能直接访问 PostgreSQL,但某些文件系统(例如 ZFS、ext4)可以使用它们来将数据刷新到回写启用驱动器上的磁盘盘片中。不幸的是,当与电池备份单元 (BBU) 磁盘控制器结合使用时,此类文件系统的行为次优。在这样的设置中,synchronize 命令会将控制器缓存中的所有数据强制到磁盘中,从而消除了 BBU 的大部分好处。你可以运行 pg_test_fsync 程序以了解是否受到影响。如果受到影响,可以通过关闭文件系统中的写屏障或重新配置磁盘控制器(如果这是选项)来重新获得 BBU 的性能优势。如果关闭写屏障,请确保电池保持正常运行;有故障的电池可能会导致数据丢失。希望文件系统和磁盘控制器设计人员最终能解决这种次优行为。
Recent SATA drives (those following ATAPI-6 or later) offer a drive cache flush command (FLUSH CACHE EXT), while SCSI drives have long supported a similar command SYNCHRONIZE CACHE. These commands are not directly accessible to PostgreSQL, but some file systems (e.g., ZFS, ext4) can use them to flush data to the platters on write-back-enabled drives. Unfortunately, such file systems behave suboptimally when combined with battery-backup unit (BBU) disk controllers. In such setups, the synchronize command forces all data from the controller cache to the disks, eliminating much of the benefit of the BBU. You can run the pg_test_fsync program to see if you are affected. If you are affected, the performance benefits of the BBU can be regained by turning off write barriers in the file system or reconfiguring the disk controller, if that is an option. If write barriers are turned off, make sure the battery remains functional; a faulty battery can potentially lead to data loss. Hopefully file system and disk controller designers will eventually address this suboptimal behavior.
当操作系统将写请求发送到存储硬件时,它能做的很少,以确保数据已到达真正非易失性存储区域。相反,管理员有责任确保所有存储组件都确保数据和文件系统元数据的完整性。避免使用具有非电池备份写缓存的磁盘控制器。在驱动器级别,如果驱动器无法保证在关闭之前写入数据,请禁用回写缓存。如果你使用 SSD,请注意,其中许多 SSD 默认情况下不响应缓存刷新命令。你可以使用 diskchecker.pl 测试可靠的 I/O 子系统行为。
When the operating system sends a write request to the storage hardware, there is little it can do to make sure the data has arrived at a truly non-volatile storage area. Rather, it is the administrator’s responsibility to make certain that all storage components ensure integrity for both data and file-system metadata. Avoid disk controllers that have non-battery-backed write caches. At the drive level, disable write-back caching if the drive cannot guarantee the data will be written before shutdown. If you use SSDs, be aware that many of these do not honor cache flush commands by default. You can test for reliable I/O subsystem behavior using diskchecker.pl.
数据丢失的另一个风险是由磁盘盘片写入操作本身造成的。磁盘盘片分为几个扇区,每个扇区通常为 512 个字节。每个物理读或写操作都处理一个完整的扇区。当写入请求到达磁盘驱动器时,它可能是一些 512 字节的倍数(PostgreSQL 通常一次写入 8192 个字节,或 16 个扇区),并且写入过程可能随时因断电而失败,这意味着一些 512 字节的扇区已写入,而另一些扇区则没有。为防止此类故障,PostgreSQL 会定期将完整的页面图像写入到永久 WAL 存储 before 中,并修改磁盘上的实际页面。这样,在崩溃恢复期间,PostgreSQL 可以从 WAL 恢复部分写入的页面。如果您有防止部分页面写入的文件系统软件(例如,ZFS),您可以通过关闭 full_page_writes 参数来关闭此页面映像。带电池备份单元 (BBU) 的磁盘控制器不能阻止部分页面写入,除非它们保证数据以完整(8kB)页面的形式写入 BBU。
Another risk of data loss is posed by the disk platter write operations themselves. Disk platters are divided into sectors, commonly 512 bytes each. Every physical read or write operation processes a whole sector. When a write request arrives at the drive, it might be for some multiple of 512 bytes (PostgreSQL typically writes 8192 bytes, or 16 sectors, at a time), and the process of writing could fail due to power loss at any time, meaning some of the 512-byte sectors were written while others were not. To guard against such failures, PostgreSQL periodically writes full page images to permanent WAL storage before modifying the actual page on disk. By doing this, during crash recovery PostgreSQL can restore partially-written pages from WAL. If you have file-system software that prevents partial page writes (e.g., ZFS), you can turn off this page imaging by turning off the full_page_writes parameter. Battery-Backed Unit (BBU) disk controllers do not prevent partial page writes unless they guarantee that data is written to the BBU as full (8kB) pages.
PostgreSQL 还可以防止存储设备上出现某些类型的数据损坏,这些损坏可能是由于硬件错误或媒体随着时间的推移发生故障(例如,读写垃圾数据)造成的。
PostgreSQL also protects against some kinds of data corruption on storage devices that may occur because of hardware errors or media failure over time, such as reading/writing garbage data.
PostgreSQL 不防止可纠正的内存错误,并且假定您将使用采用行业标准纠错码 (ECC) 或更好保护的 RAM 运行。
PostgreSQL does not protect against correctable memory errors and it is assumed you will operate using RAM that uses industry standard Error Correcting Codes (ECC) or better protection.