Postgresql 中文操作指南

27.3. Failover #

如果主服务器发生故障，备用服务器应开始故障转移过程。

If the primary server fails then the standby server should begin failover procedures.

如果备用服务器发生故障，则无需进行故障转移。如果可以重启备用服务器，即使是在稍后的时间，也可以立即重启恢复进程，利用可重启恢复。如果无法重启备用服务器，应创建一个全新的备用服务器实例。

If the standby server fails then no failover need take place. If the standby server can be restarted, even some time later, then the recovery process can also be restarted immediately, taking advantage of restartable recovery. If the standby server cannot be restarted, then a full new standby server instance should be created.

如果主服务器发生故障，备用服务器变为新的主服务器，然后旧的主服务器重新启动，则必须具备一种机制来告知旧的主服务器它不再是主服务器。这有时被称为 STONITH（向另一个节点开枪），这是必要的，以避免两种情况：两种系统都认为它们是主服务器，这将导致混乱且最终导致数据丢失。

If the primary server fails and the standby server becomes the new primary, and then the old primary restarts, you must have a mechanism for informing the old primary that it is no longer the primary. This is sometimes known as STONITH (Shoot The Other Node In The Head), which is necessary to avoid situations where both systems think they are the primary, which will lead to confusion and ultimately data loss.

许多故障转移系统只使用两个系统，即主服务器和备用服务器，它们通过某种心跳机制连接，以持续验证两者之间的连接以及主服务器的可行性。使用第三系统（称为见证服务器）来防止某些不适当的故障转移也是可能的，但如果第三系统未设置得足够谨慎并经过严格测试，其附加的复杂性可能并不值得。

Many failover systems use just two systems, the primary and the standby, connected by some kind of heartbeat mechanism to continually verify the connectivity between the two and the viability of the primary. It is also possible to use a third system (called a witness server) to prevent some cases of inappropriate failover, but the additional complexity might not be worthwhile unless it is set up with sufficient care and rigorous testing.

PostgreSQL 不提供识别主服务器上的故障并通知备用数据库服务器所需的系统软件。许多此类工具存在，并且与成功故障转移所需的各种操作系统设施（如 IP 地址迁移）很好地集成在一起。

PostgreSQL does not provide the system software required to identify a failure on the primary and notify the standby database server. Many such tools exist and are well integrated with the operating system facilities required for successful failover, such as IP address migration.

一旦发生故障转移至备用库，则只有一个服务器处于运行状态。这称为退化状态。以前的备用库现在是主库，但以前的主库已经关闭，并且可能一直关闭。要恢复到正常操作，必须重新创建一个备用服务器，这可以在以前的系统启动时在主系统上进行，或者在第三个（可能是新的）系统上进行。 pg_rewind 实用程序可用于在大型集群上加快此过程。一旦完成，主库和备用库就可以被视为切换了角色。有些人选择使用第三个服务器为新的主库提供备份，直到重新创建新的备用服务器为止，尽管这显然会使系统配置和操作流程复杂化。

Once failover to the standby occurs, there is only a single server in operation. This is known as a degenerate state. The former standby is now the primary, but the former primary is down and might stay down. To return to normal operation, a standby server must be recreated, either on the former primary system when it comes up, or on a third, possibly new, system. The pg_rewind utility can be used to speed up this process on large clusters. Once complete, the primary and standby can be considered to have switched roles. Some people choose to use a third server to provide backup for the new primary until the new standby server is recreated, though clearly this complicates the system configuration and operational processes.

所以，从主服务器切换到备用服务器可能很快，但需要一些时间来重新准备故障转移群集。从主服务器定期切换到备用服务器非常有用，因为它允许每个系统定期停机进行维护。这还可以作为故障转移机制的测试，以确保在需要时故障转移机制真正有效。建议制定书面管理程序。

So, switching from primary to standby server can be fast but requires some time to re-prepare the failover cluster. Regular switching from primary to standby is useful, since it allows regular downtime on each system for maintenance. This also serves as a test of the failover mechanism to ensure that it will really work when you need it. Written administration procedures are advised.

要触发日志传送备用服务器的故障转移，请运行 pg_ctl promote 或调用 pg_promote()。如果您正在设置仅用于从主服务器卸载只读查询的报告服务器，而不是出于高可用性目的，则无需升级。

To trigger failover of a log-shipping standby server, run pg_ctl promote or call pg_promote(). If you’re setting up reporting servers that are only used to offload read-only queries from the primary, not for high availability purposes, you don’t need to promote.