Postgresql 中文操作指南
14.4. Populating a Database #
首次填充数据库时,可能需要插入大量数据。本部分包含有关如何让此过程尽可能高效的一些建议。
One might need to insert a large amount of data when first populating a database. This section contains some suggestions on how to make this process as efficient as possible.
14.4.1. Disable Autocommit #
使用多个 INSERT_s, turn off autocommit and just do one commit at the end. (In plain SQL, this means issuing _BEGIN 和 COMMIT 时。有些客户端库可能在您不知情的情况下进行此操作,此时您需要确保库在您希望其执行此操作时执行此操作。)如果您允许独立提交每次插入操作,PostgreSQL 会为每新增一行做大量工作。在一个事务中完成所有插入操作的另一个好处是,如果一行的插入失败,则会回滚到那一行之前插入的所有行的插入操作,这样您便不会陷入数据加载不完整的情况。
When using multiple INSERT_s, turn off autocommit and just do one commit at the end. (In plain SQL, this means issuing _BEGIN at the start and COMMIT at the end. Some client libraries might do this behind your back, in which case you need to make sure the library does it when you want it done.) If you allow each insertion to be committed separately, PostgreSQL is doing a lot of work for each row that is added. An additional benefit of doing all insertions in one transaction is that if the insertion of one row were to fail then the insertion of all rows inserted up to that point would be rolled back, so you won’t be stuck with partially loaded data.
14.4.2. Use COPY #
使用 COPY 加载一个命令中的所有行,而不是使用一系列 INSERT 命令。 COPY 命令针对加载大量行进行了优化;它不如 INSERT 灵活,但在大数据加载中产生的开销明显更少。由于 COPY 是单个命令,因此如果你使用此方法填充表,则无需禁用自动提交。
Use COPY to load all the rows in one command, instead of using a series of INSERT commands. The COPY command is optimized for loading large numbers of rows; it is less flexible than INSERT, but incurs significantly less overhead for large data loads. Since COPY is a single command, there is no need to disable autocommit if you use this method to populate a table.
如果你无法使用 COPY ,使用 PREPARE 创建准备好的 INSERT 语句,然后根据需要多次使用 EXECUTE 可能会有所帮助。这避免了反复解析和规划 INSERT 的一些开销。不同的界面以不同方式提供此功能;在界面文档中查找“准备好的语句”。
If you cannot use COPY, it might help to use PREPARE to create a prepared INSERT statement, and then use EXECUTE as many times as required. This avoids some of the overhead of repeatedly parsing and planning INSERT. Different interfaces provide this facility in different ways; look for “prepared statements” in the interface documentation.
请注意,即使使用了 PREPARE 并且多个插入操作分批成为一个事务,使用 COPY 加载大量行几乎总是比使用 INSERT 快。
Note that loading a large number of rows using COPY is almost always faster than using INSERT, even if PREPARE is used and multiple insertions are batched into a single transaction.
当在与以前的 CREATE TABLE 或 TRUNCATE 命令相同的交易中使用时,COPY 最快。在这些情况下,无需写 WAL,因为出错时,将新加载的数据包含在文件中的文件将被删除。但是,这个考虑因素只有在 wal_level 是 minimal 时才适用,因为这时所有命令都必须写 WAL。
COPY is fastest when used within the same transaction as an earlier CREATE TABLE or TRUNCATE command. In such cases no WAL needs to be written, because in case of an error, the files containing the newly loaded data will be removed anyway. However, this consideration only applies when wal_level is minimal as all commands must write WAL otherwise.
14.4.3. Remove Indexes #
如果您要加载新创建的表,则最快的办法是创建表,使用 COPY 对表数据进行批量加载,然后创建该表所需的任何索引。在现有数据上创建索引比随着每行加载而增量更新要快。
If you are loading a freshly created table, the fastest method is to create the table, bulk load the table’s data using COPY, then create any indexes needed for the table. Creating an index on pre-existing data is quicker than updating it incrementally as each row is loaded.
如果您要向现有表添加大量数据,则删除索引、加载表,然后重新创建索引可能是一种解决办法。当然,在没有索引的那段时间里,其他用户的数据库性能可能会受到影响。同样,在删除唯一索引之前,您也应三思而后行,因为在索引缺失时,唯一约束所提供的错误检查功能将丢失。
If you are adding large amounts of data to an existing table, it might be a win to drop the indexes, load the table, and then recreate the indexes. Of course, the database performance for other users might suffer during the time the indexes are missing. One should also think twice before dropping a unique index, since the error checking afforded by the unique constraint will be lost while the index is missing.
14.4.4. Remove Foreign Key Constraints #
正如索引一样,可以比逐行更有效地“批量”检查外键约束。因此,删除外键约束、加载数据并重新创建约束可能会很有用。同样,在约束缺失时,数据加载速度和错误检查功能的损失之间存在权衡。
Just as with indexes, a foreign key constraint can be checked “in bulk” more efficiently than row-by-row. So it might be useful to drop foreign key constraints, load data, and re-create the constraints. Again, there is a trade-off between data load speed and loss of error checking while the constraint is missing.
此外,当您将数据加载到具有现有外键约束的表中时,每行都需要服务器挂起触发器事件列表中的条目(因为触发器触发时,会检查行的外键约束)。加载数百万行可能会导致触发器事件队列溢出可用内存,从而导致无法容忍的交换甚至导致命令彻底失败。因此,在加载大量数据时,删除并重新应用外键可能不仅仅是可取的,而且 necessary。如果暂时取消该约束不可行,则唯一的其他办法可能是将加载操作拆分成较小的事务。
What’s more, when you load data into a table with existing foreign key constraints, each new row requires an entry in the server’s list of pending trigger events (since it is the firing of a trigger that checks the row’s foreign key constraint). Loading many millions of rows can cause the trigger event queue to overflow available memory, leading to intolerable swapping or even outright failure of the command. Therefore it may be necessary, not just desirable, to drop and re-apply foreign keys when loading large amounts of data. If temporarily removing the constraint isn’t acceptable, the only other recourse may be to split up the load operation into smaller transactions.
14.4.5. Increase maintenance_work_mem #
在加载大量数据时,临时增加 maintenance_work_mem 配置变量可以提高性能。这将有助于加快 CREATE INDEX 命令和 ALTER TABLE ADD FOREIGN KEY 命令。它对 COPY 本身不会有太大影响,因此这个建议只在你使用上述一种或两种技术时才有用。
Temporarily increasing the maintenance_work_mem configuration variable when loading large amounts of data can lead to improved performance. This will help to speed up CREATE INDEX commands and ALTER TABLE ADD FOREIGN KEY commands. It won’t do much for COPY itself, so this advice is only useful when you are using one or both of the above techniques.
14.4.6. Increase max_wal_size #
暂时增加 max_wal_size 配置变量也可以让加载大数据变得更快。这是因为将大量数据加载到 PostgreSQL 中会导致检查点比正常的检查点频率(由 checkpoint_timeout 配置变量指定)更频繁地发生。每当发生检查点时,所有脏页必须刷新到磁盘。通过在批量数据加载期间暂时增加 max_wal_size,可以减少所需的检查点数量。
Temporarily increasing the max_wal_size configuration variable can also make large data loads faster. This is because loading a large amount of data into PostgreSQL will cause checkpoints to occur more often than the normal checkpoint frequency (specified by the checkpoint_timeout configuration variable). Whenever a checkpoint occurs, all dirty pages must be flushed to disk. By increasing max_wal_size temporarily during bulk data loads, the number of checkpoints that are required can be reduced.
14.4.7. Disable WAL Archival and Streaming Replication #
当将大量数据加载到使用 WAL 归档或流复制的安装环境时,在加载完成后进行新基准备份可能比处理大量增量 WAL 数据的速度快。为了防止在加载过程中进行增量 WAL 记录,通过将 wal_level 设置为 minimal、 archive_mode 设置为 off 和 max_wal_senders 设置为零来禁用归档和流复制。但请注意,更改这些设置需要重新启动服务器,并使以前拍摄的任何基准备份无法用于归档恢复和备用服务器,这可能导致数据丢失。
When loading large amounts of data into an installation that uses WAL archiving or streaming replication, it might be faster to take a new base backup after the load has completed than to process a large amount of incremental WAL data. To prevent incremental WAL logging while loading, disable archiving and streaming replication, by setting wal_level to minimal, archive_mode to off, and max_wal_senders to zero. But note that changing these settings requires a server restart, and makes any base backups taken before unavailable for archive recovery and standby server, which may lead to data loss.
除了避免归档器或 WAL 发送器处理 WAL 数据的时间之外,这样做实际上会加快某些命令,因为如果 wal_level 为 minimal 并且当前的子事务(或顶级事务)创建或截断了它们更改的表或索引,则它们根本不会写入 WAL。(相比通过写入 WAL,它们可以通过在末尾执行 fsync 更低成本地确保崩溃安全性。)
Aside from avoiding the time for the archiver or WAL sender to process the WAL data, doing this will actually make certain commands faster, because they do not to write WAL at all if wal_level is minimal and the current subtransaction (or top-level transaction) created or truncated the table or index they change. (They can guarantee crash safety more cheaply by doing an fsync at the end than by writing WAL.)
14.4.8. Run ANALYZE Afterwards #
无论何时大幅更改表中数据分布,强烈建议运行 ANALYZE 。这包括批量加载大量数据到表中。运行 ANALYZE (或 VACUUM ANALYZE )可确保规划器拥有有关该表的最新统计信息。如果没有统计信息或统计信息已过期,规划器可能会在查询规划期间做出错误的决策,从而导致带有不准确或不存在的统计信息的任何表性能不佳。请注意,如果自动清理守护程序已启用,它可能会自动运行 ANALYZE ;有关更多信息,请参阅 Section 25.1.3 和 Section 25.1.6 。
Whenever you have significantly altered the distribution of data within a table, running ANALYZE is strongly recommended. This includes bulk loading large amounts of data into the table. Running ANALYZE (or VACUUM ANALYZE) ensures that the planner has up-to-date statistics about the table. With no statistics or obsolete statistics, the planner might make poor decisions during query planning, leading to poor performance on any tables with inaccurate or nonexistent statistics. Note that if the autovacuum daemon is enabled, it might run ANALYZE automatically; see Section 25.1.3 and Section 25.1.6 for more information.
14.4.9. Some Notes about pg_dump #
由 pg_dump 生成的转储脚本会自动应用上面几条准则(但不是全部准则)。要尽可能快地恢复 pg_dump 转储,您需要手动再做一些事情。(请注意,这些要点适用于 restoring 转储的情况,而不适用于 creating 转储的情况。无论使用 psql 加载文本转储还是使用 pg_restore 从 pg_dump 归档文件加载文本转储,都适用相同的要点。)
Dump scripts generated by pg_dump automatically apply several, but not all, of the above guidelines. To restore a pg_dump dump as quickly as possible, you need to do a few extra things manually. (Note that these points apply while restoring a dump, not while creating it. The same points apply whether loading a text dump with psql or using pg_restore to load from a pg_dump archive file.)
默认情况下,pg_dump 使用 COPY,而且当它生成一个完整的架构和数据转储时,它会在创建索引和外键之前仔细加载数据。因此,在这种情况下,有几条准则会自动得到处理。剩下的要由您做的是:
By default, pg_dump uses COPY, and when it is generating a complete schema-and-data dump, it is careful to load data before creating indexes and foreign keys. So in this case several guidelines are handled automatically. What is left for you to do is to:
仅数据转储仍将使用 COPY ,但它不会删除或重新创建索引,并且通常不会触及外键。 [14 ] 因此,在加载仅数据转储时,如果你希望使用这些技术,则需要自行删除和重新创建索引和外键。在加载数据时,增加 max_wal_size 仍然很有用,但不要费心增加 maintenance_work_mem ;相反,你可以在之后手动重新创建索引和外键时执行此操作。而且,当你完成后,不要忘记 ANALYZE ;有关更多信息,请参阅 Section 25.1.3 和 Section 25.1.6 。
A data-only dump will still use COPY, but it does not drop or recreate indexes, and it does not normally touch foreign keys. [14] So when loading a data-only dump, it is up to you to drop and recreate indexes and foreign keys if you wish to use those techniques. It’s still useful to increase max_wal_size while loading the data, but don’t bother increasing maintenance_work_mem; rather, you’d do that while manually recreating indexes and foreign keys afterwards. And don’t forget to ANALYZE when you’re done; see Section 25.1.3 and Section 25.1.6 for more information.
[14 ] يمكنك الحصول على تأثير تعطيل المفاتيح الخارجية باستخدام خيار —disable-triggers — لكن أدرك أن هذا يلغي التحقق من المفتاح الخارجي بدلاً من تأجيله فقط، وبالتالي من الممكن إدخال بيانات سيئة إذا كنت تستخدمه.
[14] You can get the effect of disabling foreign keys by using the —disable-triggers option — but realize that that eliminates, rather than just postpones, foreign key validation, and so it is possible to insert bad data if you use it.