Postgresql 中文操作指南

67.4. Implementation #

本节涵盖 B 树索引实现详细信息，高级用户可能会用到。有关 B 树实现更详细、更专注于内部的描述，请参阅源发行版中的 src/backend/access/nbtree/README。

This section covers B-Tree index implementation details that may be of use to advanced users. See src/backend/access/nbtree/README in the source distribution for a much more detailed, internals-focused description of the B-Tree implementation.

67.4.1. B-Tree Structure #

PostgreSQL B 树索引是多级树结构，其中树的每个级别都可以用作页面的双向链接列表。一个元页存储在索引的第一个段文件开头的固定位置。所有其他页面都是叶页面或内部页面。叶页面是树中最低级别的页面。所有其他级别包含内部页面。每个叶页面都包含指向表行的元组。每个内部页面都包含指向树中下一级别的元组。通常，超过 99% 的页面都是叶页面。内部页面和叶页面都使用 Section 73.6 中描述的标准页面格式。

PostgreSQL B-Tree indexes are multi-level tree structures, where each level of the tree can be used as a doubly-linked list of pages. A single metapage is stored in a fixed position at the start of the first segment file of the index. All other pages are either leaf pages or internal pages. Leaf pages are the pages on the lowest level of the tree. All other levels consist of internal pages. Each leaf page contains tuples that point to table rows. Each internal page contains tuples that point to the next level down in the tree. Typically, over 99% of all pages are leaf pages. Both internal pages and leaf pages use the standard page format described in Section 73.6.

当现有叶页无法容纳传入元组时，将新叶页添加到 B 树索引。一个 page split 操作通过将部分项移动到新页来为原先属于溢出页的项腾出空间。页拆分还必须将一个 downlink 新插入到父页中的新页，这可能会导致父也拆分。页拆分以递归方式“向上级联”。当根页最终无法容纳新的下行链路时，将执行 root page split 操作。这通过创建一个高于原始根页一级的新根页，为树结构添加一个新层。

New leaf pages are added to a B-Tree index when an existing leaf page cannot fit an incoming tuple. A page split operation makes room for items that originally belonged on the overflowing page by moving a portion of the items to a new page. Page splits must also insert a new downlink to the new page in the parent page, which may cause the parent to split in turn. Page splits “cascade upwards” in a recursive fashion. When the root page finally cannot fit a new downlink, a root page split operation takes place. This adds a new level to the tree structure by creating a new root page that is one level above the original root page.

67.4.2. Bottom-up Index Deletion #

B-Tree 索引并不会直接得知在 MVCC 下可能存在同一逻辑表行的多个现有版本；对于索引，每个元组都是一个需要它自己的索引项的独立对象。有时，“版本混乱”元组可能会累积并对查询延迟和吞吐量产生不利影响。这通常发生在 UPDATE 繁重的工作负载中，其中大多数单独的更新无法应用 HOT optimization. 在 UPDATE always 期间只更改由一个索引覆盖的一列的值需要一组新的索引元组—一组用于表上的 each and every 索引。特别需要注意的是，这包括未被 UPDATE “逻辑修改”的索引。所有索引都将需要一个后继物理索引元组，该元组指向表中的最新版本。每个索引中的每个新元组通常需要与原始“更新的”元组共存一小段时间（通常持续到 UPDATE 事务提交后不久）。

B-Tree indexes are not directly aware that under MVCC, there might be multiple extant versions of the same logical table row; to an index, each tuple is an independent object that needs its own index entry. “Version churn” tuples may sometimes accumulate and adversely affect query latency and throughput. This typically occurs with UPDATE-heavy workloads where most individual updates cannot apply the HOT optimization. Changing the value of only one column covered by one index during an UPDATE always necessitates a new set of index tuples — one for each and every index on the table. Note in particular that this includes indexes that were not “logically modified” by the UPDATE. All indexes will need a successor physical index tuple that points to the latest version in the table. Each new tuple within each index will generally need to coexist with the original “updated” tuple for a short period of time (typically until shortly after the UPDATE transaction commits).

B 树索引通过执行 bottom-up index deletion_传递来增量删除版本变更索引元组。每次删除传递都是对预期的“版本变更页面拆分”作出反应而触发的。这仅发生在未被 _UPDATE 语句逻辑修改的索引中，否则会在特定页面中集中累积旧版本。拆分页面通常可以避免，但某些实现级别启发法可能会无法识别甚至删除一个垃圾索引元组（在这种情况下，页面拆分或重复数据删除传递会解决传入新元组不适合某个叶页面的问题）。任何索引扫描必须遍历的最坏情况下版本数量（对于任何单个逻辑行）是影响整体系统响应能力和吞吐量的一个重要因素。自下而上的索引删除传递根据涉及逻辑行和版本的 qualitative 区别，针对单个叶页面中的可疑垃圾元组。这与由自动清理工作线程执行的“自上而下”索引清理形成对比，当超过某些 quantitative 表级别阈值时，会触发此清理（请参见 Section 25.1.6）。

B-Tree indexes incrementally delete version churn index tuples by performing bottom-up index deletion passes. Each deletion pass is triggered in reaction to an anticipated “version churn page split”. This only happens with indexes that are not logically modified by UPDATE statements, where concentrated build up of obsolete versions in particular pages would occur otherwise. A page split will usually be avoided, though it’s possible that certain implementation-level heuristics will fail to identify and delete even one garbage index tuple (in which case a page split or deduplication pass resolves the issue of an incoming new tuple not fitting on a leaf page). The worst-case number of versions that any index scan must traverse (for any single logical row) is an important contributor to overall system responsiveness and throughput. A bottom-up index deletion pass targets suspected garbage tuples in a single leaf page based on qualitative distinctions involving logical rows and versions. This contrasts with the “top-down” index cleanup performed by autovacuum workers, which is triggered when certain quantitative table-level thresholds are exceeded (see Section 25.1.6).

Note

并非在 B 树索引中执行的所有删除操作都是自底向上的删除操作。有一种截然不同的索引元组删除类别：simple index tuple deletion。这是一个延迟维护操作，它删除已知是安全的删除的索引元组（其项目标识符 LP_DEAD 位已设置）。与自底向上索引删除一样，当预见到页面拆分为避免拆分时，简单的索引删除将发生。

Not all deletion operations that are performed within B-Tree indexes are bottom-up deletion operations. There is a distinct category of index tuple deletion: simple index tuple deletion. This is a deferred maintenance operation that deletes index tuples that are known to be safe to delete (those whose item identifier’s LP_DEAD bit is already set). Like bottom-up index deletion, simple index deletion takes place at the point that a page split is anticipated as a way of avoiding the split.

简单删除在某种意义上是不失时的，因为它只能在最近的索引扫描设置受影响项的 LP_DEAD 位时进行。在 PostgreSQL 14 之前，B 树删除的唯一类别是简单删除。它与自下而上删除的主要区别在于，只有前者由传递索引扫描的活动机会性驱动，而后者专门针对_UPDATE_s_中未对索引列进行逻辑修改的版本转换。

Simple deletion is opportunistic in the sense that it can only take place when recent index scans set the LP_DEAD bits of affected items in passing. Prior to PostgreSQL 14, the only category of B-Tree deletion was simple deletion. The main differences between it and bottom-up deletion are that only the former is opportunistically driven by the activity of passing index scans, while only the latter specifically targets version churn from _UPDATE_s that do not logically modify indexed columns.

自下而上的索引删除对具有特定工作负载的特定索引执行绝大多数所有垃圾索引元组清理。对于受到 UPDATE_s that rarely or never logically modify the columns that the index covers. The average and worst-case number of versions per logical row can be kept low purely through targeted incremental deletion passes. It’s quite possible that the on-disk size of certain indexes will never increase by even one single page/block despite _constant 版本转换中 UPDATE_s. Even then, an exhaustive “clean sweep” by a _VACUUM 操作版本转换的任何 B 树索引，都有可能实现这一点（通常在自动清理工作程序进程中运行），最终需要作为 collective 清理表及其每个索引的一部分。

Bottom-up index deletion performs the vast majority of all garbage index tuple cleanup for particular indexes with certain workloads. This is expected with any B-Tree index that is subject to significant version churn from UPDATE_s that rarely or never logically modify the columns that the index covers. The average and worst-case number of versions per logical row can be kept low purely through targeted incremental deletion passes. It’s quite possible that the on-disk size of certain indexes will never increase by even one single page/block despite _constant version churn from UPDATE_s. Even then, an exhaustive “clean sweep” by a _VACUUM operation (typically run in an autovacuum worker process) will eventually be required as a part of collective cleanup of the table and each of its indexes.

与 VACUUM 不同，自下而上的索引删除并不提供有关最旧的垃圾索引元组将有多久历史的任何可靠保证。不允许任何索引保留在表及其所有索引共同共享的保守截止点之前变为无效的“浮动垃圾”索引元组。此基本表级别不变量可以安全回收表 TID。这就是随着时间的推移，不同的逻辑行可以重复使用相同的表 TID 的原因（但对于其生命周期跨越同一 VACUUM 周期的两行逻辑来说，这种情况永远不会发生）。

Unlike VACUUM, bottom-up index deletion does not provide any strong guarantees about how old the oldest garbage index tuple may be. No index can be permitted to retain “floating garbage” index tuples that became dead prior to a conservative cutoff point shared by the table and all of its indexes collectively. This fundamental table-level invariant makes it safe to recycle table TIDs. This is how it is possible for distinct logical rows to reuse the same table TID over time (though this can never happen with two logical rows whose lifetimes span the same VACUUM cycle).

67.4.3. Deduplication #

副本是一个叶页面元组（指向表行的元组）其中 all 索引键列具有与同一索引中至少一个其他叶页面元组的对应列值匹配的值。副本元组在实践中非常常见。当启用可选技术时，B 树索引可以使用特殊的、节省空间的副本表示方式：deduplication。

A duplicate is a leaf page tuple (a tuple that points to a table row) where all indexed key columns have values that match corresponding column values from at least one other leaf page tuple in the same index. Duplicate tuples are quite common in practice. B-Tree indexes can use a special, space-efficient representation for duplicates when an optional technique is enabled: deduplication.

去重通过定期合并副本元组组来实现，为每个组形成一个 posting list 元组。列键值仅在此表示方式中出现一次。其后是指向表中行的 TID 的已排序数组。这显著减小了每个值（或列值的每个不同组合）平均出现多次的索引的存储大小。可以显著降低查询的延迟。总体查询吞吐量可以显著增加。例程索引回收的开销也可以显著降低。

Deduplication works by periodically merging groups of duplicate tuples together, forming a single posting list tuple for each group. The column key value(s) only appear once in this representation. This is followed by a sorted array of TIDs that point to rows in the table. This significantly reduces the storage size of indexes where each value (or each distinct combination of column values) appears several times on average. The latency of queries can be reduced significantly. Overall query throughput may increase significantly. The overhead of routine index vacuuming may also be reduced significantly.

Note

B 树去重与包含 NULL 值的“重复项”一样有效，即使根据任何 B 树运算符类的 = 成员，NULL 值永远不等于彼此。就实现中理解磁盘上 B 树结构的任何部分而言，NULL 只是索引值域中的另一个值。

B-Tree deduplication is just as effective with “duplicates” that contain a NULL value, even though NULL values are never equal to each other according to the = member of any B-Tree operator class. As far as any part of the implementation that understands the on-disk B-Tree structure is concerned, NULL is just another value from the domain of indexed values.

去重过程发生在惰性回收中，当新项目被插入现有页无法容纳时，但也仅在索引元组删除无法为新项目释放足够空间时（通常简要考虑删除，然后跳过）。与 GIN 定位列表元组不同，B 树定位列表元组在每次插入新副本时不需要扩展；它们只是页的原始逻辑内容的另一种物理表示。这种设计优先考虑与混合读写工作负载一致的性能。大多数客户端应用程序至少会看到使用去重带来的中等性能优势。默认启用去重。

The deduplication process occurs lazily, when a new item is inserted that cannot fit on an existing leaf page, though only when index tuple deletion could not free sufficient space for the new item (typically deletion is briefly considered and then skipped over). Unlike GIN posting list tuples, B-Tree posting list tuples do not need to expand every time a new duplicate is inserted; they are merely an alternative physical representation of the original logical contents of the leaf page. This design prioritizes consistent performance with mixed read-write workloads. Most client applications will at least see a moderate performance benefit from using deduplication. Deduplication is enabled by default.

CREATE INDEX 和 REINDEX 将去重应用于创建定位列表元组，尽管它们使用的策略略有不同。从表中取得的已排序输入中遇到的每组重复常规元组都被合并到要添加到当前待定叶页的定位列表元组 before 中。各个定位列表元组使用尽可能多的 TID 进行打包。以通常的方式写出叶页面，没有任何单独的去重传递。这种策略非常适合 CREATE INDEX 和 REINDEX，因为它们是单次批处理操作。

CREATE INDEX and REINDEX apply deduplication to create posting list tuples, though the strategy they use is slightly different. Each group of duplicate ordinary tuples encountered in the sorted input taken from the table is merged into a posting list tuple before being added to the current pending leaf page. Individual posting list tuples are packed with as many TIDs as possible. Leaf pages are written out in the usual way, without any separate deduplication pass. This strategy is well-suited to CREATE INDEX and REINDEX because they are once-off batch operations.

由于索引中几乎没有或没有重复值，因此从去重中受益不大的大量写入工作负载将产生小的固定性能损失（除非明确禁用去重）。deduplicate_items 存储参数可用于在各个索引中禁用去重。在只读工作负载中永远没有性能损失，因为读取定位列表元组至少与读取标准元组表示一样高效。禁用去重通常没有帮助。

Write-heavy workloads that don’t benefit from deduplication due to having few or no duplicate values in indexes will incur a small, fixed performance penalty (unless deduplication is explicitly disabled). The deduplicate_items storage parameter can be used to disable deduplication within individual indexes. There is never any performance penalty with read-only workloads, since reading posting list tuples is at least as efficient as reading the standard tuple representation. Disabling deduplication isn’t usually helpful.

有时唯一索引（以及唯一约束）可以使用去重。这允许叶页面临时“吸收”额外的版本变更重复。唯一索引中的去重扩充了自下而上的索引删除，特别是在长时间运行的事务持有阻止垃圾回收的快照的情况下。目标是为了给自下而上的索引删除策略争取时间，使之重新生效。延迟页面拆分直到单个长期运行的事务自然消失，可以允许自下而上删除传递在早期删除传递失败的地方成功。

It is sometimes possible for unique indexes (as well as unique constraints) to use deduplication. This allows leaf pages to temporarily “absorb” extra version churn duplicates. Deduplication in unique indexes augments bottom-up index deletion, especially in cases where a long-running transaction holds a snapshot that blocks garbage collection. The goal is to buy time for the bottom-up index deletion strategy to become effective again. Delaying page splits until a single long-running transaction naturally goes away can allow a bottom-up deletion pass to succeed where an earlier deletion pass failed.

Tip

应用特殊启发式方法来确定是否在唯一索引中进行去重传递。它通常会直接跳到拆分叶页面，避免浪费周期在无益的去重传递上造成的性能损失。如果您担心去重的开销，请考虑选择性地设置 deduplicate_items = off。在唯一索引中启用去重几乎没有缺点。

A special heuristic is applied to determine whether a deduplication pass in a unique index should take place. It can often skip straight to splitting a leaf page, avoiding a performance penalty from wasting cycles on unhelpful deduplication passes. If you’re concerned about the overhead of deduplication, consider setting deduplicate_items = off selectively. Leaving deduplication enabled in unique indexes has little downside.

由于实现级别限制，并非所有情况都可以使用去重。CREATE INDEX 或 REINDEX 运行时，确定去重安全性。

Deduplication cannot be used in all cases due to implementation-level restrictions. Deduplication safety is determined when CREATE INDEX or REINDEX is run.

请注意，去重被视为不安全且无法用于涉及相等数据之间语义差异的以下情况：

Note that deduplication is deemed unsafe and cannot be used in the following cases involving semantically significant differences among equal datums:

在未来的 PostgreSQL 版本中，可能会取消一项进一步的实现级别限制：

There is one further implementation-level restriction that may be lifted in a future version of PostgreSQL:

无论使用的操作员类或对照关系如何，都适用一项进一步的实现级限制：

There is one further implementation-level restriction that applies regardless of the operator class or collation used: