Postgresql 中文操作指南

11.9. Index-Only Scans and Covering Indexes #

PostgreSQL 中的所有索引都是 secondary 索引,这意味着每个索引都独立于表的主数据区域(在 PostgreSQL 术语中称为表的 heap)存储。这意味着在普通的索引扫描中,每次行检索都需要从索引和堆中提取数据。此外,尽管匹配给定的可索引 WHERE 条件的索引条目通常在索引中彼此紧密相连,但它们引用的表行可能在堆中的任何地方。因此,索引扫描的堆访问部分涉及大量的堆中随机访问,这可能是缓慢的,尤其是在传统的旋转介质上。(如 Section 11.5 中所述,位图扫描尝试通过以排序顺序进行堆访问来减轻此成本,但这只会达到一定程度。)

All indexes in PostgreSQL are secondary indexes, meaning that each index is stored separately from the table’s main data area (which is called the table’s heap in PostgreSQL terminology). This means that in an ordinary index scan, each row retrieval requires fetching data from both the index and the heap. Furthermore, while the index entries that match a given indexable WHERE condition are usually close together in the index, the table rows they reference might be anywhere in the heap. The heap-access portion of an index scan thus involves a lot of random access into the heap, which can be slow, particularly on traditional rotating media. (As described in Section 11.5, bitmap scans try to alleviate this cost by doing the heap accesses in sorted order, but that only goes so far.)

为了解决此性能问题,PostgreSQL 支持 index-only scans,它可以仅从索引中回答查询,而无需进行堆访问。其基本思想是直接从每个索引条目中返回值,而不是查阅关联的堆条目。何时可以使用此方法有两个基本限制:

To solve this performance problem, PostgreSQL supports index-only scans, which can answer queries from an index alone without any heap access. The basic idea is to return values directly out of each index entry instead of consulting the associated heap entry. There are two fundamental restrictions on when this method can be used:

如果满足这两个基本要求,则查询所需的所有数据值都可从索引中获得,因此索引只读扫描在物理上是可行的。但是,PostgreSQL 中的任何表扫描都有一个附加要求:它必须验证每个检索的行对查询的 MVCC 快照“可见”,如 Chapter 13 中所讨论的。可见性信息不存储在索引条目中,只存储在堆条目中;因此,乍一看,似乎无论如何每次行检索都需要进行堆访问。事实上,如果表行近来被修改了,情况就是如此。但是,对于很少更改的数据,有一种方法可以解决这个问题。PostgreSQL 对表堆中的每一页进行跟踪,即该页中存储的所有行是否足够旧,以至于对所有当前和未来的事务均可见。此信息存储在表 visibility map 中的一个位中。索引只读扫描在找到候选索引条目后,将检查对应堆页的可视性映射位。如果已设置,则已知行是可见的,因此可以返回数据而不用做进一步的工作。如果未设置,则必须访问堆条目以找出它是否可见,因此与标准索引扫描相比没有获得性能优势。即使在这种成功的情况下,这种方法也会用可见性映射访问来交换堆访问;但由于可见性映射比它所描述的堆小四个数量级,所以需要更少的物理 I/O 来访问它。在大多数情况下,可见性映射一直缓存在内存中。

If these two fundamental requirements are met, then all the data values required by the query are available from the index, so an index-only scan is physically possible. But there is an additional requirement for any table scan in PostgreSQL: it must verify that each retrieved row be “visible” to the query’s MVCC snapshot, as discussed in Chapter 13. Visibility information is not stored in index entries, only in heap entries; so at first glance it would seem that every row retrieval would require a heap access anyway. And this is indeed the case, if the table row has been modified recently. However, for seldom-changing data there is a way around this problem. PostgreSQL tracks, for each page in a table’s heap, whether all rows stored in that page are old enough to be visible to all current and future transactions. This information is stored in a bit in the table’s visibility map. An index-only scan, after finding a candidate index entry, checks the visibility map bit for the corresponding heap page. If it’s set, the row is known visible and so the data can be returned with no further work. If it’s not set, the heap entry must be visited to find out whether it’s visible, so no performance advantage is gained over a standard index scan. Even in the successful case, this approach trades visibility map accesses for heap accesses; but since the visibility map is four orders of magnitude smaller than the heap it describes, far less physical I/O is needed to access it. In most situations the visibility map remains cached in memory all the time.

简而言之,尽管在两个基本要求下可以进行索引仅扫描,但只有表的堆页的很大一部分具有其 all-visible 映射位集时,这才是胜利。但在足够大一部分行保持不变的表中,这种类型的扫描非常有用。

In short, while an index-only scan is possible given the two fundamental requirements, it will be a win only if a significant fraction of the table’s heap pages have their all-visible map bits set. But tables in which a large fraction of the rows are unchanging are common enough to make this type of scan very useful in practice.

若要有效利用索引仅扫描功能,不妨创建专门用于包含某个类型查询所需的列的 covering index 。鉴于查询通常需要检索的列不只是用于搜索的列,PostgreSQL 允许您创建一个索引,其中一些列只是“有效负载”,且不属于搜索键。可通过添加一个 INCLUDE 子句来列出这些其他列,从而完成此操作。例如,如果您经常运行类似以下的查询:

To make effective use of the index-only scan feature, you might choose to create a covering index, which is an index specifically designed to include the columns needed by a particular type of query that you run frequently. Since queries typically need to retrieve more columns than just the ones they search on, PostgreSQL allows you to create an index in which some columns are just “payload” and are not part of the search key. This is done by adding an INCLUDE clause listing the extra columns. For example, if you commonly run queries like

SELECT y FROM tab WHERE x = 'key';

加速此类查询的传统方法是在 x 上创建一个索引。然而,定义为

the traditional approach to speeding up such queries would be to create an index on x only. However, an index defined as

CREATE INDEX tab_x_y ON tab(x) INCLUDE (y);

的索引可以将这些查询处理为索引仅扫描,因为 y 可以从索引中获得,而无需访问堆。

could handle these queries as index-only scans, because y can be obtained from the index without visiting the heap.

因为列 y 不是索引搜索键的一部分,所以不必是索引可以处理的数据类型;它只是存储在索引中,并且不被索引机制解释。此外,如果索引是唯一索引,则

Because column y is not part of the index’s search key, it does not have to be of a data type that the index can handle; it’s merely stored in the index and is not interpreted by the index machinery. Also, if the index is a unique index, that is

CREATE UNIQUE INDEX tab_x_y ON tab(x) INCLUDE (y);

唯一性条件仅适用于列 x,而不适用于 xy 的组合。(INCLUDE 子句也可以用 UNIQUEPRIMARY KEY 约束编写,为设置此类索引提供替代语法。)

the uniqueness condition applies to just column x, not to the combination of x and y. (An INCLUDE clause can also be written in UNIQUE and PRIMARY KEY constraints, providing alternative syntax for setting up an index like this.)

为索引添加非键有效负载列时,明智的做法是保守的,尤其是有较宽列时。如果索引元组超过索引类型允许的最大大小,数据插入将失败。在任何情况下,非键列都会复制索引表的中的数据并增加索引的大小,从而可能减慢搜索。并且请记住,除非表更改得足够慢以至于索引仅扫描可能不需要访问堆,否则在索引中包含有效负载列几乎没有意义。如果无论如何都必须访问堆元组,则从那里获取列的值不会花费更多。其他限制是,目前不支持使用表达式作为包含的列,并且目前只有 B 树、GiST 和 SP-GiST 索引支持包含的列。

It’s wise to be conservative about adding non-key payload columns to an index, especially wide columns. If an index tuple exceeds the maximum size allowed for the index type, data insertion will fail. In any case, non-key columns duplicate data from the index’s table and bloat the size of the index, thus potentially slowing searches. And remember that there is little point in including payload columns in an index unless the table changes slowly enough that an index-only scan is likely to not need to access the heap. If the heap tuple must be visited anyway, it costs nothing more to get the column’s value from there. Other restrictions are that expressions are not currently supported as included columns, and that only B-tree, GiST and SP-GiST indexes currently support included columns.

在 PostgreSQL 具备 INCLUDE 特性之前,人们有时通过将有效负载列写为普通索引列来制作覆盖索引,即编写,

Before PostgreSQL had the INCLUDE feature, people sometimes made covering indexes by writing the payload columns as ordinary index columns, that is writing

CREATE INDEX tab_x_y ON tab(x, y);

尽管他们无意将 y 作为 WHERE 子句的一部分,但他们依然可以通过查询中的 WHERE 子句来访问此列。(这通常称为“隐藏列”。)只要额外的列是尾随列,这种方法就可以正常工作;出于 Section 11.3 中解释的原因,将它们作为前导列是不明智的。但是,此方法不支持您希望索引对键列强制唯一性的情况。

even though they had no intention of ever using y as part of a WHERE clause. This works fine as long as the extra columns are trailing columns; making them be leading columns is unwise for the reasons explained in Section 11.3. However, this method doesn’t support the case where you want the index to enforce uniqueness on the key column(s).

Suffix truncation 总会从较高的 B 树级别删除非键列。作为有效负载列,它们永远不会用于指导索引扫描。当键列的前缀足以描述最低 B 树级别上的元组时,截断过程还会删除一个或多个拖尾键列。在实践中,没有 INCLUDE 子句的覆盖索引通常避免在更高级别中存储事实上为有效负载的列。但是,明确将有效负载列定义为非键列 reliably 使较高层级的元组保持较小。

Suffix truncation always removes non-key columns from upper B-Tree levels. As payload columns, they are never used to guide index scans. The truncation process also removes one or more trailing key column(s) when the remaining prefix of key column(s) happens to be sufficient to describe tuples on the lowest B-Tree level. In practice, covering indexes without an INCLUDE clause often avoid storing columns that are effectively payload in the upper levels. However, explicitly defining payload columns as non-key columns reliably keeps the tuples in upper levels small.

原则上,索引仅扫描可以与表达式索引一起使用。例如,给定 f(x) 上的索引,其中 x 是表列,应该可以执行

In principle, index-only scans can be used with expression indexes. For example, given an index on f(x) where x is a table column, it should be possible to execute

SELECT f(x) FROM tab WHERE f(x) < 1;

作为索引仅扫描;如果 f() 是一个计算成本很高的函数,这非常有吸引力。但是,PostgreSQL 计划器目前不是针对此类案例非常智能。只有当查询所需的所有 columns 都可从索引中获得时,它才会考虑通过索引仅扫描潜在地执行查询。在此示例中,x 除了在上下文 f(x) 中是不需要的,但计划器没有注意到这一点,并得出结论,即索引仅扫描是不可能的。如果索引仅扫描似乎非常有价值,可以通过添加 x 作为包含的列来解决此问题,例如:

as an index-only scan; and this is very attractive if f() is an expensive-to-compute function. However, PostgreSQL’s planner is currently not very smart about such cases. It considers a query to be potentially executable by index-only scan only when all columns needed by the query are available from the index. In this example, x is not needed except in the context f(x), but the planner does not notice that and concludes that an index-only scan is not possible. If an index-only scan seems sufficiently worthwhile, this can be worked around by adding x as an included column, for example

CREATE INDEX tab_f_x ON tab (f(x)) INCLUDE (x);

如果目标是避免重新计算 f(x),另一个需要注意的是,计划器不一定匹配 f(x) 用法,该用法不在索引 WHERE 子句内用于索引列。在上面显示的简单查询中,通常会得到正确的结果,但在涉及联接的查询中则不会。这些缺陷可能会在 PostgreSQL 的未来版本中得到修复。

An additional caveat, if the goal is to avoid recalculating f(x), is that the planner won’t necessarily match uses of f(x) that aren’t in indexable WHERE clauses to the index column. It will usually get this right in simple queries such as shown above, but not in queries that involve joins. These deficiencies may be remedied in future versions of PostgreSQL.

部分索引也与索引只读扫描具有有趣的交互。考虑 Example 11.3 中所示的部分索引:

Partial indexes also have interesting interactions with index-only scans. Consider the partial index shown in Example 11.3:

CREATE UNIQUE INDEX tests_success_constraint ON tests (subject, target)
    WHERE success;

理论上,我们可以在此索引上执行仅索引扫描以满足如下查询:

In principle, we could do an index-only scan on this index to satisfy a query like

SELECT target FROM tests WHERE subject = 'some-subject' AND success;

但有一个问题:WHERE 子句引用了 success,而后者不可用作索引的结果列。尽管如此,仅索引扫描仍然可行,因为计划不需要在运行时重新检查 WHERE 子句的那一部分:在索引中找到的所有条目必然具有 success = true,因此不需要在计划中明确检查这一点。PostgreSQL 9.6 及更高版本将识别此类情况并允许生成仅索引扫描,但较早版本则不然。

But there’s a problem: the WHERE clause refers to success which is not available as a result column of the index. Nonetheless, an index-only scan is possible because the plan does not need to recheck that part of the WHERE clause at run time: all entries found in the index necessarily have success = true so this need not be explicitly checked in the plan. PostgreSQL versions 9.6 and later will recognize such cases and allow index-only scans to be generated, but older versions will not.