Postgresql 中文操作指南
11.8. Partial Indexes #
partial index 部分索引是针对表的一部分子集创建的索引;此子集由条件表达式(称为部分索引的 predicate)定义。索引仅包含满足谓词的那些表行项。部分索引是一项专门的功能,但在若干情况下它们很有用。
A partial index is an index built over a subset of a table; the subset is defined by a conditional expression (called the predicate of the partial index). The index contains entries only for those table rows that satisfy the predicate. Partial indexes are a specialized feature, but there are several situations in which they are useful.
使用部分索引的一个主要原因是避免对通用值进行索引。由于搜索通用值(占整个表行的百分比不少于几个百分点)的查询无论如何都不会使用索引,所以根本没有必要将这些行保留在索引中。这将减小索引的大小,这将加速确实使用索引的那些查询。这也将加速许多表更新操作,因为在所有情况下都不需要更新索引。 Example 11.1展示了这个想法的可能应用。
One major reason for using a partial index is to avoid indexing common values. Since a query searching for a common value (one that accounts for more than a few percent of all the table rows) will not use the index anyway, there is no point in keeping those rows in the index at all. This reduces the size of the index, which will speed up those queries that do use the index. It will also speed up many table update operations because the index does not need to be updated in all cases. Example 11.1 shows a possible application of this idea.
Example 11.1. Setting up a Partial Index to Exclude Common Values
假设你正在存储数据库中的 Web 服务器访问日志。大多数访问都源自你的组织的 IP 地址范围,但部分来自其他地方(例如拨号连接上的员工)。如果按 IP 进行的搜索主要是针对外部访问,则你可能不必将对应于你的组织子网的 IP 范围编入索引。
Suppose you are storing web server access logs in a database. Most accesses originate from the IP address range of your organization but some are from elsewhere (say, employees on dial-up connections). If your searches by IP are primarily for outside accesses, you probably do not need to index the IP range that corresponds to your organization’s subnet.
假设有一张这样的表:
Assume a table like this:
CREATE TABLE access_log (
url varchar,
client_ip inet,
...
);
要创建适合我们的示例的部分索引,可使用以下命令:
To create a partial index that suits our example, use a command such as this:
CREATE INDEX access_log_client_ip_ix ON access_log (client_ip)
WHERE NOT (client_ip > inet '192.168.100.0' AND
client_ip < inet '192.168.100.255');
可以使用此索引的典型查询如下:
A typical query that can use this index would be:
SELECT *
FROM access_log
WHERE url = '/index.html' AND client_ip = inet '212.78.10.32';
在此处,查询的 IP 地址被部分索引覆盖。以下查询无法使用部分索引,因为它使用了从索引中排除的 IP 地址:
Here the query’s IP address is covered by the partial index. The following query cannot use the partial index, as it uses an IP address that is excluded from the index:
SELECT *
FROM access_log
WHERE url = '/index.html' AND client_ip = inet '192.168.100.23';
请注意,这种类型的部分索引要求预先确定常见值,因此这些部分索引最适合用于数据分布不会发生变化的数据。可以重新创建此类索引以调整新的数据分布,但这就增加了维护工作量。
Observe that this kind of partial index requires that the common values be predetermined, so such partial indexes are best used for data distributions that do not change. Such indexes can be recreated occasionally to adjust for new data distributions, but this adds maintenance effort.
部分索引的另一个可能用途是从典型查询负载不感兴趣的值中排除索引;这在 Example 11.2中显示。这会产生与上面列出的相同优点,但它阻止“无趣”的值通过该索引进行访问,即使在这种情况下索引扫描可能获益。显然,为此类场景设置部分索引需要大量的细心和实验。
Another possible use for a partial index is to exclude values from the index that the typical query workload is not interested in; this is shown in Example 11.2. This results in the same advantages as listed above, but it prevents the “uninteresting” values from being accessed via that index, even if an index scan might be profitable in that case. Obviously, setting up partial indexes for this kind of scenario will require a lot of care and experimentation.
Example 11.2. Setting up a Partial Index to Exclude Uninteresting Values
如果有一个表同时包含已开单和未开单的订单,其中未开单的订单占表的总量很少,但它们是访问次数最多的行,则可以通过仅针对未开单行创建一个索引来提高性能。创建索引的命令如下所示:
If you have a table that contains both billed and unbilled orders, where the unbilled orders take up a small fraction of the total table and yet those are the most-accessed rows, you can improve performance by creating an index on just the unbilled rows. The command to create the index would look like this:
CREATE INDEX orders_unbilled_index ON orders (order_nr)
WHERE billed is not true;
可以使用此索引的一个可能的查询如下:
A possible query to use this index would be:
SELECT * FROM orders WHERE billed is not true AND order_nr < 10000;
但是,索引也可用于根本不涉及 order_nr 的查询,例如:
However, the index can also be used in queries that do not involve order_nr at all, e.g.:
SELECT * FROM orders WHERE billed is not true AND amount > 5000.00;
这不如 amount 列中的部分索引那么高效,因为系统必须扫描整个索引。然而,如果未开单订单相对较少,仅仅使用此部分索引来查找未开单订单可能就足够了。
This is not as efficient as a partial index on the amount column would be, since the system has to scan the entire index. Yet, if there are relatively few unbilled orders, using this partial index just to find the unbilled orders could be a win.
请注意,此查询无法使用此索引:
Note that this query cannot use this index:
SELECT * FROM orders WHERE order_nr = 3501;
订单 3501 可能属于已开单或未开单的订单。
The order 3501 might be among the billed or unbilled orders.
Example 11.2还说明了索引列和谓词中使用的列不必匹配。PostgreSQL 支持具有任意谓词的部分索引,只要涉及正在索引的表的列即可。但是,请记住谓词必须与应从该索引中受益的查询中使用的条件匹配。确切地说,只有当系统能够识别查询的 _WHERE_条件在数学上意味着索引的谓词时,才能在查询中使用部分索引。PostgreSQL 并没有一个复杂的定理证明器,它可以识别以不同形式编写的在数学上等效的表达式。(不仅这样通用的定理证明器极难创建,而且它可能太慢而无法实际使用。)该系统可以识别简单的不等式蕴涵,例如“x < 1”意味着“x < 2”;否则,谓词条件必须与查询 _WHERE_条件的一部分完全匹配,否则该索引将不被识别为可用。匹配发生在查询计划时间,而不是在运行时。因此,参数化查询子句不适用于部分索引。例如,带参数的准备查询可能指定“x < ?”,而这对于参数的所有可能值永远不会意味着“x < 2”。
Example 11.2 also illustrates that the indexed column and the column used in the predicate do not need to match. PostgreSQL supports partial indexes with arbitrary predicates, so long as only columns of the table being indexed are involved. However, keep in mind that the predicate must match the conditions used in the queries that are supposed to benefit from the index. To be precise, a partial index can be used in a query only if the system can recognize that the WHERE condition of the query mathematically implies the predicate of the index. PostgreSQL does not have a sophisticated theorem prover that can recognize mathematically equivalent expressions that are written in different forms. (Not only is such a general theorem prover extremely difficult to create, it would probably be too slow to be of any real use.) The system can recognize simple inequality implications, for example “x < 1” implies “x < 2”; otherwise the predicate condition must exactly match part of the query’s WHERE condition or the index will not be recognized as usable. Matching takes place at query planning time, not at run time. As a result, parameterized query clauses do not work with a partial index. For example a prepared query with a parameter might specify “x < ?” which will never imply “x < 2” for all possible values of the parameter.
部分索引的第三个可能用途根本不需要在查询中使用索引。此处的想法是在表的子集上创建唯一索引,如 Example 11.3所示。这强制了满足索引谓词的行之间的唯一性,而不约束不满足谓词的行。
A third possible use for partial indexes does not require the index to be used in queries at all. The idea here is to create a unique index over a subset of a table, as in Example 11.3. This enforces uniqueness among the rows that satisfy the index predicate, without constraining those that do not.
Example 11.3. Setting up a Partial Unique Index
假设我们有一张描述测试结果的表。我们希望确保给定的主题和目标组合只有一个“成功”条目,但可能有多个“不成功”条目。实现这一目标的方法之一如下:
Suppose that we have a table describing test outcomes. We wish to ensure that there is only one “successful” entry for a given subject and target combination, but there might be any number of “unsuccessful” entries. Here is one way to do it:
CREATE TABLE tests (
subject text,
target text,
success boolean,
...
);
CREATE UNIQUE INDEX tests_success_constraint ON tests (subject, target)
WHERE success;
当成功测试较少且不成功测试较多时,这一方法特别有效。还可以通过创建一个具有 IS NULL 限制的唯一部分索引,在列中只允许一个空值。
This is a particularly efficient approach when there are few successful tests and many unsuccessful ones. It is also possible to allow only one null in a column by creating a unique partial index with an IS NULL restriction.
最后,部分索引还可以用于覆盖系统的查询计划选择。此外,分布奇特的数据集可能导致系统在实际不应使用索引时使用索引。在这种情况下,可以设置索引,使其无法用于冒犯性查询。通常,PostgreSQL 会对索引使用做出合理的决策(例如,在检索通用值时避免使用索引,因此前面的示例实际上只是节省了索引大小,避免索引使用并非必要),而极其不正确的计划选择是提交问题报告的原因。
Finally, a partial index can also be used to override the system’s query plan choices. Also, data sets with peculiar distributions might cause the system to use an index when it really should not. In that case the index can be set up so that it is not available for the offending query. Normally, PostgreSQL makes reasonable choices about index usage (e.g., it avoids them when retrieving common values, so the earlier example really only saves index size, it is not required to avoid index usage), and grossly incorrect plan choices are cause for a bug report.
请记住,设置部分索引表示您至少了解与查询计划器一样多的知识,尤其是当索引可能获利时。形成这种知识需要经验和对 PostgreSQL 中索引工作方式的理解。在大多数情况下,部分索引相对于常规索引的优势将是最小的。在某些情况下它们非常有害,例如在 Example 11.4中。
Keep in mind that setting up a partial index indicates that you know at least as much as the query planner knows, in particular you know when an index might be profitable. Forming this knowledge requires experience and understanding of how indexes in PostgreSQL work. In most cases, the advantage of a partial index over a regular index will be minimal. There are cases where they are quite counterproductive, as in Example 11.4.
Example 11.4. Do Not Use Partial Indexes as a Substitute for Partitioning
你可能想创建一组大型非重叠部分索引,例如
You might be tempted to create a large set of non-overlapping partial indexes, for example
CREATE INDEX mytable_cat_1 ON mytable (data) WHERE category = 1;
CREATE INDEX mytable_cat_2 ON mytable (data) WHERE category = 2;
CREATE INDEX mytable_cat_3 ON mytable (data) WHERE category = 3;
...
CREATE INDEX mytable_cat_N ON mytable (data) WHERE category = N;
这是一个坏主意!你很可能创建一个单独的非部分索引会更好,声明如下
This is a bad idea! Almost certainly, you’ll be better off with a single non-partial index, declared like
CREATE INDEX mytable_cat_data ON mytable (category, data);
(出于 Section 11.3中描述的原因,将类别列放在最前。) 在这个较大的索引中进行搜索可能需要经过比在较小索引中搜索多几个树级,但这几乎肯定比选择适当部分索引所需的计划工作更便宜。问题的核心在于系统不理解部分索引之间的关系,它会费力地测试每个索引以查看它是否适用于当前查询。
(Put the category column first, for the reasons described in Section 11.3.) While a search in this larger index might have to descend through a couple more tree levels than a search in a smaller index, that’s almost certainly going to be cheaper than the planner effort needed to select the appropriate one of the partial indexes. The core of the problem is that the system does not understand the relationship among the partial indexes, and will laboriously test each one to see if it’s applicable to the current query.
如果你的表足够大,以至于单个索引确实是一个糟糕的主意,你应该考虑使用分区代替(请参阅 Section 5.11)。借助该机制,系统确实理解表和索引是不重叠的,因此可以实现更好的性能。
If your table is large enough that a single index really is a bad idea, you should look into using partitioning instead (see Section 5.11). With that mechanism, the system does understand that the tables and indexes are non-overlapping, so far better performance is possible.
有关部分索引的更多信息,请参阅 [id="ston89b",role="bare"]biblio.html#STON89B[id="ston89b"]、[id="olson93",role="bare"]biblio.html#OLSON93[id="olson93"]和 [id="seshadri95",role="bare"]biblio.html#SESHADRI95[id="seshadri95"]。
More information about partial indexes can be found in biblio.html#STON89B, biblio.html#OLSON93, and biblio.html#SESHADRI95.