Postgresql 中文操作指南
14.2. Statistics Used by the Planner #
14.2.1. Single-Column Statistics #
正如我们在前一节中看到的,查询计划程序需要估计查询检索的行数,以便对查询计划做出良好的选择。本节快速了解系统用于这些估计的统计数据。
As we saw in the previous section, the query planner needs to estimate the number of rows retrieved by a query in order to make good choices of query plans. This section provides a quick look at the statistics that the system uses for these estimates.
统计数据的一个组成部分是每张表和索引的条目总数,以及每张表和索引占用的磁盘块数。此信息保存在表 pg_class 的列 reltuples 和 relpages 中。我们可以使用类似以下查询来查看它:
One component of the statistics is the total number of entries in each table and index, as well as the number of disk blocks occupied by each table and index. This information is kept in the table pg_class, in the columns reltuples and relpages. We can look at it with queries similar to this one:
SELECT relname, relkind, reltuples, relpages
FROM pg_class
WHERE relname LIKE 'tenk1%';
relname | relkind | reltuples | relpages
----------------------+---------+-----------+----------
tenk1 | r | 10000 | 358
tenk1_hundred | i | 10000 | 30
tenk1_thous_tenthous | i | 10000 | 30
tenk1_unique1 | i | 10000 | 30
tenk1_unique2 | i | 10000 | 30
(5 rows)
在这里,我们可以看到 tenk1 和其索引都包含 10000 行,但索引明显比表小得多。
Here we can see that tenk1 contains 10000 rows, as do its indexes, but the indexes are (unsurprisingly) much smaller than the table.
出于效率原因,reltuples 和 relpages 不会动态更新,因此它们通常包含一些过时值。它们由 VACUUM、ANALYZE 和 CREATE INDEX 等一些 DDL 命令更新。不扫描整个表(通常是这种情况)的 VACUUM 或 ANALYZE 操作将根据所扫描表的那个部分对 reltuples 计数进行增量更新,从而得到近似值。无论如何,计划程序将调整它在 pg_class 中找到的值以匹配当前物理表大小,从而获得更近似的值。
For efficiency reasons, reltuples and relpages are not updated on-the-fly, and so they usually contain somewhat out-of-date values. They are updated by VACUUM, ANALYZE, and a few DDL commands such as CREATE INDEX. A VACUUM or ANALYZE operation that does not scan the entire table (which is commonly the case) will incrementally update the reltuples count on the basis of the part of the table it did scan, resulting in an approximate value. In any case, the planner will scale the values it finds in pg_class to match the current physical table size, thus obtaining a closer approximation.
大多数查询仅取回表中的一小部分行,因为 WHERE 子句限制了要检查的行。因此,规划器需要估算 WHERE 子句的 selectivity ,即 WHERE 子句中每个条件匹配的行数。用于此任务的信息存储在 pg_statistic 系统目录中。 pg_statistic 中的条目由 ANALYZE 和 VACUUM ANALYZE 命令更新,即使是刚更新的,也始终是近似的。
Most queries retrieve only a fraction of the rows in a table, due to WHERE clauses that restrict the rows to be examined. The planner thus needs to make an estimate of the selectivity of WHERE clauses, that is, the fraction of rows that match each condition in the WHERE clause. The information used for this task is stored in the pg_statistic system catalog. Entries in pg_statistic are updated by the ANALYZE and VACUUM ANALYZE commands, and are always approximate even when freshly updated.
与其直接查看 pg_statistic ,不如在手动检查统计数据时查看其视图 pg_stats 。 pg_stats 旨在更易于阅读。此外, pg_stats 对所有人都可读,而 pg_statistic 仅对超级用户可读。(这可以防止未授权的用户从统计数据中学到有关其他人员的表内容的信息。 pg_stats 视图仅限于显示当前用户可以读取的表有关的行。)例如,我们可以执行:
Rather than look at pg_statistic directly, it’s better to look at its view pg_stats when examining the statistics manually. pg_stats is designed to be more easily readable. Furthermore, pg_stats is readable by all, whereas pg_statistic is only readable by a superuser. (This prevents unprivileged users from learning something about the contents of other people’s tables from the statistics. The pg_stats view is restricted to show only rows about tables that the current user can read.) For example, we might do:
SELECT attname, inherited, n_distinct,
array_to_string(most_common_vals, E'\n') as most_common_vals
FROM pg_stats
WHERE tablename = 'road';
attname | inherited | n_distinct | most_common_vals
---------+-----------+------------+------------------------------------
name | f | -0.363388 | I- 580 Ramp+
| | | I- 880 Ramp+
| | | Sp Railroad +
| | | I- 580 +
| | | I- 680 Ramp
name | t | -0.284859 | I- 880 Ramp+
| | | I- 580 Ramp+
| | | I- 680 Ramp+
| | | I- 580 +
| | | State Hwy 13 Ramp
(2 rows)
请注意,为同一列显示了两行,一行对应于从 road 表开始的完整继承层次结构 (inherited=t),另一行只包括 road 表本身 (inherited=f)。
Note that two rows are displayed for the same column, one corresponding to the complete inheritance hierarchy starting at the road table (inherited=t), and another one including only the road table itself (inherited=f).
特别是 pg_statistic 中 ANALYZE 存储的信息的量(即 most_common_vals 和 histogram_bounds 数组中每列的最大条目数)可以使用 ALTER TABLE SET STATISTICS 命令逐列设置,或通过设置 default_statistics_target 配置变量来全局设置。目前的默认限制是 100 个条目。提高限制可能会进行更精确的计划程序估计,特别是对于数据分布不规则的列,代价是占用 pg_statistic 中更多空间,并且计算估计的时间稍长。相反,较低的限制可能足以满足数据分布简单的列。
The amount of information stored in pg_statistic by ANALYZE, in particular the maximum number of entries in the most_common_vals and histogram_bounds arrays for each column, can be set on a column-by-column basis using the ALTER TABLE SET STATISTICS command, or globally by setting the default_statistics_target configuration variable. The default limit is presently 100 entries. Raising the limit might allow more accurate planner estimates to be made, particularly for columns with irregular data distributions, at the price of consuming more space in pg_statistic and slightly more time to compute the estimates. Conversely, a lower limit might be sufficient for columns with simple data distributions.
有关计划程序使用统计信息的更多详细信息,请参阅 Chapter 76。
Further details about the planner’s use of statistics can be found in Chapter 76.
14.2.2. Extended Statistics #
通常会看到运行错误执行计划的缓慢查询,这是因为查询子句中使用的多个列具有相关性。当列值具有相关性时,计划程序通常假设多个条件是彼此独立的,而这并不正确。常规统计数据由于其针对每个单列的性质,无法获取有关列间相关的任何信息。但是,PostgreSQL 具有计算 multivariate statistics 的能力,可以获取此类信息。
It is common to see slow queries running bad execution plans because multiple columns used in the query clauses are correlated. The planner normally assumes that multiple conditions are independent of each other, an assumption that does not hold when column values are correlated. Regular statistics, because of their per-individual-column nature, cannot capture any knowledge about cross-column correlation. However, PostgreSQL has the ability to compute multivariate statistics, which can capture such information.
由于列组合的可能数量非常大,自动计算多元统计数据是不切实际的。相反,extended statistics objects(更经常称为 statistics objects)可用于指示服务器跨有趣的列集获取统计数据。
Because the number of possible column combinations is very large, it’s impractical to compute multivariate statistics automatically. Instead, extended statistics objects, more often called just statistics objects, can be created to instruct the server to obtain statistics across interesting sets of columns.
使用 CREATE STATISTICS 命令创建统计数据对象。创建这样的对象只是创建表示对统计数据感兴趣的目录条目。实际数据收集由 ANALYZE 执行(手动命令或后台自动分析)。可以在 pg_statistic_ext_data 目录中检查收集的值。
Statistics objects are created using the CREATE STATISTICS command. Creation of such an object merely creates a catalog entry expressing interest in the statistics. Actual data collection is performed by ANALYZE (either a manual command, or background auto-analyze). The collected values can be examined in the pg_statistic_ext_data catalog.
ANALYZE 根据用于计算常规单列统计信息的同一表格行样本,计算扩展统计信息。由于通过增加表或其任何列的统计信息目标(如前一节中所述)来增加样本量,因此更大的统计信息目标通常会导致更准确的扩展统计信息,以及更长的时间来计算它们。
ANALYZE computes extended statistics based on the same sample of table rows that it takes for computing regular single-column statistics. Since the sample size is increased by increasing the statistics target for the table or any of its columns (as described in the previous section), a larger statistics target will normally result in more accurate extended statistics, as well as more time spent calculating them.
以下小节介绍了当前支持的扩展统计信息类型。
The following subsections describe the kinds of extended statistics that are currently supported.
14.2.2.1. Functional Dependencies #
最简单的扩展统计信息类型跟踪 functional dependencies,这是数据库范式定义中使用的一个概念。我们说列 b 在功能上依赖于列 a,如果已知 a 的值就足够确定 b 的值,也就是说,不存在两个具有相同 a 值但具有不同 b 值的行。在完全归一化的数据库中,函数依赖关系应该只存在于主键和超键上。然而,在实践中,由于各种原因,许多数据集并未完全归一化;为了性能原因而有意地进行非规范化是一个常见的例子。即使在完全归一化的数据库中,一些列之间也可能有部分相关性,这可以表示为部分函数依赖关系。
The simplest kind of extended statistics tracks functional dependencies, a concept used in definitions of database normal forms. We say that column b is functionally dependent on column a if knowledge of the value of a is sufficient to determine the value of b, that is there are no two rows having the same value of a but different values of b. In a fully normalized database, functional dependencies should exist only on primary keys and superkeys. However, in practice many data sets are not fully normalized for various reasons; intentional denormalization for performance reasons is a common example. Even in a fully normalized database, there may be partial correlation between some columns, which can be expressed as partial functional dependency.
函数依赖关系的存在直接影响某些查询中的估计准确性。如果查询同时在独立列和从属列上包含条件,则从属列上的条件不会进一步减少结果大小;但是,如果不了解函数依赖关系,查询规划器将假定这些条件是独立的,从而导致对结果大小估计不足。
The existence of functional dependencies directly affects the accuracy of estimates in certain queries. If a query contains conditions on both the independent and the dependent column(s), the conditions on the dependent columns do not further reduce the result size; but without knowledge of the functional dependency, the query planner will assume that the conditions are independent, resulting in underestimating the result size.
为了向规划器告知函数依赖关系,ANALYZE 可以收集跨列依赖性的度量。评估所有列集之间的依赖性程度将极其昂贵,因此数据收集仅限于那些在使用 dependencies 选项定义的统计信息对象中同时出现的列组。建议仅对强相关性的列组创建 dependencies 统计信息,以避免在 ANALYZE 和后续查询规划中产生不必要的开销。
To inform the planner about functional dependencies, ANALYZE can collect measurements of cross-column dependency. Assessing the degree of dependency between all sets of columns would be prohibitively expensive, so data collection is limited to those groups of columns appearing together in a statistics object defined with the dependencies option. It is advisable to create dependencies statistics only for column groups that are strongly correlated, to avoid unnecessary overhead in both ANALYZE and later query planning.
以下是如何收集函数依赖关系统计信息的一个示例:
Here is an example of collecting functional-dependency statistics:
CREATE STATISTICS stts (dependencies) ON city, zip FROM zipcodes;
ANALYZE zipcodes;
SELECT stxname, stxkeys, stxddependencies
FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
WHERE stxname = 'stts';
stxname | stxkeys | stxddependencies
---------+---------+------------------------------------------
stts | 1 5 | {"1 => 5": 1.000000, "5 => 1": 0.423130}
(1 row)
在这里可以看到,列 1(邮政编码)完全决定了列 5(城市),因此系数为 1.0,而城市只有约 42% 的时间决定邮政编码,这意味着有许多城市(58%)由单个邮政编码表示。
Here it can be seen that column 1 (zip code) fully determines column 5 (city) so the coefficient is 1.0, while city only determines zip code about 42% of the time, meaning that there are many cities (58%) that are represented by more than a single ZIP code.
在计算涉及函数相关列的查询的选择性时,规划器使用依赖性系数调整按条件选择性估计值,以免产生低估。
When computing the selectivity for a query involving functionally dependent columns, the planner adjusts the per-condition selectivity estimates using the dependency coefficients so as not to produce an underestimate.
目前仅在考虑将列与常量值进行比较的简单相等条件和具有常量值的 IN 子句时应用函数依赖关系。它们不用于改善用于比较两列或将列与表达式进行比较的相等条件的估计值,也不用于范围子句、LIKE 或任何其他类型的条件。
Functional dependencies are currently only applied when considering simple equality conditions that compare columns to constant values, and IN clauses with constant values. They are not used to improve estimates for equality conditions comparing two columns or comparing a column to an expression, nor for range clauses, LIKE or any other type of condition.
在使用函数依赖关系进行估计时,规划器假定包含列中的条件是兼容的,因此是冗余的。如果它们不兼容,则正确的估计值为零行,但没有考虑这种可能性。例如,给定如下查询:
When estimating with functional dependencies, the planner assumes that conditions on the involved columns are compatible and hence redundant. If they are incompatible, the correct estimate would be zero rows, but that possibility is not considered. For example, given a query like
SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '94105';
规划器将忽略 city 子句,因为它不会改变选择性,这是正确的。但是,它会对以下内容做出相同的假设:
the planner will disregard the city clause as not changing the selectivity, which is correct. However, it will make the same assumption about
SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '90210';
即使此次查询真正满足零行也是如此。然而,函数依赖关系统计信息并不能提供足够的的信息来得出这一结论。
even though there will really be zero rows satisfying this query. Functional dependency statistics do not provide enough information to conclude that, however.
在许多实际情况下,通常满足此假设;例如,应用程序中可能有一个 GUI,它只允许选择要用于查询中的兼容城市和邮政编码值。但是,如果不是这样,函数依赖关系可能不是一个可行的选项。
In many practical situations, this assumption is usually satisfied; for example, there might be a GUI in the application that only allows selecting compatible city and ZIP code values to use in a query. But if that’s not the case, functional dependencies may not be a viable option.
14.2.2.2. Multivariate N-Distinct Counts #
单列统计信息存储每列中的唯一值数。当组合多列时(例如,对于 GROUP BY a, b)的唯一值数估计通常是错误的,当规划器仅具有单列统计数据时,会导致选择错误的计划。
Single-column statistics store the number of distinct values in each column. Estimates of the number of distinct values when combining more than one column (for example, for GROUP BY a, b) are frequently wrong when the planner only has single-column statistical data, causing it to select bad plans.
为了改善此类估计,ANALYZE 可以收集列组的 n-distinct 统计数据。与之前一样,对每一种可能的列分组执行此操作是不切实际的,因此仅为出现在使用 ndistinct 选项定义的统计信息对象中同时出现的那些列组收集数据。将为从一组列出的列中两个或更多列的每一种可能组合收集数据。
To improve such estimates, ANALYZE can collect n-distinct statistics for groups of columns. As before, it’s impractical to do this for every possible column grouping, so data is collected only for those groups of columns appearing together in a statistics object defined with the ndistinct option. Data will be collected for each possible combination of two or more columns from the set of listed columns.
继续之前的示例,邮政编码表中的 n-distinct 计数可能如下所示:
Continuing the previous example, the n-distinct counts in a table of ZIP codes might look like the following:
CREATE STATISTICS stts2 (ndistinct) ON city, state, zip FROM zipcodes;
ANALYZE zipcodes;
SELECT stxkeys AS k, stxdndistinct AS nd
FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
WHERE stxname = 'stts2';
-[ RECORD 1 ]--------------------------------------------------------
k | 1 2 5
nd | {"1, 2": 33178, "1, 5": 33178, "2, 5": 27435, "1, 2, 5": 33178}
(1 row)
这表明有三种列组合具有 33178 个唯一值:邮政编码和州;邮政编码和城市;以及邮政编码、城市和州(鉴于该表中仅邮政编码是唯一的,因此它们全部相等是可以预料的)。另一方面,城市和州的组合仅有 27435 个唯一值。
This indicates that there are three combinations of columns that have 33178 distinct values: ZIP code and state; ZIP code and city; and ZIP code, city and state (the fact that they are all equal is expected given that ZIP code alone is unique in this table). On the other hand, the combination of city and state has only 27435 distinct values.
建议仅对实际上用于分组的列组合创建 ndistinct 统计信息对象,以及因错误估计组的数量而导致计划不佳的情况。否则,ANALYZE 周期只是浪费。
It’s advisable to create ndistinct statistics objects only on combinations of columns that are actually used for grouping, and for which misestimation of the number of groups is resulting in bad plans. Otherwise, the ANALYZE cycles are just wasted.
14.2.2.3. Multivariate MCV Lists #
为每列存储的另一种类型的统计信息是最常见值列表。这允许对单个列进行非常准确的估计,但可能会导致对具有多列条件的查询进行重大错误估计。
Another type of statistic stored for each column are most-common value lists. This allows very accurate estimates for individual columns, but may result in significant misestimates for queries with conditions on multiple columns.
为了提高此类估计值,ANALYZE 可以对列组合收集 MCV 列表。与函数依赖性和 n 个不同系数类似,对每个可能的列分组执行此操作也不切实际。在此情况下更是如此,因为 MCV 列表(不同于函数依赖性和 n 个不同系数)会存储公共列值。因此,仅为同时出现在使用 mcv 选项定义的统计对象中的那些列组收集数据。
To improve such estimates, ANALYZE can collect MCV lists on combinations of columns. Similarly to functional dependencies and n-distinct coefficients, it’s impractical to do this for every possible column grouping. Even more so in this case, as the MCV list (unlike functional dependencies and n-distinct coefficients) does store the common column values. So data is collected only for those groups of columns appearing together in a statistics object defined with the mcv option.
接续前面的示例,邮政编码表中的 MCV 列表可能如下所示(与更简单的统计类型不同,函数是检查 MCV 内容所必需的):
Continuing the previous example, the MCV list for a table of ZIP codes might look like the following (unlike for simpler types of statistics, a function is required for inspection of MCV contents):
CREATE STATISTICS stts3 (mcv) ON city, state FROM zipcodes;
ANALYZE zipcodes;
SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid),
pg_mcv_list_items(stxdmcv) m WHERE stxname = 'stts3';
index | values | nulls | frequency | base_frequency
-------+------------------------+-------+-----------+----------------
0 | {Washington, DC} | {f,f} | 0.003467 | 2.7e-05
1 | {Apo, AE} | {f,f} | 0.003067 | 1.9e-05
2 | {Houston, TX} | {f,f} | 0.002167 | 0.000133
3 | {El Paso, TX} | {f,f} | 0.002 | 0.000113
4 | {New York, NY} | {f,f} | 0.001967 | 0.000114
5 | {Atlanta, GA} | {f,f} | 0.001633 | 3.3e-05
6 | {Sacramento, CA} | {f,f} | 0.001433 | 7.8e-05
7 | {Miami, FL} | {f,f} | 0.0014 | 6e-05
8 | {Dallas, TX} | {f,f} | 0.001367 | 8.8e-05
9 | {Chicago, IL} | {f,f} | 0.001333 | 5.1e-05
...
(99 rows)
这表明城市和州最常见的组合是华盛顿特区,实际频率(在样本中)约为 0.35%。组合的基本频率(根据简单的每列频率计算)仅为 0.0027%,导致低估了两个数量级。
This indicates that the most common combination of city and state is Washington in DC, with actual frequency (in the sample) about 0.35%. The base frequency of the combination (as computed from the simple per-column frequencies) is only 0.0027%, resulting in two orders of magnitude under-estimates.
建议仅在实际一起用于条件并且其组数估计错误会导致错误计划的列组合上创建 MCV 统计对象。否则,ANALYZE 和规划周期只是浪费。
It’s advisable to create MCV statistics objects only on combinations of columns that are actually used in conditions together, and for which misestimation of the number of groups is resulting in bad plans. Otherwise, the ANALYZE and planning cycles are just wasted.