Postgresql 中文操作指南

76.2. Multivariate Statistics Examples #

76.2.1. Functional Dependencies #

多元相关性可以使用一个非常简单的数据集来演示——一个包含两列的表,两列都包含相同的值:

Multivariate correlation can be demonstrated with a very simple data set — a table with two columns, both containing the same values:

CREATE TABLE t (a INT, b INT);
INSERT INTO t SELECT i % 100, i % 100 FROM generate_series(1, 10000) s(i);
ANALYZE t;

Section 14.2中所述,计划程序可以使用从_pg_class_获取的页数和行数来确定_t_基数:

As explained in Section 14.2, the planner can determine cardinality of t using the number of pages and rows obtained from pg_class:

SELECT relpages, reltuples FROM pg_class WHERE relname = 't';

 relpages | reltuples
----------+-----------
       45 |     10000

数据分布非常简单,每列只有 100 个不同的值,分布均匀。

The data distribution is very simple; there are only 100 distinct values in each column, uniformly distributed.

以下示例显示了对 a 栏进行 WHERE 条件估算的结果:

The following example shows the result of estimating a WHERE condition on the a column:

EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1;
                                 QUERY PLAN
-------------------------------------------------------------------​------------
 Seq Scan on t  (cost=0.00..170.00 rows=100 width=8) (actual rows=100 loops=1)
   Filter: (a = 1)
   Rows Removed by Filter: 9900

计划程序检查条件并确定此子句的选择性为 1%。通过比较此估计值和实际行数,我们可以看到估计值非常准确(实际上是完全准确的,因为表非常小)。将 WHERE 条件更改为使用 b 栏,会生成一个相同的计划。但观察一下如果我们将相同的条件应用于两列,并将其与 AND 结合起来会发生什么:

The planner examines the condition and determines the selectivity of this clause to be 1%. By comparing this estimate and the actual number of rows, we see that the estimate is very accurate (in fact exact, as the table is very small). Changing the WHERE condition to use the b column, an identical plan is generated. But observe what happens if we apply the same condition on both columns, combining them with AND:

EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
                                 QUERY PLAN
-------------------------------------------------------------------​----------
 Seq Scan on t  (cost=0.00..195.00 rows=1 width=8) (actual rows=100 loops=1)
   Filter: ((a = 1) AND (b = 1))
   Rows Removed by Filter: 9900

计划程序分别对每个条件估算选择性,得出与上述相同的 1% 估计值。然后它假设这些条件是独立的,因此它将它们的选择性相乘,产生仅 0.01% 的最终选择性估计值。这是一个重大的低估,因为满足条件的实际行数(100)高出两个数量级。

The planner estimates the selectivity for each condition individually, arriving at the same 1% estimates as above. Then it assumes that the conditions are independent, and so it multiplies their selectivities, producing a final selectivity estimate of just 0.01%. This is a significant underestimate, as the actual number of rows matching the conditions (100) is two orders of magnitude higher.

通过创建一个统计对象可以解决此问题,该对象将_ANALYZE_引导至计算两列上的函数式依赖多元化统计数据:

This problem can be fixed by creating a statistics object that directs ANALYZE to calculate functional-dependency multivariate statistics on the two columns:

CREATE STATISTICS stts (dependencies) ON a, b FROM t;
ANALYZE t;
EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
                                  QUERY PLAN
-------------------------------------------------------------------​------------
 Seq Scan on t  (cost=0.00..195.00 rows=100 width=8) (actual rows=100 loops=1)
   Filter: ((a = 1) AND (b = 1))
   Rows Removed by Filter: 9900

76.2.2. Multivariate N-Distinct Counts #

对多列(如_GROUP BY_子句将生成的组号)集的基数估算也会出现类似问题。当_GROUP BY_列出单列时,n-distinct估算(可用作HashAggregate节点返回的行数的估算值)非常精确:

A similar problem occurs with estimation of the cardinality of sets of multiple columns, such as the number of groups that would be generated by a GROUP BY clause. When GROUP BY lists a single column, the n-distinct estimate (which is visible as the estimated number of rows returned by the HashAggregate node) is very accurate:

EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a;
                                       QUERY PLAN
-------------------------------------------------------------------​----------------------
 HashAggregate  (cost=195.00..196.00 rows=100 width=12) (actual rows=100 loops=1)
   Group Key: a
   ->  Seq Scan on t  (cost=0.00..145.00 rows=10000 width=4) (actual rows=10000 loops=1)

但是,不使用多元化统计数据,像以下示例中在_GROUP BY_中两列的组号估算就会大一个数量级:

But without multivariate statistics, the estimate for the number of groups in a query with two columns in GROUP BY, as in the following example, is off by an order of magnitude:

EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
                                       QUERY PLAN
-------------------------------------------------------------------​-------------------------
 HashAggregate  (cost=220.00..230.00 rows=1000 width=16) (actual rows=100 loops=1)
   Group Key: a, b
   ->  Seq Scan on t  (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000 loops=1)

通过将统计对象重新定义为包含两列的n-distinct计数,估算会得到很大提高:

By redefining the statistics object to include n-distinct counts for the two columns, the estimate is much improved:

DROP STATISTICS stts;
CREATE STATISTICS stts (dependencies, ndistinct) ON a, b FROM t;
ANALYZE t;
EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
                                       QUERY PLAN
-------------------------------------------------------------------​-------------------------
 HashAggregate  (cost=220.00..221.00 rows=100 width=16) (actual rows=100 loops=1)
   Group Key: a, b
   ->  Seq Scan on t  (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000 loops=1)

76.2.3. MCV Lists #

Section 76.2.1中所述,函数依赖性是一种非常经济且高效的统计类型,但其主要限制在于其全局性(仅跟踪列级别上的依赖性,而不是在各个列值之间)。

As explained in Section 76.2.1, functional dependencies are very cheap and efficient type of statistics, but their main limitation is their global nature (only tracking dependencies at the column level, not between individual column values).

本节介绍了 MCV(最常见值)列表的多变量变体,是对 Section 76.1中所述的按列统计的直接扩展。这些统计信息通过存储各个值来解决此限制,但在构建_ANALYZE_中的统计信息、存储和计划时间方面自然更加昂贵。

This section introduces multivariate variant of MCV (most-common values) lists, a straightforward extension of the per-column statistics described in Section 76.1. These statistics address the limitation by storing individual values, but it is naturally more expensive, both in terms of building the statistics in ANALYZE, storage and planning time.

让我们再查看 Section 76.2.1中的查询,但这次是在同一组列上创建了 MCV 列表(请务必删除函数依赖性,以确保计划程序使用新创建的统计信息)。

Let’s look at the query from Section 76.2.1 again, but this time with a MCV list created on the same set of columns (be sure to drop the functional dependencies, to make sure the planner uses the newly created statistics).

DROP STATISTICS stts;
CREATE STATISTICS stts2 (mcv) ON a, b FROM t;
ANALYZE t;
EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
                                   QUERY PLAN
-------------------------------------------------------------------​------------
 Seq Scan on t  (cost=0.00..195.00 rows=100 width=8) (actual rows=100 loops=1)
   Filter: ((a = 1) AND (b = 1))
   Rows Removed by Filter: 9900

该估算与函数式依赖的估算一样精确,这主要是因为表相当小,且具有值明显不同的简单分布。在查看函数式依赖不够好的第二个查询之前,我们先来检查一下MCV列表。

The estimate is as accurate as with the functional dependencies, mostly thanks to the table being fairly small and having a simple distribution with a low number of distinct values. Before looking at the second query, which was not handled by functional dependencies particularly well, let’s inspect the MCV list a bit.

可以使用_pg_mcv_list_items_集合返回函数检查MCV列表。

Inspecting the MCV list is possible using pg_mcv_list_items set-returning function.

SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid),
                pg_mcv_list_items(stxdmcv) m WHERE stxname = 'stts2';
 index |  values  | nulls | frequency | base_frequency
-------+----------+-------+-----------+----------------
     0 | {0, 0}   | {f,f} |      0.01 |         0.0001
     1 | {1, 1}   | {f,f} |      0.01 |         0.0001
   ...
    49 | {49, 49} | {f,f} |      0.01 |         0.0001
    50 | {50, 50} | {f,f} |      0.01 |         0.0001
   ...
    97 | {97, 97} | {f,f} |      0.01 |         0.0001
    98 | {98, 98} | {f,f} |      0.01 |         0.0001
    99 | {99, 99} | {f,f} |      0.01 |         0.0001
(100 rows)

这确认了两个列中有100个不同的组合,且这些组合的可能性都差不多(每个列的频次都是1%)。基本频次是以按列统计数据计算的频次,就像没有多列统计数据一样。如果任何一列中存在空值,就会在_nulls_列中标识出来。

This confirms there are 100 distinct combinations in the two columns, and all of them are about equally likely (1% frequency for each one). The base frequency is the frequency computed from per-column statistics, as if there were no multi-column statistics. Had there been any null values in either of the columns, this would be identified in the nulls column.

估算选择性时,规划器会将所有条件应用到MCV列表中的项目上,然后对匹配项的频次求和。有关详细信息,请参阅_src/backend/statistics/mcv.c_中的_mcv_clauselist_selectivity_。

When estimating the selectivity, the planner applies all the conditions on items in the MCV list, and then sums the frequencies of the matching ones. See mcv_clauselist_selectivity in src/backend/statistics/mcv.c for details.

与函数式依赖相比,MCV列表有两个主要优点。首先,此列表存储实际值,从而可以确定哪些组合是兼容的。

Compared to functional dependencies, MCV lists have two major advantages. Firstly, the list stores actual values, making it possible to decide which combinations are compatible.

EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 10;
                                 QUERY PLAN
-------------------------------------------------------------------​--------
 Seq Scan on t  (cost=0.00..195.00 rows=1 width=8) (actual rows=0 loops=1)
   Filter: ((a = 1) AND (b = 10))
   Rows Removed by Filter: 10000

其次,MCV列表处理更多子句类型,不仅仅是函数式依赖中的相等子句。例如,考虑针对同一表的以下范围查询:

Secondly, MCV lists handle a wider range of clause types, not just equality clauses like functional dependencies. For example, consider the following range query for the same table:

EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a <= 49 AND b > 49;
                                QUERY PLAN
-------------------------------------------------------------------​--------
 Seq Scan on t  (cost=0.00..195.00 rows=1 width=8) (actual rows=0 loops=1)
   Filter: ((a <= 49) AND (b > 49))
   Rows Removed by Filter: 10000