Postgresql 中文操作指南

14.3. Controlling the Planner with Explicit JOIN Clauses #

通过使用显式 JOIN 语法,可以在一定程度上控制查询计划器。为了了解原因,我们需要先了解一些背景信息。

It is possible to control the query planner to some extent by using the explicit JOIN syntax. To see why this matters, we first need some background.

在简单连接查询(例如:)中,

In a simple join query, such as:

SELECT * FROM a, b, c WHERE a.id = b.id AND b.ref = c.id;

计划器可以按任何顺序连接给定的表。例如,它可以生成连接 A 和 B 的查询计划(使用 WHERE 条件 a.id = b.id),然后使用另一个 WHERE 条件将 C 连接到该连接表。或者,它可以将 B 连接到 C,然后将 A 连接到该结果。或者,它可以将 A 连接到 C,然后将它们连接到 B - 但这效率低下,因为必须形成 A 和 C 的完全笛卡儿积,WHERE 子句中没有适用的条件来允许连接优化。(PostgreSQL 执行器中的所有连接都发生在两个输入表之间,因此有必要以其中一种方式建立结果。)重要的一点是,这些不同的连接可能性会产生语义等效的结果,但执行开销可能会有很大差异。因此,计划器会浏览所有这些可能性,以尝试找到最有效的查询计划。

the planner is free to join the given tables in any order. For example, it could generate a query plan that joins A to B, using the WHERE condition a.id = b.id, and then joins C to this joined table, using the other WHERE condition. Or it could join B to C and then join A to that result. Or it could join A to C and then join them with B — but that would be inefficient, since the full Cartesian product of A and C would have to be formed, there being no applicable condition in the WHERE clause to allow optimization of the join. (All joins in the PostgreSQL executor happen between two input tables, so it’s necessary to build up the result in one or another of these fashions.) The important point is that these different join possibilities give semantically equivalent results but might have hugely different execution costs. Therefore, the planner will explore all of them to try to find the most efficient query plan.

当查询仅涉及两到三张表时,不必担心连接顺序。但随着表数的增加,可能的连接顺序将呈指数级增长。对于十张或更多的输入表,不再可行对所有可能性进行穷举式搜索,甚至对于六到七张表,规划可能会花费很长时间。当有太多的输入表时,PostgreSQL 规划器将从穷举式搜索切换到对有限数量的可能性进行 genetic 搜索(切换阈值由 geqo_threshold 运行时参数设置)。遗传搜索耗时较少,但不会找到最优的执行方案。

When a query only involves two or three tables, there aren’t many join orders to worry about. But the number of possible join orders grows exponentially as the number of tables expands. Beyond ten or so input tables it’s no longer practical to do an exhaustive search of all the possibilities, and even for six or seven tables planning might take an annoyingly long time. When there are too many input tables, the PostgreSQL planner will switch from exhaustive search to a genetic probabilistic search through a limited number of possibilities. (The switch-over threshold is set by the geqo_threshold run-time parameter.) The genetic search takes less time, but it won’t necessarily find the best possible plan.

当查询涉及外部连接时,计划器的自由度低于普通(内部)连接。例如,考虑:

When the query involves outer joins, the planner has less freedom than it does for plain (inner) joins. For example, consider:

SELECT * FROM a LEFT JOIN (b JOIN c ON (b.ref = c.id)) ON (a.id = b.id);

尽管此查询的限制与上一个示例表面上相似,但语义不同,因为对于 A 中没有匹配 B 和 C 连接的行,必须为其发出行。因此,计划器此处没有连接顺序可选:它必须将 B 连接到 C,然后将 A 连接到该结果。因此,此查询比上一个查询的计划花费的时间更少。在其他情况下,计划器可能能够确定多个连接顺序是安全的。例如,给定:

Although this query’s restrictions are superficially similar to the previous example, the semantics are different because a row must be emitted for each row of A that has no matching row in the join of B and C. Therefore the planner has no choice of join order here: it must join B to C and then join A to that result. Accordingly, this query takes less time to plan than the previous query. In other cases, the planner might be able to determine that more than one join order is safe. For example, given:

SELECT * FROM a LEFT JOIN b ON (a.bid = b.id) LEFT JOIN c ON (a.cid = c.id);

首先将 A 连接到 B 或 C 是有效的。目前,只有 FULL JOIN 完全限制连接顺序。涉及 LEFT JOINRIGHT JOIN 的大多数实际案例都可以重新排列到某种程度。

it is valid to join A to either B or C first. Currently, only FULL JOIN completely constrains the join order. Most practical cases involving LEFT JOIN or RIGHT JOIN can be rearranged to some extent.

显式内部连接语法 (INNER JOINCROSS JOIN 或未修饰的 JOIN) 与在 FROM 中列出输入关系在语义上相同,因此它不会限制连接顺序。

Explicit inner join syntax (INNER JOIN, CROSS JOIN, or unadorned JOIN) is semantically the same as listing the input relations in FROM, so it does not constrain the join order.

即使大多数类型的 JOIN 不会完全限制连接顺序,也可以指示 PostgreSQL 查询计划器将所有 JOIN 子句都视为限制连接顺序。例如,以下三个查询在逻辑上是等效的:

Even though most kinds of JOIN don’t completely constrain the join order, it is possible to instruct the PostgreSQL query planner to treat all JOIN clauses as constraining the join order anyway. For example, these three queries are logically equivalent:

SELECT * FROM a, b, c WHERE a.id = b.id AND b.ref = c.id;
SELECT * FROM a CROSS JOIN b CROSS JOIN c WHERE a.id = b.id AND b.ref = c.id;
SELECT * FROM a JOIN (b JOIN c ON (b.ref = c.id)) ON (a.id = b.id);

但如果我们告诉计划器遵循 JOIN 顺序,则第二个和第三个计划花费的时间少于第一个。对于只有三个表,这种影响不值得担心,但是对于许多表,它可以救命。

But if we tell the planner to honor the JOIN order, the second and third take less time to plan than the first. This effect is not worth worrying about for only three tables, but it can be a lifesaver with many tables.

若要强制规划器遵循通过显式 JOIN 布局的连接顺序,将 join_collapse_limit 运行时参数设置为 1。(其他可能的值在下面讨论。)

To force the planner to follow the join order laid out by explicit _JOIN_s, set the join_collapse_limit run-time parameter to 1. (Other possible values are discussed below.)

您不必完全限制连接顺序以减少搜索时间,因为在普通 FROM 列表的项目中使用 JOIN 运算符是可以的。例如,考虑:

You do not need to constrain the join order completely in order to cut search time, because it’s OK to use JOIN operators within items of a plain FROM list. For example, consider:

SELECT * FROM a CROSS JOIN b, c, d, e WHERE ...;

join_collapse_limit = 1 时,这将强制计划器在将 A 连接到 B 之前先将它们连接到其他表,但不会限制它的其他选择。在这个示例中,可能的连接顺序的数量减少了 5 倍。

With join_collapse_limit = 1, this forces the planner to join A to B before joining them to other tables, but doesn’t constrain its choices otherwise. In this example, the number of possible join orders is reduced by a factor of 5.

以这种方式限制计划器的搜索既可以减少规划时间,又可以指导计划器制定一个好的查询计划,这是一个有用的技巧。如果计划器默认选择了一个错误的连接顺序,您可以强制它通过 JOIN 语法选择一个更好的顺序——假设您知道一个更好的顺序。建议进行试验。

Constraining the planner’s search in this way is a useful technique both for reducing planning time and for directing the planner to a good query plan. If the planner chooses a bad join order by default, you can force it to choose a better order via JOIN syntax — assuming that you know of a better order, that is. Experimentation is recommended.

影响规划时间的一个密切相关的问题是将子查询合并到其父级查询中。例如,考虑:

A closely related issue that affects planning time is collapsing of subqueries into their parent query. For example, consider:

SELECT *
FROM x, y,
    (SELECT * FROM a, b, c WHERE something) AS ss
WHERE somethingelse;

这种情况可能源于使用包含连接的视图;视图的 SELECT 规则将插入到视图引用中,生成类似于上述查询。通常,计划器将尝试将子查询合并到父级中,生成:

This situation might arise from use of a view that contains a join; the view’s SELECT rule will be inserted in place of the view reference, yielding a query much like the above. Normally, the planner will try to collapse the subquery into the parent, yielding:

SELECT * FROM x, y, a, b, c WHERE something AND somethingelse;

这通常会导致比单独规划子查询更好的计划。(例如,外部 WHERE 条件可能是首先将 X 连接到 A 消除了许多 A 行,从而避免了形成子查询的完整逻辑输出的需要。)但同时,我们增加了规划时间;在这里,我们有一个五路连接问题,代替了两个独立的三路连接问题。由于可能性的指数增长,这将产生很大的差异。如果超过 from_collapse_limit FROM 项会导致父查询,则计划器将尝试避免陷入巨大的联合搜索问题,方法是不合并子查询。您可以通过向上或向下调整此运行时参数,以在规划时间和计划质量之间进行权衡。

This usually results in a better plan than planning the subquery separately. (For example, the outer WHERE conditions might be such that joining X to A first eliminates many rows of A, thus avoiding the need to form the full logical output of the subquery.) But at the same time, we have increased the planning time; here, we have a five-way join problem replacing two separate three-way join problems. Because of the exponential growth of the number of possibilities, this makes a big difference. The planner tries to avoid getting stuck in huge join search problems by not collapsing a subquery if more than from_collapse_limit FROM items would result in the parent query. You can trade off planning time against quality of plan by adjusting this run-time parameter up or down.

from_collapse_limitjoin_collapse_limit 的名称相似,因为它们执行几乎相同的功能:一个控制规划器在何时“扁平化”子查询,另一个控制规划器在何时扁平化显式连接。通常,您要么将 join_collapse_limit 设置为等于 from_collapse_limit(以便显式连接和子查询以相似的方式起作用),要么将 join_collapse_limit 设置为 1(如果您想要通过显式连接控制连接顺序)。但是,如果您尝试微调计划时间和运行时间之间的权衡,则可以对它们进行不同的设置。

from_collapse_limit and join_collapse_limit are similarly named because they do almost the same thing: one controls when the planner will “flatten out” subqueries, and the other controls when it will flatten out explicit joins. Typically you would either set join_collapse_limit equal to from_collapse_limit (so that explicit joins and subqueries act similarly) or set join_collapse_limit to 1 (if you want to control join order with explicit joins). But you might set them differently if you are trying to fine-tune the trade-off between planning time and run time.