Postgresql 中文操作指南
38.15. Operator Optimization Information #
PostgreSQL 运算符定义可以包含多个可选子句,这些子句向系统传达有关运算符行为的有用信息。只要适当,就应该提供这些子句,因为它们可以显著加快使用该运算符查询的执行速度。但是,如果您提供它们,您必须确保它们是正确的!不正确地使用优化子句会导致查询速度慢、输出微妙错误或出现其他不良情况。如果您不确定某优化子句,您始终可以将其省略;唯一的结果是查询可能会比需要的速度慢。
A PostgreSQL operator definition can include several optional clauses that tell the system useful things about how the operator behaves. These clauses should be provided whenever appropriate, because they can make for considerable speedups in execution of queries that use the operator. But if you provide them, you must be sure that they are right! Incorrect use of an optimization clause can result in slow queries, subtly wrong output, or other Bad Things. You can always leave out an optimization clause if you are not sure about it; the only consequence is that queries might run slower than they need to.
将来版本的 PostgreSQL 中可能会添加其他优化子句。此处描述的是 16.3 版本理解的所有子句。
Additional optimization clauses might be added in future versions of PostgreSQL. The ones described here are all the ones that release 16.3 understands.
还可以将计划支持函数附加到运算符下方的函数,这提供了另一种向系统告知运算符行为的方法。有关详细信息,请参阅 Section 38.11。
It is also possible to attach a planner support function to the function that underlies an operator, providing another way of telling the system about the behavior of the operator. See Section 38.11 for more information.
38.15.1. COMMUTATOR #
如果提供了 COMMUTATOR 子句,则会为正在定义的运算符命名一个交换运算符。我们说运算符 A 是运算符 B 的交换运算符,如果对于所有可能的输入值 x,y,(x A y) 等于 (y B x)。注意,B 也是 A 的交换运算符。例如,特定数据类型的运算符 < 和 > 通常是彼此的交换运算符,运算符 + 通常与自身交换。但是运算符 - 通常与任何东西都不交换。
The COMMUTATOR clause, if provided, names an operator that is the commutator of the operator being defined. We say that operator A is the commutator of operator B if (x A y) equals (y B x) for all possible input values x, y. Notice that B is also the commutator of A. For example, operators < and > for a particular data type are usually each others' commutators, and operator + is usually commutative with itself. But operator - is usually not commutative with anything.
可交换运算符的左操作数类型与其交换运算符的右操作数类型相同,反之亦然。因此,PostgreSQL 只需要交换运算符的名称来查找交换运算符,而 COMMUTATOR 子句中只需要提供这些名称即可。
The left operand type of a commutable operator is the same as the right operand type of its commutator, and vice versa. So the name of the commutator operator is all that PostgreSQL needs to be given to look up the commutator, and that’s all that needs to be provided in the COMMUTATOR clause.
为将用于索引和联接子句中的运算符提供交换运算符信息至关重要,因为这允许查询优化器将此类子句“反转”到不同计划类型所需的格式。例如,考虑具有 tab1.x = tab2.y 之类 WHERE 子句的查询,其中 tab1.x 和 tab2.y 是用户定义的类型,并且假设 tab2.y 已编入索引。除非优化器能够确定如何将子句反转到 tab2.y = tab1.x,否则优化器无法生成索引扫描,因为索引扫描机制期望在给定的运算符的左边看到索引列。PostgreSQL 将 not 简单地假设这是一个有效的转换—— = 运算符的创建者必须通过为运算符标记交换运算符信息来指定它有效。
It’s critical to provide commutator information for operators that will be used in indexes and join clauses, because this allows the query optimizer to “flip around” such a clause to the forms needed for different plan types. For example, consider a query with a WHERE clause like tab1.x = tab2.y, where tab1.x and tab2.y are of a user-defined type, and suppose that tab2.y is indexed. The optimizer cannot generate an index scan unless it can determine how to flip the clause around to tab2.y = tab1.x, because the index-scan machinery expects to see the indexed column on the left of the operator it is given. PostgreSQL will not simply assume that this is a valid transformation — the creator of the = operator must specify that it is valid, by marking the operator with commutator information.
在定义自交换运算符时,只需执行此操作即可。在定义一对交换运算符时,事情会变得有点棘手:第一个定义的运算符如何引用尚未定义的另一个运算符?这个问题有两个解决方案:
When you are defining a self-commutative operator, you just do it. When you are defining a pair of commutative operators, things are a little trickier: how can the first one to be defined refer to the other one, which you haven’t defined yet? There are two solutions to this problem:
38.15.2. NEGATOR #
如果提供了 NEGATOR 子句,则会为正在定义的运算符命名一个否定运算符。我们说运算符 A 是运算符 B 的否定运算符,如果它们都返回布尔结果,并且对于所有可能的输入 x,y,(x A y) 等于 NOT (x B y)。注意 B 也是 A 的否定运算符。例如,< 和 >= 是大多数数据类型的否定对。一个运算符永远不能有效地成为其自身的否定运算符。
The NEGATOR clause, if provided, names an operator that is the negator of the operator being defined. We say that operator A is the negator of operator B if both return Boolean results and (x A y) equals NOT (x B y) for all possible inputs x, y. Notice that B is also the negator of A. For example, < and >= are a negator pair for most data types. An operator can never validly be its own negator.
与交换运算符不同,一对一元运算符可以有效地标记为彼此的否定运算符;这意味着对于所有 x,(A x) 等于 NOT (B x)。
Unlike commutators, a pair of unary operators could validly be marked as each other’s negators; that would mean (A x) equals NOT (B x) for all x.
运算符的否定运算符必须与要定义的运算符具有相同的左和/或右操作数类型,因此,就像 COMMUTATOR 一样,NEGATOR 子句中只需要提供运算符名称即可。
An operator’s negator must have the same left and/or right operand types as the operator to be defined, so just as with COMMUTATOR, only the operator name need be given in the NEGATOR clause.
提供否定运算符对查询优化器非常有帮助,因为它允许将 NOT (x = y) 等表达式简化为 x <> y。这比您想象的要常见,因为 NOT 运算可以作为其他重新排列的结果而插入。
Providing a negator is very helpful to the query optimizer since it allows expressions like NOT (x = y) to be simplified into x <> y. This comes up more often than you might think, because NOT operations can be inserted as a consequence of other rearrangements.
否定运算符对可以使用上面为交换运算符对解释的相同方法定义。
Pairs of negator operators can be defined using the same methods explained above for commutator pairs.
38.15.3. RESTRICT #
如果提供了 RESTRICT 子句,则会为运算符命名一个限制选择性估计函数。(请注意,这是一个函数名称,而不是运算符名称。)RESTRICT 子句仅适用于返回 boolean 的二元运算符。限制选择性估算器的思想是猜测表中哪一部分行将满足形式为 WHERE 子句条件:
The RESTRICT clause, if provided, names a restriction selectivity estimation function for the operator. (Note that this is a function name, not an operator name.) RESTRICT clauses only make sense for binary operators that return boolean. The idea behind a restriction selectivity estimator is to guess what fraction of the rows in a table will satisfy a WHERE-clause condition of the form:
column OP constant
对于当前运算符和特定常量值。通过向优化器传达有关将通过此类形式的 WHERE 子句消除多少行的一些想法,这可以帮助优化器。(您可能想知道如果常量在左边会发生什么?嗯,那是 COMMUTATOR 的作用之一……)
for the current operator and a particular constant value. This assists the optimizer by giving it some idea of how many rows will be eliminated by WHERE clauses that have this form. (What happens if the constant is on the left, you might be wondering? Well, that’s one of the things that COMMUTATOR is for…)
编写新的限制选择性估计函数远远超出了本章的范围,但幸运的是,您通常可以仅对许多自己的运算符使用系统的一个标准估算器。这些是标准限制估算器:
Writing new restriction selectivity estimation functions is far beyond the scope of this chapter, but fortunately you can usually just use one of the system’s standard estimators for many of your own operators. These are the standard restriction estimators:
即使不是真正的等式或不等式,您也可以经常使用 eqsel 或 neqsel 来处理具有非常高或非常低选择性的运算符。例如,近似相等几何运算符使用 eqsel,假设它们通常只会匹配表中的一小部分条目。
You can frequently get away with using either eqsel or neqsel for operators that have very high or very low selectivity, even if they aren’t really equality or inequality. For example, the approximate-equality geometric operators use eqsel on the assumption that they’ll usually only match a small fraction of the entries in a table.
您可以对数据类型使用 scalarltsel、scalarlesel、scalargtsel 和 scalargesel 进行比较,这些数据类型有一些合理的方法可以转换为用于范围比较的数字标量。如果可能,将数据类型添加到 src/backend/utils/adt/selfuncs.c 中的 pg_type 系统目录的一列中,由函数 convert_to_scalar() 理解的数据类型中。(最终,此函数应该被通过 pg_type 系统目录一列识别的每个数据类型函数替换;但这尚未发生。)如果您不这样做,事情仍然会奏效,但优化器的估计值不会像实际情况那么好。
You can use scalarltsel, scalarlesel, scalargtsel and scalargesel for comparisons on data types that have some sensible means of being converted into numeric scalars for range comparisons. If possible, add the data type to those understood by the function convert_to_scalar() in src/backend/utils/adt/selfuncs.c. (Eventually, this function should be replaced by per-data-type functions identified through a column of the pg_type system catalog; but that hasn’t happened yet.) If you do not do this, things will still work, but the optimizer’s estimates won’t be as good as they could be.
matchingsel 是另一个有用的内置选择性估计函数,如果为输入数据类型收集了标准的 MCV 和/或直方图统计信息,它将适用于几乎所有二元运算符。它的默认估计值设置为 eqsel 中使用的默认估计值的两倍,这使其最适合严格程度略低于相等性的比较运算符。(或者你可以调用底层 generic_restriction_selectivity 函数,提供不同的默认估计值。)
Another useful built-in selectivity estimation function is matchingsel, which will work for almost any binary operator, if standard MCV and/or histogram statistics are collected for the input data type(s). Its default estimate is set to twice the default estimate used in eqsel, making it most suitable for comparison operators that are somewhat less strict than equality. (Or you could call the underlying generic_restriction_selectivity function, providing a different default estimate.)
src/backend/utils/adt/geo_selfuncs.c 中为几何运算符设计了其他选择性估计函数:areasel、positionsel 和 contsel。在撰写本文时,这些只是存根,但你可能希望使用它们(或甚至改进它们)。
There are additional selectivity estimation functions designed for geometric operators in src/backend/utils/adt/geo_selfuncs.c: areasel, positionsel, and contsel. At this writing these are just stubs, but you might want to use them (or even better, improve them) anyway.
38.15.4. JOIN #
如果提供了 JOIN 子句,它将为运算符命名连接选择性估计函数。(请注意,这是一个函数名,而不是运算符名。)JOIN 子句只对返回 boolean 的二元运算符有意义。连接选择性估计器的思路是猜测一对表中哪一部分行将满足以下形式的 WHERE 子句条件:
The JOIN clause, if provided, names a join selectivity estimation function for the operator. (Note that this is a function name, not an operator name.) JOIN clauses only make sense for binary operators that return boolean. The idea behind a join selectivity estimator is to guess what fraction of the rows in a pair of tables will satisfy a WHERE-clause condition of the form:
table1.column1 OP table2.column2
对于当前运算符。与 RESTRICT 子句一样,这极大地帮助了优化器,因为它可以找出多个可能的连接序列中哪个可能需要最少的工作。
for the current operator. As with the RESTRICT clause, this helps the optimizer very substantially by letting it figure out which of several possible join sequences is likely to take the least work.
与之前一样,本章将不会尝试解释如何编写连接选择性估计器函数,而是建议在适用时使用标准估计器之一:
As before, this chapter will make no attempt to explain how to write a join selectivity estimator function, but will just suggest that you use one of the standard estimators if one is applicable:
38.15.5. HASHES #
如果存在 HASHES 子句,它将告诉系统允许对基于此运算符的连接使用哈希连接方法。HASHES 仅对返回 boolean 的二元运算符有意义,实际上运算符必须表示某种数据类型或一对数据类型的相等性。
The HASHES clause, if present, tells the system that it is permissible to use the hash join method for a join based on this operator. HASHES only makes sense for a binary operator that returns boolean, and in practice the operator must represent equality for some data type or pair of data types.
哈希连接的基本假设是,连接运算符只能对哈希到同一哈希代码的左右值对返回真。如果将两个值放入不同的哈希存储桶,连接将根本不会比较它们,隐式地假设连接运算符的结果必须为假。因此,对于不表示某种相等性形式的运算符指定 HASHES 是毫无意义的。在大多数情况下,支持哈希仅适用于两侧采用相同数据类型的运算符是切实可行的。但是,有时有可能为两个或更多数据类型设计兼容的哈希函数;也就是说,即使值有不同的表示形式,也会为“相等”值生成相同的哈希代码的函数。例如,在对不同宽度整数进行哈希处理时,安排此属性相当简单。
The assumption underlying hash join is that the join operator can only return true for pairs of left and right values that hash to the same hash code. If two values get put in different hash buckets, the join will never compare them at all, implicitly assuming that the result of the join operator must be false. So it never makes sense to specify HASHES for operators that do not represent some form of equality. In most cases it is only practical to support hashing for operators that take the same data type on both sides. However, sometimes it is possible to design compatible hash functions for two or more data types; that is, functions that will generate the same hash codes for “equal” values, even though the values have different representations. For example, it’s fairly simple to arrange this property when hashing integers of different widths.
为了被标记为 HASHES,连接运算符必须出现在哈希索引运算符系列中。此项在创建运算符时不受强制,因为引用运算符系列当然还不能存在。但如果没有这样的运算符系列,则在哈希连接中使用运算符的尝试将在运行时失败。系统需要运算符系列来查找对应运算符的输入数据类型的特定于数据类型哈希函数。当然,你还必须在创建运算符系列之前创建合适的哈希函数。
To be marked HASHES, the join operator must appear in a hash index operator family. This is not enforced when you create the operator, since of course the referencing operator family couldn’t exist yet. But attempts to use the operator in hash joins will fail at run time if no such operator family exists. The system needs the operator family to find the data-type-specific hash function(s) for the operator’s input data type(s). Of course, you must also create suitable hash functions before you can create the operator family.
准备哈希函数时应小心谨慎,因为存在机器相关的方式,可能会导致它无法执行正确的操作。例如,如果你的数据类型是一个可能存在无意义填充位的结构,你不能简单地将整个结构传递给 hash_any。(除非你编写其他运算符和函数以确保未使用的位始终为零,这是推荐的策略。)另一个示例是,在符合 IEEE 浮点标准的机器上,负零和正零是不同的值(不同的位模式),但它们被定义为比较相等。如果浮点值可能包含负零,则需要额外的步骤来确保它生成与正零相同哈希值。
Care should be exercised when preparing a hash function, because there are machine-dependent ways in which it might fail to do the right thing. For example, if your data type is a structure in which there might be uninteresting pad bits, you cannot simply pass the whole structure to hash_any. (Unless you write your other operators and functions to ensure that the unused bits are always zero, which is the recommended strategy.) Another example is that on machines that meet the IEEE floating-point standard, negative zero and positive zero are different values (different bit patterns) but they are defined to compare equal. If a float value might contain negative zero then extra steps are needed to ensure it generates the same hash value as positive zero.
一个可哈希连接的运算符必须具有共同作者(如果两个操作数数据类型相同,则自己;如果不同,则相关的相等性运算符),并出现在相同的运算符系列中。如果并非如此,则在使用运算符时可能会出现规划器错误。此外,建议为支持多种数据类型的哈希运算符系列(但不是严格要求的)为每种数据类型组合提供相等性运算符;这允许更好的优化。
A hash-joinable operator must have a commutator (itself if the two operand data types are the same, or a related equality operator if they are different) that appears in the same operator family. If this is not the case, planner errors might occur when the operator is used. Also, it is a good idea (but not strictly required) for a hash operator family that supports multiple data types to provide equality operators for every combination of the data types; this allows better optimization.
Note
哈希可连接运算符的底层函数必须标记为不可变或稳定。如果它是易变的,则系统将永远不会尝试将该运算符用于哈希连接。
The function underlying a hash-joinable operator must be marked immutable or stable. If it is volatile, the system will never attempt to use the operator for a hash join.
Note
如果哈希可连接运算符具有标记为严格的底层函数,则该函数还必须完成:即,对于任何两个非空输入,它应返回真或假,而永远不返回空值。如果未遵循此规则,则 IN 操作的哈希优化可能会生成错误的结果。(具体来说,IN 可能会在正确答案应为空值的标准中返回 false,或者可能产生错误抱怨没有为 null 结果做好准备。)
If a hash-joinable operator has an underlying function that is marked strict, the function must also be complete: that is, it should return true or false, never null, for any two nonnull inputs. If this rule is not followed, hash-optimization of IN operations might generate wrong results. (Specifically, IN might return false where the correct answer according to the standard would be null; or it might yield an error complaining that it wasn’t prepared for a null result.)
38.15.6. MERGES #
如果存在 MERGES 子句,它将告诉系统允许对基于此运算符的连接使用合并连接方法。MERGES 仅对返回 boolean 的二元运算符有意义,实际上运算符必须表示某种数据类型或一对数据类型的相等性。
The MERGES clause, if present, tells the system that it is permissible to use the merge-join method for a join based on this operator. MERGES only makes sense for a binary operator that returns boolean, and in practice the operator must represent equality for some data type or pair of data types.
合并连接基于按顺序对左右表进行排序,然后并行扫描它们的思路。因此,两种数据类型都必须能够被完全排序,连接运算符必须只能对排序中“相同位置”处的值对执行成功。实际上,这意味着连接运算符的行为必须类似于相等性。但是,只要它们在逻辑上兼容,就可以合并连接两个不同的数据类型。例如,smallint-versus-integer 相等性运算符是可合并连接的。我们只需要将两种数据类型带入逻辑兼容序列的排序运算符。
Merge join is based on the idea of sorting the left- and right-hand tables into order and then scanning them in parallel. So, both data types must be capable of being fully ordered, and the join operator must be one that can only succeed for pairs of values that fall at the “same place” in the sort order. In practice this means that the join operator must behave like equality. But it is possible to merge-join two distinct data types so long as they are logically compatible. For example, the smallint-versus-integer equality operator is merge-joinable. We only need sorting operators that will bring both data types into a logically compatible sequence.
为了被标记为 MERGES,连接运算符必须作为 btree 索引运算符系列的相等性成员出现。此项在创建运算符时不受强制,因为引用运算符系列当然还不能存在。但是,除非可以找到匹配的运算符系列,否则该运算符实际上不会用于合并连接。因此,MERGES 标志作为暗示作用于规划器,值得寻找匹配的运算符系列。
To be marked MERGES, the join operator must appear as an equality member of a btree index operator family. This is not enforced when you create the operator, since of course the referencing operator family couldn’t exist yet. But the operator will not actually be used for merge joins unless a matching operator family can be found. The MERGES flag thus acts as a hint to the planner that it’s worth looking for a matching operator family.
一个可合并连接的运算符必须具有共同作者(如果两个操作数数据类型相同,则自己;如果不同,则相关的相等性运算符),并出现在相同的运算符系列中。如果并非如此,则在使用运算符时可能会出现规划器错误。此外,建议为支持多种数据类型的 btree 运算符系列(但不是严格要求的)为每种数据类型组合提供相等性运算符;这允许更好的优化。
A merge-joinable operator must have a commutator (itself if the two operand data types are the same, or a related equality operator if they are different) that appears in the same operator family. If this is not the case, planner errors might occur when the operator is used. Also, it is a good idea (but not strictly required) for a btree operator family that supports multiple data types to provide equality operators for every combination of the data types; this allows better optimization.