Postgresql 中文操作指南

62.3. Genetic Query Optimization (GEQO) in PostgreSQL #

GEQO 模块将查询优化问题当作众所周知的旅行商问题 (TSP) 来处理。可能的查询计划被编码为整数字符串。每个字符串表示从一个查询关系到下一个关系的连接顺序。例如,连接树

The GEQO module approaches the query optimization problem as though it were the well-known traveling salesman problem (TSP). Possible query plans are encoded as integer strings. Each string represents the join order from one relation of the query to the next. For example, the join tree

   /\
  /\ 2
 /\ 3
4  1

由整数字符串“4-1-3-2”编码,这意味着,首先连接关系“4”和“1”,然后“3”以及“2”,其中 1、2、3、4 是 PostgreSQL 优化器内的关系 ID。

is encoded by the integer string '4-1-3-2', which means, first join relation '4' and '1', then '3', and then '2', where 1, 2, 3, 4 are relation IDs within the PostgreSQL optimizer.

GEQO 在 PostgreSQL 中实现的具体特性:

Specific characteristics of the GEQO implementation in PostgreSQL are:

GEQO 模块的某些部分从 D. Whitley 的 Genitor 算法中改编而来。

Parts of the GEQO module are adapted from D. Whitley’s Genitor algorithm.

GEQO 模块允许 PostgreSQL 查询优化器有效地通过非穷举搜索支持大量连接查询。

The GEQO module allows the PostgreSQL query optimizer to support large join queries effectively through non-exhaustive search.

62.3.1. Generating Possible Plans with GEQO #

GEQO 规划进程使用标准规划代码为单个关系的扫描生成计划。然后,使用遗传方法来制定连接计划。如上所示,每个候选连接计划都由一个序列表示,按此序列对基础关系进行连接。在初始阶段,GEQO 代码只是随机生成一些可能的连接序列。对于每个考虑的连接序列,将调用标准计划代码,以估计使用该连接序列执行该查询的成本。(对于连接序列的每个步骤,将考虑所有三种可能的连接策略;并且最初确定的所有关系扫描计划都是可用的。估计的成本是这些可能性中最低的。)估计成本较低的连接序列被认为“更合适”,而估计成本较高的则被认为“较不合适”。遗传算法会丢弃最不合适的候选。然后通过组合更合适候选的基因来生成新的候选——也就是说,通过使用已知低成本连接序列的随机选择部分来创建新的候选序列以供考虑。此过程将重复,直到考虑了预设数量的连接序列;然后使用在搜索过程中任何时间找到的最佳序列来生成最终计划。

The GEQO planning process uses the standard planner code to generate plans for scans of individual relations. Then join plans are developed using the genetic approach. As shown above, each candidate join plan is represented by a sequence in which to join the base relations. In the initial stage, the GEQO code simply generates some possible join sequences at random. For each join sequence considered, the standard planner code is invoked to estimate the cost of performing the query using that join sequence. (For each step of the join sequence, all three possible join strategies are considered; and all the initially-determined relation scan plans are available. The estimated cost is the cheapest of these possibilities.) Join sequences with lower estimated cost are considered “more fit” than those with higher cost. The genetic algorithm discards the least fit candidates. Then new candidates are generated by combining genes of more-fit candidates — that is, by using randomly-chosen portions of known low-cost join sequences to create new sequences for consideration. This process is repeated until a preset number of join sequences have been considered; then the best one found at any time during the search is used to generate the finished plan.

此过程本质上是非确定性的,原因是初始群体选择和最佳候选者的后续“突变”中做出的随机选择。为了避免选定计划出现令人惊讶的更改,每次 GEQO 算法运行都会使用当前 geqo_seed 参数设置重新启动其随机数生成器。只要 geqo_seed 和其他 GEQO 参数保持固定,就会针对给定的查询(以及其他计划输入,如统计信息)生成相同的计划。要试验不同的搜索路径,可以尝试更改 geqo_seed

This process is inherently nondeterministic, because of the randomized choices made during both the initial population selection and subsequent “mutation” of the best candidates. To avoid surprising changes of the selected plan, each run of the GEQO algorithm restarts its random number generator with the current geqo_seed parameter setting. As long as geqo_seed and the other GEQO parameters are kept fixed, the same plan will be generated for a given query (and other planner inputs such as statistics). To experiment with different search paths, try changing geqo_seed.

62.3.2. Future Implementation Tasks for PostgreSQL GEQO #

仍需要对遗传算法参数设置进行改进。在文件 src/backend/optimizer/geqo/geqo_main.c 中,例程 gimme_pool_sizegimme_number_generations 中,我们必须对参数设置找到一个折衷方案,以满足两个相互竞争的要求:

Work is still needed to improve the genetic algorithm parameter settings. In file src/backend/optimizer/geqo/geqo_main.c, routines gimme_pool_size and gimme_number_generations, we have to find a compromise for the parameter settings to satisfy two competing demands:

在当前的实现中,每个候选连接序列的适应性是通过从头开始运行标准计划程序的连接选择和成本估算代码来估算的。在不同的候选使用类似的连接子序列的范围内,将会重复大量的操作。通过保留子连接的成本估算,可以使此过程显著地加快。问题是如何避免在保留该状态上占用不合理的内存量。

In the current implementation, the fitness of each candidate join sequence is estimated by running the standard planner’s join selection and cost estimation code from scratch. To the extent that different candidates use similar sub-sequences of joins, a great deal of work will be repeated. This could be made significantly faster by retaining cost estimates for sub-joins. The problem is to avoid expending unreasonable amounts of memory on retaining that state.

在一个更基本的水准上,用为 TSP 设计的 GA 算法来解决查询优化是否合适尚不清楚。在 TSP 案例中,与任何子字符串(部分行程)相关联的成本独立于行程的其余部分,但这对于查询优化而言肯定不是真的。因此,边缘重组交叉是否是最高效的变异过程还有待商榷。

At a more basic level, it is not clear that solving query optimization with a GA algorithm designed for TSP is appropriate. In the TSP case, the cost associated with any substring (partial tour) is independent of the rest of the tour, but this is certainly not true for query optimization. Thus it is questionable whether edge recombination crossover is the most effective mutation procedure.