Postgresql 中文操作指南
70.4. Implementation #
在内部,GIN 索引包含在键上构造的 B 树索引,其中每个键是索引项的一个或多个元素(例如,数组的成员),而叶页中的每一个元组包含指向堆指针的 B 树的指针(“发布树”)或一个简单的堆指针列表(“发布列表”),当列表足够小到可以与键值一起放入一个索引元组中时。 Figure 70.1说明了 GIN 索引的这些组件。
Internally, a GIN index contains a B-tree index constructed over keys, where each key is an element of one or more indexed items (a member of an array, for example) and where each tuple in a leaf page contains either a pointer to a B-tree of heap pointers (a “posting tree”), or a simple list of heap pointers (a “posting list”) when the list is small enough to fit into a single index tuple along with the key value. Figure 70.1 illustrates these components of a GIN index.
从 PostgreSQL 9.1 开始,可以在索引中包含空键值。此外,根据_extractValue_,对于空或不包含任何键的索引项,在索引中包含占位符空值。这允许搜索找到空项。
As of PostgreSQL 9.1, null key values can be included in the index. Also, placeholder nulls are included in the index for indexed items that are null or contain no keys according to extractValue. This allows searches that should find empty items to do so.
通过针对复合值(列编号、键值)构建一棵 B 树来实现多列 GIN 索引。不同列的键值可以具有不同类型。
Multicolumn GIN indexes are implemented by building a single B-tree over composite values (column number, key value). The key values for different columns can be of different types.
Figure 70.1. GIN Internals
70.4.1. GIN Fast Update Technique #
更新 GIN 索引往往很慢,这是因为倒排索引的本质:插入或更新一个堆行会导致对索引的许多插入(每个从被索引项中提取的键一个)。GIN 能够通过将新元组插入到一个临时的、未排序的待处理条目列表中来推迟大部分工作。当对表进行 VACUUM 或自动分析时,或者当 _gin_clean_pending_list_函数被调用时,或者如果待处理列表变大到 gin_pending_list_limit时,条目将使用在初始索引创建期间所使用的相同批量插入技术移到主 GIN 数据结构中。即使计算额外的 VACUUM 开销在内,这也极大地提高了 GIN 索引的更新速度。此外,开销的工作可以由后台进程完成,而不是在前台查询处理中完成。
Updating a GIN index tends to be slow because of the intrinsic nature of inverted indexes: inserting or updating one heap row can cause many inserts into the index (one for each key extracted from the indexed item). GIN is capable of postponing much of this work by inserting new tuples into a temporary, unsorted list of pending entries. When the table is vacuumed or autoanalyzed, or when gin_clean_pending_list function is called, or if the pending list becomes larger than gin_pending_list_limit, the entries are moved to the main GIN data structure using the same bulk insert techniques used during initial index creation. This greatly improves GIN index update speed, even counting the additional vacuum overhead. Moreover the overhead work can be done by a background process instead of in foreground query processing.
此方法的主要缺点是,除了搜索常规索引,搜索还必须扫描待定条目列表,所以大量的待定条目将显著减慢搜索速度。另一个缺点是,虽然大多数更新很快,但导致待定列表变得“太大”的更新将引起立即清理周期,因此比其他更新要慢得多。正确使用自动 VACUUM 可以最大程度地减少这两个问题。
The main disadvantage of this approach is that searches must scan the list of pending entries in addition to searching the regular index, and so a large list of pending entries will slow searches significantly. Another disadvantage is that, while most updates are fast, an update that causes the pending list to become “too large” will incur an immediate cleanup cycle and thus be much slower than other updates. Proper use of autovacuum can minimize both of these problems.
如果一致的响应时间比更新速度更重要,可以通过关闭 GIN 索引的 fastupdate 存储参数来禁用保留项的使用。有关详细信息,请参见 CREATE INDEX 。
If consistent response time is more important than update speed, use of pending entries can be disabled by turning off the fastupdate storage parameter for a GIN index. See CREATE INDEX for details.
70.4.2. Partial Match Algorithm #
GIN 可以支持“部分匹配”查询,其中查询不会为一个或多个键确定明确匹配,而是可能匹配项落在合理范围的键值内(在由_compare_支持方法确定的键排序顺序中)。extractQuery 方法不会返回要精确匹配的键值,而是返回作为要搜索范围的下限的键值,并将_pmatch_标志设置为 true。然后使用_comparePartial_方法扫描键范围。comparePartial 必须对匹配的索引键返回零,对仍在要搜索的范围内的不匹配项返回小于零,或者如果索引键超过可能匹配的范围,则返回大于零。
GIN can support “partial match” queries, in which the query does not determine an exact match for one or more keys, but the possible matches fall within a reasonably narrow range of key values (within the key sorting order determined by the compare support method). The extractQuery method, instead of returning a key value to be matched exactly, returns a key value that is the lower bound of the range to be searched, and sets the pmatch flag true. The key range is then scanned using the comparePartial method. comparePartial must return zero for a matching index key, less than zero for a non-match that is still within the range to be searched, or greater than zero if the index key is past the range that could match.