Postgresql 中文操作指南
12.9. Preferred Index Types for Text Search #
There are two kinds of indexes that can be used to speed up full text searches: GIN and GiST. Note that indexes are not mandatory for full text searching, but in cases where a column is searched on a regular basis, an index is usually desirable.
如要创建这样的索引,执行以下操作之一:
To create such an index, do one of:
-
CREATE INDEX _name ON table USING GIN (column);_
-
Creates a GIN (Generalized Inverted Index)-based index. The column must be of tsvector type.
-
-
CREATE INDEX _name ON table USING GIST (column [ { DEFAULT | tsvector_ops } (siglen = number) ] );_
-
Creates a GiST (Generalized Search Tree)-based index. The column can be of tsvector or tsquery type. Optional integer parameter siglen determines signature length in bytes (see below for details).
-
GIN 索引是优先选择的文本搜索索引类型。作为倒排索引,它们为每个单词(语素)包含一个索引项,并以压缩列表形式列出匹配的位置。多词搜索可以找到首个匹配项,然后使用索引删除缺少其他单词的行。GIN 索引仅存储 tsvector 值的单词(语素),而不存储其权重标签。因此,在使用涉及权重的查询时需要重新检查表行。
GIN indexes are the preferred text search index type. As inverted indexes, they contain an index entry for each word (lexeme), with a compressed list of matching locations. Multi-word searches can find the first match, then use the index to remove rows that are lacking additional words. GIN indexes store only the words (lexemes) of tsvector values, and not their weight labels. Thus a table row recheck is needed when using a query that involves weights.
GiST 索引是 lossy,这意味着该索引可能会产生错误匹配项,因此有必要检查实际表行以消除这种错误匹配项。(PostgreSQL 在需要时会自动执行此操作。)GiST 索引是有损索引,因为每个文档都在索引中由固定长度的签名表示。签名长度(以字节为单位)由可选整型参数 siglen 的值决定。默认签名长度(在未指定 siglen 时)为 124 字节,最大签名长度为 2024 字节。签名是通过将每个单词哈希到 n 位字符串中的单个位,然后对所有这些位进行或运算而生成的 n 位文档签名。当两个单词哈希到相同位位置时,就会出现错误匹配项。如果查询中的所有单词都有匹配项(真实或错误),则必须检索表行以查看匹配项是否正确。更长的签名会导致更精确的搜索(扫描索引和堆页页的较小部分),但代价是索引更大。
A GiST index is lossy, meaning that the index might produce false matches, and it is necessary to check the actual table row to eliminate such false matches. (PostgreSQL does this automatically when needed.) GiST indexes are lossy because each document is represented in the index by a fixed-length signature. The signature length in bytes is determined by the value of the optional integer parameter siglen. The default signature length (when siglen is not specified) is 124 bytes, the maximum signature length is 2024 bytes. The signature is generated by hashing each word into a single bit in an n-bit string, with all these bits OR-ed together to produce an n-bit document signature. When two words hash to the same bit position there will be a false match. If all words in the query have matches (real or false) then the table row must be retrieved to see if the match is correct. Longer signatures lead to a more precise search (scanning a smaller fraction of the index and fewer heap pages), at the cost of a larger index.
GiST 索引可以覆盖,即使用 INCLUDE 子句。包含的列可以具有没有任何 GiST 运算符类的任何数据类型。包含的属性将以非压缩格式存储。
A GiST index can be covering, i.e., use the INCLUDE clause. Included columns can have data types without any GiST operator class. Included attributes will be stored uncompressed.
有损性会导致性能下降,原因是由于错误匹配项引起的不必要的表记录提取。由于随机访问表记录速度很慢,因此会限制 GiST 索引的效用。错误匹配项的可能性取决于多个因素,尤其是唯一单词的数量,因此建议使用词典来减少此数量。
Lossiness causes performance degradation due to unnecessary fetches of table records that turn out to be false matches. Since random access to table records is slow, this limits the usefulness of GiST indexes. The likelihood of false matches depends on several factors, in particular the number of unique words, so using dictionaries to reduce this number is recommended.
请注意,通过增加 maintenance_work_mem,通常可以改善 GIN 索引的构建时间,而 GiST 索引构建时间对此参数不敏感。
Note that GIN index build time can often be improved by increasing maintenance_work_mem, while GiST index build time is not sensitive to that parameter.
对大集合进行分区以及正确使用 GIN 和 GiST 索引,可以实现借助在线更新执行的极快速搜索。分区可以在数据库级别使用表继承来完成,或通过在服务器上分发文档并收集外部搜索结果来完成,例如通过 Foreign Data访问。后者之所以可行,是因为排名函数仅使用本地信息。
Partitioning of big collections and the proper use of GIN and GiST indexes allows the implementation of very fast searches with online update. Partitioning can be done at the database level using table inheritance, or by distributing documents over servers and collecting external search results, e.g., via Foreign Data access. The latter is possible because ranking functions use only local information.