Postgresql 中文操作指南
12.2. Tables and Indexes #
上一节中的示例说明如何使用简单的常量字符串进行全文匹配。本节介绍了如何搜索表格数据,可选择使用索引。
The examples in the previous section illustrated full text matching using simple constant strings. This section shows how to search table data, optionally using indexes.
12.2.1. Searching a Table #
可以执行不需要索引的全文搜索。打印包含 title 的每个行的 friend 域的简单查询是:
It is possible to do a full text search without an index. A simple query to print the title of each row that contains the word friend in its body field is:
SELECT title
FROM pgweb
WHERE to_tsvector('english', body) @@ to_tsquery('english', 'friend');
这也会找到相关单词,例如 friends 和 friendly,因为所有这些都减少为相同的归一化词素。
This will also find related words such as friends and friendly, since all these are reduced to the same normalized lexeme.
上述查询指定使用 english 配置来解析和归一化字符串。或者,我们可以省略配置参数:
The query above specifies that the english configuration is to be used to parse and normalize the strings. Alternatively we could omit the configuration parameters:
SELECT title
FROM pgweb
WHERE to_tsvector(body) @@ to_tsquery('friend');
此查询将使用 default_text_search_config 设置的配置。
This query will use the configuration set by default_text_search_config.
更复杂的示例是选择包含 create 和 table 中的 title 或 body 的最近十份文档:
A more complex example is to select the ten most recent documents that contain create and table in the title or body:
SELECT title
FROM pgweb
WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('create & table')
ORDER BY last_mod_date DESC
LIMIT 10;
为清楚起见,我们省略了 coalesce 函数调用,该函数调用是查找在两个字段之一中包含 NULL 的行的必需调用。
For clarity we omitted the coalesce function calls which would be needed to find rows that contain NULL in one of the two fields.
尽管这些查询可以在没有索引的情况下工作,但大多数应用程序会发现这种方法太慢,也许偶尔的即席搜索除外。文本搜索的实际使用通常需要创建索引。
Although these queries will work without an index, most applications will find this approach too slow, except perhaps for occasional ad-hoc searches. Practical use of text searching usually requires creating an index.
12.2.2. Creating Indexes #
我们可以创建一个 GIN 索引 ( Section 12.9) 以加快文本搜索速度:
We can create a GIN index (Section 12.9) to speed up text searches:
CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector('english', body));
请注意使用了 to_tsvector 的 2 个参数版本。只有指定配置名称的文本搜索函数才能在表达式索引 ( Section 11.7) 中使用。这是因为索引内容不能受 default_text_search_config 影响。如果受到影响,索引内容可能会不一致,因为不同的条目可能包含使用不同的文本搜索配置创建的 _tsvector_s,并且无法猜测哪一个是什么。将此类索引正确转储和恢复是不可能的。
Notice that the 2-argument version of to_tsvector is used. Only text search functions that specify a configuration name can be used in expression indexes (Section 11.7). This is because the index contents must be unaffected by default_text_search_config. If they were affected, the index contents might be inconsistent because different entries could contain _tsvector_s that were created with different text search configurations, and there would be no way to guess which was which. It would be impossible to dump and restore such an index correctly.
由于上述索引中使用了_to_tsvector_的双参数版本,只有与同配置名一同使用_to_tsvector_的双参数版本的查询引用才会使用该索引。换句话说,_WHERE to_tsvector('english', body) @@ 'a & b'_可以使用该索引,但_WHERE to_tsvector(body) @@ 'a & b'_不行。这么做是为了确保仅在创建索引项所用的配置中使用该索引。
Because the two-argument version of to_tsvector was used in the index above, only a query reference that uses the 2-argument version of to_tsvector with the same configuration name will use that index. That is, WHERE to_tsvector('english', body) @@ 'a & b' can use the index, but WHERE to_tsvector(body) @@ 'a & b' cannot. This ensures that an index will be used only with the same configuration used to create the index entries.
可以设置一个更复杂的表达式索引,其中配置名称由另一列指定,比如说:
It is possible to set up more complex expression indexes wherein the configuration name is specified by another column, e.g.:
CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector(config_name, body));
其中_config_name_是_pgweb_表中的一列。这允许在同一索引中使用混合配置,同时记录在每个索引项中使用的配置。例如,如果文档集中包含不同语言的文档,这将很有用。这里再强调一遍,目的是使用索引的查询必须经过改写以匹配,例如_WHERE to_tsvector(config_name, body) @@ 'a & b'_。
where config_name is a column in the pgweb table. This allows mixed configurations in the same index while recording which configuration was used for each index entry. This would be useful, for example, if the document collection contained documents in different languages. Again, queries that are meant to use the index must be phrased to match, e.g., WHERE to_tsvector(config_name, body) @@ 'a & b'.
索引甚至可以连接列:
Indexes can even concatenate columns:
CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector('english', title || ' ' || body));
另一种方式是创建一个单独的_tsvector_列以保存_to_tsvector_的输出。为了使此列始终与源数据同步,请使用存储生成列。此示例连接了_title_和_body_,并使用_coalesce_确保即使其他列为_NULL_,也要对一列建立索引:
Another approach is to create a separate tsvector column to hold the output of to_tsvector. To keep this column automatically up to date with its source data, use a stored generated column. This example is a concatenation of title and body, using coalesce to ensure that one field will still be indexed when the other is NULL:
ALTER TABLE pgweb
ADD COLUMN textsearchable_index_col tsvector
GENERATED ALWAYS AS (to_tsvector('english', coalesce(title, '') || ' ' || coalesce(body, ''))) STORED;
然后我们创建一个GIN索引来加速搜索:
Then we create a GIN index to speed up the search:
CREATE INDEX textsearch_idx ON pgweb USING GIN (textsearchable_index_col);
现在我们可以执行快速全文本搜索:
Now we are ready to perform a fast full text search:
SELECT title
FROM pgweb
WHERE textsearchable_index_col @@ to_tsquery('create & table')
ORDER BY last_mod_date DESC
LIMIT 10;
与表达式索引相比,单独列方法的一个优势是,无需在查询中明确指定文本搜索配置即可使用该索引。如上例所示,查询可以依赖 default_text_search_config。另一个优势是搜索将更快,因为不必重复执行 to_tsvector 调用以验证索引匹配。(使用 GiST 索引时这一点比使用 GIN 索引更重要;参见 Section 12.9。)但是,表达式索引方法设置起来更简单,并且需要的磁盘空间更少,因为未明确存储 tsvector 表示形式。
One advantage of the separate-column approach over an expression index is that it is not necessary to explicitly specify the text search configuration in queries in order to make use of the index. As shown in the example above, the query can depend on default_text_search_config. Another advantage is that searches will be faster, since it will not be necessary to redo the to_tsvector calls to verify index matches. (This is more important when using a GiST index than a GIN index; see Section 12.9.) The expression-index approach is simpler to set up, however, and it requires less disk space since the tsvector representation is not stored explicitly.