Postgresql 中文操作指南

12.4. Additional Features #

本节描述了与文本搜索相关的其他函数和运算符。

This section describes additional functions and operators that are useful in connection with text search.

12.4.1. Manipulating Documents #

Section 12.3.1 显示了如何将原始文本文档转换为 tsvector 值。PostgreSQL 还提供可用于处理已呈 tsvector 形式的文档的函数和运算符。

Section 12.3.1 showed how raw textual documents can be converted into tsvector values. PostgreSQL also provides functions and operators that can be used to manipulate documents that are already in tsvector form.

tsvector || tsvector
- The tsvector concatenation operator returns a vector which combines the lexemes and positional information of the two vectors given as arguments. Positions and weight labels are retained during the concatenation. Positions appearing in the right-hand vector are offset by the largest position mentioned in the left-hand vector, so that the result is nearly equivalent to the result of performing to_tsvector on the concatenation of the two original document strings. (The equivalence is not exact, because any stop-words removed from the end of the left-hand argument will not affect the result, whereas they would have affected the positions of the lexemes in the right-hand argument if textual concatenation were used.)
- One advantage of using concatenation in the vector form, rather than concatenating text before applying to_tsvector, is that you can use different configurations to parse different sections of the document. Also, because the setweight function marks all lexemes of the given vector the same way, it is necessary to parse the text and do setweight before concatenating if you want to label different parts of the document with different weights.
setweight(_vector tsvector, weight "char") returns tsvector_
- setweight returns a copy of the input vector in which every position has been labeled with the given weight, either A, B, C, or D. (D is the default for new vectors and as such is not displayed on output.) These labels are retained when vectors are concatenated, allowing words from different parts of a document to be weighted differently by ranking functions.
- Note that weight labels apply to positions, not lexemes. If the input vector has been stripped of positions then setweight does nothing.
length(_vector tsvector) returns integer_
- Returns the number of lexemes stored in the vector.
strip(_vector tsvector) returns tsvector_
- Returns a vector that lists the same lexemes as the given vector, but lacks any position or weight information. The result is usually much smaller than an unstripped vector, but it is also less useful. Relevance ranking does not work as well on stripped vectors as unstripped ones. Also, the <→ (FOLLOWED BY) tsquery operator will never match stripped input, since it cannot determine the distance between lexeme occurrences.

可以在 Table 9.43 中找到 tsvector 相关函数的完整列表。

A full list of tsvector-related functions is available in Table 9.43.

12.4.2. Manipulating Queries #

Section 12.3.2 显示了如何将原始文本查询转换为 tsquery 值。PostgreSQL 还提供可用于处理已呈 tsquery 形式的查询的函数和运算符。

Section 12.3.2 showed how raw textual queries can be converted into tsquery values. PostgreSQL also provides functions and operators that can be used to manipulate queries that are already in tsquery form.

tsquery && tsquery
- Returns the AND-combination of the two given queries.
tsquery || tsquery
- Returns the OR-combination of the two given queries.
!! _tsquery_
- Returns the negation (NOT) of the given query.
tsquery <→ tsquery
- Returns a query that searches for a match to the first given query immediately followed by a match to the second given query, using the <→ (FOLLOWED BY) tsquery operator. For example:

SELECT to_tsquery('fat') <-> to_tsquery('cat | rat');
          ?column?
----------------------------
 'fat' <-> ( 'cat' | 'rat' )

tsquery_phrase(_query1 tsquery, query2 tsquery [, distance integer ]) returns tsquery_
- Returns a query that searches for a match to the first given query followed by a match to the second given query at a distance of exactly distance lexemes, using the <_N>_ tsquery operator. For example:

SELECT tsquery_phrase(to_tsquery('fat'), to_tsquery('cat'), 10);
  tsquery_phrase
------------------
 'fat' <10> 'cat'

numnode(_query tsquery) returns integer_
- Returns the number of nodes (lexemes plus operators) in a tsquery. This function is useful to determine if the query is meaningful (returns > 0), or contains only stop words (returns 0). Examples:

SELECT numnode(plainto_tsquery('the any'));
NOTICE:  query contains only stopword(s) or doesn't contain lexeme(s), ignored
 numnode
---------
       0

SELECT numnode('foo & bar'::tsquery);
 numnode
---------
       3

querytree(_query tsquery) returns text_
- Returns the portion of a tsquery that can be used for searching an index. This function is useful for detecting unindexable queries, for example those containing only stop words or only negated terms. For example:

SELECT querytree(to_tsquery('defined'));
 querytree
-----------
 'defin'

SELECT querytree(to_tsquery('!defined'));
 querytree
-----------
 T

12.4.2.1. Query Rewriting #

ts_rewrite 函数系列搜索给定的 tsquery 中是否出现目标子查询，并使用替代子查询替换每次出现。从本质上讲，此操作是子串替换的 tsquery 专用版本。可以将目标和替代组合视为 query rewrite rule。此类重写规则的集合可能是一个强大的搜索辅助工具。例如，你可以使用同义词（例如，new york、big apple、nyc、gotham）来扩展搜索，或缩小搜索范围以将用户引导至某些热门话题。此功能与同义词库 ( Section 12.6.4) 之间有一些功能重叠。但是，你可以即时修改一组重写规则而无需重新索引，而更新同义词库则需要重新索引才能生效。

The ts_rewrite family of functions search a given tsquery for occurrences of a target subquery, and replace each occurrence with a substitute subquery. In essence this operation is a tsquery-specific version of substring replacement. A target and substitute combination can be thought of as a query rewrite rule. A collection of such rewrite rules can be a powerful search aid. For example, you can expand the search using synonyms (e.g., new york, big apple, nyc, gotham) or narrow the search to direct the user to some hot topic. There is some overlap in functionality between this feature and thesaurus dictionaries (Section 12.6.4). However, you can modify a set of rewrite rules on-the-fly without reindexing, whereas updating a thesaurus requires reindexing to be effective.

ts_rewrite (_query tsquery, target tsquery, substitute tsquery) returns tsquery_
- This form of ts_rewrite simply applies a single rewrite rule: target is replaced by substitute wherever it appears in query. For example:

SELECT ts_rewrite('a & b'::tsquery, 'a'::tsquery, 'c'::tsquery);
 ts_rewrite
------------
 'b' & 'c'

ts_rewrite (_query tsquery, select text) returns tsquery_
- This form of ts_rewrite accepts a starting query and an SQL select command, which is given as a text string. The select must yield two columns of tsquery type. For each row of the select result, occurrences of the first column value (the target) are replaced by the second column value (the substitute) within the current query value. For example:

CREATE TABLE aliases (t tsquery PRIMARY KEY, s tsquery);
INSERT INTO aliases VALUES('a', 'c');

SELECT ts_rewrite('a & b'::tsquery, 'SELECT t,s FROM aliases');
 ts_rewrite
------------
 'b' & 'c'

Note that when multiple rewrite rules are applied in this way, the order of application can be important; so in practice you will want the source query to ORDER BY some ordering key.

让我们考虑一个真实的天文学示例。我们将使用表驱动重写规则扩展查询 supernovae：

Let’s consider a real-life astronomical example. We’ll expand query supernovae using table-driven rewriting rules:

CREATE TABLE aliases (t tsquery primary key, s tsquery);
INSERT INTO aliases VALUES(to_tsquery('supernovae'), to_tsquery('supernovae|sn'));

SELECT ts_rewrite(to_tsquery('supernovae & crab'), 'SELECT * FROM aliases');
           ts_rewrite
---------------------------------
 'crab' & ( 'supernova' | 'sn' )

我们只需更新表格即可更改重写规则：

We can change the rewriting rules just by updating the table:

UPDATE aliases
SET s = to_tsquery('supernovae|sn & !nebulae')
WHERE t = to_tsquery('supernovae');

SELECT ts_rewrite(to_tsquery('supernovae & crab'), 'SELECT * FROM aliases');
                 ts_rewrite
---------------------------------------------
 'crab' & ( 'supernova' | 'sn' & !'nebula' )

当有许多重写规则时，重写会很慢，因为它要检查每条规则是否可能匹配。为了筛选出明显的非候选规则，我们可以使用 tsquery 类型的包含运算符。在下面的示例中，我们只选择那些可能与原始查询匹配的规则：

Rewriting can be slow when there are many rewriting rules, since it checks every rule for a possible match. To filter out obvious non-candidate rules we can use the containment operators for the tsquery type. In the example below, we select only those rules which might match the original query:

SELECT ts_rewrite('a & b'::tsquery,
                  'SELECT t,s FROM aliases WHERE ''a & b''::tsquery @> t');
 ts_rewrite
------------
 'b' & 'c'

12.4.3. Triggers for Automatic Updates #

Note

本节中描述的方法已因使用存储的生成列而失效，如 Section 12.2.2所述。

The method described in this section has been obsoleted by the use of stored generated columns, as described in Section 12.2.2.

当使用一个单独的列来存储文档的 tsvector 表示时，需要创建一个触发器，以便在文档内容列更改时更新 tsvector 列。为此提供了两个内置触发器函数，或者你可以自己编写。

When using a separate column to store the tsvector representation of your documents, it is necessary to create a trigger to update the tsvector column when the document content columns change. Two built-in trigger functions are available for this, or you can write your own.

tsvector_update_trigger(tsvector_column_name, config_name, text_column_name [, ... ])
tsvector_update_trigger_column(tsvector_column_name, config_column_name, text_column_name [, ... ])

这些触发器函数根据 CREATE TRIGGER 命令中指定的参数，自动从一个或多个文本列计算出一个 tsvector 列。以下是一个使用方法的示例：

These trigger functions automatically compute a tsvector column from one or more textual columns, under the control of parameters specified in the CREATE TRIGGER command. An example of their use is:

CREATE TABLE messages (
    title       text,
    body        text,
    tsv         tsvector
);

CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON messages FOR EACH ROW EXECUTE FUNCTION
tsvector_update_trigger(tsv, 'pg_catalog.english', title, body);

INSERT INTO messages VALUES('title here', 'the body text is here');

SELECT * FROM messages;
   title    |         body          |            tsv
------------+-----------------------+----------------------------
 title here | the body text is here | 'bodi':4 'text':5 'titl':1

SELECT title, body FROM messages WHERE tsv @@ to_tsquery('title & body');
   title    |         body
------------+-----------------------
 title here | the body text is here

创建此触发器后，title 或 body 中的任何更改都将自动反映到 tsv 中，而应用程序不必担心这一点。

Having created this trigger, any change in title or body will automatically be reflected into tsv, without the application having to worry about it.

第一个触发器参数必须是要更新的 tsvector 列的名称。第二个参数指定要用于执行转换的文本搜索配置。对于 tsvector_update_trigger，配置名称只作为第二个触发器参数给出。它必须如上文所示经过模式限定，以便触发器行为不会因 search_path 的更改而更改。对于 tsvector_update_trigger_column，第二个触发器参数是另一张表列的名称，该列必须是 regconfig 类型。这允许进行每行配置选择。其余参数是文本列的名称（text、varchar 或 char 类型）。这些将按照所给顺序包括在文档中。NULL 值将被跳过（但仍会对其他列进行索引）。

The first trigger argument must be the name of the tsvector column to be updated. The second argument specifies the text search configuration to be used to perform the conversion. For tsvector_update_trigger, the configuration name is simply given as the second trigger argument. It must be schema-qualified as shown above, so that the trigger behavior will not change with changes in search_path. For tsvector_update_trigger_column, the second trigger argument is the name of another table column, which must be of type regconfig. This allows a per-row selection of configuration to be made. The remaining argument(s) are the names of textual columns (of type text, varchar, or char). These will be included in the document in the order given. NULL values will be skipped (but the other columns will still be indexed).

这些内置触发器的限制在于它们将所有输入列都视为相同。要以不同的方式处理列，例如对标题的权重与正文不同，需要编写一个自定义触发器。这里有一个使用 PL/pgSQL 作为触发器语言的示例：

A limitation of these built-in triggers is that they treat all the input columns alike. To process columns differently — for example, to weight title differently from body — it is necessary to write a custom trigger. Here is an example using PL/pgSQL as the trigger language:

CREATE FUNCTION messages_trigger() RETURNS trigger AS $$
begin
  new.tsv :=
     setweight(to_tsvector('pg_catalog.english', coalesce(new.title,'')), 'A') ||
     setweight(to_tsvector('pg_catalog.english', coalesce(new.body,'')), 'D');
  return new;
end
$$ LANGUAGE plpgsql;

CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
    ON messages FOR EACH ROW EXECUTE FUNCTION messages_trigger();

请记住，在触发器中创建 tsvector 值时，显式指定配置名称非常重要，以便列的内容不会受 default_text_search_config 中更改的影响。如果不这样做，很可能会导致问题，例如，在转储和还原后搜索结果会更改。

Keep in mind that it is important to specify the configuration name explicitly when creating tsvector values inside triggers, so that the column’s contents will not be affected by changes to default_text_search_config. Failure to do this is likely to lead to problems such as search results changing after a dump and restore.

12.4.4. Gathering Document Statistics #

函数 ts_stat 对于检查配置以及查找停用词候选非常有用。

The function ts_stat is useful for checking your configuration and for finding stop-word candidates.

ts_stat(sqlquery text, [ weights text, ]
        OUT word text, OUT ndoc integer,
        OUT nentry integer) returns setof record

sqlquery 是包含 SQL 查询的文本值，该查询必须返回单个 tsvector 列。ts_stat 执行查询，并返回 tsvector 数据中包含的每个不同词素（单词）的统计信息。返回的列是

sqlquery is a text value containing an SQL query which must return a single tsvector column. ts_stat executes the query and returns statistics about each distinct lexeme (word) contained in the tsvector data. The columns returned are

如果提供了 weights，则只计入具有其中一种权重的出现。

If weights is supplied, only occurrences having one of those weights are counted.

例如，要查找文档集合中最频繁出现的十个单词：

For example, to find the ten most frequent words in a document collection:

SELECT * FROM ts_stat('SELECT vector FROM apod')
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;

同样，但仅计算带权重的单词出现次数 A 或 B:

The same, but counting only word occurrences with weight A or B:

SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab')
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;