Postgresql 中文操作指南
12.1. Introduction #
全文搜索(或简称 text search)提供了识别满足 query 的自然语言 documents,并根据查询相关性对它们进行排序的功能。最常见的搜索类型是查找包含给定 query terms 的所有文档,并按照它们对查询的 similarity 顺序返回它们。query 和 similarity 的概念非常灵活,具体取决于应用程序。最简单的搜索将 query 视为一组单词,将 similarity 视为文档中查询单词的频率。
Full Text Searching (or just text search) provides the capability to identify natural-language documents that satisfy a query, and optionally to sort them by relevance to the query. The most common type of search is to find all documents containing given query terms and return them in order of their similarity to the query. Notions of query and similarity are very flexible and depend on the specific application. The simplest search considers query as a set of words and similarity as the frequency of query words in the document.
数据库中文本搜索操作符已经存在多年。PostgreSQL 具有 、*、LIKE 和 ILIKE 运算符,用于文本数据类型,但它们缺少现代信息系统所需许多必需的属性:
Textual search operators have existed in databases for years. PostgreSQL has ~, ~*, LIKE, and ILIKE operators for textual data types, but they lack many essential properties required by modern information systems:
全文索引允许对文档进行 preprocessed,并将索引保存下来,以便以后快速搜索。预处理包括:
Full text indexing allows documents to be preprocessed and an index saved for later rapid searching. Preprocessing includes:
词典允许对令牌如何规范化进行细粒度控制。如果使用适当的词典,可以:
Dictionaries allow fine-grained control over how tokens are normalized. With appropriate dictionaries, you can:
提供数据类型 tsvector 用于存储预处理文档,以及类型 tsquery 用于表示已处理的查询 ( Section 8.11)。这些数据类型可使用许多函数和操作符 ( Section 9.13),其中最重要的操作符是匹配操作符 @@,我们将在 Section 12.1.2 中介绍该操作符。可以使用索引 ( Section 12.9) 加快全文搜索。
A data type tsvector is provided for storing preprocessed documents, along with a type tsquery for representing processed queries (Section 8.11). There are many functions and operators available for these data types (Section 9.13), the most important of which is the match operator @@, which we introduce in Section 12.1.2. Full text searches can be accelerated using indexes (Section 12.9).
12.1.1. What Is a Document? #
document 是全文搜索系统中搜索的单元;例如,杂志文章或电子邮件消息。文本搜索引擎必须能够解析文档并将词素(关键词)的关联与其父文档一起存储。之后,这些关联用于搜索包含查询词的文档。
A document is the unit of searching in a full text search system; for example, a magazine article or email message. The text search engine must be able to parse documents and store associations of lexemes (key words) with their parent document. Later, these associations are used to search for documents that contain query words.
在 PostgreSQL 中搜索时,文档通常是数据库表中的行内的文本字段,或这些字段的组合(连接),可能存储在多个表中或动态获取。换句话说,文档可以由不同的部分构成以进行索引,而它可能不会以整体形式存储在任何地方。例如:
For searches within PostgreSQL, a document is normally a textual field within a row of a database table, or possibly a combination (concatenation) of such fields, perhaps stored in several tables or obtained dynamically. In other words, a document can be constructed from different parts for indexing and it might not be stored anywhere as a whole. For example:
SELECT title || ' ' || author || ' ' || abstract || ' ' || body AS document
FROM messages
WHERE mid = 12;
SELECT m.title || ' ' || m.author || ' ' || m.abstract || ' ' || d.body AS document
FROM messages m, docs d
WHERE m.mid = d.did AND m.mid = 12;
Note
实际上,在这些示例查询中,应使用 coalesce 以防止单个 NULL 属性导致整个文档的 NULL 结果。
Actually, in these example queries, coalesce should be used to prevent a single NULL attribute from causing a NULL result for the whole document.
另一种可能性是将文档作为简单文本文件存储在文件系统中。在这种情况下,数据库可用于存储全文索引并执行搜索,并且可以利用一些唯一标识符从文件系统中检索文档。但是,从数据库外部检索文件需要超级用户权限或特殊功能支持,因此这通常不如将所有数据保存在 PostgreSQL 中方便。而且,将所有内容保存在数据库中可以很容易地访问文档元数据,以协助索引和显示。
Another possibility is to store the documents as simple text files in the file system. In this case, the database can be used to store the full text index and to execute searches, and some unique identifier can be used to retrieve the document from the file system. However, retrieving files from outside the database requires superuser permissions or special function support, so this is usually less convenient than keeping all the data inside PostgreSQL. Also, keeping everything inside the database allows easy access to document metadata to assist in indexing and display.
出于文本搜索目的,每个文档都必须被简化为预处理 tsvector 格式。对文档的 tsvector 表示进行搜索和排序——仅在文档被选中要显示给用户时才需要检索原始文本。因此,我们经常把 tsvector 称作文档,但当然它只是完整文档的紧凑表示。
For text search purposes, each document must be reduced to the preprocessed tsvector format. Searching and ranking are performed entirely on the tsvector representation of a document — the original text need only be retrieved when the document has been selected for display to a user. We therefore often speak of the tsvector as being the document, but of course it is only a compact representation of the full document.
12.1.2. Basic Text Matching #
PostgreSQL 中的全文搜索基于 match 运算符 @@,如果 tsvector(文档)与 tsquery(查询)匹配,则返回 true。按哪种数据类型首先编写无关紧要:
Full text searching in PostgreSQL is based on the match operator @@, which returns true if a tsvector (document) matches a tsquery (query). It doesn’t matter which data type is written first:
SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat & rat'::tsquery;
?column?
----------
t
SELECT 'fat & cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
?column?
----------
f
如上例所示,tsquery 不仅仅是原始文本,与 tsvector 不仅是原始文本没什么两样。tsquery 包含搜索词,这些词必须已经是标准化词素,并且可以使用 AND、OR、NOT 和 FOLLOWED BY 操作符组合多个词。(有关语法详细信息,请参阅 Section 8.11.2。)有函数 to_tsquery、plainto_tsquery 和 phraseto_tsquery 有助于将用户书写的文本转换为适当的 tsquery,主要是通过标准化文本中出现的单词。类似地,to_tsvector 用于解析和标准化文档字符串。所以在实践中,文本搜索匹配看起来更像是这样:
As the above example suggests, a tsquery is not just raw text, any more than a tsvector is. A tsquery contains search terms, which must be already-normalized lexemes, and may combine multiple terms using AND, OR, NOT, and FOLLOWED BY operators. (For syntax details see Section 8.11.2.) There are functions to_tsquery, plainto_tsquery, and phraseto_tsquery that are helpful in converting user-written text into a proper tsquery, primarily by normalizing words appearing in the text. Similarly, to_tsvector is used to parse and normalize a document string. So in practice a text search match would look more like this:
SELECT to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat & rat');
?column?
----------
t
请注意,如果写成以下形式,此匹配不会成功
Observe that this match would not succeed if written as
SELECT 'fat cats ate fat rats'::tsvector @@ to_tsquery('fat & rat');
?column?
----------
f
因为此处不会对单词 rats 进行规范化。tsvector 的元素是词素,假定已经规范化,因此 rats 与 rat 不匹配。
since here no normalization of the word rats will occur. The elements of a tsvector are lexemes, which are assumed already normalized, so rats does not match rat.
@@ 运算符还支持 text 输入,允许在简单情况下跳过将文本字符串显式转换为 tsvector 或 tsquery 的过程。可用的变体有:
The @@ operator also supports text input, allowing explicit conversion of a text string to tsvector or tsquery to be skipped in simple cases. The variants available are:
tsvector @@ tsquery
tsquery @@ tsvector
text @@ tsquery
text @@ text
我们已经看到了前两个变体。形式 text @@ tsquery 等于 to_tsvector(x) @@ y。形式 text @@ text 等于 to_tsvector(x) @@ plainto_tsquery(y)。
The first two of these we saw already. The form text @@ tsquery is equivalent to to_tsvector(x) @@ y. The form text @@ text is equivalent to to_tsvector(x) @@ plainto_tsquery(y).
在 tsquery 内,&(AND)运算符指定匹配项必须出现在文档中才能匹配。类似地,|(OR)运算符指定至少一个参数必须出现,而 !(NOT)运算符指定参数必须 not 出现才能匹配。例如,查询 fat & ! rat 匹配包含 fat 但不包含 rat 的文档。
Within a tsquery, the & (AND) operator specifies that both its arguments must appear in the document to have a match. Similarly, the | (OR) operator specifies that at least one of its arguments must appear, while the ! (NOT) operator specifies that its argument must not appear in order to have a match. For example, the query fat & ! rat matches documents that contain fat but not rat.
可以在 <→(FOLLOWED BY)tsquery 运算符的帮助下搜索短语,只有当运算符的参数相邻且按指定顺序匹配时才能匹配。例如:
Searching for phrases is possible with the help of the <→ (FOLLOWED BY) tsquery operator, which matches only if its arguments have matches that are adjacent and in the given order. For example:
SELECT to_tsvector('fatal error') @@ to_tsquery('fatal <-> error');
?column?
----------
t
SELECT to_tsvector('error is not fatal') @@ to_tsquery('fatal <-> error');
?column?
----------
f
FOLLOW BY 运算符有一个更一般的版本,形式为 <_N>_,其中 N 是一个整数,表示匹配词素位置之间的差值。<1> 等于 <→,而 <2> 允许在匹配项之间准确出现一个其他词素,依此类推。phraseto_tsquery 函数利用此运算符构建一个 tsquery,当某些单词是停用词时,该运算符可以匹配多词短语。例如:
There is a more general version of the FOLLOWED BY operator having the form <_N>_, where N is an integer standing for the difference between the positions of the matching lexemes. <1> is the same as <→, while <2> allows exactly one other lexeme to appear between the matches, and so on. The phraseto_tsquery function makes use of this operator to construct a tsquery that can match a multi-word phrase when some of the words are stop words. For example:
SELECT phraseto_tsquery('cats ate rats');
phraseto_tsquery
-------------------------------
'cat' <-> 'ate' <-> 'rat'
SELECT phraseto_tsquery('the cats ate the rats');
phraseto_tsquery
-------------------------------
'cat' <-> 'ate' <2> 'rat'
有时有用的一个特殊情况是 <0> 可用于要求两个模式与同一个单词匹配。
A special case that’s sometimes useful is that <0> can be used to require that two patterns match the same word.
可以利用圆括号控制 tsquery 运算符的嵌套。在没有圆括号的情况下,| 结合最松散,然后是 &,然后是 <→,最后是 ! 最紧密。
Parentheses can be used to control nesting of the tsquery operators. Without parentheses, | binds least tightly, then &, then <→, and ! most tightly.
值得注意的是,当 AND/OR/NOT 运算符是 FOLLOWED BY 运算符的运算参数时,其含义与单独使用时略有不同,因为在 FOLLOWED BY 中,匹配的确切位置很重要。例如,!x 通常仅匹配不包含 x 的文档。但如果 y 不是紧跟在 x 之后,则 !x <→ y 与 y 匹配;文档中其他位置的 x 不会妨碍匹配。另一个示例是,x & y 通常只需要 x 和 y 同时出现在文档中的某个位置,但 (x & y) <→ z 要求 x 和 y 匹配在同一位置,也就是 z 之前。因此,此查询的行为与 x <→ z & y <→ z 不同,后者将匹配包含两个独立序列 x z 和 y z 的文档。(按原样编写,此特定查询无用,因为 x 和 y 无法在同一地方匹配;但是在前缀匹配模式等更复杂的情况下,此类查询可能会有用。)
It’s worth noticing that the AND/OR/NOT operators mean something subtly different when they are within the arguments of a FOLLOWED BY operator than when they are not, because within FOLLOWED BY the exact position of the match is significant. For example, normally !x matches only documents that do not contain x anywhere. But !x <→ y matches y if it is not immediately after an x; an occurrence of x elsewhere in the document does not prevent a match. Another example is that x & y normally only requires that x and y both appear somewhere in the document, but (x & y) <→ z requires x and y to match at the same place, immediately before a z. Thus this query behaves differently from x <→ z & y <→ z, which will match a document containing two separate sequences x z and y z. (This specific query is useless as written, since x and y could not match at the same place; but with more complex situations such as prefix-match patterns, a query of this form could be useful.)
12.1.3. Configurations #
以上都是简单的文本搜索示例。如前所述,全文搜索功能包括执行更多操作的能力:跳过索引某些单词(停止词)、处理同义词以及使用复杂分析,例如,不仅根据空格进行分析。此功能由 text search configurations 控制。PostgreSQL 附带了许多语言的预定义配置,并且您可以轻松创建自己的配置。(psql 的 \dF 命令会显示所有可用的配置。)
The above are all simple text search examples. As mentioned before, full text search functionality includes the ability to do many more things: skip indexing certain words (stop words), process synonyms, and use sophisticated parsing, e.g., parse based on more than just white space. This functionality is controlled by text search configurations. PostgreSQL comes with predefined configurations for many languages, and you can easily create your own configurations. (psql’s \dF command shows all available configurations.)
在安装过程中,将选择一个合适的配置并将 postgresql.conf 中的 default_text_search_config 适当地设置为该配置。如果您对整个集群使用相同的文本搜索配置,则可以使用 postgresql.conf 中的值。要在整个集群中使用不同的配置,但在任何一个数据库中使用相同的配置,请使用 ALTER DATABASE … SET。否则,您可以在每个会话中设置 default_text_search_config。
During installation an appropriate configuration is selected and default_text_search_config is set accordingly in postgresql.conf. If you are using the same text search configuration for the entire cluster you can use the value in postgresql.conf. To use different configurations throughout the cluster but the same configuration within any one database, use ALTER DATABASE … SET. Otherwise, you can set default_text_search_config in each session.
每个依赖于配置的文本搜索函数都具有一个可选的 regconfig 参数,以便可以明确指定要使用的配置。仅当省略此参数后才使用 default_text_search_config。
Each text search function that depends on a configuration has an optional regconfig argument, so that the configuration to use can be specified explicitly. default_text_search_config is used only when this argument is omitted.
为了更轻松地构建自定义文本搜索配置,可根据更简单的数据库对象构建配置。PostgreSQL 的文本搜索工具提供了四种类型的与配置相关的数据库对象:
To make it easier to build custom text search configurations, a configuration is built up from simpler database objects. PostgreSQL’s text search facility provides four types of configuration-related database objects:
文本搜索分析器和模板由低级 C 函数构建;因此,必须具备编程 C 的能力,才能开发新函数,并且必须具有超级用户权限才能将新函数安装到数据库中。(PostgreSQL 发行版中的 contrib/ 区域中有一些附加分析器和模板的示例。)由于词典和配置只对一些底层分析器和模板进行参数化和连接,因此不需要特殊权限即可创建新词典或配置。有关创建自定义词典和配置的示例,请参见本章后面的内容。
Text search parsers and templates are built from low-level C functions; therefore it requires C programming ability to develop new ones, and superuser privileges to install one into a database. (There are examples of add-on parsers and templates in the contrib/ area of the PostgreSQL distribution.) Since dictionaries and configurations just parameterize and connect together some underlying parsers and templates, no special privilege is needed to create a new dictionary or configuration. Examples of creating custom dictionaries and configurations appear later in this chapter.