Postgresql 中文操作指南

12.3. Controlling Text Search #

要实现全文搜索,必须有一个从文档创建 tsvector 和从用户查询创建 tsquery 的函数。此外,我们需要按有用顺序返回结果,因此我们需要一个将文档与其与查询的相关性进行比较的函数。很好地显示结果也很重要。PostgreSQL 为所有这些函数提供支持。

To implement full text searching there must be a function to create a tsvector from a document and a tsquery from a user query. Also, we need to return results in a useful order, so we need a function that compares documents with respect to their relevance to the query. It’s also important to be able to display the results nicely. PostgreSQL provides support for all of these functions.

12.3.1. Parsing Documents #

PostgreSQL 提供了 to_tsvector 函数,用于将文档转换为 tsvector 数据类型。

PostgreSQL provides the function to_tsvector for converting a document to the tsvector data type.

to_tsvector([ config regconfig, ] document text) returns tsvector

to_tsvector 将文本文档解析为标记,将标记简化为词素,并返回 tsvector,其中列出了词素及其在文档中的位置。文档将根据指定或默认的文本搜索配置进行处理。以下是一个简单的示例:

to_tsvector parses a textual document into tokens, reduces the tokens to lexemes, and returns a tsvector which lists the lexemes together with their positions in the document. The document is processed according to the specified or default text search configuration. Here is a simple example:

SELECT to_tsvector('english', 'a fat  cat sat on a mat - it ate a fat rats');
                  to_tsvector
-----------------------------------------------------
 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4

在上述示例中,我们可以看到结果 tsvector 不包含单词 aonit,单词 rats 变成 rat,而标点符号 - 被忽略了。

In the example above we see that the resulting tsvector does not contain the words a, on, or it, the word rats became rat, and the punctuation sign - was ignored.

to_tsvector 函数在内部调用一个解析器,该解析器将文档文本分隔成标记并为每个标记分配一个类型。对于每个标记,将查阅词典列表( Section 12.6),其中该列表可以根据标记类型而有所不同。recognizes 标记的第一个词典会发出一个或多个规范化 lexemes 来表示该标记。例如,rats 变成 rat,因为其中一个词典识别出单词 rats 是单词 rat 的复数形式。有些单词被识别为 stop wordsSection 12.6.1),这会导致它们被忽略,因为它们出现的频率太高而无法用于搜索。在本示例中,这些单词包括 aonit。如果列表中的任何词典都无法识别标记,则也会忽略该标记。在本示例中,这种情况发生在标点符号 -,因为实际上没有为其标记类型分配任何词典(Space symbols),这意味着空格标记将永远不会被索引。解析器、词典和要索引的标记类型的选择由选定的文本搜索配置( Section 12.7)决定。在同一个数据库中可以有许多不同的配置,并且有适用于各种语言的预定义配置。在本示例中,我们使用了英语的默认配置 english

The to_tsvector function internally calls a parser which breaks the document text into tokens and assigns a type to each token. For each token, a list of dictionaries (Section 12.6) is consulted, where the list can vary depending on the token type. The first dictionary that recognizes the token emits one or more normalized lexemes to represent the token. For example, rats became rat because one of the dictionaries recognized that the word rats is a plural form of rat. Some words are recognized as stop words (Section 12.6.1), which causes them to be ignored since they occur too frequently to be useful in searching. In our example these are a, on, and it. If no dictionary in the list recognizes the token then it is also ignored. In this example that happened to the punctuation sign - because there are in fact no dictionaries assigned for its token type (Space symbols), meaning space tokens will never be indexed. The choices of parser, dictionaries and which types of tokens to index are determined by the selected text search configuration (Section 12.7). It is possible to have many different configurations in the same database, and predefined configurations are available for various languages. In our example we used the default configuration english for the English language.

函数 setweight 可用于使用给定的 weight 标记 tsvector 的条目,其中权重是 ABCD 中的一个字母。通常使用此函数来标记来自文档不同部分的条目,例如标题与正文。稍后,可以使用此信息对搜索结果进行排名。

The function setweight can be used to label the entries of a tsvector with a given weight, where a weight is one of the letters A, B, C, or D. This is typically used to mark entries coming from different parts of a document, such as title versus body. Later, this information can be used for ranking of search results.

由于 to_tsvector(NULL) 将返回 NULL,因此建议在字段可能为 null 时使用 coalesce。以下是使用结构化文档创建 tsvector 的推荐方法:

Because to_tsvector(NULL) will return NULL, it is recommended to use coalesce whenever a field might be null. Here is the recommended method for creating a tsvector from a structured document:

UPDATE tt SET ti =
    setweight(to_tsvector(coalesce(title,'')), 'A')    ||
    setweight(to_tsvector(coalesce(keyword,'')), 'B')  ||
    setweight(to_tsvector(coalesce(abstract,'')), 'C') ||
    setweight(to_tsvector(coalesce(body,'')), 'D');

在此,我们使用 setweight 在完成的 tsvector 中标记每个词素的来源,然后使用 tsvector 连接运算符 || 合并标记的 tsvector 值。( Section 12.4.1 提供有关这些操作的详细信息。)

Here we have used setweight to label the source of each lexeme in the finished tsvector, and then merged the labeled tsvector values using the tsvector concatenation operator ||. (Section 12.4.1 gives details about these operations.)

12.3.2. Parsing Queries #

PostgreSQL 提供函数 to_tsqueryplainto_tsqueryphraseto_tsquerywebsearch_to_tsquery,用于将查询转换为 tsquery 数据类型。to_tsquery 提供了比 plainto_tsqueryphraseto_tsquery 更多的特性,但对输入不太宽容。websearch_to_tsqueryto_tsquery 的简化版本,采用替代语法,类似于网络搜索引擎使用的语法。

PostgreSQL provides the functions to_tsquery, plainto_tsquery, phraseto_tsquery and websearch_to_tsquery for converting a query to the tsquery data type. to_tsquery offers access to more features than either plainto_tsquery or phraseto_tsquery, but it is less forgiving about its input. websearch_to_tsquery is a simplified version of to_tsquery with an alternative syntax, similar to the one used by web search engines.

to_tsquery([ config regconfig, ] querytext text) returns tsquery

to_tsquery 使用 querytext 创建一个 tsquery 值,该值必须由通过 tsquery 运算符 &(AND)、|(OR)、!(NOT)和 <→(FOLLOWED BY)分隔的单个标记组成,可能使用括号进行分组。换句话说,to_tsquery 的输入必须已遵循 tsquery 输入的常规规则,如 Section 8.11.2 中所述。不同之处在于,虽然基本的 tsquery 输入按字面值提取标记,但 to_tsquery 使用指定或默认配置将每个标记归一化为词素,并根据该配置丢弃任何是停用词的标记。例如:

to_tsquery creates a tsquery value from querytext, which must consist of single tokens separated by the tsquery operators & (AND), | (OR), ! (NOT), and <→ (FOLLOWED BY), possibly grouped using parentheses. In other words, the input to to_tsquery must already follow the general rules for tsquery input, as described in Section 8.11.2. The difference is that while basic tsquery input takes the tokens at face value, to_tsquery normalizes each token into a lexeme using the specified or default configuration, and discards any tokens that are stop words according to the configuration. For example:

SELECT to_tsquery('english', 'The & Fat & Rats');
  to_tsquery
---------------
 'fat' & 'rat'

与基本 tsquery 输入类似,可以将权重附加到每个词素,以将其限制为仅匹配具有上述权重的 tsvector 词素。例如:

As in basic tsquery input, weight(s) can be attached to each lexeme to restrict it to match only tsvector lexemes of those weight(s). For example:

SELECT to_tsquery('english', 'Fat | Rats:AB');
    to_tsquery
------------------
 'fat' | 'rat':AB

此外,可以将 * 附加到词素以指定前缀匹配:

Also, * can be attached to a lexeme to specify prefix matching:

SELECT to_tsquery('supern:*A & star:A*B');
        to_tsquery
--------------------------
 'supern':*A & 'star':*AB

此类词素将匹配 tsvector 中以给定字符串开头的任何单词。

Such a lexeme will match any word in a tsvector that begins with the given string.

to_tsquery 也可接受单引号短语。如果配置中包含可能触发此类短语的同义词词典,则此功能尤为有用。在以下示例中,同义词词典包含 supernovae stars : sn 规则:

to_tsquery can also accept single-quoted phrases. This is primarily useful when the configuration includes a thesaurus dictionary that may trigger on such phrases. In the example below, a thesaurus contains the rule supernovae stars : sn:

SELECT to_tsquery('''supernovae stars'' & !crab');
  to_tsquery
---------------
 'sn' & !'crab'

如果没有引号,to_tsquery 将对未通过 AND、OR 或 FOLLOWED BY 运算符分隔的标记生成语法错误。

Without quotes, to_tsquery will generate a syntax error for tokens that are not separated by an AND, OR, or FOLLOWED BY operator.

plainto_tsquery([ config regconfig, ] querytext text) returns tsquery

plainto_tsquery 将未格式化的文本 querytext 转换为 tsquery 值。对文本的解析和规范化与 to_tsvector 的处理方式非常相似,之后在保留的单词之间插入 & (AND) tsquery 运算符。

plainto_tsquery transforms the unformatted text querytext to a tsquery value. The text is parsed and normalized much as for to_tsvector, then the & (AND) tsquery operator is inserted between surviving words.

示例:

Example:

SELECT plainto_tsquery('english', 'The Fat Rats');
 plainto_tsquery
-----------------
 'fat' & 'rat'

请注意,plainto_tsquery 不会在其输入中识别 tsquery 运算符、权重标签或前缀匹配标签:

Note that plainto_tsquery will not recognize tsquery operators, weight labels, or prefix-match labels in its input:

SELECT plainto_tsquery('english', 'The Fat & Rats:C');
   plainto_tsquery
---------------------
 'fat' & 'rat' & 'c'

此处,所有输入标点符号都被舍弃了。

Here, all the input punctuation was discarded.

phraseto_tsquery([ config regconfig, ] querytext text) returns tsquery

phraseto_tsquery 的行为与 plainto_tsquery 类似,但它在保留的单词之间插入 <→ (FOLLOWED BY) 运算符,而不是 & (AND) 运算符。此外,停用词不会被简单地舍弃,而是通过插入 <_N>_ 运算符(而不是 <→ 运算符)来解释。在搜索确切的词素序列时,此函数非常有用,因为 FOLLOWED BY 运算符检查词素顺序,而不仅仅是所有词素的存在。

phraseto_tsquery behaves much like plainto_tsquery, except that it inserts the <→ (FOLLOWED BY) operator between surviving words instead of the & (AND) operator. Also, stop words are not simply discarded, but are accounted for by inserting <_N>_ operators rather than <→ operators. This function is useful when searching for exact lexeme sequences, since the FOLLOWED BY operators check lexeme order not just the presence of all the lexemes.

示例:

Example:

SELECT phraseto_tsquery('english', 'The Fat Rats');
 phraseto_tsquery
------------------
 'fat' <-> 'rat'

plainto_tsquery 一样,phraseto_tsquery 函数不会在其输入中识别 tsquery 运算符、权重标签或前缀匹配标签:

Like plainto_tsquery, the phraseto_tsquery function will not recognize tsquery operators, weight labels, or prefix-match labels in its input:

SELECT phraseto_tsquery('english', 'The Fat & Rats:C');
      phraseto_tsquery
-----------------------------
 'fat' <-> 'rat' <-> 'c'
websearch_to_tsquery([ config regconfig, ] querytext text) returns tsquery

websearch_to_tsquery 使用 querytext 中的替代语法创建 tsquery 值,其中简单的未格式化文本是有效的查询。与 plainto_tsqueryphraseto_tsquery 不同,它还识别某些运算符。此外,此函数永远不会引发语法错误,这使得可以将原始的用户提供输入用于搜索。支持以下语法:

websearch_to_tsquery creates a tsquery value from querytext using an alternative syntax in which simple unformatted text is a valid query. Unlike plainto_tsquery and phraseto_tsquery, it also recognizes certain operators. Moreover, this function will never raise syntax errors, which makes it possible to use raw user-supplied input for search. The following syntax is supported:

其他标点符号将被忽略。因此,与 plainto_tsqueryphraseto_tsquery 一样,websearch_to_tsquery 函数不会在其输入中识别 tsquery 运算符、权重标签或前缀匹配标签。

Other punctuation is ignored. So like plainto_tsquery and phraseto_tsquery, the websearch_to_tsquery function will not recognize tsquery operators, weight labels, or prefix-match labels in its input.

示例:

Examples:

SELECT websearch_to_tsquery('english', 'The fat rats');
 websearch_to_tsquery
----------------------
 'fat' & 'rat'
(1 row)

SELECT websearch_to_tsquery('english', '"supernovae stars" -crab');
       websearch_to_tsquery
----------------------------------
 'supernova' <-> 'star' & !'crab'
(1 row)

SELECT websearch_to_tsquery('english', '"sad cat" or "fat rat"');
       websearch_to_tsquery
-----------------------------------
 'sad' <-> 'cat' | 'fat' <-> 'rat'
(1 row)

SELECT websearch_to_tsquery('english', 'signal -"segmentation fault"');
         websearch_to_tsquery
---------------------------------------
 'signal' & !( 'segment' <-> 'fault' )
(1 row)

SELECT websearch_to_tsquery('english', '""" )( dummy \\ query <->');
 websearch_to_tsquery
----------------------
 'dummi' & 'queri'
(1 row)

12.3.3. Ranking Search Results #

排名试图衡量文档与特定查询的相关性,以便在有许多匹配项时可以优先显示最相关的内容。PostgreSQL 提供了两个预定义的排名函数,它们考虑词法、邻近和结构信息;也就是说,它们考虑查询条件在文档中出现的频率、条件在文档中的接近程度以及它们出现的文档部分的重要程度。但是,相关性的概念是模糊且非常特定于应用程序的。不同的应用程序可能需要额外的信息进行排名,例如文档修改时间。内置排名函数只是示例。您可以编写自己的排名函数和/或将它们的结果与其他因素相结合,以满足您的特定需求。

Ranking attempts to measure how relevant documents are to a particular query, so that when there are many matches the most relevant ones can be shown first. PostgreSQL provides two predefined ranking functions, which take into account lexical, proximity, and structural information; that is, they consider how often the query terms appear in the document, how close together the terms are in the document, and how important is the part of the document where they occur. However, the concept of relevancy is vague and very application-specific. Different applications might require additional information for ranking, e.g., document modification time. The built-in ranking functions are only examples. You can write your own ranking functions and/or combine their results with additional factors to fit your specific needs.

目前可用的两个排名函数为:

The two ranking functions currently available are:

  • ts_rank([ _weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4_

    • Ranks vectors based on the frequency of their matching lexemes.

  • ts_rank_cd([ _weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4_

    • This function computes the cover density ranking for the given document vector and query, as described in Clarke, Cormack, and Tudhope’s "Relevance Ranking for One to Three Term Queries" in the journal "Information Processing and Management", 1999. Cover density is similar to ts_rank ranking except that the proximity of matching lexemes to each other is taken into consideration.

    • This function requires lexeme positional information to perform its calculation. Therefore, it ignores any “stripped” lexemes in the tsvector. If there are no unstripped lexemes in the input, the result will be zero. (See Section 12.4.1 for more information about the strip function and positional information in _tsvector_s.)

对于这两个函数而言,可选的 weights 参数提供了根据单词的标记方式多或少地加权单词实例的能力。权重数组指定按以下顺序给每个类别单词加多少权重:

For both these functions, the optional weights argument offers the ability to weigh word instances more or less heavily depending on how they are labeled. The weight arrays specify how heavily to weigh each category of word, in the order:

{D-weight, C-weight, B-weight, A-weight}

如果未提供 weights,则使用以下默认值:

If no weights are provided, then these defaults are used:

{0.1, 0.2, 0.4, 1.0}

通常,权重用于标记文档特定区域的单词,例如标题或初始摘要,以便可以比文档正文中的单词给予更多或更少的权重。

Typically weights are used to mark words from special areas of the document, like the title or an initial abstract, so they can be treated with more or less importance than words in the document body.

由于较长的文档包含查询条件的可能性更大,因此有理由考虑文档大小,例如,包含五个搜索词实例的百字文档可能比包含五个实例的千字文档更相关。这两个排名函数都采用整数 normalization 选项,它指定是否以及如何使文档的长度影响其排名。整数选项控制多种行为,因此它是一个位掩码:您可以使用 | 指定一种或多种行为(例如,2|4)。

Since a longer document has a greater chance of containing a query term it is reasonable to take into account document size, e.g., a hundred-word document with five instances of a search word is probably more relevant than a thousand-word document with five instances. Both ranking functions take an integer normalization option that specifies whether and how a document’s length should impact its rank. The integer option controls several behaviors, so it is a bit mask: you can specify one or more behaviors using | (for example, 2|4).

如果指定了多个标志位,则按所列顺序应用转换。

If more than one flag bit is specified, the transformations are applied in the order listed.

需要注意的是,排名函数不使用任何全局信息,因此不可能按有时所需的比例生成到 1% 或 100% 的合理归一化。归一化选项 32 (rank/(rank+1)) 可用于将所有排名缩放到零到一之间的范围,但当然这只是一个外观上的更改;它不会影响搜索结果的排序。

It is important to note that the ranking functions do not use any global information, so it is impossible to produce a fair normalization to 1% or 100% as sometimes desired. Normalization option 32 (rank/(rank+1)) can be applied to scale all ranks into the range zero to one, but of course this is just a cosmetic change; it will not affect the ordering of the search results.

以下是一个仅选择排名最高的前十个匹配项的示例:

Here is an example that selects only the ten highest-ranked matches:

SELECT title, ts_rank_cd(textsearch, query) AS rank
FROM apod, to_tsquery('neutrino|(dark & matter)') query
WHERE query @@ textsearch
ORDER BY rank DESC
LIMIT 10;
                     title                     |   rank
-----------------------------------------------+----------
 Neutrinos in the Sun                          |      3.1
 The Sudbury Neutrino Detector                 |      2.4
 A MACHO View of Galactic Dark Matter          |  2.01317
 Hot Gas and Dark Matter                       |  1.91171
 The Virgo Cluster: Hot Plasma and Dark Matter |  1.90953
 Rafting for Solar Neutrinos                   |      1.9
 NGC 4650A: Strange Galaxy and Dark Matter     |  1.85774
 Hot Gas and Dark Matter                       |   1.6123
 Ice Fishing for Cosmic Neutrinos              |      1.6
 Weak Lensing Distorts the Universe            | 0.818218

这是使用归一化排名的相同示例:

This is the same example using normalized ranking:

SELECT title, ts_rank_cd(textsearch, query, 32 /* rank/(rank+1) */ ) AS rank
FROM apod, to_tsquery('neutrino|(dark & matter)') query
WHERE  query @@ textsearch
ORDER BY rank DESC
LIMIT 10;
                     title                     |        rank
-----------------------------------------------+-------------------
 Neutrinos in the Sun                          | 0.756097569485493
 The Sudbury Neutrino Detector                 | 0.705882361190954
 A MACHO View of Galactic Dark Matter          | 0.668123210574724
 Hot Gas and Dark Matter                       |  0.65655958650282
 The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973
 Rafting for Solar Neutrinos                   | 0.655172410958162
 NGC 4650A: Strange Galaxy and Dark Matter     | 0.650072921219637
 Hot Gas and Dark Matter                       | 0.617195790024749
 Ice Fishing for Cosmic Neutrinos              | 0.615384618911517
 Weak Lensing Distorts the Universe            | 0.450010798361481

排名可能是昂贵的,因为它需要查阅每个匹配文档的 tsvector,而这可能会受到 I/O 限制,因此速度较慢。不幸的是,几乎不可能避免,因为实际查询通常会导致大量的匹配项。

Ranking can be expensive since it requires consulting the tsvector of each matching document, which can be I/O bound and therefore slow. Unfortunately, it is almost impossible to avoid since practical queries often result in large numbers of matches.

12.3.4. Highlighting Results #

为了呈现搜索结果,最好显示每个文档的部分内容以及它与查询的关系。通常,搜索引擎会显示带有标记搜索条件的文档片段。PostgreSQL 提供了一个实现此功能的 ts_headline 函数。

To present search results it is ideal to show a part of each document and how it is related to the query. Usually, search engines show fragments of the document with marked search terms. PostgreSQL provides a function ts_headline that implements this functionality.

ts_headline([ config regconfig, ] document text, query tsquery [, options text ]) returns text

ts_headline 接受文档和查询,并返回一个文档摘录,其中查询中的条件被突出显示。具体来说,该函数将使用查询选择相关的文本片段,然后突出显示出现在查询中的所有单词,即使这些单词位置与查询的限制不匹配。用于解析文档的配置可由 config 指定;如果省略 config,则使用 default_text_search_config 配置。

ts_headline accepts a document along with a query, and returns an excerpt from the document in which terms from the query are highlighted. Specifically, the function will use the query to select relevant text fragments, and then highlight all words that appear in the query, even if those word positions do not match the query’s restrictions. The configuration to be used to parse the document can be specified by config; if config is omitted, the default_text_search_config configuration is used.

如果指定了一个 options 字符串,它必须包含一个由一个或多个 option=value 对组成的用逗号分隔的列表。可用的选项包括:

If an options string is specified it must consist of a comma-separated list of one or more option=value pairs. The available options are:

这些选项名称不区分大小写。如果字符串值包含空格或逗号,则必须使用双引号将其引起来。

These option names are recognized case-insensitively. You must double-quote string values if they contain spaces or commas.

在基于非片段的标题生成中,ts_headline 找到给定 query 的匹配项,并选择一个匹配项进行显示,优先选择在允许的标题长度内具有更多查询单词的匹配项。在基于片段的标题生成中,ts_headline 定位查询匹配项,并将每个匹配项拆分为不超过 MaxWords 个单词的“片段”,优先选择具有更多查询单词的片段,并在可能的情况下“延伸”片段以包括周围的单词。因此,当查询匹配项跨越文档的大部分时,或者当需要显示多个匹配项时,基于片段的模式更有用。在任何模式中,如果无法识别任何查询匹配项,那么将显示文档中前 MinWords 个单词的单个片段。

In non-fragment-based headline generation, ts_headline locates matches for the given query and chooses a single one to display, preferring matches that have more query words within the allowed headline length. In fragment-based headline generation, ts_headline locates the query matches and splits each match into “fragments” of no more than MaxWords words each, preferring fragments with more query words, and when possible “stretching” fragments to include surrounding words. The fragment-based mode is thus more useful when the query matches span large sections of the document, or when it’s desirable to display multiple matches. In either mode, if no query matches can be identified, then a single fragment of the first MinWords words in the document will be displayed.

例如:

For example:

SELECT ts_headline('english',
  'The most common type of search
is to find all documents containing given query terms
and return them in order of their similarity to the
query.',
  to_tsquery('english', 'query & similarity'));
                        ts_headline
------------------------------------------------------------
 containing given <b>query</b> terms                       +
 and return them in order of their <b>similarity</b> to the+
 <b>query</b>.

SELECT ts_headline('english',
  'Search terms may occur
many times in a document,
requiring ranking of the search matches to decide which
occurrences to display in the result.',
  to_tsquery('english', 'search & term'),
  'MaxFragments=10, MaxWords=7, MinWords=3, StartSel=<<, StopSel=>>');
                        ts_headline
------------------------------------------------------------
 <<Search>> <<terms>> may occur                            +
 many times ... ranking of the <<search>> matches to decide

ts_headline 使用原始文档,而不是 tsvector 摘要,所以它可能会很慢,并且应谨慎使用。

ts_headline uses the original document, not a tsvector summary, so it can be slow and should be used with care.