Postgresql 中文操作指南
8.11. Text Search Types #
PostgreSQL 提供了两种数据类型,旨在支持全文搜索,即在自然语言 _documents_集合中搜索以找到与 _query_最匹配的文本的活动。_tsvector_类型表示以优化文本搜索格式的文档;_tsquery_类型同样表示文本查询。 Chapter 12对该工具进行了详细说明,而 Section 9.13总结了相关函数和运算符。
PostgreSQL provides two data types that are designed to support full text search, which is the activity of searching through a collection of natural-language documents to locate those that best match a query. The tsvector type represents a document in a form optimized for text search; the tsquery type similarly represents a text query. Chapter 12 provides a detailed explanation of this facility, and Section 9.13 summarizes the related functions and operators.
8.11.1. tsvector #
_tsvector_值是有序的唯一 _lexemes_列表,这些词语已被 _normalized_合并为同一词语的不同变体(有关详细信息,请参见 Chapter 12)。排序和消除重复操作在输入期间自动完成,如示例所示:
A tsvector value is a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word (see Chapter 12 for details). Sorting and duplicate-elimination are done automatically during input, as shown in this example:
SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
tsvector
----------------------------------------------------
'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat'
用引号将包含空格或标点的词素前后包起来,表示词素:
To represent lexemes containing whitespace or punctuation, surround them with quotes:
SELECT $$the lexeme ' ' contains spaces$$::tsvector;
tsvector
-------------------------------------------
' ' 'contains' 'lexeme' 'spaces' 'the'
(我们在此示例和下一个示例中使用美元引号字符串文本,以避免出现文本中使用双引号的混乱情况。)嵌入的引号和反斜杠必须加倍:
(We use dollar-quoted string literals in this example and the next one to avoid the confusion of having to double quote marks within the literals.) Embedded quotes and backslashes must be doubled:
SELECT $$the lexeme 'Joe''s' contains a quote$$::tsvector;
tsvector
------------------------------------------------
'Joe''s' 'a' 'contains' 'lexeme' 'quote' 'the'
也可以对词素附加整数 positions:
Optionally, integer positions can be attached to lexemes:
SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::tsvector;
tsvector
-------------------------------------------------------------------------------
'a':1,6,10 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'on':5 'rat':12 'sat':4
位置通常表示文档中源词的位置。位置信息可以用于 proximity ranking。位置值范围为 1-16383;较大的数字会自动设置为 16383。同一词素的重复位置会被舍弃。
A position normally indicates the source word’s location in the document. Positional information can be used for proximity ranking. Position values can range from 1 to 16383; larger numbers are silently set to 16383. Duplicate positions for the same lexeme are discarded.
具有位置的词素可以使用 weight 进一步标记,可以是 A、B、C 或 D。D 是默认值,因此不会显示在输出中:
Lexemes that have positions can further be labeled with a weight, which can be A, B, C, or D. D is the default and hence is not shown on output:
SELECT 'a:1A fat:2B,4C cat:5D'::tsvector;
tsvector
----------------------------
'a':1A 'cat':5 'fat':2B,4C
权重通常用于反映文档结构,例如通过将标题词与正文词不同地标记。文本搜索排序函数可以为不同的权重标记分配不同的优先级。
Weights are typically used to reflect document structure, for example by marking title words differently from body words. Text search ranking functions can assign different priorities to the different weight markers.
理解 tsvector 类型本身不执行任何单词规范化非常重要;它假定给定的单词已针对应用程序适当规范化。例如:
It is important to understand that the tsvector type itself does not perform any word normalization; it assumes the words it is given are normalized appropriately for the application. For example,
SELECT 'The Fat Rats'::tsvector;
tsvector
--------------------
'Fat' 'Rats' 'The'
对于大多数英语文本搜索应用程序,上述单词将被视为未规范化,但 tsvector 并不关心。原始文档文本通常应通过 to_tsvector 进行传递,以适当规范化用于搜索的单词:
For most English-text-searching applications the above words would be considered non-normalized, but tsvector doesn’t care. Raw document text should usually be passed through to_tsvector to normalize the words appropriately for searching:
SELECT to_tsvector('english', 'The Fat Rats');
to_tsvector
-----------------
'fat':2 'rat':3
再次参阅 Chapter 12以了解详细信息。
Again, see Chapter 12 for more detail.
8.11.2. tsquery #
tsquery 值存储要搜索的词素,并可以使用布尔运算符 &(AND)、| (OR) 和 ! (NOT) 及词组搜索运算符 <→ (FOLLOWED BY) 对它们进行组合。还有 FOLLOWED BY 运算符的变体 <_N>_,其中 N 是一个整数常量,指定要搜索的两个词素之间的距离。<→ 等效于 <1>。
A tsquery value stores lexemes that are to be searched for, and can combine them using the Boolean operators & (AND), | (OR), and ! (NOT), as well as the phrase search operator <→ (FOLLOWED BY). There is also a variant <_N>_ of the FOLLOWED BY operator, where N is an integer constant that specifies the distance between the two lexemes being searched for. <→ is equivalent to <1>.
可以使用括号来强制对这些运算符进行分组。在没有括号的情况下,! (NOT) 绑定得最紧密,<→ (FOLLOWED BY) 次紧密,然后是 & (AND),| (OR) 绑定得最不紧密。
Parentheses can be used to enforce grouping of these operators. In the absence of parentheses, ! (NOT) binds most tightly, <→ (FOLLOWED BY) next most tightly, then & (AND), with | (OR) binding the least tightly.
这里一些示例:
Here are some examples:
SELECT 'fat & rat'::tsquery;
tsquery
---------------
'fat' & 'rat'
SELECT 'fat & (rat | cat)'::tsquery;
tsquery
---------------------------
'fat' & ( 'rat' | 'cat' )
SELECT 'fat & rat & ! cat'::tsquery;
tsquery
------------------------
'fat' & 'rat' & !'cat'
可选地,tsquery 中的词素可以用一个或多个权重字母标记,这将限制它们仅匹配具有这些权重之一的 tsvector 词素:
Optionally, lexemes in a tsquery can be labeled with one or more weight letters, which restricts them to match only tsvector lexemes with one of those weights:
SELECT 'fat:ab & cat'::tsquery;
tsquery
------------------
'fat':AB & 'cat'
此外,tsquery 中的词素可以用 * 标记以指定前缀匹配:
Also, lexemes in a tsquery can be labeled with * to specify prefix matching:
SELECT 'super:*'::tsquery;
tsquery
-----------
'super':*
此查询将匹配 tsvector 中以“super”开头的任何单词。
This query will match any word in a tsvector that begins with “super”.
词素的引用规则与先前描述的 tsvector 中的词素相同;并且,与 tsvector 一样,在转换为 tsquery 类型之前,必须对单词进行任何必需的规范化。to_tsquery 函数便于执行此类规范化:
Quoting rules for lexemes are the same as described previously for lexemes in tsvector; and, as with tsvector, any required normalization of words must be done before converting to the tsquery type. The to_tsquery function is convenient for performing such normalization:
SELECT to_tsquery('Fat:ab & Cats');
to_tsquery
------------------
'fat':AB & 'cat'
请注意,to_tsquery 以与其他单词相同的方式处理前缀,这意味着此比较返回 true:
Note that to_tsquery will process prefixes in the same way as other words, which means this comparison returns true:
SELECT to_tsvector( 'postgraduate' ) @@ to_tsquery( 'postgres:*' );
?column?
----------
t
因为 postgres 被词干化为 postgr:
because postgres gets stemmed to postgr:
SELECT to_tsvector( 'postgraduate' ), to_tsquery( 'postgres:*' );
to_tsvector | to_tsquery
---------------+------------
'postgradu':1 | 'postgr':*
这将匹配 postgraduate 的词干形式。
which will match the stemmed form of postgraduate.