Postgresql 中文操作指南
12.6. Dictionaries #
字典用于消除不应该在搜索中考虑的单词(stop words),并将单词 normalize,以便相同单词的不同派生形式匹配。一个成功标准化的单词称为 lexeme。除了提高搜索质量外,标准化和停用词的删除还减小了文档的 tsvector 表示的大小,从而提高了性能。标准化并不总是有语言含义,通常取决于应用程序语义。
Dictionaries are used to eliminate words that should not be considered in a search (stop words), and to normalize words so that different derived forms of the same word will match. A successfully normalized word is called a lexeme. Aside from improving search quality, normalization and removal of stop words reduce the size of the tsvector representation of a document, thereby improving performance. Normalization does not always have linguistic meaning and usually depends on application semantics.
一些标准化示例:
Some examples of normalization:
字典是一项程序,可以接收一个令牌作为输入并返回:
A dictionary is a program that accepts a token as input and returns:
PostgreSQL 为多种语言提供预定义的字典。还有几个预定义的模板可以用于创建带有自定义参数的新字典。下面描述了每个预定义的字典模板。如果没有任何现有的模板适合,则可以创建新模板;请参阅 PostgreSQL 发行版的 contrib/ 区域以获取示例。
PostgreSQL provides predefined dictionaries for many languages. There are also several predefined templates that can be used to create new dictionaries with custom parameters. Each predefined dictionary template is described below. If no existing template is suitable, it is possible to create new ones; see the contrib/ area of the PostgreSQL distribution for examples.
文本搜索配置将解析器与一组字典绑定在一起以处理解析器的输出令牌。对于解析器可以返回的每一种令牌类型,配置都会指定一个单独的字典列表。当解析器找到该类型的令牌时,将依次查阅列表中的每个字典,直到某个字典将其识别为已知单词。如果它被识别为停用词,或者如果没有字典识别出该令牌,它将被丢弃且不会被索引或搜索。通常,返回非 NULL 输出的第一个字典确定结果,并且不会查阅任何剩余的字典;但过滤字典可以用修改后的词替换给定的词,然后将其传递给后续字典。
A text search configuration binds a parser together with a set of dictionaries to process the parser’s output tokens. For each token type that the parser can return, a separate list of dictionaries is specified by the configuration. When a token of that type is found by the parser, each dictionary in the list is consulted in turn, until some dictionary recognizes it as a known word. If it is identified as a stop word, or if no dictionary recognizes the token, it will be discarded and not indexed or searched for. Normally, the first dictionary that returns a non-NULL output determines the result, and any remaining dictionaries are not consulted; but a filtering dictionary can replace the given word with a modified word, which is then passed to subsequent dictionaries.
配置字典列表的一般规则是首先放置最窄、最具体的字典,然后放置更通用的字典,最后以非常通用的字典结束,例如 Snowball 词干词或 simple,后者识别所有内容。例如,对于特定于天文学的搜索(astro_en 配置),可以将令牌类型 asciiword(ASCII 单词)绑定到天文学术语同义词字典、通用英语词典和 Snowball 英语词干词:
The general rule for configuring a list of dictionaries is to place first the most narrow, most specific dictionary, then the more general dictionaries, finishing with a very general dictionary, like a Snowball stemmer or simple, which recognizes everything. For example, for an astronomy-specific search (astro_en configuration) one could bind token type asciiword (ASCII word) to a synonym dictionary of astronomical terms, a general English dictionary and a Snowball English stemmer:
ALTER TEXT SEARCH CONFIGURATION astro_en
ADD MAPPING FOR asciiword WITH astrosyn, english_ispell, english_stem;
过滤词典可以放在列表中的任何位置,但不能放在末尾,因为这样一来它就毫无用处了。过滤词典对于部分规范化单词以简化后续词典的任务很有用。例如,过滤词典可用于删除带有重音符号的字母的重音,就像 unaccent 模块所做的那样。
A filtering dictionary can be placed anywhere in the list, except at the end where it’d be useless. Filtering dictionaries are useful to partially normalize words to simplify the task of later dictionaries. For example, a filtering dictionary could be used to remove accents from accented letters, as is done by the unaccent module.
12.6.1. Stop Words #
停用词是极其常见、几乎出现在每份文档中且没有歧义值的单词。因此,在全文检索的情况下,可以忽略它们。例如,每篇英文文本都包含 a 和 the 等单词,因此将它们存储在索引中毫无用处。然而,停用词确实会影响 tsvector 中的位置,而位置又会影响排名:
Stop words are words that are very common, appear in almost every document, and have no discrimination value. Therefore, they can be ignored in the context of full text searching. For example, every English text contains words like a and the, so it is useless to store them in an index. However, stop words do affect the positions in tsvector, which in turn affect ranking:
SELECT to_tsvector('english', 'in the list of stop words');
to_tsvector
----------------------------
'list':3 'stop':5 'word':6
缺少的位置 1、2、4 是因为停用词。为包含停用词和不包含停用词的文档计算的排名有很大不同:
The missing positions 1,2,4 are because of stop words. Ranks calculated for documents with and without stop words are quite different:
SELECT ts_rank_cd (to_tsvector('english', 'in the list of stop words'), to_tsquery('list & stop'));
ts_rank_cd
------------
0.05
SELECT ts_rank_cd (to_tsvector('english', 'list stop words'), to_tsquery('list & stop'));
ts_rank_cd
------------
0.1
字典如何处理停用词取决于具体的字典。例如,ispell 字典首先归一化单词,然后查看停用词列表,而 Snowball 词干分析器首先检查停用词列表。行为不同的原因在于减少噪音的尝试。
It is up to the specific dictionary how it treats stop words. For example, ispell dictionaries first normalize words and then look at the list of stop words, while Snowball stemmers first check the list of stop words. The reason for the different behavior is an attempt to decrease noise.
12.6.2. Simple Dictionary #
simple 字典模板的工作原理是将输入标记转换为小写,然后根据一个停用词文件进行检查。如果在文件中找到该标记,则返回空数组,导致标记被丢弃。如果没有找到该标记,则以小写形式返回这个单词作为归一化词素。或者,可以将字典配置为将非停用词报告为未识别,从而允许将其传递到列表中的下一个字典。
The simple dictionary template operates by converting the input token to lower case and checking it against a file of stop words. If it is found in the file then an empty array is returned, causing the token to be discarded. If not, the lower-cased form of the word is returned as the normalized lexeme. Alternatively, the dictionary can be configured to report non-stop-words as unrecognized, allowing them to be passed on to the next dictionary in the list.
以下是使用 simple 模板定义字典的一个示例:
Here is an example of a dictionary definition using the simple template:
CREATE TEXT SEARCH DICTIONARY public.simple_dict (
TEMPLATE = pg_catalog.simple,
STOPWORDS = english
);
此处,english 是一个停用词文件的基名。此文件全名将为 $SHAREDIR/tsearch_data/english.stop,其中 $SHAREDIR 指 PostgreSQL 安装的共享数据目录,通常是 /usr/local/share/postgresql(如果你不确定,可以使用 pg_config --sharedir 来确定)。文件格式只是一个单词列表,每行一个单词。将忽略空行和尾部空格,并折叠大写字母为小写字母,但不对文件内容执行任何其他处理。
Here, english is the base name of a file of stop words. The file’s full name will be $SHAREDIR/tsearch_data/english.stop, where $SHAREDIR means the PostgreSQL installation’s shared-data directory, often /usr/local/share/postgresql (use pg_config --sharedir to determine it if you’re not sure). The file format is simply a list of words, one per line. Blank lines and trailing spaces are ignored, and upper case is folded to lower case, but no other processing is done on the file contents.
现在我们可以测试我们的字典:
Now we can test our dictionary:
SELECT ts_lexize('public.simple_dict', 'YeS');
ts_lexize
-----------
{yes}
SELECT ts_lexize('public.simple_dict', 'The');
ts_lexize
-----------
{}
我们还可以选择返回 NULL,而不是小写单词(如果在停用词文件中找不到该单词)。这种行为可以通过将字典的 Accept 参数设置为 false 来选择。继续这个示例:
We can also choose to return NULL, instead of the lower-cased word, if it is not found in the stop words file. This behavior is selected by setting the dictionary’s Accept parameter to false. Continuing the example:
ALTER TEXT SEARCH DICTIONARY public.simple_dict ( Accept = false );
SELECT ts_lexize('public.simple_dict', 'YeS');
ts_lexize
-----------
SELECT ts_lexize('public.simple_dict', 'The');
ts_lexize
-----------
{}
使用 Accept = true 的默认设置,将 simple 字典放置在字典列表的结尾处才有意义,因为它不会将任何标记传递到后续字典。相反,Accept = false 仅在有至少一个后续字典时才有用。
With the default setting of Accept = true, it is only useful to place a simple dictionary at the end of a list of dictionaries, since it will never pass on any token to a following dictionary. Conversely, Accept = false is only useful when there is at least one following dictionary.
Caution
大多数类型的字典都依赖于配置文件,例如停用词文件。这些文件 must 可以存储为 UTF-8 编码。当它们被读入服务器时,如果实际数据库编码不同,它们将被转换为实际数据库编码。
Most types of dictionaries rely on configuration files, such as files of stop words. These files must be stored in UTF-8 encoding. They will be translated to the actual database encoding, if that is different, when they are read into the server.
Caution
通常,当数据库会话首次在会话中使用字典时,它只读取一次字典配置文件。如果你修改了一个配置文件,并想要强制现有会话获取新内容,请针对该字典发出一个 ALTER TEXT SEARCH DICTIONARY 命令。这可以是一个不会实际更改任何参数值的“虚拟”更新。
Normally, a database session will read a dictionary configuration file only once, when it is first used within the session. If you modify a configuration file and want to force existing sessions to pick up the new contents, issue an ALTER TEXT SEARCH DICTIONARY command on the dictionary. This can be a “dummy” update that doesn’t actually change any parameter values.
12.6.3. Synonym Dictionary #
此词典模板用于创建将单词替换为同义词的词典。不支持短语(对此使用同义词表模板 ( Section 12.6.4))。同义词词典可用于克服语言问题,例如,防止英语词干词典将单词“Paris”简化为“pari”。在同义词词典中有一行 Paris paris 并将其放在 english_stem 词典之前就足够了。例如:
This dictionary template is used to create dictionaries that replace a word with a synonym. Phrases are not supported (use the thesaurus template (Section 12.6.4) for that). A synonym dictionary can be used to overcome linguistic problems, for example, to prevent an English stemmer dictionary from reducing the word “Paris” to “pari”. It is enough to have a Paris paris line in the synonym dictionary and put it before the english_stem dictionary. For example:
SELECT * FROM ts_debug('english', 'Paris');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+----------------+--------------+---------
asciiword | Word, all ASCII | Paris | {english_stem} | english_stem | {pari}
CREATE TEXT SEARCH DICTIONARY my_synonym (
TEMPLATE = synonym,
SYNONYMS = my_synonyms
);
ALTER TEXT SEARCH CONFIGURATION english
ALTER MAPPING FOR asciiword
WITH my_synonym, english_stem;
SELECT * FROM ts_debug('english', 'Paris');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+---------------------------+------------+---------
asciiword | Word, all ASCII | Paris | {my_synonym,english_stem} | my_synonym | {paris}
synonym 字典模板所需的唯一参数是 SYNONYMS,它是其配置文件的基名——在上述示例中为 my_synonyms。此文件全名将为 $SHAREDIR/tsearch_data/my_synonyms.syn(其中 $SHAREDIR 表示 PostgreSQL 安装的共享数据目录)。文件格式只是待替换的每个单词一行,单词后跟其同义词,两者之间用空格分隔。将忽略空行和尾部空格。
The only parameter required by the synonym template is SYNONYMS, which is the base name of its configuration file — my_synonyms in the above example. The file’s full name will be $SHAREDIR/tsearch_data/my_synonyms.syn (where $SHAREDIR means the PostgreSQL installation’s shared-data directory). The file format is just one line per word to be substituted, with the word followed by its synonym, separated by white space. Blank lines and trailing spaces are ignored.
synonym 模板还有一个可选参数 CaseSensitive,其默认值为 false。当 CaseSensitive 等于 false 时,同义词文件中的单词将折叠为小写,与输入标记一样。当它等于 true 时,不会将单词和标记折叠为小写,而是按原样进行比较。
The synonym template also has an optional parameter CaseSensitive, which defaults to false. When CaseSensitive is false, words in the synonym file are folded to lower case, as are input tokens. When it is true, words and tokens are not folded to lower case, but are compared as-is.
星号 (*) 可以放在配置文件中同义词的末尾。这表示同义词是一个前缀。当条目在 to_tsvector() 中使用时,星号会被忽略,但当它在 to_tsquery() 中使用时,结果将是带有前缀匹配标记的查询项(参见 Section 12.3.2)。例如,假设我们在 $SHAREDIR/tsearch_data/synonym_sample.syn 中有以下条目:
An asterisk (*) can be placed at the end of a synonym in the configuration file. This indicates that the synonym is a prefix. The asterisk is ignored when the entry is used in to_tsvector(), but when it is used in to_tsquery(), the result will be a query item with the prefix match marker (see Section 12.3.2). For example, suppose we have these entries in $SHAREDIR/tsearch_data/synonym_sample.syn:
postgres pgsql
postgresql pgsql
postgre pgsql
gogle googl
indices index*
那么我们将会得到以下结果:
Then we will get these results:
mydb=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample');
mydb=# SELECT ts_lexize('syn', 'indices');
ts_lexize
-----------
{index}
(1 row)
mydb=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple);
mydb=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn;
mydb=# SELECT to_tsvector('tst', 'indices');
to_tsvector
-------------
'index':1
(1 row)
mydb=# SELECT to_tsquery('tst', 'indices');
to_tsquery
------------
'index':*
(1 row)
mydb=# SELECT 'indexes are very useful'::tsvector;
tsvector
---------------------------------
'are' 'indexes' 'useful' 'very'
(1 row)
mydb=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst', 'indices');
?column?
----------
t
(1 row)
12.6.4. Thesaurus Dictionary #
同义库字典(有时缩写为 TZ)是一组包含有关单词和短语关系的信息的单词,即:较宽泛的术语 (BT)、较窄的术语 (NT)、首选术语、非首选术语、相关术语等。
A thesaurus dictionary (sometimes abbreviated as TZ) is a collection of words that includes information about the relationships of words and phrases, i.e., broader terms (BT), narrower terms (NT), preferred terms, non-preferred terms, related terms, etc.
基本上,同义库字典将所有非首选术语替换为一个首选术语,还可以选择保留原始术语用于索引。PostgreSQL 当前实现的同义库字典是对同义词字典的扩展,增加了 phrase 支持。同义库字典需要以下格式的配置文件:
Basically a thesaurus dictionary replaces all non-preferred terms by one preferred term and, optionally, preserves the original terms for indexing as well. PostgreSQL’s current implementation of the thesaurus dictionary is an extension of the synonym dictionary with added phrase support. A thesaurus dictionary requires a configuration file of the following format:
# this is a comment
sample word(s) : indexed word(s)
more sample word(s) : more indexed word(s)
...
其中冒号 (:) 符号充当了一个短语与其替代内容之间的分隔符。
where the colon (:) symbol acts as a delimiter between a phrase and its replacement.
一个同义词词典使用一个 subdictionary(在词典的配置中指定)在检查短语匹配之前对输入文本进行规范化。一次只能选择一个子词典。如果子词典无法识别一个单词,则系统将报告一个错误。在这种情况下,你应该去除使用该单词或者教子词典如何识别它。你可以在一个已编过索引的单词的开头放置一个星号 (*) 来跳过将子词典应用到该单词,但所有示例单词 must 必须都是子词典已知的。
A thesaurus dictionary uses a subdictionary (which is specified in the dictionary’s configuration) to normalize the input text before checking for phrase matches. It is only possible to select one subdictionary. An error is reported if the subdictionary fails to recognize a word. In that case, you should remove the use of the word or teach the subdictionary about it. You can place an asterisk (*) at the beginning of an indexed word to skip applying the subdictionary to it, but all sample words must be known to the subdictionary.
同义词词典如果有多个短语与输入相匹配,它将选择最长的匹配,并且使用最后一个定义来打破相等。
The thesaurus dictionary chooses the longest match if there are multiple phrases matching the input, and ties are broken by using the last definition.
无法指定子词典识别的特定停用词;而应使用 ? 来标记任何停用词可以出现的位置。例如,假设 a 和 the 根据子词典是停用词:
Specific stop words recognized by the subdictionary cannot be specified; instead use ? to mark the location where any stop word can appear. For example, assuming that a and the are stop words according to the subdictionary:
? one ? two : swsw
匹配 a one the two 和 the one a two;两者都应替换为 swsw。
matches a one the two and the one a two; both would be replaced by swsw.
由于同义词词典具备识别短语的能力,所以它必须记住其状态并与解析器交互。同义词词典使用这些赋值来检查是否应处理下一个单词或者停止累积。必须仔细配置同义词词典。例如,如果同义词词典被指定仅处理 asciiword 令牌,则像 one 7 这样的同义词词典定义将不起作用,因为令牌类型 uint 未被指定给同义词词典。
Since a thesaurus dictionary has the capability to recognize phrases it must remember its state and interact with the parser. A thesaurus dictionary uses these assignments to check if it should handle the next word or stop accumulation. The thesaurus dictionary must be configured carefully. For example, if the thesaurus dictionary is assigned to handle only the asciiword token, then a thesaurus dictionary definition like one 7 will not work since token type uint is not assigned to the thesaurus dictionary.
Caution
在编制索引期间使用同义词库,所以同义词词典的参数中的任何更改都会 requires 重新编制索引。对于大多数其他类型的词典,诸如添加或去除停用词之类的更改并不会强制重新编制索引。
Thesauruses are used during indexing so any change in the thesaurus dictionary’s parameters requires reindexing. For most other dictionary types, small changes such as adding or removing stopwords does not force reindexing.
12.6.4.1. Thesaurus Configuration #
要定义一个新的同义词词典,请使用 thesaurus 模板。例如:
To define a new thesaurus dictionary, use the thesaurus template. For example:
CREATE TEXT SEARCH DICTIONARY thesaurus_simple (
TEMPLATE = thesaurus,
DictFile = mythesaurus,
Dictionary = pg_catalog.english_stem
);
在此处:
Here:
现在,可以在一个配置中将同义词词典 thesaurus_simple 绑定到所需的令牌类型,例如:
Now it is possible to bind the thesaurus dictionary thesaurus_simple to the desired token types in a configuration, for example:
ALTER TEXT SEARCH CONFIGURATION russian
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
WITH thesaurus_simple;
12.6.4.2. Thesaurus Example #
考虑一个简单的天文同义词词典 thesaurus_astro,其中包含一些天文术语组合:
Consider a simple astronomical thesaurus thesaurus_astro, which contains some astronomical word combinations:
supernovae stars : sn
crab nebulae : crab
在以下部分,我们创建了一个词典,并将某些令牌类型绑定到一个天文同义词词典和英语词干提取器:
Below we create a dictionary and bind some token types to an astronomical thesaurus and English stemmer:
CREATE TEXT SEARCH DICTIONARY thesaurus_astro (
TEMPLATE = thesaurus,
DictFile = thesaurus_astro,
Dictionary = english_stem
);
ALTER TEXT SEARCH CONFIGURATION russian
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
WITH thesaurus_astro, english_stem;
现在我们可以看看它如何工作。ts_lexize 对于测试同义词词典不是很有用,因为它将它的输入视为一个单独的令牌。相反,我们可以使用 plainto_tsquery 和 to_tsvector,它们会将它们的输入字符串分解为多个令牌:
Now we can see how it works. ts_lexize is not very useful for testing a thesaurus, because it treats its input as a single token. Instead we can use plainto_tsquery and to_tsvector which will break their input strings into multiple tokens:
SELECT plainto_tsquery('supernova star');
plainto_tsquery
-----------------
'sn'
SELECT to_tsvector('supernova star');
to_tsvector
-------------
'sn':1
原则上,如果你引号引入了变量,则可以使用 to_tsquery:
In principle, one can use to_tsquery if you quote the argument:
SELECT to_tsquery('''supernova star''');
to_tsquery
------------
'sn'
注意,supernova star 匹配了 supernovae stars 中的 thesaurus_astro,因为我们在同义词定义中指定了 english_stem 词干提取器。词干提取器去除了 e 和 s。
Notice that supernova star matches supernovae stars in thesaurus_astro because we specified the english_stem stemmer in the thesaurus definition. The stemmer removed the e and s.
若要为原始短语和替换内容建立索引,只需将其包含在定义的右半部分:
To index the original phrase as well as the substitute, just include it in the right-hand part of the definition:
supernovae stars : sn supernovae stars
SELECT plainto_tsquery('supernova star');
plainto_tsquery
-----------------------------
'sn' & 'supernova' & 'star'
12.6.5. Ispell Dictionary #
Ispell 词典模板支持 morphological dictionaries,它可以将一个单词的许多不同的语言形式标准化为相同的语素。例如,一个英语 Ispell 词典可以匹配搜索词 bank 的所有变格和变位形式,例如 banking、banked、banks、banks' 和 bank’s。
The Ispell dictionary template supports morphological dictionaries, which can normalize many different linguistic forms of a word into the same lexeme. For example, an English Ispell dictionary can match all declensions and conjugations of the search term bank, e.g., banking, banked, banks, banks', and bank’s.
标准 PostgreSQL 分发不包含任何 Ispell 配置文件。可以使用 Ispell 中找到多种语言的词典。此外,还支持一些更现代的词典文件格式 — MySpell (OO < 2.0.1) 和 Hunspell (OO >= 2.0.2)。可在 OpenOffice Wiki 中找到大量词典。
The standard PostgreSQL distribution does not include any Ispell configuration files. Dictionaries for a large number of languages are available from Ispell. Also, some more modern dictionary file formats are supported — MySpell (OO < 2.0.1) and Hunspell (OO >= 2.0.2). A large list of dictionaries is available on the OpenOffice Wiki.
要创建 Ispell 词典,请执行这些步骤:
To create an Ispell dictionary perform these steps:
这里,DictFile、AffFile、和 StopWords 指定了字典、词缀和停用词文件的基准名称。停用词文件的格式与上述 simple 词典类型中说明的相同。其他文件的格式此处未指定,但可以在上述网站中找到。
Here, DictFile, AffFile, and StopWords specify the base names of the dictionary, affixes, and stop-words files. The stop-words file has the same format explained above for the simple dictionary type. The format of the other files is not specified here but is available from the above-mentioned web sites.
Ispell 词典通常只识别有限的单词,所以应该添加其他范围更大的词典;例如,可识别万物的 Snowball 词典。
Ispell dictionaries usually recognize a limited set of words, so they should be followed by another broader dictionary; for example, a Snowball dictionary, which recognizes everything.
Ispell 的 .affix 文件具有以下结构:
The .affix file of Ispell has the following structure:
prefixes
flag *A:
. > RE # As in enter > reenter
suffixes
flag T:
E > ST # As in late > latest
[^AEIOU]Y > -Y,IEST # As in dirty > dirtiest
[AEIOU]Y > EST # As in gray > grayest
[^EY] > EST # As in small > smallest
而 .dict 文件具有以下结构:
And the .dict file has the following structure:
lapse/ADGRS
lard/DGRS
large/PRTY
lark/MRS
.dict 文件的格式是:
Format of the .dict file is:
basic_form/affix_class_name
在 .affix 文件中,每个词缀标记都以以下格式描述:
In the .affix file every affix flag is described in the following format:
condition > [-stripping_letters,] adding_affix
在此处,condition 的格式类似于正则表达式的格式。它可以使用分组 […] 和 [^…]。例如,[AEIOU]Y 表示单词的最后一个字母是 "y",倒数第二个字母是 "a"、"e"、"i"、"o" 或 "u"。[^EY] 表示最后一个字母既不是 "e" 也不是 "y"。
Here, condition has a format similar to the format of regular expressions. It can use groupings […] and [^…]. For example, [AEIOU]Y means that the last letter of the word is "y" and the penultimate letter is "a", "e", "i", "o" or "u". [^EY] means that the last letter is neither "e" nor "y".
Ispell 词典支持拆分复合词;这是个有用的功能。注意,词缀文件应使用_compoundwords controlled_ 语句指定一个特殊标记,该语句标记可在复合形成中起作用的词典词:
Ispell dictionaries support splitting compound words; a useful feature. Notice that the affix file should specify a special flag using the compoundwords controlled statement that marks dictionary words that can participate in compound formation:
compoundwords controlled z
下面是针对挪威语的一些示例:
Here are some examples for the Norwegian language:
SELECT ts_lexize('norwegian_ispell', 'overbuljongterningpakkmesterassistent');
{over,buljong,terning,pakk,mester,assistent}
SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
{sjokoladefabrikk,sjokolade,fabrikk}
MySpell 格式是 Hunspell 的一个子集。Hunspell 的 .affix 文件具有以下结构:
MySpell format is a subset of Hunspell. The .affix file of Hunspell has the following structure:
PFX A Y 1
PFX A 0 re .
SFX T N 4
SFX T 0 st e
SFX T y iest [^aeiou]y
SFX T 0 est [aeiou]y
SFX T 0 est [^ey]
一个词缀类的第一行是标题。词缀规则的字段在标题之后列出:
The first line of an affix class is the header. Fields of an affix rules are listed after the header:
.dict 文件看起来像 Ispell 的 .dict 文件:
The .dict file looks like the .dict file of Ispell:
larder/M
lardy/RT
large/RSPMYT
largehearted
Note
MySpell 不支持复合词。Hunspell 对复合词提供复杂的支持。目前,PostgreSQL 只实现了 Hunspell 的基本复合词操作。
MySpell does not support compound words. Hunspell has sophisticated support for compound words. At present, PostgreSQL implements only the basic compound word operations of Hunspell.
12.6.6. Snowball Dictionary #
Snowball 词典模板基于 Martin Porter 的一个项目,Martin Porter 是英语流行的 Porter 词干算法的发明者。Snowball 现在为多种语言提供词干算法(有关详细信息,请参阅 Snowball site)。每种算法都知道如何将单词的常用变体形式缩减为其语言中的基础或词干拼写。Snowball 词典需要 language 参数来标识要使用的词干提取器,还可以选择指定一个 stopword 文件名,该文件名给出了要消除的单词列表。(PostgreSQL 的标准停用词列表也由 Snowball 项目提供。)例如,有一个等效于以下内容的内置定义
The Snowball dictionary template is based on a project by Martin Porter, inventor of the popular Porter’s stemming algorithm for the English language. Snowball now provides stemming algorithms for many languages (see the Snowball site for more information). Each algorithm understands how to reduce common variant forms of words to a base, or stem, spelling within its language. A Snowball dictionary requires a language parameter to identify which stemmer to use, and optionally can specify a stopword file name that gives a list of words to eliminate. (PostgreSQL’s standard stopword lists are also provided by the Snowball project.) For example, there is a built-in definition equivalent to
CREATE TEXT SEARCH DICTIONARY english_stem (
TEMPLATE = snowball,
Language = english,
StopWords = english
);
停用词文件格式与已说明的一样。
The stopword file format is the same as already explained.
Snowball 词典识别所有内容,无论是否能够简化单词,所以它应放在词典列表的末尾。将其放在任何其他词典之前毫无用处,因为一个词条绝不会通过它传递到下一个词典。
A Snowball dictionary recognizes everything, whether or not it is able to simplify the word, so it should be placed at the end of the dictionary list. It is useless to have it before any other dictionary because a token will never pass through it to the next dictionary.