Mysql 简明教程

MySQL - ngram Full-Text Parser

通常在全文搜索中,内置的 MySQL 全文解析器将单词之间的空格视为分隔符。这会确定单词实际上在哪里开始和结束,以使搜索变得更简单。但是,这仅对于使用空格来分隔单词的语言而言很简单。

Usually in Full-Text searching, the built-in MySQL Full-Text parser considers the white spaces between words as delimiters. This determines where the words actually begin and end, to make the search simpler. However, this is only simple for languages that use spaces to separate words.

汉语、日语和韩语等几种意形文字语言不使用单词分隔符。为了支持这些语言中的全文搜索,使用了 ngram 解析器。此解析器受 InnoDB 和 MyISAM 存储引擎支持。

Several ideographic languages like Chinese, Japanese and Korean languages do not use word delimiters. To support full-text searches in languages like these, an ngram parser is used. This parser is supported by both InnoDB and MyISAM storage engines.

The ngram Full-Text Parser

ngram 是给定文本序列中连续的 'n' 个字符的连续序列。ngram 解析器将文本序列划分为长度为 n 个字符的连续序列令牌。

An ngram is a continuous sequence of 'n' characters from a given sequence of text. The ngram parser divides a sequence of text into tokens as a contiguous sequence of n characters.

例如,考虑文本“Tutorial”,并观察它是如何被 ngram 解析器标记化的−

For example, consider the text 'Tutorial' and observe how it is tokenized by the ngram parser −

n=1: 'T', 'u', 't', 'o', 'r', 'i', 'a', 'l'
n=2: 'Tu', 'ut', 'to' 'or', 'ri', 'ia' 'al'
n=3: 'Tut', 'uto', 'tor', 'ori', 'ria', 'ial'
n=4: 'Tuto', 'utor', 'tori', 'oria', 'rial'
n=5: 'Tutor', 'utori', 'toria', 'orial'
n=6: 'Tutori', 'utoria', 'torial'
n=7: 'Tutoria', 'utorial'
n=8: 'Tutorial'

ngram 全文解析器是一个内置的服务器插件。与其他内置的服务器插件一样,它在服务器启动时自动加载。

The ngram full-text parser is a built-in server plugin. As with other built-in server plug-ins, it is automatically loaded when the server is started.

Configuring ngram Token Size

要从其默认大小 2 更改令牌大小,请使用 ngram_token_size 配置选项。ngram 值的范围是从 1 到 10。但是,为了提高搜索查询的速度,请使用较小的令牌大小;因为较小的令牌大小允许使用较小的全文搜索索引进行更快的搜索。

To change the token size, from its default size 2, use the ngram_token_size configuration option. The range of ngram values is from 1 to 10. But to increase the speed of search queries, use smallers token sizes; as smaller token sizes allow faster searches with smaller full-text search indexes.

因为 ngram_token_size 是一个只读变量,所以您只能使用两个选项来设置其值:

Because ngram_token_size is a read-only variable, you can only set its value using two options:

在启动字符串中设置 --ngram_token_size:

Setting the --ngram_token_size in startup string:

mysqld --ngram_token_size=1

在配置文件“my.cnf”中设置 ngram_token_size:

Setting ngram_token_size in configuration file 'my.cnf':

[mysqld]

ngram_token_size=1

Creating FULLTEXT Index Using ngram Parser

可以通过在表列上使用 FULLTEXT 关键字来创建 FULLTEXT 索引。这用于 CREATE TABLE、ALTER TABLE 或 CREATE INDEX SQL 语句;您只需要指定“WITH PARSER ngram”。以下是语法−

A FULLTEXT index can be created on columns of a table using the FULLTEXT keyword. This is used with CREATE TABLE, ALTER TABLE or CREATE INDEX SQL statements; you just have to specify 'WITH PARSER ngram'. Following is the syntax −

CREATE TABLE table_name (
   column_name1 datatype,
   column_name2 datatype,
   column_name3 datatype,
   ...
   FULLTEXT (column_name(s)) WITH PARSER NGRAM
) ENGINE=INNODB CHARACTER SET UTF8mb4;

Example

在此示例中,我们使用 CREATE TABLE 语句创建 FULLTEXT 索引,如下所示−

In this example, we are creating a FULLTEXT index using the CREATE TABLE statement as follows −

CREATE TABLE blog (
   ID INT AUTO_INCREMENT NOT NULL,
   TITLE VARCHAR(255),
   DESCRIPTION TEXT,
   FULLTEXT ( TITLE, DESCRIPTION ) WITH PARSER NGRAM,
   PRIMARY KEY(id)
) ENGINE=INNODB CHARACTER SET UTF8MB4;

SET NAMES UTF8MB4;

现在,将数据(以任何意形文字语言)插入到新创建的表中−

Now, insert data (in any ideographic language) into this table created −

INSERT INTO BLOG VALUES
(NULL, '教程', '教程是对一个概念的冗长研究'),
(NULL, '文章', '文章是关于一个概念的基于事实的小信息');

要检查文本的标记方式,请执行以下语句−

To check how the text is tokenized, execute the following statements −

SET GLOBAL innodb_ft_aux_table = "customers/blog";

SELECT * FROM
INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE
ORDER BY doc_id, position;

ngram Parser Space Handling

在 ngram 解析器进行解析时,任何空格字符都会被消除。例如,考虑带标记大小 2 的以下文本−

Any whitespace character is eliminated in the ngram parser when parsing. For instance, consider the following TEXT with token size 2 −

  1. "ab cd" is parsed to "ab", "cd"

  2. "a bc" is parsed to "bc"

ngram Parser Stop word Handling

除了空格字符之外,MySQL 还有一个停用词列表,其中包含各种被认为是停用词的词。如果解析器遇到出现在停用词列表中的文本中的任何单词,则该单词将从索引中排除。

Apart from the whitespace character, MySQL has a stop word list consisting of various that are considered to be stopwords. If the parser encounters any word in the text present in the stopword list, the word is excluded from the index.

正常的词组搜索将转换为 ngram 词组搜索。例如,搜索词组 “abc” 转换为 “ab bc”,它返回包含 “abc” 和 “ab bc” 的文档;而搜索词组 “abc def” 转换为 “ab bc de ef”,它返回包含 “abc def” 和 “ab bc de ef” 的文档。不返回包含 “abcdef” 的文档。

Normal Phrase searches are converted to ngram phrase searches. For example, The search phrase "abc" is converted to "ab bc", which returns documents containing "abc" and "ab bc"; and the search phrase "abc def" is converted to "ab bc de ef", which returns documents containing "abc def" and "ab bc de ef". A document that contains "abcdef" is not returned.

对于自然语言模式搜索,搜索词会被转换成一个ngram词组联合。例如,字符串“abc”(假设ngram_token_size=2)会被转换成“ab bc”。给定两个文档,一个包含“ab”,另一个包含“abc”,搜索词“ab bc”会匹配这两个文档。

For natural language mode search, the search term is converted to a union of ngram terms. For example, the string "abc" (assuming ngram_token_size=2) is converted to "ab bc". Given two documents, one containing "ab" and the other containing "abc", the search term "ab bc" matches both documents.

对于布尔模式搜索,搜索词会被转换成一个ngram短语搜索。例如,字符串“abc”(假设ngram_token_size=2)会被转换成“"ab bc"”。给定两个文档,一个包含“ab”,另一个包含“abc”,搜索短语“"ab bc"”只会匹配包含“abc”的文档。

For boolean mode search, the search term is converted to an ngram phrase search. For example, the string 'abc' (assuming ngram_token_size=2) is converted to '"ab bc"'. Given two documents, one containing 'ab' and the other containing 'abc', the search phrase '"ab bc"' only matches the document containing 'abc'.

由于ngram FULLTEXT索引只包含ngram,没有关于词语开头的信息,因此通配符搜索可能会返回意外的结果。使用ngram FULLTEXT搜索索引时,通配符搜索遵循以下行为:

Because an ngram FULLTEXT index contains only ngrams, and does not contain information about the beginning of terms, wildcard searches may return unexpected results. The following behaviors apply to wildcard searches using ngram FULLTEXT search indexes:

  1. If the prefix term of a wildcard search is shorter than ngram token size, the query returns all indexed rows that contain ngram tokens starting with the prefix term. For example, assuming ngram_token_size=2, a search on "a*" returns all rows starting with "a".

  2. If the prefix term of a wildcard search is longer than ngram token size, the prefix term is converted to an ngram phrase and the wildcard operator is ignored. For example, assuming ngram_token_size=2, an "abc*" wildcard search is converted to "ab bc".

ngram Full-Text Parser Using a Client Program

我们还可以使用客户端程序执行ngram全文解析操作。

We can also perform ngram full-text parser operation using the client program.

Syntax

Example

以下是这些程序 −

Following are the programs −