Mysql 简明教程
MySQL - Natural Language Fulltext Search
在我们完全了解自然语言全文搜索的概念之前,让我们尝试了解它的上下文。如今,用于搜索的关键字可能并不总是与用户期望的结果相匹配。因此,搜索引擎旨在专注于提高搜索相关性,以减少搜索查询和搜索结果之间的准确性差距。因此,显示的结果与搜索关键字的相关性最高。
Before we fully get into the concept of Natural Language Full-text Search, let us try to understand the context of it. Nowadays, the keywords used for searches might not always match the results that users expect. So search engines are designed to focus on increasing search relevance to reduce the accuracy gap between search queries and search results. Thus, results are displayed in order of most relevance to the search keyword.
类似地,在像 MySQL 那样的关系型数据库中,全文搜索是一种用于检索不一定与搜索关键字完全匹配的结果集的技术。全文搜索使用三种类型的搜索模式:
Similarly, in relational databases like MySQL, full-text search is a technique used to retrieve result-sets that might not perfectly match the search keyword. There are three types of search modes used with full-text search −
-
Natural Language Mode
-
Query Expansion Mode
-
Boolean Mode
Natural Language Full-text Search
自然语言全文搜索在 IN NATURAL LANGUAGE 模式下执行常规全文搜索。当以这种模式执行全文搜索时,搜索结果将按与其相关性的顺序显示(针对该搜索执行的关键字)。这是全文搜索的默认模式。
The Natural Language Full-text search performs the usual Full-text search in the IN NATURAL LANGUAGE mode. When a Full-text search is performed in this mode, the search results are displayed in the order of their relevance to the keyword (against which this search is performed). This is the default mode for the Full-text search.
由于这是全文搜索,因此必须对基于文本的列(如 CHAR、VARCHAR、TEXT 数据类型列)应用 FULLTEXT 索引。FULLTEXT 索引是一种特殊类型的索引,用于搜索文本值中的关键字,而不是尝试将关键字与这些列值进行比较。
Since this is a Full-text search, the FULLTEXT indexes must be applied on text-based columns (like CHAR, VARCHAR, TEXT datatype columns). The FULLTEXT index is a special type of index that is used to search for the keywords in the text values instead of trying to compare the keyword with these column values.
Syntax
以下是执行自然语言全文搜索的基本语法:
Following is the basic syntax to perform the Natural Language Full-text Search −
SELECT * FROM table_name
WHERE MATCH(column_name(s))
AGAINST ('keyword_name' IN NATURAL LANGUAGE MODE);
Example
让我们通过以下示例了解如何在数据库表上执行自然语言全文搜索。
Let us understand how to perform Natural Language Full-text Search on a database table in the following example.
为此,我们首先创建一个名为 ARTICLES 的表,其中包含文章的标题和描述。正如下所示,已对文本列 article_title 和 descriptions 应用 FULLTEXT 索引:
For that, we will first create a table named ARTICLES containing the title and description of an article. The FULLTEXT index is applied on text columns article_title and descriptions as shown below −
CREATE TABLE ARTICLES (
ID INT AUTO_INCREMENT NOT NULL PRIMARY KEY,
ARTICLE_TITLE VARCHAR(100),
DESCRIPTION TEXT,
FULLTEXT (ARTICLE_TITLE, DESCRIPTION)
) ENGINE = InnoDB;
现在,让我们使用以下查询将文章详细信息(如标题和 DESCRIPTION)插入此表:
Now, let us insert details about articles, like their titles and DESCRIPTION, into this table using the following queries −
INSERT INTO ARTICLES (ARTICLE_TITLE, DESCRIPTION) VALUES
('MySQL Tutorial', 'MySQL is a relational database system that uses SQL to structure data stored'),
('Java Tutorial', 'Java is an object-oriented and platform-independent programming language'),
('Hadoop Tutorial', 'Hadoop is framework that is used to process large sets of data'),
('Big Data Tutorial', 'Big Data refers to data that has wider variety of data sets in larger numbers'),
('JDBC Tutorial', 'JDBC is a Java based technology used for database connectivity');
表创建如下:
The table is created is as follows −
在全文搜索中使用自然语言模式,使用关键字“data set”搜索与数据相关的文章记录。
Using the Natural Language Mode in Full-text search, search for records of articles relevant to data, with the keyword 'data set'.
SELECT * FROM ARTICLES
WHERE MATCH(ARTICLE_TITLE, DESCRIPTION)
AGAINST ('data set' IN NATURAL LANGUAGE MODE);
Output
输出如下:
Following is the output −
正如我们在上面看到的,在表中存在的所有文章中,获得了三个与术语“data set”相关且按相关性排序的搜索结果。但请注意,“MySQL Tutorial”文章记录中的关键字“data set”并不是完美匹配,但它仍然被检索到,因为 MySQL 也能处理数据集。
As we see above, among all the articles present in the table, three search results are obtained which are relevant to the term 'data set' and are arranged in the order of their relevance. But note how keyword 'data set' is not a perfect match in the 'MySQL Tutorial' article record and its still retrieved because MySQL deals with data sets as well.
Stop Words in a Search
自然语言全文搜索使用 tf-idf 算法,其中“tf”表示词频,“idf”表示逆向文档频率。该搜索是指单词在单个文档中的频率以及单词存在的文档数。但是,有些单词通常会被搜索忽略,例如字数少于特定数量的单词。InnoDB 忽略字数少于 3 个字符的单词,而 MyISAM 忽略字数少于 4 个字符的单词。此类单词被称为停用词(the、a、an、are 等)。
The Natural Language Full-text Search uses tf-idf algorithm, where 'tf' refers to term frequency and 'idf' is inverse document frequency. The search refers to the frequency of a word in a single document, and the number of documents the word is present in. However, there are some words that the search usually ignores, like words having less than certain characters. InnoDB ignores words with less than 3 characters while MyISAM ignores words less than 4 characters. Such words are known as Stopwords (the, a, an, are etc.).
Example
在以下示例中,我们对上面创建的 ARTICLES 表执行简单的自然语言全文搜索。让我们通过针对两个关键字(“Big Tutorial”和“is Tutorial”)执行搜索,了解停用词如何影响全文搜索。
In the following example, we are performing a simple Natural Language Full-text Search on the ARTICLES Table created above. Let us see how stop words impact the Full-text search by performing it against two keywords: 'Big Tutorial' and 'is Tutorial'.
Searching 'Big Tutorial':
Searching 'Big Tutorial':
以下查询针对“Big Tutorial”关键字在自然语言模式下执行全文搜索:
Following query performs the full-text search in Natural Language Mode against 'Big Tutorial' keyword −
SELECT ARTICLE_TITLE, DESCRIPTION FROM ARTICLES
WHERE MATCH(ARTICLE_TITLE, DESCRIPTION)
AGAINST ('Big Tutorial' IN NATURAL LANGUAGE MODE);
Output:
Output:
获得的输出为 −
The output is obtained as −
Searching 'is Tutorial':
Searching 'is Tutorial':
以下查询针对“is Tutorial”关键字在自然语言模式下执行全文搜索:
Following query performs the full-text search in Natural Language Mode against 'is Tutorial' keyword −
SELECT ARTICLE_TITLE, DESCRIPTION FROM Articles
WHERE MATCH(ARTICLE_TITLE, DESCRIPTION)
AGAINST ('is Tutorial' IN NATURAL LANGUAGE MODE);
Output:
Output:
获得的输出为 −
The output is obtained as −
正如我们在上面的示例中看到的,由于单词“Tutorial”存在于表的全部记录中,因此在两种情况下均检索到了这些记录。但是,相关性的顺序由指定的关键字的第二个单词确定。
As we see in the example above, since the word 'Tutorial' is present in all the records of the table, all of them are retrieved in both cases. However, the order of relevance is determined by the second word of the keyword specified.
在第一种情况下,由于单词“Big”存在于“Big Data Tutorial”中,因此该记录首先被检索。在第二种情况下,结果集中的记录顺序与原始表相同,因为单词“is”是一个停用词,因此被忽略了。
In the first case, as the word 'Big' is present in 'Big Data Tutorial', that record is retrieved first. In the second case, the order of records in the result-set are the same as that of original table since the word 'is' is a stop word, so it is ignored.