Elasticsearch 简明教程

Elasticsearch - Analysis

在搜索操作过程中处理查询时,分析模块会分析任何索引中的内容。该模块由分析器、分词器、分词器过滤器和字符过滤器组成。如果没有定义分析器,则默认情况下,内置的分析器、标记、过滤器和分词器会向分析模块注册。

When a query is processed during a search operation, the content in any index is analyzed by the analysis module. This module consists of analyzer, tokenizer, tokenfilters and charfilters. If no analyzer is defined, then by default the built in analyzers, token, filters and tokenizers get registered with analysis module.

在以下示例中,我们在未指定其他分析器时使用标准分析器。它将根据语法分析句子,并生成句子中使用的单词。

In the following example, we use a standard analyzer which is used when no other analyzer is specified. It will analyze the sentence based on the grammar and produce words used in the sentence.

POST _analyze
{
   "analyzer": "standard",
   "text": "Today's weather is beautiful"
}

在运行以上代码时,我们得到响应,如下所示:-

On running the above code, we get the response as shown below −

{
   "tokens" : [
      {
         "token" : "today's",
         "start_offset" : 0,
         "end_offset" : 7,
         "type" : "",
         "position" : 0
      },
      {
         "token" : "weather",
         "start_offset" : 8,
         "end_offset" : 15,
         "type" : "",
         "position" : 1
      },
      {
         "token" : "is",
         "start_offset" : 16,
         "end_offset" : 18,
         "type" : "",
         "position" : 2
      },
      {
         "token" : "beautiful",
         "start_offset" : 19,
         "end_offset" : 28,
         "type" : "",
         "position" : 3
      }
   ]
}

Configuring the Standard analyzer

我们可以使用各种参数配置标准分析器来满足我们的自定义要求。

We can configure the standard analyser with various parameters to get our custom requirements.

在以下示例中,我们配置标准分析器让 max_token_length 为 5。

In the following example, we configure the standard analyzer to have a max_token_length of 5.

为此,我们首先使用具有 max_length_token 参数的分析器创建了一个索引。

For this, we first create an index with the analyser having max_length_token parameter.

PUT index_4_analysis
{
   "settings": {
      "analysis": {
         "analyzer": {
            "my_english_analyzer": {
               "type": "standard",
               "max_token_length": 5,
               "stopwords": "_english_"
            }
         }
      }
   }
}

然后我们使用文本应用分析器,如下所示。请注意标记如何不显示,因为它开头有两个空格,结尾有两个空格。对于单词“is”,它开头有一个空格,结尾有一个空格。考虑所有这些,它成为 4 个带有空格的字母,并且这不会使它成为一个单词。至少在开头或结尾处应该有一个非空格字符,才能使它成为一个单词。

Next we apply the analyser with a text as shown below. Please note how the token is does not appear as it has two spaces in the beginning and two spaces at the end. For the word “is”, there is a space at the beginning of it and a space at the end of it. Taking all of them, it becomes 4 letters with spaces and that does not make it a word. There should be a nonspace character at least at the beginning or at the end, to make it a word to be counted.

POST index_4_analysis/_analyze
{
   "analyzer": "my_english_analyzer",
   "text": "Today's weather is beautiful"
}

在运行以上代码时,我们得到响应,如下所示:-

On running the above code, we get the response as shown below −

{
   "tokens" : [
      {
         "token" : "today",
         "start_offset" : 0,
         "end_offset" : 5,
         "type" : "",
         "position" : 0
      },
      {
         "token" : "s",
         "start_offset" : 6,
         "end_offset" : 7,
         "type" : "",
         "position" : 1
      },
      {
         "token" : "weath",
         "start_offset" : 8,
         "end_offset" : 13,
         "type" : "",
         "position" : 2
      },
      {
         "token" : "er",
         "start_offset" : 13,
         "end_offset" : 15,
         "type" : "",
         "position" : 3
      },
      {
         "token" : "beaut",
         "start_offset" : 19,
         "end_offset" : 24,
         "type" : "",
         "position" : 5
      },
      {
         "token" : "iful",
         "start_offset" : 24,
         "end_offset" : 28,
         "type" : "",
         "position" : 6
      }
   ]
}

各种分析器及其描述的列表在以下所示的表中给出 −

The list of various analyzers and their description are given in the table shown below −

S.No

Analyzer & Description

1

Standard analyzer (standard) stopwords and max_token_length setting can be set for this analyzer. By default, stopwords list is empty and max_token_length is 255.

2

Simple analyzer (simple) This analyzer is composed of lowercase tokenizer.

3

Whitespace analyzer (whitespace) This analyzer is composed of whitespace tokenizer.

4

Stop analyzer (stop) stopwords and stopwords_path can be configured. By default stopwords initialized to English stop words and stopwords_path contains path to a text file with stop words.

Tokenizers

标记化程序用于从 Elasticsearch 中的文本生成标记。可以通过考虑空格或其他标点符号将文本分解为标记。Elasticsearch 拥有大量内置标记化程序,可用于自定义分析器。

Tokenizers are used for generating tokens from a text in Elasticsearch. Text can be broken down into tokens by taking whitespace or other punctuations into account. Elasticsearch has plenty of built-in tokenizers, which can be used in custom analyzer.

下面展示了一个标记化程序的示例,该标记化程序在每遇到一个非字母字符时将文本分解为词条,但它也会将所有词条小写 −

An example of tokenizer that breaks text into terms whenever it encounters a character which is not a letter, but it also lowercases all terms, is shown below −

POST _analyze
{
   "tokenizer": "lowercase",
   "text": "It Was a Beautiful Weather 5 Days ago."
}

在运行以上代码时,我们得到响应,如下所示:-

On running the above code, we get the response as shown below −

{
   "tokens" : [
      {
         "token" : "it",
         "start_offset" : 0,
         "end_offset" : 2,
         "type" : "word",
         "position" : 0
      },
      {
         "token" : "was",
         "start_offset" : 3,
         "end_offset" : 6,
         "type" : "word",
         "position" : 1
      },
      {
         "token" : "a",
         "start_offset" : 7,
         "end_offset" : 8,
         "type" : "word",
         "position" : 2
      },
      {
         "token" : "beautiful",
         "start_offset" : 9,
         "end_offset" : 18,
         "type" : "word",
         "position" : 3
      },
      {
         "token" : "weather",
         "start_offset" : 19,
         "end_offset" : 26,
         "type" : "word",
         "position" : 4
      },
      {
         "token" : "days",
         "start_offset" : 29,
         "end_offset" : 33,
         "type" : "word",
         "position" : 5
      },
      {
         "token" : "ago",
         "start_offset" : 34,
         "end_offset" : 37,
         "type" : "word",
         "position" : 6
      }
   ]
}

下表列出了标记化程序及其说明 −

A list of Tokenizers and their descriptions are shown here in the table given below −

S.No

Tokenizer & Description

1

Standard tokenizer (standard) This is built on grammar based tokenizer and max_token_length can be configured for this tokenizer.

2

Edge NGram tokenizer (edgeNGram) Settings like min_gram, max_gram, token_chars can be set for this tokenizer.

3

Keyword tokenizer (keyword) This generates entire input as an output and buffer_size can be set for this.

4

Letter tokenizer (letter) This captures the whole word until a non-letter is encountered.