Lucene 简明教程

Lucene - Analysis

在前面的章节中,我们了解到 Lucene 使用 IndexWriter 使用 Analyzer 分析文档,然后根据需要创建/打开/编辑索引。在本节中,我们将讨论分析过程中使用的各种类型的 Analyzer 对象和其他相关对象。了解分析过程和分析器的运行机制将让你深入了解 Lucene 如何为文档编制索引。

In one of our previous chapters, we have seen that Lucene uses IndexWriter to analyze the Document(s) using the Analyzer and then creates/open/edit indexes as required. In this chapter, we are going to discuss the various types of Analyzer objects and other relevant objects which are used during the analysis process. Understanding the Analysis process and how analyzers work will give you great insight over how Lucene indexes the documents.

以下是我们将会逐步讨论的对象列表。

Following is the list of objects that we’ll discuss in due course.

S.No.

Class & Description

1

TokenToken represents text or word in a document with relevant details like its metadata (position, start offset, end offset, token type and its position increment).

2

TokenStreamTokenStream is an output of the analysis process and it comprises of a series of tokens. It is an abstract class.

3

AnalyzerThis is an abstract base class for each and every type of Analyzer.

4

WhitespaceAnalyzerThis analyzer splits the text in a document based on whitespace.

5

SimpleAnalyzerThis analyzer splits the text in a document based on non-letter characters and puts the text in lowercase.

6

StopAnalyzerThis analyzer works just as the SimpleAnalyzer and removes the common words like 'a', 'an', 'the', etc.

7

StandardAnalyzerThis is the most sophisticated analyzer and is capable of handling names, email addresses, etc. It lowercases each token and removes common words and punctuations, if any.