Lucene 简明教程
Lucene - Overview
Lucene 是一个简单但强大的基于 Java 的 Search 库。它可以在任何应用程序中使用,以向其添加搜索功能。Lucene 是一个开源项目。它具有可伸缩性。这个高性能库用于索引和搜索几乎任何类型的文本。Lucene 库提供了任何搜索应用程序所需的核心的操作。索引和搜索。
Lucene is a simple yet powerful Java-based Search library. It can be used in any application to add search capability to it. Lucene is an open-source project. It is scalable. This high-performance library is used to index and search virtually any kind of text. Lucene library provides the core operations which are required by any search application. Indexing and Searching.
How Search Application works?
搜索应用程序执行以下部分或全部操作 -
A Search application performs all or a few of the following operations −
Step |
Title |
Description |
1 |
Acquire Raw Content |
The first step of any search application is to collect the target contents on which search application is to be conducted. |
2 |
Build the document |
The next step is to build the document(s) from the raw content, which the search application can understand and interpret easily. |
3 |
Analyze the document |
Before the indexing process starts, the document is to be analyzed as to which part of the text is a candidate to be indexed. This process is where the document is analyzed. |
4 |
Indexing the document |
Once documents are built and analyzed, the next step is to index them so that this document can be retrieved based on certain keys instead of the entire content of the document. Indexing process is similar to indexes at the end of a book where common words are shown with their page numbers so that these words can be tracked quickly instead of searching the complete book. |
5 |
User Interface for Search |
Once a database of indexes is ready then the application can make any search. To facilitate a user to make a search, the application must provide a user a mean or a user interface where a user can enter text and start the search process. |
6 |
Build Query |
Once a user makes a request to search a text, the application should prepare a Query object using that text which can be used to inquire index database to get the relevant details. |
7 |
Search Query |
Using a query object, the index database is then checked to get the relevant details and the content documents. |
8 |
Render Results |
Once the result is received, the application should decide on how to show the results to the user using User Interface. How much information is to be shown at first look and so on. |
除了这些基本操作外,搜索应用程序还可以提供 administration user interface 并帮助应用程序的管理员根据用户配置文件控制搜索级别。搜索结果分析是任何搜索应用程序的另一个重要且高级的方面。
Apart from these basic operations, a search application can also provide administration user interface and help administrators of the application to control the level of search based on the user profiles. Analytics of search results is another important and advanced aspect of any search application.
Lucene’s Role in Search Application
Lucene 在上述步骤 2 到步骤 7 中发挥作用,并提供执行所需操作的类。简而言之,Lucene 是任何搜索应用程序的核心,并提供与索引和搜索相关的至关重要的操作。获取内容和显示结果留给应用程序部分处理。
Lucene plays role in steps 2 to step 7 mentioned above and provides classes to do the required operations. In a nutshell, Lucene is the heart of any search application and provides vital operations pertaining to indexing and searching. Acquiring contents and displaying the results is left for the application part to handle.
在下一章中,我们将使用 Lucene 搜索库执行一个简单的搜索应用程序。
In the next chapter, we will perform a simple Search application using Lucene Search library.
Lucene - Environment Setup
本教程将指导你如何准备一个开发环境来使用 Spring 框架开始你的工作。本教程还将教你如何在你的机器上设置 JDK、Tomcat 和 Eclipse,然后设置 Spring 框架 -
This tutorial will guide you on how to prepare a development environment to start your work with the Spring Framework. This tutorial will also teach you how to setup JDK, Tomcat and Eclipse on your machine before you set up the Spring Framework −
Step 1 - Java Development Kit (JDK) Setup
你可以从 Oracle 的 Java 网站下载 SDK 的最新版本: Java SE Downloads 。你将在下载的文件中找到用于安装 JDK 的说明;按照给定的说明安装和配置设置。最后设置 PATH 和 JAVA_HOME 环境变量以引用包含 Java 和 javac 的目录,通常分别为 java_install_dir/bin 和 java_install_dir。
You can download the latest version of SDK from Oracle’s Java site: Java SE Downloads. You will find instructions for installing JDK in downloaded files; follow the given instructions to install and configure the setup. Finally set the PATH and JAVA_HOME environment variables to refer to the directory that contains Java and javac, typically java_install_dir/bin and java_install_dir respectively.
如果你正在运行 Windows,并在 C:\jdk1.6.0_15 中安装了 JDK,则你必须在你 C:\autoexec.bat 文件中放入以下行。
If you are running Windows and installed the JDK in C:\jdk1.6.0_15, you would have to put the following line in your C:\autoexec.bat file.
set PATH = C:\jdk1.6.0_15\bin;%PATH%
set JAVA_HOME = C:\jdk1.6.0_15
或者,在 Windows NT/2000/XP 中,您还可以右键单击 My Computer ,选择 Properties ,然后 Advanced ,再选择 Environment Variables 。然后,可以更新 PATH 值并按 OK 按钮。
Alternatively, on Windows NT/2000/XP, you could also right-click on My Computer, select Properties, then Advanced, then Environment Variables. Then, you would update the PATH value and press the OK button.
在 Unix(Solaris、Linux 等)中,如果 SDK 安装在 /usr/local/jdk1.6.0_15 中,且您使用 C shell,则可以将以下内容放在 .cshrc 文件中。
On Unix (Solaris, Linux, etc.), if the SDK is installed in /usr/local/jdk1.6.0_15 and you use the C shell, you would put the following into your .cshrc file.
setenv PATH /usr/local/jdk1.6.0_15/bin:$PATH
setenv JAVA_HOME /usr/local/jdk1.6.0_15
或者,如果您使用 Borland JBuilder、Eclipse、IntelliJ IDEA 或 Sun ONE Studio 等这样的 Integrated Development Environment (IDE) ,则编译并运行一个简单程序以确认 IDE 知道您安装 Java 的位置,否则可按照 IDE 文档中的给定内容进行适当的设置。
Alternatively, if you use an Integrated Development Environment (IDE) like Borland JBuilder, Eclipse, IntelliJ IDEA, or Sun ONE Studio, compile and run a simple program to confirm that the IDE knows where you installed Java, otherwise do proper setup as given in the document of the IDE.
Step 2 - Eclipse IDE Setup
本教程中的所有示例均使用 Eclipse IDE 编写。因此,我建议您应在计算机上安装最新版本的 Eclipse。
All the examples in this tutorial have been written using Eclipse IDE. So I would suggest you should have the latest version of Eclipse installed on your machine.
要安装 Eclipse IDE,请从 https://www.eclipse.org/downloads/ 下载最新的 Eclipse 二进制文件。下载完安装程序后,将二进制发行版解压缩到方便的位置。例如,在 C:\eclipse on windows, 或 /usr/local/eclipse on Linux/Unix 中,最后适当设置 PATH 变量。
To install Eclipse IDE, download the latest Eclipse binaries from https://www.eclipse.org/downloads/. Once you downloaded the installation, unpack the binary distribution into a convenient location. For example, in C:\eclipse on windows, or /usr/local/eclipse on Linux/Unix and finally set PATH variable appropriately.
可以通过在 Windows 计算机上执行以下命令启动 Eclipse,或者您可以简单地双击 eclipse.exe
Eclipse can be started by executing the following commands on windows machine, or you can simply double click on eclipse.exe
%C:\eclipse\eclipse.exe
可以通过在 Unix(Solaris、Linux 等)机器上执行以下命令启动 Eclipse −
Eclipse can be started by executing the following commands on Unix (Solaris, Linux, etc.) machine −
$/usr/local/eclipse/eclipse
成功启动后,应显示以下结果 −
After a successful startup, it should display the following result −

Step 3 - Setup Lucene Framework Libraries
如果启动成功,则可以继续设置 Lucene 框架。以下是在计算机上下载并安装框架的简单步骤。
If the startup is successful, then you can proceed to set up your Lucene framework. Following are the simple steps to download and install the framework on your machine.
-
Make a choice whether you want to install Lucene on Windows, or Unix and then proceed to the next step to download the .zip file for windows and .tz file for Unix.
-
Download the suitable version of Lucene framework binaries from https://archive.apache.org/dist/lucene/java/.
-
At the time of writing this tutorial, I downloaded lucene-3.6.2.zip on my Windows machine and when you unzip the downloaded file it will give you the directory structure inside C:\lucene-3.6.2 as follows.

您将在目录 C:\lucene-3.6.2 中找到所有 Lucene 库。确保正确在此目录上设置您的 CLASSPATH 变量,否则在运行应用程序时将面临问题。如果使用 Eclipse,则无需设置 CLASSPATH,因为所有设置均将通过 Eclipse 完成。
You will find all the Lucene libraries in the directory C:\lucene-3.6.2. Make sure you set your CLASSPATH variable on this directory properly otherwise, you will face problem while running your application. If you are using Eclipse, then it is not required to set CLASSPATH because all the setting will be done through Eclipse.
完成此最后一步后,即可继续在下一章中看到的第一个 Lucene 示例。
Once you are done with this last step, you are ready to proceed for your first Lucene Example which you will see in the next chapter.
Lucene - First Application
在本章中,我们将学习使用 Lucene Framework 进行实际编程。在开始使用 Lucene 框架编写第一个示例之前,您必须确保已正确设置 Lucene 环境,如 Lucene - Environment Setup 教程中所述。建议您具备 Eclipse IDE 的工作知识。
In this chapter, we will learn the actual programming with Lucene Framework. Before you start writing your first example using Lucene framework, you have to make sure that you have set up your Lucene environment properly as explained in Lucene - Environment Setup tutorial. It is recommended you have the working knowledge of Eclipse IDE.
现在让我们继续编写一个简单的搜索应用程序,该应用程序将打印找到的搜索结果数。我们还将看到在此过程中创建的索引列表。
Let us now proceed by writing a simple Search Application which will print the number of search results found. We’ll also see the list of indexes created during this process.
Step 1 - Create Java Project
第一步是使用 Eclipse IDE 创建一个简单的 Java 项目。按选项 File > New → Project ,最后从向导列表中选择 Java Project 向导。现在,使用向导窗口如下所示,将项目命名为 LuceneFirstApplication −
The first step is to create a simple Java Project using Eclipse IDE. Follow the option File > New → Project and finally select Java Project wizard from the wizard list. Now name your project as LuceneFirstApplication using the wizard window as follows −

成功创建项目后,您的 Project Explorer 中将包含以下内容 −
Once your project is created successfully, you will have following content in your Project Explorer −

Step 2 - Add Required Libraries
现在让我们在项目中添加 Lucene 核心框架库。若要执行此操作,请右键单击项目名称 LuceneFirstApplication ,然后按上下文菜单中提供的以下选项: Build Path → Configure Build Path ,以如下显示 Java 构建路径窗口 −
Let us now add Lucene core Framework library in our project. To do this, right click on your project name LuceneFirstApplication and then follow the following option available in context menu: Build Path → Configure Build Path to display the Java Build Path window as follows −

现在,使用 Libraries 下的 Add External JARs 按钮从 Lucene 安装目录添加以下核心 JAR。
Now use Add External JARs button available under Libraries tab to add the following core JAR from the Lucene installation directory −
-
lucene-core-3.6.2
Step 3 - Create Source Files
让我们在 LuceneFirstApplication 项目下创建实际的源文件。首先我们需要创建一个名为 com.tutorialspoint.lucene. 的包。为此,右键单击程序包资源管理器部分中的 src 并选择以下选项: New → Package 。
Let us now create actual source files under the LuceneFirstApplication project. First we need to create a package called com.tutorialspoint.lucene. To do this, right-click on src in package explorer section and follow the option : New → Package.
接下来,我们将在 com.tutorialspoint.lucene 包下创建 LuceneTester.java 和其他 java 类。
Next we will create LuceneTester.java and other java classes under the com.tutorialspoint.lucene package.
LuceneConstants.java
此类用于提供将在整个示例应用程序中使用的各种常量。
This class is used to provide various constants to be used across the sample application.
package com.tutorialspoint.lucene;
public class LuceneConstants {
public static final String CONTENTS = "contents";
public static final String FILE_NAME = "filename";
public static final String FILE_PATH = "filepath";
public static final int MAX_SEARCH = 10;
}
TextFileFilter.java
此类用作 .txt file 过滤器。
This class is used as a .txt file filter.
package com.tutorialspoint.lucene;
import java.io.File;
import java.io.FileFilter;
public class TextFileFilter implements FileFilter {
@Override
public boolean accept(File pathname) {
return pathname.getName().toLowerCase().endsWith(".txt");
}
}
Indexer.java
此类用于索引原始数据,以便我们可以使用 Lucene 库进行搜索。
This class is used to index the raw data so that we can make it searchable using the Lucene library.
package com.tutorialspoint.lucene;
import java.io.File;
import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
public class Indexer {
private IndexWriter writer;
public Indexer(String indexDirectoryPath) throws IOException {
//this directory will contain the indexes
Directory indexDirectory =
FSDirectory.open(new File(indexDirectoryPath));
//create the indexer
writer = new IndexWriter(indexDirectory,
new StandardAnalyzer(Version.LUCENE_36),true,
IndexWriter.MaxFieldLength.UNLIMITED);
}
public void close() throws CorruptIndexException, IOException {
writer.close();
}
private Document getDocument(File file) throws IOException {
Document document = new Document();
//index file contents
Field contentField = new Field(LuceneConstants.CONTENTS, new FileReader(file));
//index file name
Field fileNameField = new Field(LuceneConstants.FILE_NAME,
file.getName(),Field.Store.YES,Field.Index.NOT_ANALYZED);
//index file path
Field filePathField = new Field(LuceneConstants.FILE_PATH,
file.getCanonicalPath(),Field.Store.YES,Field.Index.NOT_ANALYZED);
document.add(contentField);
document.add(fileNameField);
document.add(filePathField);
return document;
}
private void indexFile(File file) throws IOException {
System.out.println("Indexing "+file.getCanonicalPath());
Document document = getDocument(file);
writer.addDocument(document);
}
public int createIndex(String dataDirPath, FileFilter filter)
throws IOException {
//get all files in the data directory
File[] files = new File(dataDirPath).listFiles();
for (File file : files) {
if(!file.isDirectory()
&& !file.isHidden()
&& file.exists()
&& file.canRead()
&& filter.accept(file)
){
indexFile(file);
}
}
return writer.numDocs();
}
}
Searcher.java
此类用于搜索索引程序创建的索引以搜索所需内容。
This class is used to search the indexes created by the Indexer to search the requested content.
package com.tutorialspoint.lucene;
import java.io.File;
import java.io.IOException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
public class Searcher {
IndexSearcher indexSearcher;
QueryParser queryParser;
Query query;
public Searcher(String indexDirectoryPath)
throws IOException {
Directory indexDirectory =
FSDirectory.open(new File(indexDirectoryPath));
indexSearcher = new IndexSearcher(indexDirectory);
queryParser = new QueryParser(Version.LUCENE_36,
LuceneConstants.CONTENTS,
new StandardAnalyzer(Version.LUCENE_36));
}
public TopDocs search( String searchQuery)
throws IOException, ParseException {
query = queryParser.parse(searchQuery);
return indexSearcher.search(query, LuceneConstants.MAX_SEARCH);
}
public Document getDocument(ScoreDoc scoreDoc)
throws CorruptIndexException, IOException {
return indexSearcher.doc(scoreDoc.doc);
}
public void close() throws IOException {
indexSearcher.close();
}
}
LuceneTester.java
此类用于测试 Lucene 库的索引和搜索功能。
This class is used to test the indexing and search capability of lucene library.
package com.tutorialspoint.lucene;
import java.io.IOException;
import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
public class LuceneTester {
String indexDir = "E:\\Lucene\\Index";
String dataDir = "E:\\Lucene\\Data";
Indexer indexer;
Searcher searcher;
public static void main(String[] args) {
LuceneTester tester;
try {
tester = new LuceneTester();
tester.createIndex();
tester.search("Mohan");
} catch (IOException e) {
e.printStackTrace();
} catch (ParseException e) {
e.printStackTrace();
}
}
private void createIndex() throws IOException {
indexer = new Indexer(indexDir);
int numIndexed;
long startTime = System.currentTimeMillis();
numIndexed = indexer.createIndex(dataDir, new TextFileFilter());
long endTime = System.currentTimeMillis();
indexer.close();
System.out.println(numIndexed+" File indexed, time taken: "
+(endTime-startTime)+" ms");
}
private void search(String searchQuery) throws IOException, ParseException {
searcher = new Searcher(indexDir);
long startTime = System.currentTimeMillis();
TopDocs hits = searcher.search(searchQuery);
long endTime = System.currentTimeMillis();
System.out.println(hits.totalHits +
" documents found. Time :" + (endTime - startTime));
for(ScoreDoc scoreDoc : hits.scoreDocs) {
Document doc = searcher.getDocument(scoreDoc);
System.out.println("File: "
+ doc.get(LuceneConstants.FILE_PATH));
}
searcher.close();
}
}
Step 4 - Data & Index directory creation
我们使用了从 record1.txt 到 record10.txt 的 10 个文本文件,其中包含学生的姓名和其他详细信息,并将它们放入目录 E:\Lucene\Data 中。 Test Data 。应该创建索引目录路径为 E:\Lucene\Index 。在运行此程序后,你可以在该文件夹中看到创建的索引文件列表。
We have used 10 text files from record1.txt to record10.txt containing names and other details of the students and put them in the directory E:\Lucene\Data. Test Data. An index directory path should be created as E:\Lucene\Index. After running this program, you can see the list of index files created in that folder.
Step 5 - Running the program
一旦完成源代码、原始数据、数据目录和索引目录的创建,你就可以编译和运行程序了。要执行此操作,保持 LuceneTester.Java 文件选项卡处于活动状态,并使用 Eclipse IDE 中的 Run 选项或使用 Ctrl + F11 来编译和运行 LuceneTester 应用程序。如果应用程序运行成功,它将在 Eclipse IDE 的控制台中打印以下消息:
Once you are done with the creation of the source, the raw data, the data directory and the index directory, you are ready for compiling and running of your program. To do this, keep the LuceneTester.Java file tab active and use either the Run option available in the Eclipse IDE or use Ctrl + F11 to compile and run your LuceneTester application. If the application runs successfully, it will print the following message in Eclipse IDE’s console −
Indexing E:\Lucene\Data\record1.txt
Indexing E:\Lucene\Data\record10.txt
Indexing E:\Lucene\Data\record2.txt
Indexing E:\Lucene\Data\record3.txt
Indexing E:\Lucene\Data\record4.txt
Indexing E:\Lucene\Data\record5.txt
Indexing E:\Lucene\Data\record6.txt
Indexing E:\Lucene\Data\record7.txt
Indexing E:\Lucene\Data\record8.txt
Indexing E:\Lucene\Data\record9.txt
10 File indexed, time taken: 109 ms
1 documents found. Time :0
File: E:\Lucene\Data\record4.txt
一旦成功运行该程序,你会在 index directory 中看到以下内容:
Once you’ve run the program successfully, you will have the following content in your index directory −

Lucene - Indexing Classes
索引进程是 Lucene 提供的核心功能之一。下图说明了索引进程和类别的使用。 IndexWriter 是索引过程中最重要的核心组件。
Indexing process is one of the core functionalities provided by Lucene. The following diagram illustrates the indexing process and the use of classes. IndexWriter is the most important and the core component of the indexing process.

我们向包含 Field(s) 的 Document(s) 添加 IndexWriter,它使用 Analyzer 分析 Document(s) ,然后根据需要创建/打开/编辑索引,并将其存储/更新到 Directory 中。IndexWriter 用于更新或创建索引。它不用于读取索引。
We add Document(s) containing Field(s) to IndexWriter which analyzes the Document(s) using the Analyzer and then creates/open/edit indexes as required and store/update them in a Directory. IndexWriter is used to update or create indexes. It is not used to read indexes.
Indexing Classes
以下是索引过程中常用的一些类。
Following is a list of commonly-used classes during the indexing process.
S.No. |
Class & Description |
1 |
IndexWriterThis class acts as a core component which creates/updates indexes during the indexing process. |
2 |
DirectoryThis class represents the storage location of the indexes. |
3 |
AnalyzerThis class is responsible to analyze a document and get the tokens/words from the text which is to be indexed. Without analysis done, IndexWriter cannot create index. |
4 |
DocumentThis class represents a virtual document with Fields where the Field is an object which can contain the physical document’s contents, its meta data and so on. The Analyzer can understand a Document only. |
5 |
FieldThis is the lowest unit or the starting point of the indexing process. It represents the key value pair relationship where a key is used to identify the value to be indexed. Let us assume a field used to represent contents of a document will have key as "contents" and the value may contain the part or all of the text or numeric content of the document. Lucene can index only text or numeric content only. |
Lucene - Searching Classes
搜索这一过程也是 Lucene 提供的核心功能之一。它的流程类似于索引过程。Lucene 的基本搜索可以使用以下类,它们也可以被称为所有搜索相关操作的基础类。
The process of Searching is again one of the core functionalities provided by Lucene. Its flow is similar to that of the indexing process. Basic search of Lucene can be made using the following classes which can also be termed as foundation classes for all search related operations.
Searching Classes
以下是搜索过程中常用的类列表。
Following is a list of commonly-used classes during searching process.
S.No. |
Class & Description |
1 |
IndexSearcherThis class act as a core component which reads/searches indexes created after the indexing process. It takes directory instance pointing to the location containing the indexes. |
2 |
TermThis class is the lowest unit of searching. It is similar to Field in indexing process. |
3 |
QueryQuery is an abstract class and contains various utility methods and is the parent of all types of queries that Lucene uses during search process. |
4 |
TermQueryTermQuery is the most commonly-used query object and is the foundation of many complex queries that Lucene can make use of. |
5 |
TopDocsTopDocs points to the top N search results which matches the search criteria. It is a simple container of pointers to point to documents which are the output of a search result. |
Lucene - Indexing Process
索引过程是 Lucene 提供的核心功能之一。下图说明了索引过程和类的使用。IndexWriter 是索引过程的最重要和核心组件。
Indexing process is one of the core functionality provided by Lucene. Following diagram illustrates the indexing process and use of classes. IndexWriter is the most important and core component of the indexing process.

我们将包含 Field 的 Document 添加到 IndexWriter,它使用 Analyzer 分析 Document,然后根据需要创建/打开/编辑索引并将其存储/更新到 Directory 中。IndexWriter 用于更新或创建索引。它不用于读取索引。
We add Document(s) containing Field(s) to IndexWriter which analyzes the Document(s) using the Analyzer and then creates/open/edit indexes as required and store/update them in a Directory. IndexWriter is used to update or create indexes. It is not used to read indexes.
现在,我们将向您展示逐步的过程,以通过一个基本的示例开始了解索引过程。
Now we’ll show you a step by step process to get a kick start in understanding of indexing process using a basic example.
Create a document
-
Create a method to get a lucene document from a text file.
-
Create various types of fields which are key value pairs containing keys as names and values as contents to be indexed.
.
-
Set field to be analyzed or not. In our case, only contents is to be analyzed as it can contain data such as a, am, are, an etc. which are not required in search operations.
.
-
Add the newly created fields to the document object and return it to the caller method.
private Document getDocument(File file) throws IOException {
Document document = new Document();
//index file contents
Field contentField = new Field(LuceneConstants.CONTENTS,
new FileReader(file));
//index file name
Field fileNameField = new Field(LuceneConstants.FILE_NAME,
file.getName(),
Field.Store.YES,Field.Index.NOT_ANALYZED);
//index file path
Field filePathField = new Field(LuceneConstants.FILE_PATH,
file.getCanonicalPath(),
Field.Store.YES,Field.Index.NOT_ANALYZED);
document.add(contentField);
document.add(fileNameField);
document.add(filePathField);
return document;
}
Create a IndexWriter
IndexWriter 类充当核心组件,在索引过程中创建/更新索引。按照以下步骤创建 IndexWriter −
IndexWriter class acts as a core component which creates/updates indexes during indexing process. Follow these steps to create a IndexWriter −
Step 1 − 创建 IndexWriter 对象。
Step 1 − Create object of IndexWriter.
Step 2 − 创建 Lucene 目录,该目录应该指向存储索引的位置。
Step 2 − Create a Lucene directory which should point to location where indexes are to be stored.
Step 3 − 使用索引目录、具有版本信息及其他必需/可选参数的标准分析器初始化创建的 IndexWriter 对象。
Step 3 − Initialize the IndexWriter object created with the index directory, a standard analyzer having version information and other required/optional parameters.
private IndexWriter writer;
public Indexer(String indexDirectoryPath) throws IOException {
//this directory will contain the indexes
Directory indexDirectory =
FSDirectory.open(new File(indexDirectoryPath));
//create the indexer
writer = new IndexWriter(indexDirectory,
new StandardAnalyzer(Version.LUCENE_36),true,
IndexWriter.MaxFieldLength.UNLIMITED);
}
Start Indexing Process
以下程序展示了如何启动索引过程 −
The following program shows how to start an indexing process −
private void indexFile(File file) throws IOException {
System.out.println("Indexing "+file.getCanonicalPath());
Document document = getDocument(file);
writer.addDocument(document);
}
Example Application
为了测试索引过程,我们需要创建一个 Lucene 应用程序测试。
To test the indexing process, we need to create a Lucene application test.
Step |
Description |
1 |
Create a project with a name LuceneFirstApplication under a package com.tutorialspoint.lucene as explained in the Lucene - First Application chapter. You can also use the project created in Lucene - First Application chapter as such for this chapter to understand the indexing process. |
2 |
Create LuceneConstants.java,TextFileFilter.java and Indexer.java as explained in the Lucene - First Application chapter. Keep the rest of the files unchanged. |
3 |
Create LuceneTester.java as mentioned below. |
4 |
Clean and build the application to make sure the business logic is working as per the requirements. |
LuceneConstants.java
此类用于提供将在整个示例应用程序中使用的各种常量。
This class is used to provide various constants to be used across the sample application.
package com.tutorialspoint.lucene;
public class LuceneConstants {
public static final String CONTENTS = "contents";
public static final String FILE_NAME = "filename";
public static final String FILE_PATH = "filepath";
public static final int MAX_SEARCH = 10;
}
TextFileFilter.java
此类用作 .txt 文件过滤器。
This class is used as a .txt file filter.
package com.tutorialspoint.lucene;
import java.io.File;
import java.io.FileFilter;
public class TextFileFilter implements FileFilter {
@Override
public boolean accept(File pathname) {
return pathname.getName().toLowerCase().endsWith(".txt");
}
}
Indexer.java
此类用于索引原始数据,以便我们可以使用 Lucene 库进行搜索。
This class is used to index the raw data so that we can make it searchable using the Lucene library.
package com.tutorialspoint.lucene;
import java.io.File;
import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
public class Indexer {
private IndexWriter writer;
public Indexer(String indexDirectoryPath) throws IOException {
//this directory will contain the indexes
Directory indexDirectory =
FSDirectory.open(new File(indexDirectoryPath));
//create the indexer
writer = new IndexWriter(indexDirectory,
new StandardAnalyzer(Version.LUCENE_36),true,
IndexWriter.MaxFieldLength.UNLIMITED);
}
public void close() throws CorruptIndexException, IOException {
writer.close();
}
private Document getDocument(File file) throws IOException {
Document document = new Document();
//index file contents
Field contentField = new Field(LuceneConstants.CONTENTS,
new FileReader(file));
//index file name
Field fileNameField = new Field(LuceneConstants.FILE_NAME,
file.getName(),
Field.Store.YES,Field.Index.NOT_ANALYZED);
//index file path
Field filePathField = new Field(LuceneConstants.FILE_PATH,
file.getCanonicalPath(),
Field.Store.YES,Field.Index.NOT_ANALYZED);
document.add(contentField);
document.add(fileNameField);
document.add(filePathField);
return document;
}
private void indexFile(File file) throws IOException {
System.out.println("Indexing "+file.getCanonicalPath());
Document document = getDocument(file);
writer.addDocument(document);
}
public int createIndex(String dataDirPath, FileFilter filter)
throws IOException {
//get all files in the data directory
File[] files = new File(dataDirPath).listFiles();
for (File file : files) {
if(!file.isDirectory()
&& !file.isHidden()
&& file.exists()
&& file.canRead()
&& filter.accept(file)
){
indexFile(file);
}
}
return writer.numDocs();
}
}
LuceneTester.java
此类用于测试 Lucene 库的索引功能。
This class is used to test the indexing capability of the Lucene library.
package com.tutorialspoint.lucene;
import java.io.IOException;
public class LuceneTester {
String indexDir = "E:\\Lucene\\Index";
String dataDir = "E:\\Lucene\\Data";
Indexer indexer;
public static void main(String[] args) {
LuceneTester tester;
try {
tester = new LuceneTester();
tester.createIndex();
} catch (IOException e) {
e.printStackTrace();
}
}
private void createIndex() throws IOException {
indexer = new Indexer(indexDir);
int numIndexed;
long startTime = System.currentTimeMillis();
numIndexed = indexer.createIndex(dataDir, new TextFileFilter());
long endTime = System.currentTimeMillis();
indexer.close();
System.out.println(numIndexed+" File indexed, time taken: "
+(endTime-startTime)+" ms");
}
}
Data & Index Directory Creation
我们使用了从 record1.txt 到 record10.txt 的 10 个文本文件,其中包含学生姓名和其他详细信息,并将它们放在目录 E:\Lucene\Data. Test Data 中。索引目录路径应创建为 E:\Lucene\Index 。在此程序运行结束后,您可以在该文件夹中看到创建的索引文件列表。
We have used 10 text files from record1.txt to record10.txt containing names and other details of the students and put them in the directory E:\Lucene\Data. Test Data. An index directory path should be created as E:\Lucene\Index. After running this program, you can see the list of index files created in that folder.
Running the Program
创建完源文件、原始数据、数据目录和索引目录后,就可以编译并运行程序。为此,保持 LuceneTester.Java 文件选项卡处于活动状态,然后使用 Eclipse IDE 中提供的 Run 选项或使用 Ctrl + F11 编译并运行 LuceneTester 应用程序。如果您的应用程序运行成功,它将打印 Eclipse IDE 控制台中的以下消息 −
Once you are done with the creation of the source, the raw data, the data directory and the index directory, you can proceed by compiling and running your program. To do this, keep the LuceneTester.Java file tab active and use either the Run option available in the Eclipse IDE or use Ctrl + F11 to compile and run your LuceneTester application. If your application runs successfully, it will print the following message in Eclipse IDE’s console −
Indexing E:\Lucene\Data\record1.txt
Indexing E:\Lucene\Data\record10.txt
Indexing E:\Lucene\Data\record2.txt
Indexing E:\Lucene\Data\record3.txt
Indexing E:\Lucene\Data\record4.txt
Indexing E:\Lucene\Data\record5.txt
Indexing E:\Lucene\Data\record6.txt
Indexing E:\Lucene\Data\record7.txt
Indexing E:\Lucene\Data\record8.txt
Indexing E:\Lucene\Data\record9.txt
10 File indexed, time taken: 109 ms
成功运行程序后,您将在 index directory − 中看到以下内容
Once you’ve run the program successfully, you will have the following content in your index directory −

Lucene - Indexing Operations
在本章中,我们将讨论索引的四个主要操作。这些操作在不同的时间段很有用,并在软件搜索应用程序中使用。
In this chapter, we’ll discuss the four major operations of indexing. These operations are useful at various times and are used throughout of a software search application.
Indexing Operations
以下是索引过程中常用的操作列表。
Following is a list of commonly-used operations during indexing process.
S.No. |
Operation & Description |
1 |
Add DocumentThis operation is used in the initial stage of the indexing process to create the indexes on the newly available content. |
2 |
Update DocumentThis operation is used to update indexes to reflect the changes in the updated contents. It is similar to recreating the index. |
3 |
Delete DocumentThis operation is used to update indexes to exclude the documents which are not required to be indexed/searched. |
4 |
Field OptionsField options specify a way or control the ways in which the contents of a field are to be made searchable. |
Lucene - Search Operation
搜索过程是 Lucene 提供的核心功能之一。下图说明了该过程及其用途。IndexSearcher 是搜索过程的核心组件之一。
The process of searching is one of the core functionalities provided by Lucene. Following diagram illustrates the process and its use. IndexSearcher is one of the core components of the searching process.

我们首先创建包含索引的 Directory(ies),然后将其传递给使用 IndexReader 打开 Directory 的 IndexSearcher。然后,我们使用 Term 创建一个 Query,并使用 IndexSearcher 进行搜索,方法是将 Query 传递给搜索器。IndexSearcher 返回一个 TopDocs 对象,其中包含搜索详细信息以及搜索操作结果的 Document 的文档 ID。
We first create Directory(s) containing indexes and then pass it to IndexSearcher which opens the Directory using IndexReader. Then we create a Query with a Term and make a search using IndexSearcher by passing the Query to the searcher. IndexSearcher returns a TopDocs object which contains the search details along with document ID(s) of the Document which is the result of the search operation.
现在,我们将向您展示一个逐步的方法,并帮助您理解使用基本示例进行的索引过程。
We will now show you a step-wise approach and help you understand the indexing process using a basic example.
Create a QueryParser
QueryParser 类将用户输入的输入解析为 Lucene 理解的格式查询。按照以下步骤创建 QueryParser −
QueryParser class parses the user entered input into Lucene understandable format query. Follow these steps to create a QueryParser −
Step 1 − 创建 QueryParser 对象。
Step 1 − Create object of QueryParser.
Step 2 − 使用具有版本信息和要在此查询上运行的索引名称的标准分析器初始化创建的 QueryParser 对象。
Step 2 − Initialize the QueryParser object created with a standard analyzer having version information and index name on which this query is to be run.
QueryParser queryParser;
public Searcher(String indexDirectoryPath) throws IOException {
queryParser = new QueryParser(Version.LUCENE_36,
LuceneConstants.CONTENTS,
new StandardAnalyzer(Version.LUCENE_36));
}
Create a IndexSearcher
IndexSearcher 类充当搜索器索引的核心组件,该索引在索引过程中创建。按照以下步骤创建 IndexSearcher −
IndexSearcher class acts as a core component which searcher indexes created during indexing process. Follow these steps to create a IndexSearcher −
Step 1 − 创建 IndexSearcher 对象。
Step 1 − Create object of IndexSearcher.
Step 2 − 创建 Lucene 目录,该目录应该指向存储索引的位置。
Step 2 − Create a Lucene directory which should point to location where indexes are to be stored.
Step 3 − 使用索引目录初始化创建的 IndexSearcher 对象。
Step 3 − Initialize the IndexSearcher object created with the index directory.
IndexSearcher indexSearcher;
public Searcher(String indexDirectoryPath) throws IOException {
Directory indexDirectory =
FSDirectory.open(new File(indexDirectoryPath));
indexSearcher = new IndexSearcher(indexDirectory);
}
Make search
按照以下步骤进行搜索 −
Follow these steps to make search −
Step 1 − 通过 QueryParser 解析搜索表达式来创建 Query 对象。
Step 1 − Create a Query object by parsing the search expression through QueryParser.
Step 2 − 通过调用 IndexSearcher.search() 方法进行搜索。
Step 2 − Make search by calling the IndexSearcher.search() method.
Query query;
public TopDocs search( String searchQuery) throws IOException, ParseException {
query = queryParser.parse(searchQuery);
return indexSearcher.search(query, LuceneConstants.MAX_SEARCH);
}
Get the Document
以下程序显示如何获取文档。
The following program shows how to get the document.
public Document getDocument(ScoreDoc scoreDoc)
throws CorruptIndexException, IOException {
return indexSearcher.doc(scoreDoc.doc);
}
Close IndexSearcher
以下程序显示如何关闭 IndexSearcher。
The following program shows how to close the IndexSearcher.
public void close() throws IOException {
indexSearcher.close();
}
Example Application
让我们创建一个测试 Lucene 应用程序来测试搜索过程。
Let us create a test Lucene application to test searching process.
Step |
Description |
1 |
Create a project with a name LuceneFirstApplication under a package com.tutorialspoint.lucene as explained in the Lucene - First Application chapter. You can also use the project created in Lucene - First Application chapter as such for this chapter to understand the searching process. |
2 |
Create LuceneConstants.java,TextFileFilter.java and Searcher.java as explained in the Lucene - First Application chapter. Keep the rest of the files unchanged. |
3 |
Create LuceneTester.java as mentioned below. |
4 |
Clean and Build the application to make sure business logic is working as per the requirements. |
LuceneConstants.java
此类用于提供将在整个示例应用程序中使用的各种常量。
This class is used to provide various constants to be used across the sample application.
package com.tutorialspoint.lucene;
public class LuceneConstants {
public static final String CONTENTS = "contents";
public static final String FILE_NAME = "filename";
public static final String FILE_PATH = "filepath";
public static final int MAX_SEARCH = 10;
}
TextFileFilter.java
此类用作 .txt 文件过滤器。
This class is used as a .txt file filter.
package com.tutorialspoint.lucene;
import java.io.File;
import java.io.FileFilter;
public class TextFileFilter implements FileFilter {
@Override
public boolean accept(File pathname) {
return pathname.getName().toLowerCase().endsWith(".txt");
}
}
Searcher.java
此类用于读取原始数据上创建的索引并使用 Lucene 库搜索数据。
This class is used to read the indexes made on raw data and searches data using the Lucene library.
package com.tutorialspoint.lucene;
import java.io.File;
import java.io.IOException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
public class Searcher {
IndexSearcher indexSearcher;
QueryParser queryParser;
Query query;
public Searcher(String indexDirectoryPath) throws IOException {
Directory indexDirectory =
FSDirectory.open(new File(indexDirectoryPath));
indexSearcher = new IndexSearcher(indexDirectory);
queryParser = new QueryParser(Version.LUCENE_36,
LuceneConstants.CONTENTS,
new StandardAnalyzer(Version.LUCENE_36));
}
public TopDocs search( String searchQuery)
throws IOException, ParseException {
query = queryParser.parse(searchQuery);
return indexSearcher.search(query, LuceneConstants.MAX_SEARCH);
}
public Document getDocument(ScoreDoc scoreDoc)
throws CorruptIndexException, IOException {
return indexSearcher.doc(scoreDoc.doc);
}
public void close() throws IOException {
indexSearcher.close();
}
}
LuceneTester.java
此类用于测试 Lucene 库的搜索功能。
This class is used to test the searching capability of the Lucene library.
package com.tutorialspoint.lucene;
import java.io.IOException;
import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
public class LuceneTester {
String indexDir = "E:\\Lucene\\Index";
String dataDir = "E:\\Lucene\\Data";
Searcher searcher;
public static void main(String[] args) {
LuceneTester tester;
try {
tester = new LuceneTester();
tester.search("Mohan");
} catch (IOException e) {
e.printStackTrace();
} catch (ParseException e) {
e.printStackTrace();
}
}
private void search(String searchQuery) throws IOException, ParseException {
searcher = new Searcher(indexDir);
long startTime = System.currentTimeMillis();
TopDocs hits = searcher.search(searchQuery);
long endTime = System.currentTimeMillis();
System.out.println(hits.totalHits +
" documents found. Time :" + (endTime - startTime) +" ms");
for(ScoreDoc scoreDoc : hits.scoreDocs) {
Document doc = searcher.getDocument(scoreDoc);
System.out.println("File: "+ doc.get(LuceneConstants.FILE_PATH));
}
searcher.close();
}
}
Data & Index Directory Creation
我们使用了 10 个名为 record1.txt 到 record10.txt 的文本文件,其中包含学生姓名和其他详细信息,并将它们放入目录 E:\Lucene\Data. 中。 Test Data . 索引目录路径应创建为 E:\Lucene\Index。在章节 Lucene - Indexing Process 中运行索引程序后,您可以在该文件夹中看到所创建的索引文件列表。
We have used 10 text files named record1.txt to record10.txt containing names and other details of the students and put them in the directory E:\Lucene\Data. Test Data. An index directory path should be created as E:\Lucene\Index. After running the indexing program in the chapter Lucene - Indexing Process, you can see the list of index files created in that folder.
Running the Program
完成源、原始数据、数据目录、索引目录和索引的创建后,您可以通过编译和运行程序继续操作。为此,请保持 LuceneTester.Java 文件选项卡处于活动状态,并使用 Eclipse IDE 中提供的“运行”选项,或使用 Ctrl + F11 编译并运行 LuceneTesterapplication 。如果您的应用程序运行成功,它将在 Eclipse IDE 控制台中打印以下消息 −
Once you are done with the creation of the source, the raw data, the data directory, the index directory and the indexes, you can proceed by compiling and running your program. To do this, keep LuceneTester.Java file tab active and use either the Run option available in the Eclipse IDE or use Ctrl + F11 to compile and run your LuceneTesterapplication. If your application runs successfully, it will print the following message in Eclipse IDE’s console −
1 documents found. Time :29 ms
File: E:\Lucene\Data\record4.txt
Lucene - Query Programming
我们在前一章 Lucene - Search Operation 中看到,Lucene 使用 IndexSearcher 进行搜索,并且它使用 QueryParser 创建的 Query 对象作为输入。在本章中,我们将讨论各种类型的 Query 对象以及以编程方式创建它们的不同方法。创建不同类型的 Query 对象可以控制要进行的搜索类型。
We have seen in previous chapter Lucene - Search Operation, Lucene uses IndexSearcher to make searches and it uses the Query object created by QueryParser as the input. In this chapter, we are going to discuss various types of Query objects and the different ways to create them programmatically. Creating different types of Query object gives control on the kind of search to be made.
考虑许多应用程序提供的高级搜索案例,在这些应用程序中用户有多个选项来限定搜索结果。通过查询编程,我们可以非常轻松地实现相同目标。
Consider a case of Advanced Search, provided by many applications where users are given multiple options to confine the search results. By Query programming, we can achieve the same very easily.
以下是我们稍后将讨论的查询类型列表。
Following is the list of Query types that we’ll discuss in due course.
S.No. |
Class & Description |
1 |
TermQueryThis class acts as a core component which creates/updates indexes during the indexing process. |
2 |
TermRangeQueryTermRangeQuery is used when a range of textual terms are to be searched. |
3 |
PrefixQueryPrefixQuery is used to match documents whose index starts with a specified string. |
4 |
BooleanQueryBooleanQuery is used to search documents which are result of multiple queries using AND, OR or NOT operators. |
5 |
PhraseQueryPhrase query is used to search documents which contain a particular sequence of terms. |
6 |
WildCardQueryWildcardQuery is used to search documents using wildcards like '*' for any character sequence,? matching a single character. |
7 |
FuzzyQueryFuzzyQuery is used to search documents using fuzzy implementation that is an approximate search based on the edit distance algorithm. |
8 |
MatchAllDocsQueryMatchAllDocsQuery as the name suggests matches all the documents. |
Lucene - Analysis
在前面的章节中,我们了解到 Lucene 使用 IndexWriter 使用 Analyzer 分析文档,然后根据需要创建/打开/编辑索引。在本节中,我们将讨论分析过程中使用的各种类型的 Analyzer 对象和其他相关对象。了解分析过程和分析器的运行机制将让你深入了解 Lucene 如何为文档编制索引。
In one of our previous chapters, we have seen that Lucene uses IndexWriter to analyze the Document(s) using the Analyzer and then creates/open/edit indexes as required. In this chapter, we are going to discuss the various types of Analyzer objects and other relevant objects which are used during the analysis process. Understanding the Analysis process and how analyzers work will give you great insight over how Lucene indexes the documents.
以下是我们将会逐步讨论的对象列表。
Following is the list of objects that we’ll discuss in due course.
S.No. |
Class & Description |
1 |
TokenToken represents text or word in a document with relevant details like its metadata (position, start offset, end offset, token type and its position increment). |
2 |
TokenStreamTokenStream is an output of the analysis process and it comprises of a series of tokens. It is an abstract class. |
3 |
AnalyzerThis is an abstract base class for each and every type of Analyzer. |
4 |
WhitespaceAnalyzerThis analyzer splits the text in a document based on whitespace. |
5 |
SimpleAnalyzerThis analyzer splits the text in a document based on non-letter characters and puts the text in lowercase. |
6 |
StopAnalyzerThis analyzer works just as the SimpleAnalyzer and removes the common words like 'a', 'an', 'the', etc. |
7 |
StandardAnalyzerThis is the most sophisticated analyzer and is capable of handling names, email addresses, etc. It lowercases each token and removes common words and punctuations, if any. |
Lucene - Sorting
在本章中,我们将了解 Lucene 在默认情况下提供的或者可以根据需要进行处理的搜索结果的排序顺序。
In this chapter, we will look into the sorting orders in which Lucene gives the search results by default or can be manipulated as required.
Sorting by Relevance
这是 Lucene 使用的默认排序模式。Lucene 根据最相关的点击结果提供结果。
This is the default sorting mode used by Lucene. Lucene provides results by the most relevant hit at the top.
private void sortUsingRelevance(String searchQuery)
throws IOException, ParseException {
searcher = new Searcher(indexDir);
long startTime = System.currentTimeMillis();
//create a term to search file name
Term term = new Term(LuceneConstants.FILE_NAME, searchQuery);
//create the term query object
Query query = new FuzzyQuery(term);
searcher.setDefaultFieldSortScoring(true, false);
//do the search
TopDocs hits = searcher.search(query,Sort.RELEVANCE);
long endTime = System.currentTimeMillis();
System.out.println(hits.totalHits +
" documents found. Time :" + (endTime - startTime) + "ms");
for(ScoreDoc scoreDoc : hits.scoreDocs) {
Document doc = searcher.getDocument(scoreDoc);
System.out.print("Score: "+ scoreDoc.score + " ");
System.out.println("File: "+ doc.get(LuceneConstants.FILE_PATH));
}
searcher.close();
}
Sorting by IndexOrder
Lucene 使用该排序模式。在此处,将首先在搜索结果中显示首次编入索引的文档。
This sorting mode is used by Lucene. Here, the first document indexed is shown first in the search results.
private void sortUsingIndex(String searchQuery)
throws IOException, ParseException {
searcher = new Searcher(indexDir);
long startTime = System.currentTimeMillis();
//create a term to search file name
Term term = new Term(LuceneConstants.FILE_NAME, searchQuery);
//create the term query object
Query query = new FuzzyQuery(term);
searcher.setDefaultFieldSortScoring(true, false);
//do the search
TopDocs hits = searcher.search(query,Sort.INDEXORDER);
long endTime = System.currentTimeMillis();
System.out.println(hits.totalHits +
" documents found. Time :" + (endTime - startTime) + "ms");
for(ScoreDoc scoreDoc : hits.scoreDocs) {
Document doc = searcher.getDocument(scoreDoc);
System.out.print("Score: "+ scoreDoc.score + " ");
System.out.println("File: "+ doc.get(LuceneConstants.FILE_PATH));
}
searcher.close();
}
Example Application
让我们创建一个测试 Lucene 应用程序来测试排序过程。
Let us create a test Lucene application to test the sorting process.
Step |
Description |
1 |
Create a project with a name LuceneFirstApplication under a package com.tutorialspoint.lucene as explained in the Lucene - First Application chapter. You can also use the project created in Lucene - First Application chapter as such for this chapter to understand the searching process. |
2 |
Create LuceneConstants.java and Searcher.java as explained in the Lucene - First Application chapter. Keep the rest of the files unchanged. |
3 |
Create LuceneTester.java as mentioned below. |
4 |
Clean and Build the application to make sure the business logic is working as per the requirements. |
LuceneConstants.java
此类用于提供将在整个示例应用程序中使用的各种常量。
This class is used to provide various constants to be used across the sample application.
package com.tutorialspoint.lucene;
public class LuceneConstants {
public static final String CONTENTS = "contents";
public static final String FILE_NAME = "filename";
public static final String FILE_PATH = "filepath";
public static final int MAX_SEARCH = 10;
}
Searcher.java
此类用于读取原始数据上创建的索引并使用 Lucene 库搜索数据。
This class is used to read the indexes made on raw data and searches data using the Lucene library.
package com.tutorialspoint.lucene;
import java.io.File;
import java.io.IOException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Sort;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
public class Searcher {
IndexSearcher indexSearcher;
QueryParser queryParser;
Query query;
public Searcher(String indexDirectoryPath) throws IOException {
Directory indexDirectory
= FSDirectory.open(new File(indexDirectoryPath));
indexSearcher = new IndexSearcher(indexDirectory);
queryParser = new QueryParser(Version.LUCENE_36,
LuceneConstants.CONTENTS,
new StandardAnalyzer(Version.LUCENE_36));
}
public TopDocs search( String searchQuery)
throws IOException, ParseException {
query = queryParser.parse(searchQuery);
return indexSearcher.search(query, LuceneConstants.MAX_SEARCH);
}
public TopDocs search(Query query)
throws IOException, ParseException {
return indexSearcher.search(query, LuceneConstants.MAX_SEARCH);
}
public TopDocs search(Query query,Sort sort)
throws IOException, ParseException {
return indexSearcher.search(query,
LuceneConstants.MAX_SEARCH,sort);
}
public void setDefaultFieldSortScoring(boolean doTrackScores,
boolean doMaxScores) {
indexSearcher.setDefaultFieldSortScoring(
doTrackScores,doMaxScores);
}
public Document getDocument(ScoreDoc scoreDoc)
throws CorruptIndexException, IOException {
return indexSearcher.doc(scoreDoc.doc);
}
public void close() throws IOException {
indexSearcher.close();
}
}
LuceneTester.java
此类用于测试 Lucene 库的搜索功能。
This class is used to test the searching capability of the Lucene library.
package com.tutorialspoint.lucene;
import java.io.IOException;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.search.FuzzyQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Sort;
import org.apache.lucene.search.TopDocs;
public class LuceneTester {
String indexDir = "E:\\Lucene\\Index";
String dataDir = "E:\\Lucene\\Data";
Indexer indexer;
Searcher searcher;
public static void main(String[] args) {
LuceneTester tester;
try {
tester = new LuceneTester();
tester.sortUsingRelevance("cord3.txt");
tester.sortUsingIndex("cord3.txt");
} catch (IOException e) {
e.printStackTrace();
} catch (ParseException e) {
e.printStackTrace();
}
}
private void sortUsingRelevance(String searchQuery)
throws IOException, ParseException {
searcher = new Searcher(indexDir);
long startTime = System.currentTimeMillis();
//create a term to search file name
Term term = new Term(LuceneConstants.FILE_NAME, searchQuery);
//create the term query object
Query query = new FuzzyQuery(term);
searcher.setDefaultFieldSortScoring(true, false);
//do the search
TopDocs hits = searcher.search(query,Sort.RELEVANCE);
long endTime = System.currentTimeMillis();
System.out.println(hits.totalHits +
" documents found. Time :" + (endTime - startTime) + "ms");
for(ScoreDoc scoreDoc : hits.scoreDocs) {
Document doc = searcher.getDocument(scoreDoc);
System.out.print("Score: "+ scoreDoc.score + " ");
System.out.println("File: "+ doc.get(LuceneConstants.FILE_PATH));
}
searcher.close();
}
private void sortUsingIndex(String searchQuery)
throws IOException, ParseException {
searcher = new Searcher(indexDir);
long startTime = System.currentTimeMillis();
//create a term to search file name
Term term = new Term(LuceneConstants.FILE_NAME, searchQuery);
//create the term query object
Query query = new FuzzyQuery(term);
searcher.setDefaultFieldSortScoring(true, false);
//do the search
TopDocs hits = searcher.search(query,Sort.INDEXORDER);
long endTime = System.currentTimeMillis();
System.out.println(hits.totalHits +
" documents found. Time :" + (endTime - startTime) + "ms");
for(ScoreDoc scoreDoc : hits.scoreDocs) {
Document doc = searcher.getDocument(scoreDoc);
System.out.print("Score: "+ scoreDoc.score + " ");
System.out.println("File: "+ doc.get(LuceneConstants.FILE_PATH));
}
searcher.close();
}
}
Data & Index Directory Creation
我们已使用 10 个文本文件(record1.txt 到 record10.txt),其中包含学生姓名和其他详细信息,并将它们放在 E:\Lucene\Data. Test Data 目录中。应将索引目录路径创建为 E:\Lucene\Index。在本章的 Lucene - Indexing Process 中运行索引程序后,您可以在该文件夹中看到创建的索引文件列表。
We have used 10 text files from record1.txt to record10.txt containing names and other details of the students and put them in the directory E:\Lucene\Data. Test Data. An index directory path should be created as E:\Lucene\Index. After running the indexing program in the chapter Lucene - Indexing Process, you can see the list of index files created in that folder.
Running the Program
创建完源代码、原始数据、数据目录、索引目录和索引后,您可以编译并运行程序。为此,保持 LuceneTester.Java 文件选项卡处于活动状态,然后使用 Eclipse IDE 中的“运行”选项,或使用 Ctrl + F11 编译并运行 LuceneTester 应用程序。如果您的应用程序运行成功,它将在 Eclipse IDE 的控制台中打印以下消息:
Once you are done with the creation of the source, the raw data, the data directory, the index directory and the indexes, you can compile and run your program. To do this, Keep the LuceneTester.Java file tab active and use either the Run option available in the Eclipse IDE or use Ctrl + F11 to compile and run your LuceneTester application. If your application runs successfully, it will print the following message in Eclipse IDE’s console −
10 documents found. Time :31ms
Score: 1.3179655 File: E:\Lucene\Data\record3.txt
Score: 0.790779 File: E:\Lucene\Data\record1.txt
Score: 0.790779 File: E:\Lucene\Data\record2.txt
Score: 0.790779 File: E:\Lucene\Data\record4.txt
Score: 0.790779 File: E:\Lucene\Data\record5.txt
Score: 0.790779 File: E:\Lucene\Data\record6.txt
Score: 0.790779 File: E:\Lucene\Data\record7.txt
Score: 0.790779 File: E:\Lucene\Data\record8.txt
Score: 0.790779 File: E:\Lucene\Data\record9.txt
Score: 0.2635932 File: E:\Lucene\Data\record10.txt
10 documents found. Time :0ms
Score: 0.790779 File: E:\Lucene\Data\record1.txt
Score: 0.2635932 File: E:\Lucene\Data\record10.txt
Score: 0.790779 File: E:\Lucene\Data\record2.txt
Score: 1.3179655 File: E:\Lucene\Data\record3.txt
Score: 0.790779 File: E:\Lucene\Data\record4.txt
Score: 0.790779 File: E:\Lucene\Data\record5.txt
Score: 0.790779 File: E:\Lucene\Data\record6.txt
Score: 0.790779 File: E:\Lucene\Data\record7.txt
Score: 0.790779 File: E:\Lucene\Data\record8.txt
Score: 0.790779 File: E:\Lucene\Data\record9.txt