Tika 简明教程
TIKA - Environment
本章将指导您完成在 Windows 和 Linux 上设置 Apache Tika 的过程。安装 Apache Tika 时需要进行用户管理。
This chapter takes you through the process of setting up Apache Tika on Windows and Linux. User administration is needed while installing the Apache Tika.
System Requirements
JDK |
Java SE 2 JDK 1.6 or above |
Memory |
1 GB RAM (recommeneded) |
Disk Space |
No minimum requirement |
Operating System Version |
Windows XP or above, Linux |
Step 1: Verifying Java Installation
为验证 Java 安装,打开控制台并执行以下 java 命令:
To verify Java installation, open the console and execute the following java command.
OS |
Task |
Command |
Windows |
Open command console |
>java –version |
Linux |
Open command terminal |
$java –version |
如果 Java 已在你的系统中正确安装,那么你应该获得以下某个输出,具体取决于你在哪个平台上工作。
If Java has been installed properly on your system, then you should get one of the following outputs, depending on the platform you are working on.
OS |
Output |
Windows |
Java version "1.7.0_60" Java ™ SE Run Time Environment (build 1.7.0_60-b19) Java Hotspot ™ 64-bit Server VM (build 24.60-b09, mixed mode) |
Lunix |
java version "1.7.0_25" Open JDK Runtime Environment (rhel-2.3.10.4.el6_4-x86_64) Open JDK 64-Bit Server VM (build 23.7-b01, mixed mode) |
-
We assume the readers of this tutorial have Java 1.7.0_60 installed on their system before proceeding for this tutorial.
-
In case you do not have Java SDK, download its current version from https://www.oracle.com/technetwork/java/javase/downloads/index.html and have it installed.
Step 2: Setting Java Environment
将 JAVA_HOME 环境变量设置为指向 Java 在你的机器上安装到的基本目录位置。例如,
Set the JAVA_HOME environment variable to point to the base directory location where Java is installed on your machine. For example,
OS |
Output |
Windows |
Set Environmental variable JAVA_HOME to C:\ProgramFiles\java\jdk1.7.0_60 |
Linux |
export JAVA_HOME = /usr/local/java-current |
将 Java 编译器位置的完整路径附加到系统路径。
Append the full path of the Java compiler location to the System Path.
OS |
Output |
Windows |
Append the String; C:\Program Files\Java\jdk1.7.0_60\bin to the end of the system variable PATH. |
Linux |
export PATH = $PATH:$JAVA_HOME/bin/ |
如上所述,从命令提示符验证命令 java-version。
Verify the command java-version from command prompt as explained above.
Step 3: Setting up Apache Tika Environment
程序员可通过以下方式将 Apache Tika 集成到其环境中:
Programmers can integrate Apache Tika in their environment by using
-
Command line,
-
Tika API,
-
Command line interface (CLI) of Tika,
-
Graphical User interface (GUI) of Tika, or
-
the source code.
对于以上任何一种方法,首先,您必须下载 Tika 的源代码。
For any of these approaches, first of all, you have to download the source code of Tika.
您可以在 https://Tika.apache.org/download.html, 中找到 Tika 的源代码,您将在该位置找到两个链接 −
You will find the source code of Tika at https://Tika.apache.org/download.html, where you will find two links −
-
apache-tika-1.6-src.zip − It contains the source code of Tika, and
-
Tika -app-1.6.jar − It is a jar file that contains the Tika application.
下载这两个文件。Tika 的官方网站的截图如下所示。
Download these two files. A snapshot of the official website of Tika is shown below.
下载这些文件后,设置 jar 文件 tika-app-1.6.jar 的类路径。添加 jar 文件的完整路径,如下表所示。
After downloading the files, set the classpath for the jar file tika-app-1.6.jar. Add the complete path of the jar file as shown in the table below.
OS |
Output |
Windows |
Append the String “C:\jars\Tika-app-1.6.jar” to the user environment variable CLASSPATH |
Linux |
Export CLASSPATH = $CLASSPATH − /usr/share/jars/Tika-app-1.6.tar − |
Apache 提供 Tika 应用程序,即使用 Eclipse 的图形用户界面 (GUI) 应用程序。
Apache provides Tika application, a Graphical User Interface (GUI) application using Eclipse.
Tika-Maven Build using Eclipse
-
Open eclipse and create a new project.
-
If you do not having Maven in your Eclipse, set it up by following the given steps. Open the link https://wiki.eclipse.org/M2E_updatesite_and_gittags. There you will find the m2e plugin releases in a tabular format
-
Pick the latest version and save the path of the url in p2 url column.
-
Now revisit eclipse, in the menu bar, click Help, and choose Install New Software from the dropdown menu
-
Click the Add button, type any desired name, as it is optional. Now paste the saved url in the Location field.
-
A new plugin will be added with the name you have chosen in the previous step, check the checkbox in front of it, and click Next.
-
Proceed with the installation. Once completed, restart the Eclipse.
-
Now right click on the project, and in the configure option, select convert to maven project.
-
A new wizard for creating a new pom appears. Enter the Group Id as org.apache.tika, enter the latest version of Tika, select the packaging as jar, and click Finish.
Maven 项目已成功安装,您的项目已转换为 Maven。现在,您必须配置 pom.xml 文件。
The Maven project is successfully installed, and your project is converted into Maven. Now you have to configure the pom.xml file.
Configure the XML File
从 https://mvnrepository.com/artifact/org.apache.tika 获得 Tika maven 依赖项
Get the Tika maven dependency from https://mvnrepository.com/artifact/org.apache.tika
下面显示的是 Apache Tika 的完整 Maven 依赖项。
Shown below is the complete Maven dependency of Apache Tika.
<dependency>
<groupId>org.apache.Tika</groupId>
<artifactId>Tika-core</artifactId>
<version>1.6</version>
<groupId>org.apache.Tika</groupId>
<artifactId> Tika-parsers</artifactId>
<version> 1.6</version>
<groupId> org.apache.Tika</groupId>
<artifactId>Tika</artifactId>
<version>1.6</version>
<groupId>org.apache.Tika</groupId>
< artifactId>Tika-serialization</artifactId>
< version>1.6< /version>
< groupId>org.apache.Tika< /groupId>
< artifactId>Tika-app< /artifactId>
< version>1.6< /version>
<groupId>org.apache.Tika</groupId>
<artifactId>Tika-bundle</artifactId>
<version>1.6</version>
</dependency>