Tika 简明教程
TIKA - Overview
What is Apache Tika?
-
Apache Tika is a library that is used for document type detection and content extraction from various file formats.
-
Internally, Tika uses existing various document parsers and document type detection techniques to detect and extract data.
-
Using Tika, one can develop a universal type detector and content extractor to extract both structured text as well as metadata from different types of documents such as spreadsheets, text documents, images, PDFs and even multimedia input formats to a certain extent.
-
Tika provides a single generic API for parsing different file formats. It uses existing specialized parser libraries for each document type.
-
All these parser libraries are encapsulated under a single interface called the Parser interface.
Why Tika?
根据 filext.com,大约有 15k 到 51k 的内容类型,并且这个数字每天都在增长。数据存储在各种格式中,例如文本文档、Excel 电子表格、PDF、图像和多媒体文件,仅举几例。因此,诸如搜索引擎和内容管理系统之类的应用程序需要额外的支持,以便轻松地从这些文档类型中提取数据。Apache Tika 通过提供一个通用 API 来查找和提取来自多个文件格式的数据,来实现此目的。
According to filext.com, there are about 15k to 51k content types, and this number is growing day by day. Data is being stored in various formats such as text documents, excel spreadsheet, PDFs, images, and multimedia files, to name a few. Therefore, applications such as search engines and content management systems need additional support for easy extraction of data from these document types. Apache Tika serves this purpose by providing a generic API to locate and extract data from multiple file formats.
Apache Tika Applications
有各种应用程序使用 Apache Tika。这里我们将讨论几个主要依赖 Apache Tika 的知名应用程序。
There are various applications that make use of Apache Tika. Here we will discuss a few prominent applications that depend heavily on Apache Tika.
Search Engines
在开发搜索引擎时,Tika 被广泛用于对数字文档的文本内容进行索引。
Tika is widely used while developing search engines to index the text contents of digital documents.
-
Search engines are information processing systems designed to search information and indexed documents from the Web.
-
Crawler is an important component of a search engine that crawls through the Web to fetch the documents that are to be indexed using some indexing technique. Thereafter, the crawler transfers these indexed documents to an extraction component.
-
The duty of extraction component is to extract the text and metadata from the document. Such extracted content and metadata are very useful for a search engine. This extraction component contains Tika.
-
The extracted content is then passed to the indexer of the search engine that uses it to build a search index. Apart from this, the search engine uses the extracted content in many other ways as well.
Document Analysis
-
In the field of artificial intelligence, there are certain tools to analyze documents automatically at semantic level and extract all kinds of data from them.
-
In such applications, the documents are classified based on the prominent terms in the extracted content of the document.
-
These tools make use of Tika for content extraction to analyze documents varying from plain text to digital documents.
Digital Asset Management
-
Some organizations manage their digital assets such as photographs, ebooks, drawings, music and video using a special application known as digital asset management (DAM).
-
Such applications take the help of document type detectors and metadata extractor to classify the various documents.
Content Analysis
-
Websites like Amazon recommend newly released contents of their website to individual users according to their interests. To do so, these websites follow machine learning techniques, or take the help of social media websites like Facebook to extract required information such as likes and interests of the users. This gathered information will be in the form of html tags or other formats that require further content type detection and extraction.
-
For content analysis of a document, we have technologies that implement machine learning techniques such as UIMA and Mahout. These technologies are useful in clustering and analyzing the data in the documents.
-
Apache Mahout is a framework which provides ML algorithms on Apache Hadoop – a cloud computing platform. Mahout provides an architecture by following certain clustering and filtering techniques. By following this architecture, programmers can write their own ML algorithms to produce recommendations by taking various text and metadata combinations. To provide inputs to these algorithms, recent versions of Mahout use Tika to extract text and metadata from binary content.
-
Apache UIMA analyzes and processes various programming languages and produces UIMA annotations. Internally it uses Tika Annotator to extract document text and metadata.
History
Year |
Development |
2006 |
The idea of Tika was projected before the Lucene Project Management Committee. |
2006 |
The concept of Tika and its usefulness in the Jackrabbit project was discussed. |
2007 |
Tika entered into Apache incubator. |
2008 |
Versions 0.1 and 0.2 were released and Tika graduated from the incubator to the Lucene sub-project. |
2009 |
Versions 0.3, 0.4, and 0.5 were released. |
2010 |
Version 0.6 and 0.7 were released and Tika graduated into the top-level Apache project. |
2011 |
Tika 1.0 was released and the book on Tika "Tika in Action” was also released in the same year. |
TIKA - Architecture
Application-Level Architecture of Tika
应用程序员可以轻松地在应用程序中集成 Tika。Tika 提供命令行界面和图形用户界面,以便于用户使用。
Application programmers can easily integrate Tika in their applications. Tika provides a Command Line Interface and a GUI to make it user friendly.
在本章中,我们将讨论构成 Tika 架构的四个重要模块。下图展示了 Tika 架构及其四个模块:
In this chapter, we will discuss the four important modules that constitute the Tika architecture. The following illustration shows the architecture of Tika along with its four modules −
-
Language detection mechanism.
-
MIME detection mechanism.
-
Parser interface.
-
Tika Facade class.
Language Detection Mechanism
每当文本文档传递给 Tika 时,它将检测其书面语言。它接受没有语言注释的文档,并在检测语言后在文档的元数据中添加该信息。
Whenever a text document is passed to Tika, it will detect the language in which it was written. It accepts documents without language annotation and adds that information in the metadata of the document by detecting the language.
为了支持语言识别,Tika 在包 org.apache.tika.language 中有一个名为 Language Identifier 的类,以及其中包含用于从给定文本中检测语言的算法的语言识别存储库。Tika 在内部使用 N-gram 算法进行语言检测。
To support language identification, Tika has a class called Language Identifier in the package org.apache.tika.language, and a language identification repository inside which contains algorithms for language detection from a given text. Tika internally uses N-gram algorithm for language detection.
MIME Detection Mechanism
Tika 可以根据 MIME 标准检测文档类型。Tika 中的默认 MIME 类型检测使用 org.apache.tika.mime.mimeTypes 来完成。它对大多数内容类型检测使用 org.apache.tika.detect.Detector 接口。
Tika can detect the document type according to the MIME standards. Default MIME type detection in Tika is done using org.apache.tika.mime.mimeTypes. It uses the org.apache.tika.detect.Detector interface for most of the content type detection.
在内部,Tika 使用了多种技术,如文件全局、内容类型提示、魔术字节、字符编码和其他多种技术。
Internally Tika uses several techniques like file globs, content-type hints, magic bytes, character encodings, and several other techniques.
Parser Interface
org.apache.tika.parser 的解析器接口是 Tika 中用于解析文档的主要接口。该接口从文档中提取文本和元数据,并将其总结出来,以便愿意编写解析器插件的外部用户使用。
The parser interface of org.apache.tika.parser is the key interface for parsing documents in Tika. This Interface extracts the text and the metadata from a document and summarizes it for external users who are willing to write parser plugins.
Tika 使用适用于各个文档类型的不同具体解析器类,从而支持许多文档格式。这些特定于格式的类通过直接实现解析逻辑或通过使用外部解析器库来支持不同的文档格式。
Using different concrete parser classes, specific for individual document types, Tika supports a lot of document formats. These format specific classes provide support for different document formats, either by directly implementing the parser logic or by using external parser libraries.
Tika Facade Class
使用 Tika 外观类是从 Java 调用 Tika 的最简单、最直接的方式,并且它遵循外观设计模式。你可以在 Tika API 的 org.apache.tika 包中找到 Tika 外观类。
Using Tika facade class is the simplest and direct way of calling Tika from Java, and it follows the facade design pattern. You can find the Tika facade class in the org.apache.tika package of Tika API.
通过实现基本用例,Tika 充当环境代理。它抽象了 Tika 库的底层复杂性,例如 MIME 检测机制、解析器接口和语言检测机制,并为用户提供了一个简单的使用接口。
By implementing basic use cases, Tika acts as a broker of landscape. It abstracts the underlying complexity of the Tika library such as MIME detection mechanism, parser interface, and language detection mechanism, and provides the users a simple interface to use.
Features of Tika
-
Unified parser Interface − Tika encapsulates all the third party parser libraries within a single parser interface. Due to this feature, the user escapes from the burden of selecting the suitable parser library and use it according to the file type encountered.
-
Low memory usage − Tika consumes less memory resources therefore it is easily embeddable with Java applications. We can also use Tika within the application which run on platforms with less resources like mobile PDA.
-
Fast processing − Quick content detection and extraction from applications can be expected.
-
Flexible metadata − Tika understands all the metadata models which are used to describe files.
-
Parser integration − Tika can use various parser libraries available for each document type in a single application.
-
MIME type detection − Tika can detect and extract content from all the media types included in the MIME standards.
-
Language detection − Tika includes language identification feature, therefore can be used in documents based on language type in a multi lingual websites.
Functionalities of Tika
Tika 支持各种功能:
Tika supports various functionalities −
-
Document type detection
-
Content extraction
-
Metadata extraction
-
Language detection
Document Type Detection
Tika 使用各种检测技术检测给定文档的类型。
Tika uses various detection techniques and detects the type of the document given to it.
Content Extraction
Tika 拥有一个解析程序库,该库可以解析各种文档格式的内容并提取出来。在检测到文档类型之后,它从解析器容器中选择相应的解析器并传递该文档。Tika 的不同类具有不同的方法来解析不同的文档格式。
Tika has a parser library that can parse the content of various document formats and extract them. After detecting the type of the document, it selects the appropriate parser from the parser repository and passes the document. Different classes of Tika have methods to parse different document formats.
Metadata Extraction
除了内容之外,Tika 还通过与内容提取中相同的方法提取文档的元数据。对于某些类型的文档,Tika 具有提取元数据的类。
Along with the content, Tika extracts the metadata of the document with the same procedure as in content extraction. For some document types, Tika have classes to extract metadata.
Language Detection
在内部,Tika 遵循 n-gram 一类的算法来检测给定文档中内容的语言。Tika 依赖 Languageidentifier 和 Profiler 等类来进行语言识别。
Internally, Tika follows algorithms like n-gram to detect the language of the content in a given document. Tika depends on classes like Languageidentifier and Profiler for language identification.
TIKA - Environment
本章将指导您完成在 Windows 和 Linux 上设置 Apache Tika 的过程。安装 Apache Tika 时需要进行用户管理。
This chapter takes you through the process of setting up Apache Tika on Windows and Linux. User administration is needed while installing the Apache Tika.
System Requirements
JDK |
Java SE 2 JDK 1.6 or above |
Memory |
1 GB RAM (recommeneded) |
Disk Space |
No minimum requirement |
Operating System Version |
Windows XP or above, Linux |
Step 1: Verifying Java Installation
为验证 Java 安装,打开控制台并执行以下 java 命令:
To verify Java installation, open the console and execute the following java command.
OS |
Task |
Command |
Windows |
Open command console |
>java –version |
Linux |
Open command terminal |
$java –version |
如果 Java 已在你的系统中正确安装,那么你应该获得以下某个输出,具体取决于你在哪个平台上工作。
If Java has been installed properly on your system, then you should get one of the following outputs, depending on the platform you are working on.
OS |
Output |
Windows |
Java version "1.7.0_60" Java ™ SE Run Time Environment (build 1.7.0_60-b19) Java Hotspot ™ 64-bit Server VM (build 24.60-b09, mixed mode) |
Lunix |
java version "1.7.0_25" Open JDK Runtime Environment (rhel-2.3.10.4.el6_4-x86_64) Open JDK 64-Bit Server VM (build 23.7-b01, mixed mode) |
-
We assume the readers of this tutorial have Java 1.7.0_60 installed on their system before proceeding for this tutorial.
-
In case you do not have Java SDK, download its current version from https://www.oracle.com/technetwork/java/javase/downloads/index.html and have it installed.
Step 2: Setting Java Environment
将 JAVA_HOME 环境变量设置为指向 Java 在你的机器上安装到的基本目录位置。例如,
Set the JAVA_HOME environment variable to point to the base directory location where Java is installed on your machine. For example,
OS |
Output |
Windows |
Set Environmental variable JAVA_HOME to C:\ProgramFiles\java\jdk1.7.0_60 |
Linux |
export JAVA_HOME = /usr/local/java-current |
将 Java 编译器位置的完整路径附加到系统路径。
Append the full path of the Java compiler location to the System Path.
OS |
Output |
Windows |
Append the String; C:\Program Files\Java\jdk1.7.0_60\bin to the end of the system variable PATH. |
Linux |
export PATH = $PATH:$JAVA_HOME/bin/ |
如上所述,从命令提示符验证命令 java-version。
Verify the command java-version from command prompt as explained above.
Step 3: Setting up Apache Tika Environment
程序员可通过以下方式将 Apache Tika 集成到其环境中:
Programmers can integrate Apache Tika in their environment by using
-
Command line,
-
Tika API,
-
Command line interface (CLI) of Tika,
-
Graphical User interface (GUI) of Tika, or
-
the source code.
对于以上任何一种方法,首先,您必须下载 Tika 的源代码。
For any of these approaches, first of all, you have to download the source code of Tika.
您可以在 https://Tika.apache.org/download.html, 中找到 Tika 的源代码,您将在该位置找到两个链接 −
You will find the source code of Tika at https://Tika.apache.org/download.html, where you will find two links −
-
apache-tika-1.6-src.zip − It contains the source code of Tika, and
-
Tika -app-1.6.jar − It is a jar file that contains the Tika application.
下载这两个文件。Tika 的官方网站的截图如下所示。
Download these two files. A snapshot of the official website of Tika is shown below.
下载这些文件后,设置 jar 文件 tika-app-1.6.jar 的类路径。添加 jar 文件的完整路径,如下表所示。
After downloading the files, set the classpath for the jar file tika-app-1.6.jar. Add the complete path of the jar file as shown in the table below.
OS |
Output |
Windows |
Append the String “C:\jars\Tika-app-1.6.jar” to the user environment variable CLASSPATH |
Linux |
Export CLASSPATH = $CLASSPATH − /usr/share/jars/Tika-app-1.6.tar − |
Apache 提供 Tika 应用程序,即使用 Eclipse 的图形用户界面 (GUI) 应用程序。
Apache provides Tika application, a Graphical User Interface (GUI) application using Eclipse.
Tika-Maven Build using Eclipse
-
Open eclipse and create a new project.
-
If you do not having Maven in your Eclipse, set it up by following the given steps. Open the link https://wiki.eclipse.org/M2E_updatesite_and_gittags. There you will find the m2e plugin releases in a tabular format
-
Pick the latest version and save the path of the url in p2 url column.
-
Now revisit eclipse, in the menu bar, click Help, and choose Install New Software from the dropdown menu
-
Click the Add button, type any desired name, as it is optional. Now paste the saved url in the Location field.
-
A new plugin will be added with the name you have chosen in the previous step, check the checkbox in front of it, and click Next.
-
Proceed with the installation. Once completed, restart the Eclipse.
-
Now right click on the project, and in the configure option, select convert to maven project.
-
A new wizard for creating a new pom appears. Enter the Group Id as org.apache.tika, enter the latest version of Tika, select the packaging as jar, and click Finish.
Maven 项目已成功安装,您的项目已转换为 Maven。现在,您必须配置 pom.xml 文件。
The Maven project is successfully installed, and your project is converted into Maven. Now you have to configure the pom.xml file.
Configure the XML File
从 https://mvnrepository.com/artifact/org.apache.tika 获得 Tika maven 依赖项
Get the Tika maven dependency from https://mvnrepository.com/artifact/org.apache.tika
下面显示的是 Apache Tika 的完整 Maven 依赖项。
Shown below is the complete Maven dependency of Apache Tika.
<dependency>
<groupId>org.apache.Tika</groupId>
<artifactId>Tika-core</artifactId>
<version>1.6</version>
<groupId>org.apache.Tika</groupId>
<artifactId> Tika-parsers</artifactId>
<version> 1.6</version>
<groupId> org.apache.Tika</groupId>
<artifactId>Tika</artifactId>
<version>1.6</version>
<groupId>org.apache.Tika</groupId>
< artifactId>Tika-serialization</artifactId>
< version>1.6< /version>
< groupId>org.apache.Tika< /groupId>
< artifactId>Tika-app< /artifactId>
< version>1.6< /version>
<groupId>org.apache.Tika</groupId>
<artifactId>Tika-bundle</artifactId>
<version>1.6</version>
</dependency>
TIKA - Referenced API
用户可以使用 Tika facade 类将 Tika 嵌入到他们的应用程序中。它提供了要探索 Tika 所有功能的方法。由于它是一个 facade 类,Tika 会将其函数背后的复杂性抽象化。除此之外,用户还可以在其应用程序中使用各种 Tika 类。
Users can embed Tika in their applications using the Tika facade class. It has methods to explore all the functionalities of Tika. Since it is a facade class, Tika abstracts the complexity behind its functions. In addition to this, users can also use the various classes of Tika in their applications.
Tika Class (facade)
这是 Tika 库中最突出的一个类,遵循门面设计模式。因此,它抽象了所有内部实现,并提供了访问 Tika 功能的简单方法。下表列出了此类的构造函数及其说明。
This is the most prominent class of the Tika library and follows the facade design pattern. Therefore, it abstracts all the internal implementations and provides simple methods to access the Tika functionalities. The following table lists the constructors of this class along with their descriptions.
package − org.apache.tika
package − org.apache.tika
class − Tika
class − Tika
Sr.No. |
Constructor & Description |
1 |
Tika () Uses default configuration and constructs the Tika class. |
2 |
Tika (Detector detector) Creates a Tika facade by accepting the detector instance as parameter |
3 |
Tika (Detector detector, Parser parser) Creates a Tika facade by accepting the detector and parser instances as parameters. |
4 |
Tika (Detector detector, Parser parser, Translator translator) Creates a Tika facade by accepting the detector, the parser, and the translator instance as parameters. |
5 |
Tika (TikaConfig config) Creates a Tika facade by accepting the object of the TikaConfig class as parameter. |
Methods and Description
以下为 Tika 门面类的重要方法 −
The following are the important methods of Tika facade class −
Sr.No. |
Methods & Description |
1 |
parse*ToString* (File file) This method and all its variants parses the file passed as parameter and returns the extracted text content in the String format. By default, the length of this string parameter is limited. |
2 |
int getMaxStringLength () Returns the maximum length of strings returned by the parseToString methods. |
3 |
void setMaxStringLength (int maxStringLength) Sets the maximum length of strings returned by the parseToString methods. |
4 |
Reader parse (File file) This method and all its variants parses the file passed as parameter and returns the extracted text content in the form of java.io.reader object. |
5 |
String detect (InputStream stream, Metadata metadata) This method and all its variants accepts an InputStream object and a Metadata object as parameters, detects the type of the given document, and returns the document type name as String object. This method abstracts the detection mechanisms used by Tika. |
6 |
String translate (InputStream text, String targetLanguage) This method and all its variants accepts the InputStream object and a String representing the language that we want our text to be translated, and translates the given text to the desired language, attempting to auto-detect the source language. |
Parser Interface
这是 Tika 软件包的所有解析器类实现的接口。
This is the interface that is implemented by all the parser classes of Tika package.
package − org.apache.tika.parser
package − org.apache.tika.parser
Interface − Parser
Interface − Parser
Methods and Description
以下是 Tika 解析器接口的重要方法 −
The following is the important method of Tika Parser interface −
Sr.No. |
Methods & Description |
1 |
parse (InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) This method parses the given document into a sequence of XHTML and SAX events. After parsing, it places the extracted document content in the object of the ContentHandler class and the metadata in the object of the Metadata class. |
Metadata Class
此类实现了各种接口,例如 CreativeCommons、Geographic、HttpHeaders、Message、MSOffice、ClimateForcast、TIFF、TikaMetadataKeys、TikaMimeKeys、Serializable,以支持各种数据模型。下表列出了该类的构造函数和方法及其说明。
This class implements various interfaces such as CreativeCommons, Geographic, HttpHeaders, Message, MSOffice, ClimateForcast, TIFF, TikaMetadataKeys, TikaMimeKeys, Serializable to support various data models. The following tables list the constructors and methods of this class along with their descriptions.
package − org.apache.tika.metadata
package − org.apache.tika.metadata
class − Metadata
class − Metadata
Sr.No. |
Constructor & Description |
1 |
Metadata() Constructs a new, empty metadata. |
Sr.No. |
Methods & Description |
1 |
add (Property property, String value) Adds a metadata property/value mapping to a given document. Using this function, we can set the value to a property. |
2 |
add (String name, String value) Adds a metadata property/value mapping to a given document. Using this method, we can set a new name value to the existing metadata of a document. |
3 |
String get (Property property) Returns the value (if any) of the metadata property given. |
4 |
String get (String name) Returns the value (if any) of the metadata name given. |
5 |
Date getDate (Property property) Returns the value of Date metadata property. |
6 |
String[] getValues (Property property) Returns all the values of a metadata property. |
7 |
String[] getValues (String name) Returns all the values of a given metadata name. |
8 |
String[] names() Returns all the names of metadata elements in a metadata object. |
9 |
set (Property property, Date date) Sets the date value of the given metadata property |
10 |
set(Property property, String[] values) Sets multiple values to a metadata property. |
Language Identifier Class
此类标识给定内容的语言。下表列出了此类的构造函数及其说明。
This class identifies the language of the given content. The following tables list the constructors of this class along with their descriptions.
package − org.apache.tika.language
package − org.apache.tika.language
class − Language Identifier
class − Language Identifier
Sr.No. |
Constructor & Description |
1 |
LanguageIdentifier (LanguageProfile profile) Instantiates the language identifier. Here you have to pass a LanguageProfile object as parameter. |
2 |
LanguageIdentifier (String content) This constructor can instantiate a language identifier by passing on a String from text content. |
Sr.No. |
Methods & Description |
1 |
String getLanguage () Returns the language given to the current LanguageIdentifier object. |
TIKA - File Formats
File Formats Supported by Tika
下表显示了 Tika 支持的文件格式。
The following table shows the file formats Tika supports.
File format |
Package Library |
Class in Tika |
XML |
org.apache.tika.parser.xml |
XMLParser |
HTML |
org.apache.tika.parser.html and it uses Tagsoup Library |
HtmlParser |
MS-Office compound document Ole2 till 2007 ooxml 2007 onwards |
org.apache.tika.parser.microsoft org.apache.tika.parser.microsoft.ooxml and it uses Apache Poi library |
OfficeParser(ole2) OOXMLParser (ooxml) |
OpenDocument Format openoffice |
org.apache.tika.parser.odf |
OpenOfficeParser |
portable Document Format(PDF) |
org.apache.tika.parser.pdf and this package uses Apache PdfBox library |
PDFParser |
Electronic Publication Format (digital books) |
org.apache.tika.parser.epub |
EpubParser |
Rich Text format |
org.apache.tika.parser.rtf |
RTFParser |
Compression and packaging formats |
org.apache.tika.parser.pkg and this package uses Common compress library |
PackageParser and CompressorParser and its sub-classes |
Text format |
org.apache.tika.parser.txt |
TXTParser |
Feed and syndication formats |
org.apache.tika.parser.feed |
FeedParser |
Audio formats |
org.apache.tika.parser.audio and org.apache.tika.parser.mp3 |
AudioParser MidiParser Mp3- for mp3parser |
Imageparsers |
org.apache.tika.parser.jpeg |
JpegParser-for jpeg images |
Videoformats |
org.apache.tika.parser.mp4 and org.apache.tika.parser.video this parser internally uses Simple Algorithm to parse flash video formats |
Mp4parser FlvParser |
java class files and jar files |
org.apache.tika.parser.asm |
ClassParser CompressorParser |
Mobxformat (email messages) |
org.apache.tika.parser.mbox |
MobXParser |
Cad formats |
org.apache.tika.parser.dwg |
DWGParser |
FontFormats |
org.apache.tika.parser.font |
TrueTypeParser |
executable programs and libraries |
org.apache.tika.parser.executable |
ExecutableParser |
TIKA - Document Type Detection
MIME Standards
多用途因特网邮件扩展 (MIME) 标准是识别文档类型的最佳可用标准。了解这些标准有助于浏览器在内部交互期间。
Multipurpose Internet Mail Extensions (MIME) standards are the best available standards for identifying document types. The knowledge of these standards helps the browser during internal interactions.
无论何时浏览器遇到媒体文件,它都会选择与其兼容的可用软件来显示其内容。如果浏览器没有任何合适的应用程序来运行某个特定的媒体文件,它会建议用户获取合适的插件软件。
Whenever the browser encounters a media file, it chooses a compatible software available with it to display its contents. In case it does not have any suitable application to run a particular media file, it recommends the user to get the suitable plugin software for it.
Type Detection in Tika
Tika 支持 MIME 中提供的所有 Internet 媒体文档类型。无论何时文件通过 Tika,它都会检测出来文件及其文档类型。为检测媒体类型,Tika 在内部使用以下机制。
Tika supports all the Internet media document types provided in MIME. Whenever a file is passed through Tika, it detects the file and its document type. To detect media types, Tika internally uses the following mechanisms.
File Extensions
检查文件扩展名是检测文件格式最简单且最广泛使用的方法。许多应用程序和操作系统都支持这些扩展名。下面显示的是一些已知文件类型 的扩展名。
Checking the file extensions is the simplest and most-widely used method to detect the format of a file. Many applications and operating systems provide support for these extensions. Shown below are the extension of a few known file types.
File name |
Extention |
image |
.jpg |
audio |
.mp3 |
java archive file |
.jar |
java class file |
.class |
Content-type Hints
无论何时从数据库中检索文件或将其附加到另一份文件时,您都可能丢失文件的名称或扩展名。在这些情况下,使用随文件提供 的元数据来检测文件扩展名。
Whenever you retrieve a file from a database or attach it to another document, you may lose the file’s name or extension. In such cases, the metadata supplied with the file is used to detect the file extension.
Magic Byte
观察文件的原始字节,您可以找到每个文件的某些唯一字符模式。某些文件具有称为 magic bytes 的特殊字节前缀,这些字节前缀是专门为 识别文件类型而在文件中制作和包含的。
Observing the raw bytes of a file, you can find some unique character patterns for each file. Some files have special byte prefixes called magic bytes that are specially made and included in a file for the purpose of identifying the file type
例如,您能在 java 文件中找到 CA FE BA BE(十六进制格式)和在 pdf 文件中找到 %PDF(ASCII 格式)。Tika 使用此信息来识别文件的媒体类型。
For example, you can find CA FE BA BE (hexadecimal format) in a java file and %PDF (ASCII format) in a pdf file. Tika uses this information to identify the media type of a file.
Character Encodings
带纯文本的文件使用不同类型的字符编码进行编码。此处的首要挑战是识别文件中使用的字符编码类型。Tika 使用 Bom markers 和 Byte Frequencies 等字符编码技术来识别纯文本内容使用的编码系统。
Files with plain text are encoded using different types of character encoding. The main challenge here is to identify the type of character encoding used in the files. Tika follows character encoding techniques like Bom markers and Byte Frequencies to identify the encoding system used by the plain text content.
Type Detection using Facade Class
detect() 方法用于检测文档类型。此方法将文件作为输入。下面显示的是通过 Tika facade 类检测文档类型的示例程序。
The detect() method of facade class is used to detect the document type. This method accepts a file as input. Shown below is an example program for document type detection with Tika facade class.
import java.io.File;
import org.apache.tika.Tika;
public class Typedetection {
public static void main(String[] args) throws Exception {
//assume example.mp3 is in your current directory
File file = new File("example.mp3");//
//Instantiating tika facade class
Tika tika = new Tika();
//detecting the file type using detect method
String filetype = tika.detect(file);
System.out.println(filetype);
}
}
将以上代码保存为 TypeDetection.java,然后使用以下命令从命令提示符运行它:
Save the above code as TypeDetection.java and run it from the command prompt using the following commands −
javac TypeDetection.java
java TypeDetection
audio/mpeg
TIKA - Content Extraction
Tika 使用各种解析器库从给定的解析器中提取内容。它选择合适的解析器来提取给定的文档类型。
Tika uses various parser libraries to extract content from given parsers. It chooses the right parser for extracting the given document type.
对于解析文档,通常使用 Tika facade 类的 parseToString() 方法。下面显示了解析过程中的步骤,这些步骤由 Tika ParsertoString() 方法抽象出来。
For parsing documents, the parseToString() method of Tika facade class is generally used. Shown below are the steps involved in the parsing process and these are abstracted by the Tika ParsertoString() method.
抽象解析过程 -
Abstracting the parsing process −
-
Initially when we pass a document to Tika, it uses a suitable type detection mechanism available with it and detects the document type.
-
Once the document type is known, it chooses a suitable parser from its parser repository. The parser repository contains classes that make use of external libraries.
-
Then the document is passed to choose the parser which will parse the content, extract the text, and also throw exceptions for unreadable formats.
Content Extraction using Tika
下面给出了使用 Tika facade 类从文件中提取文本的程序 -
Given below is the program for extracting text from a file using Tika facade class −
import java.io.File;
import java.io.IOException;
import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import org.xml.sax.SAXException;
public class TikaExtraction {
public static void main(final String[] args) throws IOException, TikaException {
//Assume sample.txt is in your current directory
File file = new File("sample.txt");
//Instantiating Tika facade class
Tika tika = new Tika();
String filecontent = tika.parseToString(file);
System.out.println("Extracted Content: " + filecontent);
}
}
将以上代码另存为 TikaExtraction.java,并从命令提示符处运行它 -
Save the above code as TikaExtraction.java and run it from the command prompt −
javac TikaExtraction.java
java TikaExtraction
下面给出了 sample.txt 的内容。
Given below is the content of sample.txt.
Hi students welcome to tutorialspoint
它给你以下输出 -
It gives you the following output −
Extracted Content: Hi students welcome to tutorialspoint
Content Extraction using Parser Interface
Tika 的 parser 包提供了多个接口和类,我们可使用它们来解析文本文档。下面给出了 org.apache.tika.parser 包的框图。
The parser package of Tika provides several interfaces and classes using which we can parse a text document. Given below is the block diagram of the org.apache.tika.parser package.
有几个解析器类可用,例如,pdf 解析器、Mp3Passer、OfficeParser 等,可分别解析各自的文档。所有这些类都实现了解析器接口。
There are several parser classes available, e.g., pdf parser, Mp3Passer, OfficeParser, etc., to parse respective documents individually. All these classes implement the parser interface.
CompositeParser
给定的框图显示了 Tika 的通用解析器类: CompositeParser 和 AutoDetectParser 。由于 CompositeParser 类遵循复合设计模式,因此可以将一组解析器实例用作单个解析器。CompositeParser 类还允许访问实现解析器接口的所有类。
The given diagram shows Tika’s general-purpose parser classes: CompositeParser and AutoDetectParser. Since the CompositeParser class follows composite design pattern, you can use a group of parser instances as a single parser. The CompositeParser class also allows access to all the classes that implement the parser interface.
AutoDetectParser
这是 CompositeParser 的子类,它提供自动类型检测。使用此功能,AutoDetectParser 使用复合方法自动将传入文档发送到适当的解析器类。
This is a subclass of CompositeParser and it provides automatic type detection. Using this functionality, the AutoDetectParser automatically sends the incoming documents to the appropriate parser classes using the composite methodology.
parse() method
除了 parseToString(),还可以使用解析器接口的 parse() 方法。此方法的原型如下所示。
Along with parseToString(), you can also use the parse() method of the parser Interface. The prototype of this method is shown below.
parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
下表列出了它接受为参数的四个对象。
The following table lists the four objects it accepts as parameters.
Sr.No. |
Object & Description |
1 |
InputStream stream Any Inputstream object that contains the content of the file |
2 |
ContentHandler handler Tika passes the document as XHTML content to this handler, thereafter the document is processed using SAX API. It provides efficient postprocessing of the contents in a document. |
3 |
Metadata metadata The metadata object is used both as a source and a target of document metadata. |
4 |
ParseContext context This object is used in cases where the client application wants to customize the parsing process. |
Example
以下是显示如何使用 parse() 方法的示例。
Given below is an example that shows how the parse() method is used.
Step 1 -
Step 1 −
要使用此解析器界面的 parse() 方法,请实例化为提供对此界面实现的任何类。
To use the parse() method of the parser interface, instantiate any of the classes providing the implementation for this interface.
有单独的解析器类,例如 PDFParser、OfficeParser、XMLParser 等。您可以使用其中任何一个文档解析器。或者,您可以使用 CompositeParser 或 AutoDetectParser,它们在内部使用所有解析器类,并使用合适的解析器提取文档的内容。
There are individual parser classes such as PDFParser, OfficeParser, XMLParser, etc. You can use any of these individual document parsers. Alternatively, you can use either CompositeParser or AutoDetectParser that uses all the parser classes internally and extracts the contents of a document using a suitable parser.
Parser parser = new AutoDetectParser();
(or)
Parser parser = new CompositeParser();
(or)
object of any individual parsers given in Tika Library
Step 2 -
Step 2 −
创建处理程序类对象。以下是三个内容处理程序 -
Create a handler class object. Given below are the three content handlers −
Sr.No. |
Class & Description |
1 |
BodyContentHandler This class picks the body part of the XHTML output and writes that content to the output writer or output stream. Then it redirects the XHTML content to another content handler instance. |
2 |
LinkContentHandler This class detects and picks all the H-Ref tags of the XHTML document and forwards those for the use of tools like web crawlers. |
3 |
TeeContentHandler This class helps in using multiple tools simultaneously. |
由于我们的目标是从文档中提取文本内容,因此,请按如下所示实例化 BodyContentHandler -
Since our target is to extract the text contents from a document, instantiate BodyContentHandler as shown below −
BodyContentHandler handler = new BodyContentHandler( );
Step 3 -
Step 3 −
按如下所示创建 Metadata 对象 -
Create the Metadata object as shown below −
Metadata metadata = new Metadata();
Step 4 -
Step 4 −
创建任何输入流对象,并将应从中提取的文件传递给它。
Create any of the input stream objects, and pass your file that should be extracted to it.
FileInputstream
将文件路径作为参数传递以实例化文件对象,并将此对象传递到 FileInputStream 类构造函数中。
Instantiate a file object by passing the file path as parameter and pass this object to the FileInputStream class constructor.
Note - 传递到文件对象的路径不应包含空格。
Note − The path passed to the file object should not contain spaces.
这些输入流类的问题在于不支持随机访问读取,而这是有效处理某些文件格式所必需的。为了解决此问题,Tika 提供了 TikaInputStream。
The problem with these input stream classes is that they don’t support random access reads, which is required to process some file formats efficiently. To resolve this problem, Tika provides TikaInputStream.
File file = new File(filepath)
FileInputStream inputstream = new FileInputStream(file);
(or)
InputStream stream = TikaInputStream.get(new File(filename));
Step 5 -
Step 5 −
按如下所示创建解析上下文对象 -
Create a parse context object as shown below −
ParseContext context =new ParseContext();
Step 6 -
Step 6 −
实例化解析器对象,调用解析方法,并传递所有必需的对象,如下面的原型所示 -
Instantiate the parser object, invoke the parse method, and pass all the objects required, as shown in the prototype below −
parser.parse(inputstream, handler, metadata, context);
以下是使用解析器接口进行内容提取的程序 -
Given below is the program for content extraction using the parser interface −
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class ParserExtraction {
public static void main(final String[] args) throws IOException,SAXException, TikaException {
//Assume sample.txt is in your current directory
File file = new File("sample.txt");
//parse method parameters
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(file);
ParseContext context = new ParseContext();
//parsing the file
parser.parse(inputstream, handler, metadata, context);
System.out.println("File content : " + Handler.toString());
}
}
将以上代码保存为 ParserExtraction.java,并从命令提示符中运行它 -
Save the above code as ParserExtraction.java and run it from the command prompt −
javac ParserExtraction.java
java ParserExtraction
以下是 sample.txt 的内容
Given below is the content of sample.txt
Hi students welcome to tutorialspoint
如果您执行上述程序,它将为您提供以下输出 -
If you execute the above program, it will give you the following output −
File content : Hi students welcome to tutorialspoint
TIKA - Metadata Extraction
除了内容之外,Tika 还从文件中提取元数据。元数据只不过是随文件提供的附加信息。如果我们考虑一个音频文件,艺术家姓名、专辑名称、标题都属于元数据。
Besides content, Tika also extracts the metadata from a file. Metadata is nothing but the additional information supplied with a file. If we consider an audio file, the artist name, album name, title comes under metadata.
XMP Standards
可扩展元数据平台 (XMP) 是处理和存储与文件内容相关的信息的标准。它由 Adobe Systems Inc 创建。XMP 提供了用于定义、创建和处理 metadata 的标准。您可以将此标准嵌入到 PDF 、 JPEG 、 JPEG 、 GIF 、 jpg 、 HTML 等多种文件格式中。
The Extensible Metadata Platform (XMP) is a standard for processing and storing information related to the content of a file. It was created by Adobe Systems Inc. XMP provides standards for defining, creating, and processing of metadata. You can embed this standard into several file formats such as PDF, JPEG, JPEG, GIF, jpg, HTML etc.
Property Class
Tika 使用 Property 类遵循 XMP 财产定义。它提供了 PropertyType 和 ValueType 枚举来捕获元数据的名称和值。
Tika uses the Property class to follow XMP property definition. It provides the PropertyType and ValueType enums to capture the name and value of a metadata.
Metadata Class
此类实现了 ClimateForcast 、CativeCommons、 Geographic 、TIFF 等各种接口,以提供对各种元数据模型的支持。此外,此类还提供了从文件中提取内容的各种方法。
This class implements various interfaces such as ClimateForcast, CativeCommons, Geographic, TIFF etc. to provide support for various metadata models. In addition, this class provides various methods to extract the content from a file.
Metadata Names
我们可以使用 names() 方法从文件的元数据对象中提取该文件所有元数据名称的列表。它以字符串数组的形式返回所有名称。使用元数据的名称,我们可以使用 get() 方法获取值。它获取一个元数据名称并返回与之关联的值。
We can extract the list of all metadata names of a file from its metadata object using the method names(). It returns all the names as a string array. Using the name of the metadata, we can get the value using the get() method. It takes a metadata name and returns a value associated with it.
String[] metadaNames = metadata.names();
String value = metadata.get(name);
Extracting Metadata using Parse Method
每当我们使用 parse() 解析文件时,我们都会将一个空元数据对象作为其中一个参数传递。此方法提取给定文件(如果该文件包含任何文件)的元数据,并将它们放置在元数据对象中。因此,在使用 parse() 解析文件后,我们可以从该对象中提取元数据。
Whenever we parse a file using parse(), we pass an empty metadata object as one of the parameters. This method extracts the metadata of the given file (if that file contains any), and places them in the metadata object. Therefore, after parsing the file using parse(), we can extract the metadata from that object.
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata(); //empty metadata object
FileInputStream inputstream = new FileInputStream(file);
ParseContext context = new ParseContext();
parser.parse(inputstream, handler, metadata, context);
// now this metadata object contains the extracted metadata of the given file.
metadata.metadata.names();
以下是从文本文件中提取元数据的完整程序。
Given below is the complete program to extract metadata from a text file.
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class GetMetadata {
public static void main(final String[] args) throws IOException, TikaException {
//Assume that boy.jpg is in your current directory
File file = new File("boy.jpg");
//Parser method parameters
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(file);
ParseContext context = new ParseContext();
parser.parse(inputstream, handler, metadata, context);
System.out.println(handler.toString());
//getting the list of all meta data elements
String[] metadataNames = metadata.names();
for(String name : metadataNames) {
System.out.println(name + ": " + metadata.get(name));
}
}
}
将以上代码保存为 GetMetadata.java,并使用以下命令从命令提示符运行它 -
Save the above code as GetMetadata.java and run it from the command prompt using the following commands −
javac GetMetadata .java
java GetMetadata
以下是 boy.jpg 的快照。
Given below is the snapshot of boy.jpg
如果您执行上述程序,它将为您提供以下输出 -
If you execute the above program, it will give you the following output −
X-Parsed-By: org.apache.tika.parser.DefaultParser
Resolution Units: inch
Compression Type: Baseline
Data Precision: 8 bits
Number of Components: 3
tiff:ImageLength: 3000
Component 2: Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert
Component 1: Y component: Quantization table 0, Sampling factors 2 horiz/2 vert
Image Height: 3000 pixels
X Resolution: 300 dots
Original Transmission Reference:
53616c7465645f5f2368da84ca932841b336ac1a49edb1a93fae938b8db2cb3ec9cc4dc28d7383f1
Image Width: 4000 pixels
IPTC-NAA record: 92 bytes binary data
Component 3: Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert
tiff:BitsPerSample: 8
Application Record Version: 4
tiff:ImageWidth: 4000
Content-Type: image/jpeg
Y Resolution: 300 dots
我们还可以获取我们需要的元数据值。
We can also get our desired metadata values.
Adding New Metadata Values
可以使用 metadata 类的 add() 方法添加新元数据值。以下是此方法的语法。此处我们正在添加作者姓名。
We can add new metadata values using the add() method of the metadata class. Given below is the syntax of this method. Here we are adding the author name.
metadata.add(“author”,”Tutorials point”);
Metadata 类具有预定义的属性,包括从 ClimateForcast 、CativeCommons、 Geographic 等类继承的属性,以支持各种数据模型。下面显示了从 Tika 实现的 TIFF 接口继承的 SOFTWARE 数据类型的用法,以便遵循针对 TIFF 图像格式的 XMP 元数据标准。
The Metadata class has predefined properties including the properties inherited from classes like ClimateForcast, CativeCommons, Geographic, etc., to support various data models. Shown below is the usage of the SOFTWARE data type inherited from the TIFF interface implemented by Tika to follow XMP metadata standards for TIFF image formats.
metadata.add(Metadata.SOFTWARE,"ms paint");
以下是演示如何向给定文件添加元数据值的完整程序。此处,元数据元素的列表显示在输出中,以便您可以在添加新值后观察列表的变化。
Given below is the complete program that demonstrates how to add metadata values to a given file. Here the list of the metadata elements is displayed in the output so that you can observe the change in the list after adding new values.
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.Arrays;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class AddMetadata {
public static void main(final String[] args) throws IOException, SAXException, TikaException {
//create a file object and assume sample.txt is in your current directory
File file = new File("Example.txt");
//Parser method parameters
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(file);
ParseContext context = new ParseContext();
//parsing the document
parser.parse(inputstream, handler, metadata, context);
//list of meta data elements before adding new elements
System.out.println( " metadata elements :" +Arrays.toString(metadata.names()));
//adding new meta data name value pair
metadata.add("Author","Tutorials Point");
System.out.println(" metadata name value pair is successfully added");
//printing all the meta data elements after adding new elements
System.out.println("Here is the list of all the metadata
elements after adding new elements");
System.out.println( Arrays.toString(metadata.names()));
}
}
将以上代码保存为 AddMetadata.java 类,然后从命令提示符运行它 −
Save the above code as AddMetadata.java class and run it from the command prompt −
javac AddMetadata .java
java AddMetadata
以下是 Example.txt 的内容
Given below is the content of Example.txt
Hi students welcome to tutorialspoint
如果您执行上述程序,它将为您提供以下输出 -
If you execute the above program, it will give you the following output −
metadata elements of the given file :
[Content-Encoding, Content-Type]
enter the number of metadata name value pairs to be added 1
enter metadata1name:
Author enter metadata1value:
Tutorials point metadata name value pair is successfully added
Here is the list of all the metadata elements after adding new elements
[Content-Encoding, Author, Content-Type]
Setting Values to Existing Metadata Elements
可以使用 set() 方法将值设为现有元数据元素。使用 set() 方法设置 date 属性的语法如下所示 −
You can set values to the existing metadata elements using the set() method. The syntax of setting the date property using the set() method is as follows −
metadata.set(Metadata.DATE, new Date());
您还可以使用 set() 方法向属性设置多个值。使用 set() 方法向 Author 属性设置多个值的语法如下所示 −
You can also set multiple values to the properties using the set() method. The syntax of setting multiple values to the Author property using the set() method is as follows −
metadata.set(Metadata.AUTHOR, "ram ,raheem ,robin ");
以下是演示 set() 方法的完整程序。
Given below is the complete program demonstrating the set() method.
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.Date;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class SetMetadata {
public static void main(final String[] args) throws IOException,SAXException, TikaException {
//Create a file object and assume example.txt is in your current directory
File file = new File("example.txt");
//parameters of parse() method
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(file);
ParseContext context = new ParseContext();
//Parsing the given file
parser.parse(inputstream, handler, metadata, context);
//list of meta data elements elements
System.out.println( " metadata elements and values of the given file :");
String[] metadataNamesb4 = metadata.names();
for(String name : metadataNamesb4) {
System.out.println(name + ": " + metadata.get(name));
}
//setting date meta data
metadata.set(Metadata.DATE, new Date());
//setting multiple values to author property
metadata.set(Metadata.AUTHOR, "ram ,raheem ,robin ");
//printing all the meta data elements with new elements
System.out.println("List of all the metadata elements after adding new elements ");
String[] metadataNamesafter = metadata.names();
for(String name : metadataNamesafter) {
System.out.println(name + ": " + metadata.get(name));
}
}
}
将以上代码保存为 SetMetadata.java,然后从命令提示符运行它 −
Save the above code as SetMetadata.java and run it from the command prompt −
javac SetMetadata.java
java SetMetadata
以下是 example.txt 的内容。
Given below is the content of example.txt.
Hi students welcome to tutorialspoint
如果您执行上面的程序,它将为您提供以下输出。在输出中,您可以观察到新增的元数据元素。
If you execute the above program it will give you the following output. In the output, you can observe the newly added metadata elements.
metadata elements and values of the given file :
Content-Encoding: ISO-8859-1
Content-Type: text/plain; charset = ISO-8859-1
Here is the list of all the metadata elements after adding new elements
date: 2014-09-24T07:01:32Z
Content-Encoding: ISO-8859-1
Author: ram, raheem, robin
Content-Type: text/plain; charset = ISO-8859-1
TIKA - Language Detection
Need for Language Detection
对于根据多语言网站中书写语言对文档进行分类,需要使用语言检测工具。此工具应接受没有语言注释(元数据)的文档,并通过检测语言将该信息添加到文档的元数据中。
For classification of documents based on the language they are written in a multilingual website, a language detection tool is needed. This tool should accept documents without language annotation (metadata) and add that information in the metadata of the document by detecting the language.
Algorithms for Profiling Corpus
What is Corpus?
要检测文档的语言,需要构建语言配置文件,并将其与已知语言的配置文件进行比较。这些已知语言的文本集称为 corpus 。
To detect the language of a document, a language profile is constructed and compared with the profile of the known languages. The text set of these known languages is known as a corpus.
语料库是一组书面语言的文本集合,用于说明在实际情况下如何使用该语言。
A corpus is a collection of texts of a written language that explains how the language is used in real situations.
语料库是从书籍、成绩单以及互联网等其他数据资源开发的。语料库的准确性取决于我们用来构建语料库的分析算法。
The corpus is developed from books, transcripts, and other data resources like the Internet. The accuracy of the corpus depends upon the profiling algorithm we use to frame the corpus.
What are Profiling Algorithms?
检测语言的常用方法是使用词典。文中给定的一段文字中使用的单词将与词典中的单词进行匹配。
The common way of detecting languages is by using dictionaries. The words used in a given piece of text will be matched with those that are in the dictionaries.
特定语言中使用的常用单词列表将成为用于检测特定语言的最简单且最有效的语料库,例如,英语中的冠词 a 、 an 、 the 。
A list of common words used in a language will be the most simple and effective corpus for detecting a particular language, for example, articles a, an, the in English.
Using Word Sets as Corpus
使用单词集,一个简单的算法把两个语料库之间的距离表达为匹配单词的频率之间的差值的求和。
Using word sets, a simple algorithm is framed to find the distance between two corpora, which will be equal to the sum of differences between the frequencies of matching words.
此类算法有以下问题:
Such algorithms suffer from the following problems −
-
Since the frequency of matching words is very less, the algorithm cannot efficiently work with small texts having few sentences. It needs a lot of text for accurate match.
-
It cannot detect word boundaries for languages having compound sentences, and those having no word dividers like spaces or punctuation marks.
由于在使用单词集作为语料库中存在这些困难,因此考虑单个字符或字符组。
Due to these difficulties in using word sets as corpus, individual characters or character groups are considered.
Using Character Sets as Corpus
由于某一语言中常用的字符数量有限,因此很容易应用基于单词频率而不是字符的算法。这种算法对于一种或极少数语言中使用的特定字符集表现得非常好。
Since the characters that are commonly used in a language are finite in number, it is easy to apply an algorithm based on word frequencies rather than characters. This algorithm works even better in case of certain character sets used in one or very few languages.
该算法有如下缺点:
This algorithm suffers from the following drawbacks −
-
It is difficult to differentiate two languages having similar character frequencies.
-
There is no specific tool or algorithm to specifically identify a language with the help of (as corpus) the character set used by multiple languages.
N-gram Algorithm
上述缺点催生了一种新方法,即使用给定长度的字符序列来描述语料库。此类字符序列通常称为 n 元词,其中 n 表示字符序列的长度。
The drawbacks stated above gave rise to a new approach of using character sequences of a given length for profiling corpus. Such sequence of characters are called as N-grams in general, where N represents the length of the character sequence.
-
N-gram algorithm is an effective approach for language detection, especially in case of European languages like English.
-
This algorithm works fine with short texts.
-
Though there are advanced language profiling algorithms to detect multiple languages in a multilingual document having more attractive features, Tika uses the 3-grams algorithm, as it is suitable in most practical situations.
Language Detection in Tika
在 ISO 639-1 标准化的所有 184 种标准语言中,Tika 可以检测 18 种语言。Tika 中的语言检测是使用 LanguageIdentifier 类的 getLanguage() 方法完成的。此方法以字符串格式返回语言的代码名称。以下是 Tika 检测到的 18 个语言代码对的列表:
Among all the 184 standard languages standardized by ISO 639-1, Tika can detect 18 languages. Language detection in Tika is done using the getLanguage() method of the LanguageIdentifier class. This method returns the code name of the language in String format. Given below is the list of the 18 language-code pairs detected by Tika −
da—Danish |
de—German |
et—Estonian |
el—Greek |
en—English |
es—Spanish |
fi—Finnish |
fr—French |
hu—Hungarian |
is—Icelandic |
it—Italian |
nl—Dutch |
no—Norwegian |
pl—Polish |
pt—Portuguese |
ru—Russian |
在实例化 LanguageIdentifier 类时,你应该传递要提取的内容的字符串格式,或 LanguageProfile 类对象。
While instantiating the LanguageIdentifier class, you should pass the String format of the content to be extracted, or a LanguageProfile class object.
LanguageIdentifier object = new LanguageIdentifier(“this is english”);
以下是 Tika 中用于语言检测的示例程序。
Given below is the example program for Language detection in Tika.
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.language.LanguageIdentifier;
import org.xml.sax.SAXException;
public class LanguageDetection {
public static void main(String args[])throws IOException, SAXException, TikaException {
LanguageIdentifier identifier = new LanguageIdentifier("this is english ");
String language = identifier.getLanguage();
System.out.println("Language of the given content is : " + language);
}
}
将上面的代码另存为 LanguageDetection.java ,并使用以下命令从命令提示符处运行它:
Save the above code as LanguageDetection.java and run it from the command prompt using the following commands −
javac LanguageDetection.java
java LanguageDetection
如果你执行上述程序,它会给出以下输出:
If you execute the above program it gives the following outpu−
Language of the given content is : en
Language Detection of a Document
要检测给定文档的语言,你必须使用 parse() 方法对其进行解析。parse() 方法解析内容并将其存储在处理程序对象中,该对象作为其中一个参数传递给它。将处理程序对象的字符串格式传递给 LanguageIdentifier 类的构造函数,如下所示:
To detect the language of a given document, you have to parse it using the parse() method. The parse() method parses the content and stores it in the handler object, which was passed to it as one of the arguments. Pass the String format of the handler object to the constructor of the LanguageIdentifier class as shown below −
parser.parse(inputstream, handler, metadata, context);
LanguageIdentifier object = new LanguageIdentifier(handler.toString());
下面给出了演示如何检测给定文档语言的完整程序 −
Given below is the complete program that demonstrates how to detect the language of a given document −
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.language.*;
import org.xml.sax.SAXException;
public class DocumentLanguageDetection {
public static void main(final String[] args) throws IOException, SAXException, TikaException {
//Instantiating a file object
File file = new File("Example.txt");
//Parser method parameters
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream content = new FileInputStream(file);
//Parsing the given document
parser.parse(content, handler, metadata, new ParseContext());
LanguageIdentifier object = new LanguageIdentifier(handler.toString());
System.out.println("Language name :" + object.getLanguage());
}
}
将以上代码保存为 SetMetadata.java,然后从命令提示符运行它 −
Save the above code as SetMetadata.java and run it from the command prompt −
javac SetMetadata.java
java SetMetadata
下面给出 Example.txt 的内容。
Given below is the content of Example.txt.
Hi students welcome to tutorialspoint
如果您执行上述程序,它将为您提供以下输出 -
If you execute the above program, it will give you the following output −
Language name :en
除了 Tika Jar,Tika 还提供了图形用户界面应用程序 (GUI) 和命令行界面 (CLI) 应用程序。你也可以像其他 Java 应用程序一样从命令提示符执行 Tika 应用程序。
Along with the Tika jar, Tika provides a Graphical User Interface application (GUI) and a Command Line Interface (CLI) application. You can execute a Tika application from the command prompt too like other Java applications.
TIKA - GUI
Graphical User Interface (GUI)
-
Tika provides a jar file along with its source code in the following link https://tika.apache.org/download.html.
-
Download both the files, set the classpath for the jar file.
-
Extract the source code zip folder, open the tika-app folder.
-
In the extracted folder at “tika-1.6\tika-app\src\main\java\org\apache\Tika\gui” you will see two class files: ParsingTransferHandler.java and TikaGUI.java.
-
Compile both the class files and execute the TikaGUI.java class file, it opens the following window.
现在,让我们来看看如何使用 Tika GUI。
Let us now see how to make use of the Tika GUI.
在 GUI 上,单击 open(打开),浏览并选择要提取的文件,或将其拖动到窗口的空白处。
On the GUI, click open, browse and select a file that is to be extracted, or drag it onto the whitespace of the window.
Tika 提取文件内容,并以五种不同格式显示,即:元数据、格式化文本、纯文本、主要内容和结构化文本。你可以选择你想要的任何一种格式。
Tika extracts the content of the files and displays it in five different formats, viz. metadata, formatted text, plain text, main content, and structured text. You can choose any of the format you want.
同样,你也可以在“tika-1.6\tikaapp\src\main\java\org\apache\tika\cli”文件夹中找到 CLI 类。
In the same way, you will also find the CLI class in the “tika-1.6\tikaapp\src\main\java\org\apache\tika\cli” folder.
以下插图显示了 Tika 的功能。当我们将图像拖放到 GUI 上时,Tika 会提取并显示其元数据。
The following illustration shows what Tika can do. When we drop the image on the GUI, Tika extracts and displays its metadata.
TIKA - Extracting PDF
下面是用于从 PDF 中提取内容和元数据的程序。
Given below is the program to extract content and metadata from a PDF.
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class PdfParse {
public static void main(final String[] args) throws IOException,TikaException {
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("Example.pdf"));
ParseContext pcontext = new ParseContext();
//parsing the document using PDF parser
PDFParser pdfparser = new PDFParser();
pdfparser.parse(inputstream, handler, metadata,pcontext);
//getting the content of the document
System.out.println("Contents of the PDF :" + handler.toString());
//getting metadata of the document
System.out.println("Metadata of the PDF:");
String[] metadataNames = metadata.names();
for(String name : metadataNames) {
System.out.println(name+ " : " + metadata.get(name));
}
}
}
将上述代码保存为 PdfParse.java ,并通过使用以下命令从命令提示符对其进行编译 −
Save the above code as PdfParse.java, and compile it from the command prompt by using the following commands −
javac PdfParse.java
java PdfParse
下面给出了 example.pdf 的快照
Below give is the snapshot of example.pdf
我们传递的 PDF 具有以下属性 −
The PDF we are passing has the following properties −
编译程序后,您将得到如下所示的输出。
After compiling the program, you will get the output as shown below.
Output −
Output −
Contents of the PDF:
Apache Tika is a framework for content type detection and content extraction
which was designed by Apache software foundation. It detects and extracts metadata
and structured text content from different types of documents such as spreadsheets,
text documents, images or PDFs including audio or video input formats to certain extent.
Metadata of the PDF:
dcterms:modified : 2014-09-28T12:31:16Z
meta:creation-date : 2014-09-28T12:31:16Z
meta:save-date : 2014-09-28T12:31:16Z
dc:creator : Krishna Kasyap
pdf:PDFVersion : 1.5
Last-Modified : 2014-09-28T12:31:16Z
Author : Krishna Kasyap
dcterms:created : 2014-09-28T12:31:16Z
date : 2014-09-28T12:31:16Z
modified : 2014-09-28T12:31:16Z
creator : Krishna Kasyap
xmpTPg:NPages : 1
Creation-Date : 2014-09-28T12:31:16Z
pdf:encrypted : false
meta:author : Krishna Kasyap
created : Sun Sep 28 05:31:16 PDT 2014
dc:format : application/pdf; version = 1.5
producer : Microsoft® Word 2013
Content-Type : application/pdf
xmp:CreatorTool : Microsoft® Word 2013
Last-Save-Date : 2014-09-28T12:31:16Z
TIKA - Extracting ODF
以下是从开放文档格式(ODF)中提取内容和元数据的程序。
Given below is the program to extract content and metadata from Open Office Document Format (ODF).
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.odf.OpenDocumentParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class OpenDocumentParse {
public static void main(final String[] args) throws IOException,SAXException, TikaException {
//detecting the file type
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("example_open_document_presentation.odp"));
ParseContext pcontext = new ParseContext();
//Open Document Parser
OpenDocumentParser openofficeparser = new OpenDocumentParser ();
openofficeparser.parse(inputstream, handler, metadata,pcontext);
System.out.println("Contents of the document:" + handler.toString());
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();
for(String name : metadataNames) {
System.out.println(name + " : " + metadata.get(name));
}
}
}
将上述代码保存为 OpenDocumentParse.java ,并使用以下命令在命令提示符中编译它 -
Save the above code as OpenDocumentParse.java, and compile it in the command prompt by using the following commands −
javac OpenDocumentParse.java
java OpenDocumentParse
以下是对 example_open_document_presentation.odp 文件的快照。
Given below is snapshot of example_open_document_presentation.odp file.
此文档具有以下属性 −
This document has the following properties −
编译程序后,您将获得以下输出。
After compiling the program, you will get the following output.
Output −
Output −
Contents of the document:
Apache Tika
Apache Tika is a framework for content type detection and content extraction which was designed
by Apache software foundation. It detects and extracts metadata and structured text content from
different types of documents such as spreadsheets, text documents, images or PDFs including audio
or video input formats to certain extent.
Metadata of the document:
editing-cycles: 4
meta:creation-date: 2009-04-16T11:32:32.86
dcterms:modified: 2014-09-28T07:46:13.03
meta:save-date: 2014-09-28T07:46:13.03
Last-Modified: 2014-09-28T07:46:13.03
dcterms:created: 2009-04-16T11:32:32.86
date: 2014-09-28T07:46:13.03
modified: 2014-09-28T07:46:13.03
nbObject: 36
Edit-Time: PT32M6S
Creation-Date: 2009-04-16T11:32:32.86
Object-Count: 36
meta:object-count: 36
generator: OpenOffice/4.1.0$Win32 OpenOffice.org_project/410m18$Build-9764
Content-Type: application/vnd.oasis.opendocument.presentation
Last-Save-Date: 2014-09-28T07:46:13.03
TIKA - Extracting MS-Office Files
以下是从 Microsoft Office 文档中提取内容和元数据程序。
Given below is the program to extract content and metadata from a Microsoft Office Document.
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.microsoft.ooxml.OOXMLParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class MSExcelParse {
public static void main(final String[] args) throws IOException, TikaException {
//detecting the file type
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("example_msExcel.xlsx"));
ParseContext pcontext = new ParseContext();
//OOXml parser
OOXMLParser msofficeparser = new OOXMLParser ();
msofficeparser.parse(inputstream, handler, metadata,pcontext);
System.out.println("Contents of the document:" + handler.toString());
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();
for(String name : metadataNames) {
System.out.println(name + ": " + metadata.get(name));
}
}
}
将上述代码另存为 MSExelParse.java ,并使用以下命令从命令提示符对其进行编译:
Save the above code as MSExelParse.java, and compile it from the command prompt by using the following commands −
javac MSExcelParse.java
java MSExcelParse
这里我们传递以下示例 Excel 文件。
Here we are passing the following sample Excel file.
给定的 Excel 文件具有以下属性:
The given Excel file has the following properties −
执行上述程序后,您将获得以下输出。
After executing the above program you will get the following output.
Output −
Output −
Contents of the document:
Sheet1
Name Age Designation Salary
Ramu 50 Manager 50,000
Raheem 40 Assistant manager 40,000
Robert 30 Superviser 30,000
sita 25 Clerk 25,000
sameer 25 Section in-charge 20,000
Metadata of the document:
meta:creation-date: 2006-09-16T00:00:00Z
dcterms:modified: 2014-09-28T15:18:41Z
meta:save-date: 2014-09-28T15:18:41Z
Application-Name: Microsoft Excel
extended-properties:Company:
dcterms:created: 2006-09-16T00:00:00Z
Last-Modified: 2014-09-28T15:18:41Z
Application-Version: 15.0300
date: 2014-09-28T15:18:41Z
publisher:
modified: 2014-09-28T15:18:41Z
Creation-Date: 2006-09-16T00:00:00Z
extended-properties:AppVersion: 15.0300
protected: false
dc:publisher:
extended-properties:Application: Microsoft Excel
Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Last-Save-Date: 2014-09-28T15:18:41Z
TIKA - Extracting Text Document
以下是用于从文本文档中提取内容和元数据的程序 −
Given below is the program to extract content and metadata from a Text document −
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.txt.TXTParser;
import org.xml.sax.SAXException;
public class TextParser {
public static void main(final String[] args) throws IOException,SAXException, TikaException {
//detecting the file type
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("example.txt"));
ParseContext pcontext=new ParseContext();
//Text document parser
TXTParser TexTParser = new TXTParser();
TexTParser.parse(inputstream, handler, metadata,pcontext);
System.out.println("Contents of the document:" + handler.toString());
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();
for(String name : metadataNames) {
System.out.println(name + " : " + metadata.get(name));
}
}
}
将上述代码保存为 TextParser.java ,并使用以下命令通过命令提示符进行编译-
Save the above code as TextParser.java, and compile it from the command prompt by using the following commands −
javac TextParser.java
java TextParser
以下是 sample.txt 文件的快照 -
Given below is the snapshot of sample.txt file −
该文本文档具有以下属性 -
The text document has the following properties −
如果您执行上述程序,它会为您提供以下输出。
If you execute the above program it will give you the following output.
Output −
Output −
Contents of the document:
At tutorialspoint.com, we strive hard to provide quality tutorials for self-learning
purpose in the domains of Academics, Information Technology, Management and Computer
Programming Languages.
The endeavour started by Mohtashim, an AMU alumni, who is the founder and the managing
director of Tutorials Point (I) Pvt. Ltd. He came up with the website tutorialspoint.com
in year 2006 with the help of handpicked freelancers, with an array of tutorials for
computer programming languages.
Metadata of the document:
Content-Encoding: windows-1252
Content-Type: text/plain; charset = windows-1252
TIKA - Extracting HTML Document
下面给出从 HTML 文档中提取内容和元数据的程序。
Given below is the program to extract content and metadata from an HTML document.
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.html.HtmlParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class HtmlParse {
public static void main(final String[] args) throws IOException,SAXException, TikaException {
//detecting the file type
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("example.html"));
ParseContext pcontext = new ParseContext();
//Html parser
HtmlParser htmlparser = new HtmlParser();
htmlparser.parse(inputstream, handler, metadata,pcontext);
System.out.println("Contents of the document:" + handler.toString());
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();
for(String name : metadataNames) {
System.out.println(name + ": " + metadata.get(name));
}
}
}
将以上代码另存为 HtmlParse.java ,并使用以下命令从命令提示符编译它 −
Save the above code as HtmlParse.java, and compile it from the command prompt by using the following commands −
javac HtmlParse.java
java HtmlParse
下面给出 example.txt 文件的快照。
Given below is the snapshot of example.txt file.
HTML 文档具有以下属性−
The HTML document has the following properties−
如果您执行上述程序,它会为您提供以下输出。
If you execute the above program it will give you the following output.
Output −
Output −
Contents of the document:
Name Salary age
Ramesh Raman 50000 20
Shabbir Hussein 70000 25
Umesh Raman 50000 30
Somesh 50000 35
Metadata of the document:
title: HTML Table Header
Content-Encoding: windows-1252
Content-Type: text/html; charset = windows-1252
dc:title: HTML Table Header
TIKA - Extracting XML Document
下面是用于从 XML 文档中提取内容和元数据的程序 −
Given below is the program to extract content and metadata from an XML document −
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.xml.XMLParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class XmlParse {
public static void main(final String[] args) throws IOException,SAXException, TikaException {
//detecting the file type
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("pom.xml"));
ParseContext pcontext = new ParseContext();
//Xml parser
XMLParser xmlparser = new XMLParser();
xmlparser.parse(inputstream, handler, metadata, pcontext);
System.out.println("Contents of the document:" + handler.toString());
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();
for(String name : metadataNames) {
System.out.println(name + ": " + metadata.get(name));
}
}
}
将上述代码保存为 XmlParse.java ,并通过使用以下命令从命令提示符对其进行编译 −
Save the above code as XmlParse.java, and compile it from the command prompt by using the following commands −
javac XmlParse.java
java XmlParse
下面给出了 example.xml 文件的快照
Given below is the snapshot of example.xml file
此文档具有以下属性 −
This document has the following properties −
如果您执行上述程序,它将给您以下输出 −
If you execute the above program it will give you the following output −
Output −
Output −
Contents of the document:
4.0.0
org.apache.tika
tika
1.6
org.apache.tika
tika-core
1.6
org.apache.tika
tika-parsers
1.6
src
maven-compiler-plugin
3.1
1.7
1.7
Metadata of the document:
Content-Type: application/xml
TIKA - Extracting .class File
以下是从 .class 文件中提取内容和元数据的程序。
Given below is the program to extract content and metadata from a .class file.
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.asm.ClassParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class JavaClassParse {
public static void main(final String[] args) throws IOException,SAXException, TikaException {
//detecting the file type
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("Example.class"));
ParseContext pcontext = new ParseContext();
//Html parser
ClassParser ClassParser = new ClassParser();
ClassParser.parse(inputstream, handler, metadata,pcontext);
System.out.println("Contents of the document:" + handler.toString());
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();
for(String name : metadataNames) {
System.out.println(name + " : " + metadata.get(name));
}
}
}
将上述代码保存为 JavaClassParse.java ,并使用以下命令通过命令提示符进行编译 -
Save the above code as JavaClassParse.java, and compile it from the command prompt by using the following commands −
javac JavaClassParse.java
java JavaClassParse
以下是在编译后生成 Example.class 的 Example.java 的快照。
Given below is the snapshot of Example.java which will generate Example.class after compilation.
Example.class 文件具有以下属性 -
Example.class file has the following properties −
执行上述程序后,您将获得以下输出。
After executing the above program, you will get the following output.
Output −
Output −
Contents of the document:
package tutorialspoint.tika.examples;
public synchronized class Example {
public void Example();
public static void main(String[]);
}
Metadata of the document:
title: Example
resourceName: Example.class
dc:title: Example
TIKA - Extracting JAR File
下面给出了从 Java 归档 (jar) 文件中提取内容和元数据的程序 −
Given below is the program to extract content and metadata from a Java Archive (jar) file −
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.pkg.PackageParser;
import org.xml.sax.SAXException;
public class PackageParse {
public static void main(final String[] args) throws IOException,SAXException, TikaException {
//detecting the file type
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("Example.jar"));
ParseContext pcontext = new ParseContext();
//Package parser
PackageParser packageparser = new PackageParser();
packageparser.parse(inputstream, handler, metadata,pcontext);
System.out.println("Contents of the document: " + handler.toString());
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();
for(String name : metadataNames) {
System.out.println(name + ": " + metadata.get(name));
}
}
}
将上述代码保存为 PackageParse.java ,并使用以下命令从命令提示符编译它 −
Save the above code as PackageParse.java, and compile it from the command prompt by using the following commands −
javac PackageParse.java
java PackageParse
下面给出了 Example.java 的快照,它位于包内。
Given below is the snapshot of Example.java that resides inside the package.
jar 文件具有以下属性 −
The jar file has the following properties −
在执行上述程序后,它将给你以下输出 −
After executing the above program, it will give you the following output −
Output −
Output −
Contents of the document:
META-INF/MANIFEST.MF
tutorialspoint/tika/examples/Example.class
Metadata of the document:
Content-Type: application/zip
TIKA - Extracting Image File
下面是用于从 JPEG 图像中提取内容和元数据的方法。
Given below is the program to extract content and meta data from a JPEG image.
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.jpeg.JpegParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class JpegParse {
public static void main(final String[] args) throws IOException,SAXException, TikaException {
//detecting the file type
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("boy.jpg"));
ParseContext pcontext = new ParseContext();
//Jpeg Parse
JpegParser JpegParser = new JpegParser();
JpegParser.parse(inputstream, handler, metadata,pcontext);
System.out.println("Contents of the document:" + handler.toString());
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();
for(String name : metadataNames) {
System.out.println(name + ": " + metadata.get(name));
}
}
}
将上述代码保存为 JpegParse.java ,并使用以下命令从命令提示符编译该代码 −
Save the above code as JpegParse.java, and compile it from the command prompt by using the following commands −
javac JpegParse.java
java JpegParse
下面给出了 Example.jpeg 的快照 −
Given below is the snapshot of Example.jpeg −
JPEG 文件具有以下属性 −
The JPEG file has the following properties −
执行程序后,您将得到以下输出。
After executing the program, you will get the following output.
Output −
Output −
Contents of the document:
Meta data of the document:
Resolution Units: inch
Compression Type: Baseline
Data Precision: 8 bits
Number of Components: 3
tiff:ImageLength: 3000
Component 2: Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert
Component 1: Y component: Quantization table 0, Sampling factors 2 horiz/2 vert
Image Height: 3000 pixels
X Resolution: 300 dots
Original Transmission Reference: 53616c7465645f5f2368da84ca932841b336ac1a49edb1a93fae938b8db2cb3ec9cc4dc28d7383f1
Image Width: 4000 pixels
IPTC-NAA record: 92 bytes binary data
Component 3: Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert
tiff:BitsPerSample: 8
Application Record Version: 4
tiff:ImageWidth: 4000
Y Resolution: 300 dots
TIKA - Extracting mp4 Files
下面提供了从 mp4 文件中提取内容和元数据的程序 −
Given below is the program to extract content and metadata from mp4 files −
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.mp4.MP4Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class Mp4Parse {
public static void main(final String[] args) throws IOException,SAXException, TikaException {
//detecting the file type
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("example.mp4"));
ParseContext pcontext = new ParseContext();
//Html parser
MP4Parser MP4Parser = new MP4Parser();
MP4Parser.parse(inputstream, handler, metadata,pcontext);
System.out.println("Contents of the document: :" + handler.toString());
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();
for(String name : metadataNames) {
System.out.println(name + ": " + metadata.get(name));
}
}
}
将上述代码另存为 JpegParse.java,并使用以下命令从命令提示符中进行编译 −
Save the above code as JpegParse.java, and compile it from the command prompt by using the following commands −
javac Mp4Parse.java
java Mp4Parse
下面是 Example.mp4 文件的属性快照。
Given below is the snapshot of properties of Example.mp4 file.
执行上述程序后,你将看到以下输出 −
After executing the above program, you will get the following output −
Output −
Output −
Contents of the document:
Metadata of the document:
dcterms:modified: 2014-01-06T12:10:27Z
meta:creation-date: 1904-01-01T00:00:00Z
meta:save-date: 2014-01-06T12:10:27Z
Last-Modified: 2014-01-06T12:10:27Z
dcterms:created: 1904-01-01T00:00:00Z
date: 2014-01-06T12:10:27Z
tiff:ImageLength: 360
modified: 2014-01-06T12:10:27Z
Creation-Date: 1904-01-01T00:00:00Z
tiff:ImageWidth: 640
Content-Type: video/mp4
Last-Save-Date: 2014-01-06T12:10:27Z
TIKA - Extracting mp3 Files
以下是用于从 mp3 文件中提取内容和元数据的程序 −
Given below is the program to extract content and metadata from mp3 files −
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.mp3.LyricsHandler;
import org.apache.tika.parser.mp3.Mp3Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class Mp3Parse {
public static void main(final String[] args) throws Exception, IOException, SAXException, TikaException {
//detecting the file type
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("example.mp3"));
ParseContext pcontext = new ParseContext();
//Mp3 parser
Mp3Parser Mp3Parser = new Mp3Parser();
Mp3Parser.parse(inputstream, handler, metadata, pcontext);
LyricsHandler lyrics = new LyricsHandler(inputstream,handler);
while(lyrics.hasLyrics()) {
System.out.println(lyrics.toString());
}
System.out.println("Contents of the document:" + handler.toString());
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();
for(String name : metadataNames) {
System.out.println(name + ": " + metadata.get(name));
}
}
}
将上述代码保存为 JpegParse.java ,并使用以下命令从命令提示符编译该代码 −
Save the above code as JpegParse.java, and compile it from the command prompt by using the following commands −
javac Mp3Parse.java
java Mp3Parse
Example.mp3 文件具有以下属性 −
Example.mp3 file has the following properties −
在执行该程序后,你将获得以下输出。如果给定的文件有任何歌词,我们的应用程序将捕获并显示这些歌词连同输出。
You will get the following output after executing the program. If the given file has any lyrics, our application will capture and display that along with the output.
Output −
Output −
Contents of the document:
Kanulanu Thaake
Arijit Singh
Manam (2014), track 01/06
2014
Soundtrack
30171.65
eng -
DRGM
Arijit Singh
Manam (2014), track 01/06
2014
Soundtrack
30171.65
eng -
DRGM
Metadata of the document:
xmpDM:releaseDate: 2014
xmpDM:duration: 30171.650390625
xmpDM:audioChannelType: Stereo
dc:creator: Arijit Singh
xmpDM:album: Manam (2014)
Author: Arijit Singh
xmpDM:artist: Arijit Singh
channels: 2
xmpDM:audioSampleRate: 44100
xmpDM:logComment: eng -
DRGM
xmpDM:trackNumber: 01/06
version: MPEG 3 Layer III Version 1
creator: Arijit Singh
xmpDM:composer: Music : Anoop Rubens | Lyrics : Vanamali
xmpDM:audioCompressor: MP3
title: Kanulanu Thaake
samplerate: 44100
meta:author: Arijit Singh
xmpDM:genre: Soundtrack
Content-Type: audio/mpeg
xmpDM:albumArtist: Manam (2014)
dc:title: Kanulanu Thaake