Tika 简明教程

TIKA - Referenced API

用户可以使用 Tika facade 类将 Tika 嵌入到他们的应用程序中。它提供了要探索 Tika 所有功能的方法。由于它是一个 facade 类,Tika 会将其函数背后的复杂性抽象化。除此之外,用户还可以在其应用程序中使用各种 Tika 类。

Users can embed Tika in their applications using the Tika facade class. It has methods to explore all the functionalities of Tika. Since it is a facade class, Tika abstracts the complexity behind its functions. In addition to this, users can also use the various classes of Tika in their applications.

user application

Tika Class (facade)

这是 Tika 库中最突出的一个类,遵循门面设计模式。因此,它抽象了所有内部实现,并提供了访问 Tika 功能的简单方法。下表列出了此类的构造函数及其说明。

This is the most prominent class of the Tika library and follows the facade design pattern. Therefore, it abstracts all the internal implementations and provides simple methods to access the Tika functionalities. The following table lists the constructors of this class along with their descriptions.

package − org.apache.tika

package − org.apache.tika

class − Tika

class − Tika

Sr.No.

Constructor & Description

1

Tika () Uses default configuration and constructs the Tika class.

2

Tika (Detector detector) Creates a Tika facade by accepting the detector instance as parameter

3

Tika (Detector detector, Parser parser) Creates a Tika facade by accepting the detector and parser instances as parameters.

4

Tika (Detector detector, Parser parser, Translator translator) Creates a Tika facade by accepting the detector, the parser, and the translator instance as parameters.

5

Tika (TikaConfig config) Creates a Tika facade by accepting the object of the TikaConfig class as parameter.

Methods and Description

以下为 Tika 门面类的重要方法 −

The following are the important methods of Tika facade class −

Sr.No.

Methods & Description

1

parse*ToString* (File file) This method and all its variants parses the file passed as parameter and returns the extracted text content in the String format. By default, the length of this string parameter is limited.

2

int getMaxStringLength () Returns the maximum length of strings returned by the parseToString methods.

3

void setMaxStringLength (int maxStringLength) Sets the maximum length of strings returned by the parseToString methods.

4

Reader parse (File file) This method and all its variants parses the file passed as parameter and returns the extracted text content in the form of java.io.reader object.

5

String detect (InputStream stream, Metadata metadata) This method and all its variants accepts an InputStream object and a Metadata object as parameters, detects the type of the given document, and returns the document type name as String object. This method abstracts the detection mechanisms used by Tika.

6

String translate (InputStream text, String targetLanguage) This method and all its variants accepts the InputStream object and a String representing the language that we want our text to be translated, and translates the given text to the desired language, attempting to auto-detect the source language.

Parser Interface

这是 Tika 软件包的所有解析器类实现的接口。

This is the interface that is implemented by all the parser classes of Tika package.

package − org.apache.tika.parser

package − org.apache.tika.parser

Interface − Parser

Interface − Parser

Methods and Description

以下是 Tika 解析器接口的重要方法 −

The following is the important method of Tika Parser interface −

Sr.No.

Methods & Description

1

parse (InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) This method parses the given document into a sequence of XHTML and SAX events. After parsing, it places the extracted document content in the object of the ContentHandler class and the metadata in the object of the Metadata class.

Metadata Class

此类实现了各种接口,例如 CreativeCommons、Geographic、HttpHeaders、Message、MSOffice、ClimateForcast、TIFF、TikaMetadataKeys、TikaMimeKeys、Serializable,以支持各种数据模型。下表列出了该类的构造函数和方法及其说明。

This class implements various interfaces such as CreativeCommons, Geographic, HttpHeaders, Message, MSOffice, ClimateForcast, TIFF, TikaMetadataKeys, TikaMimeKeys, Serializable to support various data models. The following tables list the constructors and methods of this class along with their descriptions.

package − org.apache.tika.metadata

package − org.apache.tika.metadata

class − Metadata

class − Metadata

Sr.No.

Constructor & Description

1

Metadata() Constructs a new, empty metadata.

Sr.No.

Methods & Description

1

add (Property property, String value) Adds a metadata property/value mapping to a given document. Using this function, we can set the value to a property.

2

add (String name, String value) Adds a metadata property/value mapping to a given document. Using this method, we can set a new name value to the existing metadata of a document.

3

String get (Property property) Returns the value (if any) of the metadata property given.

4

String get (String name) Returns the value (if any) of the metadata name given.

5

Date getDate (Property property) Returns the value of Date metadata property.

6

String[] getValues (Property property) Returns all the values of a metadata property.

7

String[] getValues (String name) Returns all the values of a given metadata name.

8

String[] names() Returns all the names of metadata elements in a metadata object.

9

set (Property property, Date date) Sets the date value of the given metadata property

10

set(Property property, String[] values) Sets multiple values to a metadata property.

Language Identifier Class

此类标识给定内容的语言。下表列出了此类的构造函数及其说明。

This class identifies the language of the given content. The following tables list the constructors of this class along with their descriptions.

package − org.apache.tika.language

package − org.apache.tika.language

class − Language Identifier

class − Language Identifier

Sr.No.

Constructor & Description

1

LanguageIdentifier (LanguageProfile profile) Instantiates the language identifier. Here you have to pass a LanguageProfile object as parameter.

2

LanguageIdentifier (String content) This constructor can instantiate a language identifier by passing on a String from text content.

Sr.No.

Methods & Description

1

String getLanguage () Returns the language given to the current LanguageIdentifier object.