Tika 简明教程

TIKA - Content Extraction

Tika 使用各种解析器库从给定的解析器中提取内容。它选择合适的解析器来提取给定的文档类型。

Tika uses various parser libraries to extract content from given parsers. It chooses the right parser for extracting the given document type.

对于解析文档,通常使用 Tika facade 类的 parseToString() 方法。下面显示了解析过程中的步骤,这些步骤由 Tika ParsertoString() 方法抽象出来。

For parsing documents, the parseToString() method of Tika facade class is generally used. Shown below are the steps involved in the parsing process and these are abstracted by the Tika ParsertoString() method.

parsing process

抽象解析过程 -

Abstracting the parsing process −

  1. Initially when we pass a document to Tika, it uses a suitable type detection mechanism available with it and detects the document type.

  2. Once the document type is known, it chooses a suitable parser from its parser repository. The parser repository contains classes that make use of external libraries.

  3. Then the document is passed to choose the parser which will parse the content, extract the text, and also throw exceptions for unreadable formats.

Content Extraction using Tika

下面给出了使用 Tika facade 类从文件中提取文本的程序 -

Given below is the program for extracting text from a file using Tika facade class −

import java.io.File;
import java.io.IOException;

import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;

import org.xml.sax.SAXException;

public class TikaExtraction {

   public static void main(final String[] args) throws IOException, TikaException {

      //Assume sample.txt is in your current directory
      File file = new File("sample.txt");

      //Instantiating Tika facade class
      Tika tika = new Tika();
      String filecontent = tika.parseToString(file);
      System.out.println("Extracted Content: " + filecontent);
   }
}

将以上代码另存为 TikaExtraction.java,并从命令提示符处运行它 -

Save the above code as TikaExtraction.java and run it from the command prompt −

javac TikaExtraction.java
java TikaExtraction

下面给出了 sample.txt 的内容。

Given below is the content of sample.txt.

Hi students welcome to tutorialspoint

它给你以下输出 -

It gives you the following output −

Extracted Content: Hi students welcome to tutorialspoint

Content Extraction using Parser Interface

Tika 的 parser 包提供了多个接口和类,我们可使用它们来解析文本文档。下面给出了 org.apache.tika.parser 包的框图。

The parser package of Tika provides several interfaces and classes using which we can parse a text document. Given below is the block diagram of the org.apache.tika.parser package.

parser interface

有几个解析器类可用,例如,pdf 解析器、Mp3Passer、OfficeParser 等,可分别解析各自的文档。所有这些类都实现了解析器接口。

There are several parser classes available, e.g., pdf parser, Mp3Passer, OfficeParser, etc., to parse respective documents individually. All these classes implement the parser interface.

CompositeParser

给定的框图显示了 Tika 的通用解析器类: CompositeParserAutoDetectParser 。由于 CompositeParser 类遵循复合设计模式,因此可以将一组解析器实例用作单个解析器。CompositeParser 类还允许访问实现解析器接口的所有类。

The given diagram shows Tika’s general-purpose parser classes: CompositeParser and AutoDetectParser. Since the CompositeParser class follows composite design pattern, you can use a group of parser instances as a single parser. The CompositeParser class also allows access to all the classes that implement the parser interface.

AutoDetectParser

这是 CompositeParser 的子类,它提供自动类型检测。使用此功能,AutoDetectParser 使用复合方法自动将传入文档发送到适当的解析器类。

This is a subclass of CompositeParser and it provides automatic type detection. Using this functionality, the AutoDetectParser automatically sends the incoming documents to the appropriate parser classes using the composite methodology.

parse() method

除了 parseToString(),还可以使用解析器接口的 parse() 方法。此方法的原型如下所示。

Along with parseToString(), you can also use the parse() method of the parser Interface. The prototype of this method is shown below.

parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)

下表列出了它接受为参数的四个对象。

The following table lists the four objects it accepts as parameters.

Sr.No.

Object & Description

1

InputStream stream Any Inputstream object that contains the content of the file

2

ContentHandler handler Tika passes the document as XHTML content to this handler, thereafter the document is processed using SAX API. It provides efficient postprocessing of the contents in a document.

3

Metadata metadata The metadata object is used both as a source and a target of document metadata.

4

ParseContext context This object is used in cases where the client application wants to customize the parsing process.

Example

以下是显示如何使用 parse() 方法的示例。

Given below is an example that shows how the parse() method is used.

Step 1 -

Step 1

要使用此解析器界面的 parse() 方法,请实例化为提供对此界面实现的任何类。

To use the parse() method of the parser interface, instantiate any of the classes providing the implementation for this interface.

有单独的解析器类,例如 PDFParser、OfficeParser、XMLParser 等。您可以使用其中任何一个文档解析器。或者,您可以使用 CompositeParser 或 AutoDetectParser,它们在内部使用所有解析器类,并使用合适的解析器提取文档的内容。

There are individual parser classes such as PDFParser, OfficeParser, XMLParser, etc. You can use any of these individual document parsers. Alternatively, you can use either CompositeParser or AutoDetectParser that uses all the parser classes internally and extracts the contents of a document using a suitable parser.

Parser parser = new AutoDetectParser();
   (or)
Parser parser = new CompositeParser();
   (or)
object of any individual parsers given in Tika Library

Step 2 -

Step 2

创建处理程序类对象。以下是三个内容处理程序 -

Create a handler class object. Given below are the three content handlers −

Sr.No.

Class & Description

1

BodyContentHandler This class picks the body part of the XHTML output and writes that content to the output writer or output stream. Then it redirects the XHTML content to another content handler instance.

2

LinkContentHandler This class detects and picks all the H-Ref tags of the XHTML document and forwards those for the use of tools like web crawlers.

3

TeeContentHandler This class helps in using multiple tools simultaneously.

由于我们的目标是从文档中提取文本内容,因此,请按如下所示实例化 BodyContentHandler -

Since our target is to extract the text contents from a document, instantiate BodyContentHandler as shown below −

BodyContentHandler handler = new BodyContentHandler( );

Step 3 -

Step 3

按如下所示创建 Metadata 对象 -

Create the Metadata object as shown below −

Metadata metadata = new Metadata();

Step 4 -

Step 4

创建任何输入流对象,并将应从中提取的文件传递给它。

Create any of the input stream objects, and pass your file that should be extracted to it.

FileInputstream

将文件路径作为参数传递以实例化文件对象,并将此对象传递到 FileInputStream 类构造函数中。

Instantiate a file object by passing the file path as parameter and pass this object to the FileInputStream class constructor.

Note - 传递到文件对象的路径不应包含空格。

Note − The path passed to the file object should not contain spaces.

这些输入流类的问题在于不支持随机访问读取,而这是有效处理某些文件格式所必需的。为了解决此问题,Tika 提供了 TikaInputStream。

The problem with these input stream classes is that they don’t support random access reads, which is required to process some file formats efficiently. To resolve this problem, Tika provides TikaInputStream.

File  file = new File(filepath)
FileInputStream inputstream = new FileInputStream(file);
   (or)
InputStream stream = TikaInputStream.get(new File(filename));

Step 5 -

Step 5

按如下所示创建解析上下文对象 -

Create a parse context object as shown below −

ParseContext context =new ParseContext();

Step 6 -

Step 6

实例化解析器对象,调用解析方法,并传递所有必需的对象,如下面的原型所示 -

Instantiate the parser object, invoke the parse method, and pass all the objects required, as shown in the prototype below −

parser.parse(inputstream, handler, metadata, context);

以下是使用解析器接口进行内容提取的程序 -

Given below is the program for content extraction using the parser interface −

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;

import org.xml.sax.SAXException;

public class ParserExtraction {

   public static void main(final String[] args) throws IOException,SAXException, TikaException {

      //Assume sample.txt is in your current directory
      File file = new File("sample.txt");

      //parse method parameters
      Parser parser = new AutoDetectParser();
      BodyContentHandler handler = new BodyContentHandler();
      Metadata metadata = new Metadata();
      FileInputStream inputstream = new FileInputStream(file);
      ParseContext context = new ParseContext();

      //parsing the file
      parser.parse(inputstream, handler, metadata, context);
      System.out.println("File content : " + Handler.toString());
   }
}

将以上代码保存为 ParserExtraction.java,并从命令提示符中运行它 -

Save the above code as ParserExtraction.java and run it from the command prompt −

javac  ParserExtraction.java
java  ParserExtraction

以下是 sample.txt 的内容

Given below is the content of sample.txt

Hi students welcome to tutorialspoint

如果您执行上述程序,它将为您提供以下输出 -

If you execute the above program, it will give you the following output −

File content : Hi students welcome to tutorialspoint