Apache Poi Word 简明教程
Apache POI Word - Text Extraction
本章说明如何使用 Java 从 Word 文档中提取简单文本数据。如果你想从 Word 文档中提取元数据,请使用 Apache Tika。
This chapter explains how to extract simple text data from a Word document using Java. In case you want to extract metadata from a Word document, make use of Apache Tika.
对于 .docx 文件,我们使用类 org.apache.poi.xwpf.extractor.XPFFWordExtractor ,它可以从 Word 文件中提取并返回简单数据。同样,我们有不同的方法来从 Word 文件中提取标题、脚注、表格数据等。
For .docx files, we use the class org.apache.poi.xwpf.extractor.XPFFWordExtractor that extracts and returns simple data from a Word file. In the same way, we have different methodologies to extract headings, footnotes, table data, etc. from a Word file.
以下代码演示如何从 Word 文件中提取简单文本 −
The following code shows how to extract simple text from a Word file −
import java.io.FileInputStream;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
public class WordExtractor {
public static void main(String[] args)throws Exception {
XWPFDocument docx = new XWPFDocument(new FileInputStream("createparagraph.docx"));
//using XWPFWordExtractor Class
XWPFWordExtractor we = new XWPFWordExtractor(docx);
System.out.println(we.getText());
}
}
将上述代码另存为 WordExtractor.java. ,然后从命令提示符编译并执行它,如下所示 −
Save the above code as WordExtractor.java. Compile and execute it from the command prompt as follows −
$javac WordExtractor.java
$java WordExtractor
它将生成以下输出 −
It will generate the following output −
At tutorialspoint.com, we strive hard to provide quality tutorials for self-learning purpose
in the domains of Academics, Information Technology, Management and Computer Programming Languages.