Tika 简明教程
TIKA - Language Detection
Need for Language Detection
对于根据多语言网站中书写语言对文档进行分类,需要使用语言检测工具。此工具应接受没有语言注释(元数据)的文档,并通过检测语言将该信息添加到文档的元数据中。
For classification of documents based on the language they are written in a multilingual website, a language detection tool is needed. This tool should accept documents without language annotation (metadata) and add that information in the metadata of the document by detecting the language.
Algorithms for Profiling Corpus
What is Corpus?
要检测文档的语言,需要构建语言配置文件,并将其与已知语言的配置文件进行比较。这些已知语言的文本集称为 corpus 。
To detect the language of a document, a language profile is constructed and compared with the profile of the known languages. The text set of these known languages is known as a corpus.
语料库是一组书面语言的文本集合,用于说明在实际情况下如何使用该语言。
A corpus is a collection of texts of a written language that explains how the language is used in real situations.
语料库是从书籍、成绩单以及互联网等其他数据资源开发的。语料库的准确性取决于我们用来构建语料库的分析算法。
The corpus is developed from books, transcripts, and other data resources like the Internet. The accuracy of the corpus depends upon the profiling algorithm we use to frame the corpus.
What are Profiling Algorithms?
检测语言的常用方法是使用词典。文中给定的一段文字中使用的单词将与词典中的单词进行匹配。
The common way of detecting languages is by using dictionaries. The words used in a given piece of text will be matched with those that are in the dictionaries.
特定语言中使用的常用单词列表将成为用于检测特定语言的最简单且最有效的语料库,例如,英语中的冠词 a 、 an 、 the 。
A list of common words used in a language will be the most simple and effective corpus for detecting a particular language, for example, articles a, an, the in English.
Using Word Sets as Corpus
使用单词集,一个简单的算法把两个语料库之间的距离表达为匹配单词的频率之间的差值的求和。
Using word sets, a simple algorithm is framed to find the distance between two corpora, which will be equal to the sum of differences between the frequencies of matching words.
此类算法有以下问题:
Such algorithms suffer from the following problems −
-
Since the frequency of matching words is very less, the algorithm cannot efficiently work with small texts having few sentences. It needs a lot of text for accurate match.
-
It cannot detect word boundaries for languages having compound sentences, and those having no word dividers like spaces or punctuation marks.
由于在使用单词集作为语料库中存在这些困难,因此考虑单个字符或字符组。
Due to these difficulties in using word sets as corpus, individual characters or character groups are considered.
Using Character Sets as Corpus
由于某一语言中常用的字符数量有限,因此很容易应用基于单词频率而不是字符的算法。这种算法对于一种或极少数语言中使用的特定字符集表现得非常好。
Since the characters that are commonly used in a language are finite in number, it is easy to apply an algorithm based on word frequencies rather than characters. This algorithm works even better in case of certain character sets used in one or very few languages.
该算法有如下缺点:
This algorithm suffers from the following drawbacks −
-
It is difficult to differentiate two languages having similar character frequencies.
-
There is no specific tool or algorithm to specifically identify a language with the help of (as corpus) the character set used by multiple languages.
N-gram Algorithm
上述缺点催生了一种新方法,即使用给定长度的字符序列来描述语料库。此类字符序列通常称为 n 元词,其中 n 表示字符序列的长度。
The drawbacks stated above gave rise to a new approach of using character sequences of a given length for profiling corpus. Such sequence of characters are called as N-grams in general, where N represents the length of the character sequence.
-
N-gram algorithm is an effective approach for language detection, especially in case of European languages like English.
-
This algorithm works fine with short texts.
-
Though there are advanced language profiling algorithms to detect multiple languages in a multilingual document having more attractive features, Tika uses the 3-grams algorithm, as it is suitable in most practical situations.
Language Detection in Tika
在 ISO 639-1 标准化的所有 184 种标准语言中,Tika 可以检测 18 种语言。Tika 中的语言检测是使用 LanguageIdentifier 类的 getLanguage() 方法完成的。此方法以字符串格式返回语言的代码名称。以下是 Tika 检测到的 18 个语言代码对的列表:
Among all the 184 standard languages standardized by ISO 639-1, Tika can detect 18 languages. Language detection in Tika is done using the getLanguage() method of the LanguageIdentifier class. This method returns the code name of the language in String format. Given below is the list of the 18 language-code pairs detected by Tika −
da—Danish |
de—German |
et—Estonian |
el—Greek |
en—English |
es—Spanish |
fi—Finnish |
fr—French |
hu—Hungarian |
is—Icelandic |
it—Italian |
nl—Dutch |
no—Norwegian |
pl—Polish |
pt—Portuguese |
ru—Russian |
在实例化 LanguageIdentifier 类时,你应该传递要提取的内容的字符串格式,或 LanguageProfile 类对象。
While instantiating the LanguageIdentifier class, you should pass the String format of the content to be extracted, or a LanguageProfile class object.
LanguageIdentifier object = new LanguageIdentifier(“this is english”);
以下是 Tika 中用于语言检测的示例程序。
Given below is the example program for Language detection in Tika.
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.language.LanguageIdentifier;
import org.xml.sax.SAXException;
public class LanguageDetection {
public static void main(String args[])throws IOException, SAXException, TikaException {
LanguageIdentifier identifier = new LanguageIdentifier("this is english ");
String language = identifier.getLanguage();
System.out.println("Language of the given content is : " + language);
}
}
将上面的代码另存为 LanguageDetection.java ,并使用以下命令从命令提示符处运行它:
Save the above code as LanguageDetection.java and run it from the command prompt using the following commands −
javac LanguageDetection.java
java LanguageDetection
如果你执行上述程序,它会给出以下输出:
If you execute the above program it gives the following outpu−
Language of the given content is : en
Language Detection of a Document
要检测给定文档的语言,你必须使用 parse() 方法对其进行解析。parse() 方法解析内容并将其存储在处理程序对象中,该对象作为其中一个参数传递给它。将处理程序对象的字符串格式传递给 LanguageIdentifier 类的构造函数,如下所示:
To detect the language of a given document, you have to parse it using the parse() method. The parse() method parses the content and stores it in the handler object, which was passed to it as one of the arguments. Pass the String format of the handler object to the constructor of the LanguageIdentifier class as shown below −
parser.parse(inputstream, handler, metadata, context);
LanguageIdentifier object = new LanguageIdentifier(handler.toString());
下面给出了演示如何检测给定文档语言的完整程序 −
Given below is the complete program that demonstrates how to detect the language of a given document −
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.language.*;
import org.xml.sax.SAXException;
public class DocumentLanguageDetection {
public static void main(final String[] args) throws IOException, SAXException, TikaException {
//Instantiating a file object
File file = new File("Example.txt");
//Parser method parameters
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream content = new FileInputStream(file);
//Parsing the given document
parser.parse(content, handler, metadata, new ParseContext());
LanguageIdentifier object = new LanguageIdentifier(handler.toString());
System.out.println("Language name :" + object.getLanguage());
}
}
将以上代码保存为 SetMetadata.java,然后从命令提示符运行它 −
Save the above code as SetMetadata.java and run it from the command prompt −
javac SetMetadata.java
java SetMetadata
下面给出 Example.txt 的内容。
Given below is the content of Example.txt.
Hi students welcome to tutorialspoint
如果您执行上述程序,它将为您提供以下输出 -
If you execute the above program, it will give you the following output −
Language name :en
除了 Tika Jar,Tika 还提供了图形用户界面应用程序 (GUI) 和命令行界面 (CLI) 应用程序。你也可以像其他 Java 应用程序一样从命令提示符执行 Tika 应用程序。
Along with the Tika jar, Tika provides a Graphical User Interface application (GUI) and a Command Line Interface (CLI) application. You can execute a Tika application from the command prompt too like other Java applications.