Natural Language Processing 简明教程
NLP - Linguistic Resources
在本章中,我们将了解自然语言处理中的语言资源。
In this chapter, we will learn about the linguistic resources in Natural Language Processing.
Corpus
语料库是在自然交流环境中生成的大型结构化机器可读文本集。它的复数形式是语料库。它们可以通过不同的方式派生,例如原本是电子的文本、口语转录和光学字符识别等。
A corpus is a large and structured set of machine-readable texts that have been produced in a natural communicative setting. Its plural is corpora. They can be derived in different ways like text that was originally electronic, transcripts of spoken language and optical character recognition, etc.
Elements of Corpus Design
语言是无限的,但语料库必须是有限的。为了使语料库大小有限,我们需要对广泛的文本类型进行抽样和按比例包含,以确保良好的语料库设计。
Language is infinite but a corpus has to be finite in size. For the corpus to be finite in size, we need to sample and proportionally include a wide range of text types to ensure a good corpus design.
现在让我们了解语料库设计的一些重要元素−
Let us now learn about some important elements for corpus design −
Corpus Representativeness
代表性是语料库设计的一个决定性特征。两位伟大研究人员——Leech 和 Biber 的以下定义将帮助我们理解语料库代表性−
Representativeness is a defining feature of corpus design. The following definitions from two great researchers − Leech and Biber, will help us understand corpus representativeness −
-
According to Leech (1991), “A corpus is thought to be representative of the language variety it is supposed to represent if the findings based on its contents can be generalized to the said language variety”.
-
According to Biber (1993), “Representativeness refers to the extent to which a sample includes the full range of variability in a population”.
通过这种方式,我们可以得出结论,语料库的代表性由以下两个因素决定:
In this way, we can conclude that representativeness of a corpus are determined by the following two factors −
-
Balance − The range of genre include in a corpus
-
Sampling − How the chunks for each genre are selected.
Corpus Balance
语料库设计的另一个非常重要的元素是语料库平衡——语料库中包含的体裁范围。我们已经研究过,一般语料库的代表性取决于语料库平衡度如何。平衡的语料库涵盖广泛的文本类别,这些类别被认为是语言的代表。我们没有可靠的科学方法来衡量平衡度,但最好的估计和直觉在这方面起作用。换句话说,我们可以说,可接受的平衡度仅由其预期用途决定。
Another very important element of corpus design is corpus balance – the range of genre included in a corpus. We have already studied that representativeness of a general corpus depends upon how balanced the corpus is. A balanced corpus covers a wide range of text categories, which are supposed to be representatives of the language. We do not have any reliable scientific measure for balance but the best estimation and intuition works in this concern. In other words, we can say that the accepted balance is determined by its intended uses only.
Sampling
语料库设计的另一个重要元素是抽样。语料库代表性和平衡性与抽样密切相关。这就是为什么我们可以说抽样在语料库构建中是不可避免的。
Another important element of corpus design is sampling. Corpus representativeness and balance is very closely associated with sampling. That is why we can say that sampling is inescapable in corpus building.
-
According to Biber(1993), “Some of the first considerations in constructing a corpus concern the overall design: for example, the kinds of texts included, the number of texts, the selection of particular texts, the selection of text samples from within texts, and the length of text samples. Each of these involves a sampling decision, either conscious or not.”
在获取代表性样本时,我们需要考虑以下内容:
While obtaining a representative sample, we need to consider the following −
-
Sampling unit − It refers to the unit which requires a sample. For example, for written text, a sampling unit may be a newspaper, journal or a book.
-
Sampling frame − The list of al sampling units is called a sampling frame.
-
Population − It may be referred as the assembly of all sampling units. It is defined in terms of language production, language reception or language as a product.
Corpus Size
语料库设计的另一个重要元素是它的规模。语料库应该多大?这个问题没有具体答案。语料库的规模取决于其预期用途以及以下一些实际考虑因素:
Another important element of corpus design is its size. How large the corpus should be? There is no specific answer to this question. The size of the corpus depends upon the purpose for which it is intended as well as on some practical considerations as follows −
-
Kind of query anticipated from the user.
-
The methodology used by the users to study the data.
-
Availability of the source of data.
随着技术的进步,语料库的规模也在增加。以下比较表将帮助您了解语料库规模的工作原理:
With the advancement in technology, the corpus size also increases. The following table of comparison will help you understand how the corpus size works −
Year |
Name of the Corpus |
Size (in words) |
1960s - 70s |
Brown and LOB |
1 Million words |
1980s |
The Birmingham corpora |
20 Million words |
1990s |
The British National corpus |
100 Million words |
Early 21st century |
The Bank of English corpus |
650 Million words |
在我们后面的章节中,我们将看一些语料库的示例。
In our subsequent sections, we will look at a few examples of corpus.
TreeBank Corpus
它可以定义为对句法或语义句子结构进行注释的语言解析文本语料库。Geoffrey Leech创造了术语“树库”,它表示表示语法分析的最常用方法是通过树形结构。通常,树库是在语料库的基础上创建的,该语料库已经用词性标签进行了注释。
It may be defined as linguistically parsed text corpus that annotates syntactic or semantic sentence structure. Geoffrey Leech coined the term ‘treebank’, which represents that the most common way of representing the grammatical analysis is by means of a tree structure. Generally, Treebanks are created on the top of a corpus, which has already been annotated with part-of-speech tags.
Types of TreeBank Corpus
语义树库和句法树库是语言学中最常见的两类树库。让我们现在详细了解这些类型:
Semantic and Syntactic Treebanks are the two most common types of Treebanks in linguistics. Let us now learn more about these types −
Semantic Treebanks
这些树库使用句子语义结构的形式化表示。它们在语义表示深度方面各不相同。机器人指令树库、地理查询、格罗宁根意义库、RoboCup 语料库是一些语义树库示例。
These Treebanks use a formal representation of sentence’s semantic structure. They vary in the depth of their semantic representation. Robot Commands Treebank, Geoquery, Groningen Meaning Bank, RoboCup Corpus are some of the examples of Semantic Treebanks.
Syntactic Treebanks
与语义树库相反,句法树库系统的输入是从句法分析树库数据转换中获得的形式语言表达式。此类系统的输出是基于谓词逻辑的意义表示。到目前为止,已经创建了各种不同语言的句法树库。例如, Penn Arabic Treebank, Columbia Arabic Treebank 是阿拉伯语创建的句法树库。 Sininca 是中文创建的句法树库。 Lucy, Susane 和 BLLIP WSJ 是英语创建的句法语料库。
Opposite to the semantic Treebanks, inputs to the Syntactic Treebank systems are expressions of the formal language obtained from the conversion of parsed Treebank data. The outputs of such systems are predicate logic based meaning representation. Various syntactic Treebanks in different languages have been created so far. For example, Penn Arabic Treebank, Columbia Arabic Treebank are syntactic Treebanks created in Arabia language. Sininca syntactic Treebank created in Chinese language. Lucy, Susane and BLLIP WSJ syntactic corpus created in English language.
Applications of TreeBank Corpus
以下是一些树库的应用:
Followings are some of the applications of TreeBanks −
In Computational Linguistics
如果谈到计算语言学,则树库的最佳用途是设计最先进的自然语言处理系统,例如词性标注器、解析器、语义分析器和机器翻译系统。
If we talk about Computational Linguistic then the best use of TreeBanks is to engineer state-of-the-art natural language processing systems such as part-of-speech taggers, parsers, semantic analyzers and machine translation systems.
PropBank Corpus
PropBank 更具体地称为“命题库”,它是一个语料库,其中附注了动词命题及其参数。该语料库是以动词为导向的资源;此处的注释更紧密地与句法级别相关。Martha Palmer 及科罗拉多大学博尔德分校语言学系共同开发了它。我们可以使用术语 PropBank 作为普通名词,指代任何已用命题及其参数进行注释的语料库。
PropBank more specifically called “Proposition Bank” is a corpus, which is annotated with verbal propositions and their arguments. The corpus is a verb-oriented resource; the annotations here are more closely related to the syntactic level. Martha Palmer et al., Department of Linguistic, University of Colorado Boulder developed it. We can use the term PropBank as a common noun referring to any corpus that has been annotated with propositions and their arguments.
在自然语言处理 (NLP) 中,PropBank 项目发挥了非常重要的作用。它有助于语义角色标记。
In Natural Language Processing (NLP), the PropBank project has played a very significant role. It helps in semantic role labeling.
VerbNet(VN)
VerbNet(VN) 是英语中层次化的、与领域无关且最大的词法资源,它包含有关其内容的语义和句法信息。VN 是一个广泛覆盖的动词词库,它映射到其他词法资源,如 WordNet、Xtag 和 FrameNet。它被组织成动词类别,通过细化和添加子类别来扩展 Levin 类别,以便在类别成员之间实现句法和语义连贯性。
VerbNet(VN) is the hierarchical domain-independent and largest lexical resource present in English that incorporates both semantic as well as syntactic information about its contents. VN is a broad-coverage verb lexicon having mappings to other lexical resources such as WordNet, Xtag and FrameNet. It is organized into verb classes extending Levin classes by refinement and addition of subclasses for achieving syntactic and semantic coherence among class members.
每个 VerbNet (VN) 类别包含:
Each VerbNet (VN) class contains −
A set of syntactic descriptions or syntactic frames
描述结构的可能表面实现,例如及物、不及物、介词短语、结果和大量的语态交替。
For depicting the possible surface realizations of the argument structure for constructions such as transitive, intransitive, prepositional phrases, resultatives, and a large set of diathesis alternations.
A set of semantic descriptions such as animate, human, organization
对参数允许的主题角色类型进行限制,并且可以施加进一步的限制。这将有助于指示可能与主题角色关联的成分的句法性质。
For constraining, the types of thematic roles allowed by the arguments, and further restrictions may be imposed. This will help in indicating the syntactic nature of the constituent likely to be associated with the thematic role.
WordNet
WordNet 由普林斯顿创建,是英语语言的词汇数据库。它是 NLTK 语料库的一部分。在 WordNet 中,名词、动词、形容词和副词被分组到称为 Synsets 的认知同义词组中。所有同义词集都在概念语义和词汇关系的帮助下联系在一起。它的结构使其非常适合自然语言处理 (NLP)。
WordNet, created by Princeton is a lexical database for English language. It is the part of the NLTK corpus. In WordNet, nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms called Synsets. All the synsets are linked with the help of conceptual-semantic and lexical relations. Its structure makes it very useful for natural language processing (NLP).
在信息系统中,WordNet 用于各种目的,例如消歧义、信息检索、自动文本分类和机器翻译。WordNet 最重要的用途之一是找出单词之间的相似性。对于此任务,已在各种包中实现了各种算法,例如 Perl 中的相似性、Python 中的 NLTK 和 Java 中的 ADW。
In information systems, WordNet is used for various purposes like word-sense disambiguation, information retrieval, automatic text classification and machine translation. One of the most important uses of WordNet is to find out the similarity among words. For this task, various algorithms have been implemented in various packages like Similarity in Perl, NLTK in Python and ADW in Java.