Natural Language Processing 简明教程

Natural Language Processing - Quick Guide

Natural Language Processing - Introduction

语言是一种沟通方式，借助它我们能够说话、阅读和写作。例如，我们用自然语言（确切地说，用文字）思考、做决定、制定计划等等。然而，在 AI 时代，我们面临的一个重大问题是，我们是否能够以类似的方式与计算机进行沟通。换句话说，人类是否可以用自己的自然语言与计算机沟通？对于我们来说，开发 NLP 应用程序是一项挑战，因为计算机需要结构化数据，但人类语言本质上是无结构且常常歧义的。

Language is a method of communication with the help of which we can speak, read and write. For example, we think, we make decisions, plans and more in natural language; precisely, in words. However, the big question that confronts us in this AI era is that can we communicate in a similar manner with computers. In other words, can human beings communicate with computers in their natural language? It is a challenge for us to develop NLP applications because computers need structured data, but human speech is unstructured and often ambiguous in nature.

从这个意义上讲，我们可以说自然语言处理 (NLP) 是计算机科学，特别是人工智能 (AI) 的子领域，其关注的是使计算机能够理解和处理人类语言。从技术上讲，NLP 的主要任务是对计算机进行编程，以便其分析和处理大量自然语言数据。

In this sense, we can say that Natural Language Processing (NLP) is the sub-field of Computer Science especially Artificial Intelligence (AI) that is concerned about enabling computers to understand and process human language. Technically, the main task of NLP would be to program computers for analyzing and processing huge amount of natural language data.

History of NLP

我们将 NLP 的历史划分为了四个阶段。各个阶段有不同的关注点和风格。

We have divided the history of NLP into four phases. The phases have distinctive concerns and styles.

First Phase (Machine Translation Phase) - Late 1940s to late 1960s

该阶段所做的工作主要集中在机器翻译 (MT)。这个阶段是一个充满热情和乐观主义的时期。

The work done in this phase focused mainly on machine translation (MT). This phase was a period of enthusiasm and optimism.

现在让我们看看第一阶段有什么内容：

Let us now see all that the first phase had in it −

The research on NLP started in early 1950s after Booth & Richens’ investigation and Weaver’s memorandum on machine translation in 1949.
1954 was the year when a limited experiment on automatic translation from Russian to English demonstrated in the Georgetown-IBM experiment.
In the same year, the publication of the journal MT (Machine Translation) started.
The first international conference on Machine Translation (MT) was held in 1952 and second was held in 1956.
In 1961, the work presented in Teddington International Conference on Machine Translation of Languages and Applied Language analysis was the high point of this phase.

Second Phase (AI Influenced Phase) – Late 1960s to late 1970s

在这个阶段，所做的工作主要与世界知识以及它在构造和操纵意义表征中的作用有关。这就是为什么这个阶段也被称为人工智能风味阶段。

In this phase, the work done was majorly related to world knowledge and on its role in the construction and manipulation of meaning representations. That is why, this phase is also called AI-flavored phase.

该阶段包括以下内容：

The phase had in it, the following −

In early 1961, the work began on the problems of addressing and constructing data or knowledge base. This work was influenced by AI.
In the same year, a BASEBALL question-answering system was also developed. The input to this system was restricted and the language processing involved was a simple one.
A much advanced system was described in Minsky (1968). This system, when compared to the BASEBALL question-answering system, was recognized and provided for the need of inference on the knowledge base in interpreting and responding to language input.

Third Phase (Grammatico-logical Phase) – Late 1970s to late 1980s

这个阶段可以描述为语法逻辑阶段。由于上个阶段实际系统构建失败，研究人员转向使用逻辑进行人工智能中的知识表示和推理。

This phase can be described as the grammatico-logical phase. Due to the failure of practical system building in last phase, the researchers moved towards the use of logic for knowledge representation and reasoning in AI.

第三阶段包括以下内容：

The third phase had the following in it −

The grammatico-logical approach, towards the end of decade, helped us with powerful general-purpose sentence processors like SRI’s Core Language Engine and Discourse Representation Theory, which offered a means of tackling more extended discourse.
In this phase we got some practical resources & tools like parsers, e.g. Alvey Natural Language Tools along with more operational and commercial systems, e.g. for database query.
The work on lexicon in 1980s also pointed in the direction of grammatico-logical approach.

Fourth Phase (Lexical & Corpus Phase) – The 1990s

我们可以将此描述为词法和语料库阶段。该阶段采用了词法化语法方法，该方法出现在 20 世纪 80 年代末并产生越来越大的影响。随着机器学习算法在语言处理方面的引入，这个十年发生了自然语言处理的革命。

We can describe this as a lexical & corpus phase. The phase had a lexicalized approach to grammar that appeared in late 1980s and became an increasing influence. There was a revolution in natural language processing in this decade with the introduction of machine learning algorithms for language processing.

Study of Human Languages

语言是人类生活的一个至关重要的组成部分，也是我们行为的最基本方面。我们主要可以以两种形式体验到它——书面和口语。在书面形式中，它是一种将我们的知识从一代传递到下一代的方式。在口语形式中，它是人类在日常行为中相互协调的主要媒介。语言在各个学术学科中都得到研究。每个学科都有自己的一套问题和解决这些问题的方案。

Language is a crucial component for human lives and also the most fundamental aspect of our behavior. We can experience it in mainly two forms - written and spoken. In the written form, it is a way to pass our knowledge from one generation to the next. In the spoken form, it is the primary medium for human beings to coordinate with each other in their day-to-day behavior. Language is studied in various academic disciplines. Each discipline comes with its own set of problems and a set of solution to address those.

考虑下表来理解这一点：

Consider the following table to understand this −

Discipline

Problems

Tools

Linguists

How phrases and sentences can be formed with words? What curbs the possible meaning for a sentence?

Intuitions about well-formedness and meaning. Mathematical model of structure. For example, model theoretic semantics, formal language theory.

Psycholinguists

How human beings can identify the structure of sentences? How the meaning of words can be identified? When does understanding take place?

Experimental techniques mainly for measuring the performance of human beings. Statistical analysis of observations.

Philosophers

How do words and sentences acquire the meaning? How the objects are identified by the words? What is meaning?

Natural language argumentation by using intuition. Mathematical models like logic and model theory.

Computational Linguists

How can we identify the structure of a sentence How knowledge and reasoning can be modeled? How we can use language to accomplish specific tasks?

Algorithms Data structures Formal models of representation and reasoning. AI techniques like search & representation methods.

Ambiguity and Uncertainty in Language

模糊性通常用于自然语言处理中，可以指能够以不止一种方式被理解的能力。简单来说，我们可以说模糊性是能够以不止一种方式被理解的能力。自然语言非常模棱两可。NLP 具有以下类型的模糊性−

Ambiguity, generally used in natural language processing, can be referred as the ability of being understood in more than one way. In simple terms, we can say that ambiguity is the capability of being understood in more than one way. Natural language is very ambiguous. NLP has the following types of ambiguities −

Lexical Ambiguity

单个单词的模糊性称为词法模糊性。例如，将单词 silver 视为名词、形容词或动词。

The ambiguity of a single word is called lexical ambiguity. For example, treating the word silver as a noun, an adjective, or a verb.

Syntactic Ambiguity

当句子以不同方式进行解析时，就会出现这种类型的模糊性。例如，句子“这个人用望远镜看到了那个女孩”。模糊的是，这个人看到女孩拿着望远镜，还是他通过望远镜看到了她。

This kind of ambiguity occurs when a sentence is parsed in different ways. For example, the sentence “The man saw the girl with the telescope”. It is ambiguous whether the man saw the girl carrying a telescope or he saw her through his telescope.

Semantic Ambiguity

当单词本身的含义可能被误解时，就会出现这种类型的模糊性。换句话说，语义模糊性发生在句子包含一个模棱两可的单词或短语时。例如，句子“这辆车在行驶时撞到了电线杆”具有语义模糊性，因为解释可以是“这辆车在行驶时撞到了电线杆”和“这辆车撞到了电线杆，而电线杆正在移动”。

This kind of ambiguity occurs when the meaning of the words themselves can be misinterpreted. In other words, semantic ambiguity happens when a sentence contains an ambiguous word or phrase. For example, the sentence “The car hit the pole while it was moving” is having semantic ambiguity because the interpretations can be “The car, while moving, hit the pole” and “The car hit the pole while the pole was moving”.

Anaphoric Ambiguity

这种模糊性是由于话语中使用了指代实体而产生的。例如，马沿着小路跑。它非常陡峭。它很快就累了。这里，“它”在两种情况下的指代关系造成了模糊性。

This kind of ambiguity arises due to the use of anaphora entities in discourse. For example, the horse ran up the hill. It was very steep. It soon got tired. Here, the anaphoric reference of “it” in two situations cause ambiguity.

Pragmatic ambiguity

此类模糊性是指短语的上下文赋予其多重解释的情况。简单来说，我们可以说，当表述不具体时，就会出现语用模糊性。例如，句子“我也喜欢你”可以有多种解释，例如我喜欢你（就像你一样喜欢我），我喜欢你（就像其他人一样）。

Such kind of ambiguity refers to the situation where the context of a phrase gives it multiple interpretations. In simple words, we can say that pragmatic ambiguity arises when the statement is not specific. For example, the sentence “I like you too” can have multiple interpretations like I like you (just like you like me), I like you (just like someone else dose).

NLP Phases

下图显示了自然语言处理中的阶段或逻辑步骤−

Following diagram shows the phases or logical steps in natural language processing −

Morphological Processing

这是 NLP 的第一阶段。此阶段的目的是将语言输入块分解为对应于段落、句子和单词的标记集。例如，像 “uneasy” 这样的单词可以分解成 “un-easy” 两个子词令牌。

It is the first phase of NLP. The purpose of this phase is to break chunks of language input into sets of tokens corresponding to paragraphs, sentences and words. For example, a word like “uneasy” can be broken into two sub-word tokens as “un-easy”.

Syntax Analysis

这是 NLP 的第二个阶段。此阶段的目的有两个：检查句子是否构成良好，以及将其分解为反映不同单词之间句法关系的结构。例如，像 “The school goes to the boy” 这样的句子将被语法分析器或解析器拒绝。

It is the second phase of NLP. The purpose of this phase is two folds: to check that a sentence is well formed or not and to break it up into a structure that shows the syntactic relationships between the different words. For example, the sentence like “The school goes to the boy” would be rejected by syntax analyzer or parser.

Semantic Analysis

这是 NLP 的第三个阶段。此阶段的目的是从文本中提取确切的含义，或者可以说成是词典含义。对文本进行了含义检查。例如，语义分析器会拒绝诸如“热冰淇淋”之类的句子。

It is the third phase of NLP. The purpose of this phase is to draw exact meaning, or you can say dictionary meaning from the text. The text is checked for meaningfulness. For example, semantic analyzer would reject a sentence like “Hot ice-cream”.

Pragmatic Analysis

这是 NLP 的第四个阶段。语用分析仅适合实际对象/事件，这些对象/事件存在于给定上下文中，并在上一阶段（语义分析）中获得了对象引用。例如，句子“把香蕉放在架子上的篮子里”可以有两种语义解释，语用分析器将在这两种可能性之间进行选择。

It is the fourth phase of NLP. Pragmatic analysis simply fits the actual objects/events, which exist in a given context with object references obtained during the last phase (semantic analysis). For example, the sentence “Put the banana in the basket on the shelf” can have two semantic interpretations and pragmatic analyzer will choose between these two possibilities.

NLP - Linguistic Resources

在本章中，我们将了解自然语言处理中的语言资源。

In this chapter, we will learn about the linguistic resources in Natural Language Processing.

Corpus

语料库是在自然交流环境中生成的大型结构化机器可读文本集。它的复数形式是语料库。它们可以通过不同的方式派生，例如原本是电子的文本、口语转录和光学字符识别等。

A corpus is a large and structured set of machine-readable texts that have been produced in a natural communicative setting. Its plural is corpora. They can be derived in different ways like text that was originally electronic, transcripts of spoken language and optical character recognition, etc.

Elements of Corpus Design

语言是无限的，但语料库必须是有限的。为了使语料库大小有限，我们需要对广泛的文本类型进行抽样和按比例包含，以确保良好的语料库设计。

Language is infinite but a corpus has to be finite in size. For the corpus to be finite in size, we need to sample and proportionally include a wide range of text types to ensure a good corpus design.

现在让我们了解语料库设计的一些重要元素−

Let us now learn about some important elements for corpus design −

Corpus Representativeness

代表性是语料库设计的一个决定性特征。两位伟大研究人员——Leech 和 Biber 的以下定义将帮助我们理解语料库代表性−

Representativeness is a defining feature of corpus design. The following definitions from two great researchers − Leech and Biber, will help us understand corpus representativeness −

According to Leech (1991), “A corpus is thought to be representative of the language variety it is supposed to represent if the findings based on its contents can be generalized to the said language variety”.
According to Biber (1993), “Representativeness refers to the extent to which a sample includes the full range of variability in a population”.

通过这种方式，我们可以得出结论，语料库的代表性由以下两个因素决定：

In this way, we can conclude that representativeness of a corpus are determined by the following two factors −

Balance − The range of genre include in a corpus
Sampling − How the chunks for each genre are selected.

Corpus Balance

语料库设计的另一个非常重要的元素是语料库平衡——语料库中包含的体裁范围。我们已经研究过，一般语料库的代表性取决于语料库平衡度如何。平衡的语料库涵盖广泛的文本类别，这些类别被认为是语言的代表。我们没有可靠的科学方法来衡量平衡度，但最好的估计和直觉在这方面起作用。换句话说，我们可以说，可接受的平衡度仅由其预期用途决定。

Another very important element of corpus design is corpus balance – the range of genre included in a corpus. We have already studied that representativeness of a general corpus depends upon how balanced the corpus is. A balanced corpus covers a wide range of text categories, which are supposed to be representatives of the language. We do not have any reliable scientific measure for balance but the best estimation and intuition works in this concern. In other words, we can say that the accepted balance is determined by its intended uses only.

Sampling

语料库设计的另一个重要元素是抽样。语料库代表性和平衡性与抽样密切相关。这就是为什么我们可以说抽样在语料库构建中是不可避免的。

Another important element of corpus design is sampling. Corpus representativeness and balance is very closely associated with sampling. That is why we can say that sampling is inescapable in corpus building.

According to Biber(1993), “Some of the first considerations in constructing a corpus concern the overall design: for example, the kinds of texts included, the number of texts, the selection of particular texts, the selection of text samples from within texts, and the length of text samples. Each of these involves a sampling decision, either conscious or not.”

在获取代表性样本时，我们需要考虑以下内容：

While obtaining a representative sample, we need to consider the following −

Sampling unit − It refers to the unit which requires a sample. For example, for written text, a sampling unit may be a newspaper, journal or a book.
Sampling frame − The list of al sampling units is called a sampling frame.
Population − It may be referred as the assembly of all sampling units. It is defined in terms of language production, language reception or language as a product.

Corpus Size

语料库设计的另一个重要元素是它的规模。语料库应该多大？这个问题没有具体答案。语料库的规模取决于其预期用途以及以下一些实际考虑因素：

Another important element of corpus design is its size. How large the corpus should be? There is no specific answer to this question. The size of the corpus depends upon the purpose for which it is intended as well as on some practical considerations as follows −

Kind of query anticipated from the user.
The methodology used by the users to study the data.
Availability of the source of data.

随着技术的进步，语料库的规模也在增加。以下比较表将帮助您了解语料库规模的工作原理：

With the advancement in technology, the corpus size also increases. The following table of comparison will help you understand how the corpus size works −

Year

Name of the Corpus

Size (in words)

1960s - 70s

Brown and LOB

1 Million words

1980s

The Birmingham corpora

20 Million words

1990s

The British National corpus

100 Million words

Early 21st century

The Bank of English corpus

650 Million words

在我们后面的章节中，我们将看一些语料库的示例。

In our subsequent sections, we will look at a few examples of corpus.

TreeBank Corpus

它可以定义为对句法或语义句子结构进行注释的语言解析文本语料库。Geoffrey Leech创造了术语“树库”，它表示表示语法分析的最常用方法是通过树形结构。通常，树库是在语料库的基础上创建的，该语料库已经用词性标签进行了注释。

It may be defined as linguistically parsed text corpus that annotates syntactic or semantic sentence structure. Geoffrey Leech coined the term ‘treebank’, which represents that the most common way of representing the grammatical analysis is by means of a tree structure. Generally, Treebanks are created on the top of a corpus, which has already been annotated with part-of-speech tags.

Types of TreeBank Corpus

语义树库和句法树库是语言学中最常见的两类树库。让我们现在详细了解这些类型：

Semantic and Syntactic Treebanks are the two most common types of Treebanks in linguistics. Let us now learn more about these types −

Semantic Treebanks

这些树库使用句子语义结构的形式化表示。它们在语义表示深度方面各不相同。机器人指令树库、地理查询、格罗宁根意义库、RoboCup 语料库是一些语义树库示例。

These Treebanks use a formal representation of sentence’s semantic structure. They vary in the depth of their semantic representation. Robot Commands Treebank, Geoquery, Groningen Meaning Bank, RoboCup Corpus are some of the examples of Semantic Treebanks.

Syntactic Treebanks

与语义树库相反，句法树库系统的输入是从句法分析树库数据转换中获得的形式语言表达式。此类系统的输出是基于谓词逻辑的意义表示。到目前为止，已经创建了各种不同语言的句法树库。例如， Penn Arabic Treebank, Columbia Arabic Treebank 是阿拉伯语创建的句法树库。 Sininca 是中文创建的句法树库。 Lucy, Susane 和 BLLIP WSJ 是英语创建的句法语料库。

Opposite to the semantic Treebanks, inputs to the Syntactic Treebank systems are expressions of the formal language obtained from the conversion of parsed Treebank data. The outputs of such systems are predicate logic based meaning representation. Various syntactic Treebanks in different languages have been created so far. For example, Penn Arabic Treebank, Columbia Arabic Treebank are syntactic Treebanks created in Arabia language. Sininca syntactic Treebank created in Chinese language. Lucy, Susane and BLLIP WSJ syntactic corpus created in English language.

Applications of TreeBank Corpus

以下是一些树库的应用：

Followings are some of the applications of TreeBanks −

In Computational Linguistics

如果谈到计算语言学，则树库的最佳用途是设计最先进的自然语言处理系统，例如词性标注器、解析器、语义分析器和机器翻译系统。

If we talk about Computational Linguistic then the best use of TreeBanks is to engineer state-of-the-art natural language processing systems such as part-of-speech taggers, parsers, semantic analyzers and machine translation systems.

In Corpus Linguistics

在语料库语言学中，树库的最佳用途是研究句法现象。

In case of Corpus linguistics, the best use of Treebanks is to study syntactic phenomena.

In Theoretical Linguistics and Psycholinguistics

树库在理论和心理语言学中的最佳用途是交互证据。

The best use of Treebanks in theoretical and psycholinguistics is interaction evidence.

PropBank Corpus

PropBank 更具体地称为“命题库”，它是一个语料库，其中附注了动词命题及其参数。该语料库是以动词为导向的资源；此处的注释更紧密地与句法级别相关。Martha Palmer 及科罗拉多大学博尔德分校语言学系共同开发了它。我们可以使用术语 PropBank 作为普通名词，指代任何已用命题及其参数进行注释的语料库。

PropBank more specifically called “Proposition Bank” is a corpus, which is annotated with verbal propositions and their arguments. The corpus is a verb-oriented resource; the annotations here are more closely related to the syntactic level. Martha Palmer et al., Department of Linguistic, University of Colorado Boulder developed it. We can use the term PropBank as a common noun referring to any corpus that has been annotated with propositions and their arguments.

在自然语言处理 (NLP) 中，PropBank 项目发挥了非常重要的作用。它有助于语义角色标记。

In Natural Language Processing (NLP), the PropBank project has played a very significant role. It helps in semantic role labeling.

VerbNet(VN)

VerbNet(VN) 是英语中层次化的、与领域无关且最大的词法资源，它包含有关其内容的语义和句法信息。VN 是一个广泛覆盖的动词词库，它映射到其他词法资源，如 WordNet、Xtag 和 FrameNet。它被组织成动词类别，通过细化和添加子类别来扩展 Levin 类别，以便在类别成员之间实现句法和语义连贯性。

VerbNet(VN) is the hierarchical domain-independent and largest lexical resource present in English that incorporates both semantic as well as syntactic information about its contents. VN is a broad-coverage verb lexicon having mappings to other lexical resources such as WordNet, Xtag and FrameNet. It is organized into verb classes extending Levin classes by refinement and addition of subclasses for achieving syntactic and semantic coherence among class members.

每个 VerbNet (VN) 类别包含：

Each VerbNet (VN) class contains −

A set of syntactic descriptions or syntactic frames

描述结构的可能表面实现，例如及物、不及物、介词短语、结果和大量的语态交替。

For depicting the possible surface realizations of the argument structure for constructions such as transitive, intransitive, prepositional phrases, resultatives, and a large set of diathesis alternations.

A set of semantic descriptions such as animate, human, organization

对参数允许的主题角色类型进行限制，并且可以施加进一步的限制。这将有助于指示可能与主题角色关联的成分的句法性质。

For constraining, the types of thematic roles allowed by the arguments, and further restrictions may be imposed. This will help in indicating the syntactic nature of the constituent likely to be associated with the thematic role.

WordNet

WordNet 由普林斯顿创建，是英语语言的词汇数据库。它是 NLTK 语料库的一部分。在 WordNet 中，名词、动词、形容词和副词被分组到称为 Synsets 的认知同义词组中。所有同义词集都在概念语义和词汇关系的帮助下联系在一起。它的结构使其非常适合自然语言处理 (NLP)。

WordNet, created by Princeton is a lexical database for English language. It is the part of the NLTK corpus. In WordNet, nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms called Synsets. All the synsets are linked with the help of conceptual-semantic and lexical relations. Its structure makes it very useful for natural language processing (NLP).

在信息系统中，WordNet 用于各种目的，例如消歧义、信息检索、自动文本分类和机器翻译。WordNet 最重要的用途之一是找出单词之间的相似性。对于此任务，已在各种包中实现了各种算法，例如 Perl 中的相似性、Python 中的 NLTK 和 Java 中的 ADW。

In information systems, WordNet is used for various purposes like word-sense disambiguation, information retrieval, automatic text classification and machine translation. One of the most important uses of WordNet is to find out the similarity among words. For this task, various algorithms have been implemented in various packages like Similarity in Perl, NLTK in Python and ADW in Java.

NLP - Word Level Analysis

在本章中，我们将理解自然语言处理中的世界级分析。

In this chapter, we will understand world level analysis in Natural Language Processing.

Regular Expressions

正则表达式 (RE) 是一种指定文本搜索字符串的语言。RE 帮助我们使用模式中保存的专门语法匹配或查找其他字符串或一组字符串。正则表达式用来在 UNIX 和 MS WORD 中以相同的方式搜索文本。我们有各种使用多个 RE 功能的搜索引擎。

A regular expression (RE) is a language for specifying text search strings. RE helps us to match or find other strings or sets of strings, using a specialized syntax held in a pattern. Regular expressions are used to search texts in UNIX as well as in MS WORD in identical way. We have various search engines using a number of RE features.

Properties of Regular Expressions

以下是 RE 的一些重要属性：

Followings are some of the important properties of RE −

American Mathematician Stephen Cole Kleene formalized the Regular Expression language.
RE is a formula in a special language, which can be used for specifying simple classes of strings, a sequence of symbols. In other words, we can say that RE is an algebraic notation for characterizing a set of strings.
Regular expression requires two things, one is the pattern that we wish to search and other is a corpus of text from which we need to search.

在数学上，正则表达式可以定义如下：

Mathematically, A Regular Expression can be defined as follows −

ε is a Regular Expression, which indicates that the language is having an empty string.
φ is a Regular Expression which denotes that it is an empty language.
If X and Y are Regular Expressions, then X, Y X.Y(Concatenation of XY) X+Y (Union of X and Y) X, Y* (Kleen Closure of X and Y)*

也是正则表达式。

are also regular expressions.

If a string is derived from above rules then that would also be a regular expression.

Examples of Regular Expressions

下表列出了几个正则表达式的例子 −

The following table shows a few examples of Regular Expressions −

Regular Expressions

Regular Set

(0 + 10*)

{0, 1, 10, 100, 1000, 10000, … }

(0*10*)

{1, 01, 10, 010, 0010, …}

(0 + ε)(1 + ε)

{ε, 0, 1, 01}

(a+b)*

It would be set of strings of a’s and b’s of any length which also includes the null string i.e. {ε, a, b, aa , ab , bb , ba, aaa…….}

(a+b)*abb

It would be set of strings of a’s and b’s ending with the string abb i.e. {abb, aabb, babb, aaabb, ababb, …………..}

(11)*

It would be set consisting of even number of 1’s which also includes an empty string i.e. {ε, 11, 1111, 111111, ……….}

(aa)*(bb)*b

It would be set of strings consisting of even number of a’s followed by odd number of b’s i.e. {b, aab, aabbb, aabbbbb, aaaab, aaaabbb, …………..}

(aa + ab + ba + bb)*

It would be string of a’s and b’s of even length that can be obtained by concatenating any combination of the strings aa, ab, ba and bb including null i.e. {aa, ab, ba, bb, aaab, aaba, …………..}

Regular Sets & Their Properties

可以将其定义为表示正则表达式的值的一组，并且具有特定属性。

It may be defined as the set that represents the value of the regular expression and consists specific properties.

Properties of regular sets

If we do the union of two regular sets then the resulting set would also be regula.
If we do the intersection of two regular sets then the resulting set would also be regular.
If we do the complement of regular sets, then the resulting set would also be regular.
If we do the difference of two regular sets, then the resulting set would also be regular.
If we do the reversal of regular sets, then the resulting set would also be regular.
If we take the closure of regular sets, then the resulting set would also be regular.
If we do the concatenation of two regular sets, then the resulting set would also be regular.

Finite State Automata

自动机一词源自希腊语单词“αὐτόματα”，意为“自动”，它是“自动机”的复数形式，可以将其定义为一种抽象的自推进计算设备，它自动执行预定的操作序列。

The term automata, derived from the Greek word "αὐτόματα" meaning "self-acting", is the plural of automaton which may be defined as an abstract self-propelled computing device that follows a predetermined sequence of operations automatically.

具有有限状态数的自动机称为有限自动机 (FA) 或有限状态自动机 (FSA)。

An automaton having a finite number of states is called a Finite Automaton (FA) or Finite State automata (FSA).

在数学上，自动机可以用 5 元组 (Q, Σ, δ, q0, F) 表示，其中 −

Mathematically, an automaton can be represented by a 5-tuple (Q, Σ, δ, q0, F), where −

Q is a finite set of states.
Σ is a finite set of symbols, called the alphabet of the automaton.
δ is the transition function
q0 is the initial state from where any input is processed (q0 ∈ Q).
F is a set of final state/states of Q (F ⊆ Q).

Relation between Finite Automata, Regular Grammars and Regular Expressions

以下几点将使我们清楚地了解有限自动机、正规语法和正规表达式之间的关系：

Following points will give us a clear view about the relationship between finite automata, regular grammars and regular expressions −

As we know that finite state automata are the theoretical foundation of computational work and regular expressions is one way of describing them.
We can say that any regular expression can be implemented as FSA and any FSA can be described with a regular expression.
On the other hand, regular expression is a way to characterize a kind of language called regular language. Hence, we can say that regular language can be described with the help of both FSA and regular expression.
Regular grammar, a formal grammar that can be right-regular or left-regular, is another way to characterize regular language.

下图显示了有限自动机、正规表达式和正规语法是描述正规语言的等效方式。

Following diagram shows that finite automata, regular expressions and regular grammars are the equivalent ways of describing regular languages.

Types of Finite State Automation (FSA)

有限状态自动化有两種類型。让我们来看看這些類型是什么。

Finite state automation is of two types. Let us see what the types are.

Deterministic Finite automation (DFA)

可以将其定义为有限自动化类型，其中，对于每个输入符号，我们可以确定机器将移动到的状态。它具有有限数量的状态，这就是机器被称为确定性有限自动机 (DFA) 的原因。

It may be defined as the type of finite automation wherein, for every input symbol we can determine the state to which the machine will move. It has a finite number of states that is why the machine is called Deterministic Finite Automaton (DFA).

在数学上，DFA 可以表示为 5 元组 (Q, Σ, δ, q0, F)，其中：

Mathematically, a DFA can be represented by a 5-tuple (Q, Σ, δ, q0, F), where −

Q is a finite set of states.
Σ is a finite set of symbols, called the alphabet of the automaton.
δ is the transition function where δ: Q × Σ → Q .
q0 is the initial state from where any input is processed (q0 ∈ Q).
F is a set of final state/states of Q (F ⊆ Q).

在图形上，DFA 可以通过称为状态图的有向图表示，其中：

Whereas graphically, a DFA can be represented by diagraphs called state diagrams where −

The states are represented by vertices.
The transitions are shown by labeled arcs.
The initial state is represented by an empty incoming arc.
The final state is represented by double circle.

Example of DFA

假设 DFA 为

Suppose a DFA be

Q = {a, b, c},
Σ = {0, 1},
q0 = {a},
F = {c},
Transition function δ is shown in the table as follows −

Current State

Next State for Input 0

Next State for Input 1

该 DFA 的图形表示如下 −

The graphical representation of this DFA would be as follows −

Non-deterministic Finite Automation (NDFA)

它可以定义为对于每个输入符号，我们不能确定机器将移动到的状态的有限自动化类型，即机器可以移动到任何状态的组合。它具有有限数量的状态，这就是机器被称为非确定性有限自动机 (NDFA) 的原因。

It may be defined as the type of finite automation where for every input symbol we cannot determine the state to which the machine will move i.e. the machine can move to any combination of the states. It has a finite number of states that is why the machine is called Non-deterministic Finite Automation (NDFA).

从数学上讲，NDFA 可以用 5 元组 (Q, Σ, δ, q0, F) 来表示，其中 −

Mathematically, NDFA can be represented by a 5-tuple (Q, Σ, δ, q0, F), where −

Q is a finite set of states.
Σ is a finite set of symbols, called the alphabet of the automaton.
δ :-is the transition function where δ: Q × Σ → 2 Q.
q0 :-is the initial state from where any input is processed (q0 ∈ Q).
F :-is a set of final state/states of Q (F ⊆ Q).

相反，在图形上（与 DFA 相同），NDFA 可以用称为状态图的有向图来表示，其中 −

Whereas graphically (same as DFA), a NDFA can be represented by diagraphs called state diagrams where −

The states are represented by vertices.
The transitions are shown by labeled arcs.
The initial state is represented by an empty incoming arc.
The final state is represented by double circle.

Example of NDFA

假设 NDFA 为

Suppose a NDFA be

Q = {a, b, c},
Σ = {0, 1},
q0 = {a},
F = {c},
Transition function δ is shown in the table as follows −

Current State

Next State for Input 0

Next State for Input 1

a, c

该 NDFA 的图形表示如下 −

The graphical representation of this NDFA would be as follows −

Morphological Parsing

形态解析这个术语与词素的解析有关。我们可以将形态解析定义为识别某个单词分解为较小的有意义单位（称为词素）的问题，从而为它产生某种语言结构。例如，我们可以把单词 foxes 分解为两个词，fox 和 -es。我们可以看到，单词 foxes 由两个词素组成，一个是 fox，另一个是 -es。

The term morphological parsing is related to the parsing of morphemes. We can define morphological parsing as the problem of recognizing that a word breaks down into smaller meaningful units called morphemes producing some sort of linguistic structure for it. For example, we can break the word foxes into two, fox and -es. We can see that the word foxes, is made up of two morphemes, one is fox and other is -es.

从另一个意义上，我们可以说形态学是 −

In other sense, we can say that morphology is the study of −

The formation of words.
The origin of the words.
Grammatical forms of the words.
Use of prefixes and suffixes in the formation of words.
How parts-of-speech (PoS) of a language are formed.

Types of Morphemes

词素是最小的带义单位，可分两类：

Morphemes, the smallest meaning-bearing units, can be divided into two types −

Stems
Word Order

Stems

它是单词的核心意义单位。我们也可称其为单词词根。例如，在单词 “foxes” 中，词干是 “fox”。

It is the core meaningful unit of a word. We can also say that it is the root of the word. For example, in the word foxes, the stem is fox.

Affixes − As the name suggests, they add some additional meaning and grammatical functions to the words. For example, in the word foxes, the affix is − es.

缀进一步可细分为以下四类：

Further, affixes can also be divided into following four types −

Word Order

单词顺序由形态句法分析决定。现在让我们了解构建形态句法分析器所需条件：

The order of the words would be decided by morphological parsing. Let us now see the requirements for building a morphological parser −

Lexicon

构建形态句法分析仪的最首要条件是词典，其中包括词干和缀的列表及其基本信息。例如，以下信息：词干是名词词干还是动词词干等。

The very first requirement for building a morphological parser is lexicon, which includes the list of stems and affixes along with the basic information about them. For example, the information like whether the stem is Noun stem or Verb stem, etc.

Morphotactics

它基本上是词素排序模型。换句话说，该模型解释单词中哪类词素可跟在其他类词素后面。例如，词法事实表明，英语复数词素总是跟在名词后面，而不是在名词前面。

It is basically the model of morpheme ordering. In other sense, the model explaining which classes of morphemes can follow other classes of morphemes inside a word. For example, the morphotactic fact is that the English plural morpheme always follows the noun rather than preceding it.

Orthographic rules

这些拼写规则用于对单词中发生的改变建模。例如，将单词中的 y 转换成 ie 的规则，比如 city+s = cities，而不是 citys。

These spelling rules are used to model the changes occurring in a word. For example, the rule of converting y to ie in word like city+s = cities not citys.

Natural Language Processing - Syntactic Analysis

句法分析或解析或语法分析是 NLP 的第三个阶段。此阶段的目的是提取确切的含义，或者可以说从文本中提取字典含义。语法分析根据正式语法规则检查文本的意义。例如，语义分析器会拒绝“热冰淇淋”之类的句子。

Syntactic analysis or parsing or syntax analysis is the third phase of NLP. The purpose of this phase is to draw exact meaning, or you can say dictionary meaning from the text. Syntax analysis checks the text for meaningfulness comparing to the rules of formal grammar. For example, the sentence like “hot ice-cream” would be rejected by semantic analyzer.

从这个意义上讲，句法分析或解析可以被定义为分析自然语言中的符号串符合形式语法规则的过程。词 ‘parsing’ 来源于拉丁语 ‘pars’ ，意为 ‘part’ 。

In this sense, syntactic analysis or parsing may be defined as the process of analyzing the strings of symbols in natural language conforming to the rules of formal grammar. The origin of the word ‘parsing’ is from Latin word ‘pars’ which means ‘part’.

Concept of Parser

它用于实现解析任务。它可以定义为专门用于获取输入数据（文本）并根据形式语法检查正确语法后给出输入的结构表示的软件组件。它还构建了一个通常采用解析树、抽象语法树或其他层次结构形式的数据结构。

It is used to implement the task of parsing. It may be defined as the software component designed for taking input data (text) and giving structural representation of the input after checking for correct syntax as per formal grammar. It also builds a data structure generally in the form of parse tree or abstract syntax tree or other hierarchical structure.

解析的主要作用包括：

The main roles of the parse include −

To report any syntax error.
To recover from commonly occurring error so that the processing of the remainder of program can be continued.
To create parse tree.
To create symbol table.
To produce intermediate representations (IR).

Types of Parsing

导出将解析分解为以下两种类型−

Derivation divides parsing into the followings two types −

Top-down Parsing
Bottom-up Parsing

Top-down Parsing

在此类解析中，解析器从开始符号开始构建解析树，然后尝试将开始符号转换为输入。自顶向下解析最常用的形式使用递归过程来处理输入。自顶向下解析的主要缺点是回溯。

In this kind of parsing, the parser starts constructing the parse tree from the start symbol and then tries to transform the start symbol to the input. The most common form of topdown parsing uses recursive procedure to process the input. The main disadvantage of recursive descent parsing is backtracking.

Bottom-up Parsing

在此类解析中，解析器从输入符号开始构建解析树，直到开始符号。

In this kind of parsing, the parser starts with the input symbol and tries to construct the parser tree up to the start symbol.

Concept of Derivation

为了获取输入字符串，我们需要一系列产生式。导出是一组产生式。在解析过程中，我们需要决定要替换的非终结符，并决定借助其将要替换非终结符的产生式。

In order to get the input string, we need a sequence of production rules. Derivation is a set of production rules. During parsing, we need to decide the non-terminal, which is to be replaced along with deciding the production rule with the help of which the non-terminal will be replaced.

Types of Derivation

在本节中，我们将了解两种类型的导出，可用于决定要用产生式替换哪个非终结符 −

In this section, we will learn about the two types of derivations, which can be used to decide which non-terminal to be replaced with production rule −

Left-most Derivation

在最左导出中，输入的句子形式从左到右进行扫描和替换。这种情况下的句子形式称为左句子形式。

In the left-most derivation, the sentential form of an input is scanned and replaced from the left to the right. The sentential form in this case is called the left-sentential form.

Right-most Derivation

在最左导出中，输入的句子形式从右到左进行扫描和替换。这种情况下的句子形式称为右句子形式。

In the left-most derivation, the sentential form of an input is scanned and replaced from right to left. The sentential form in this case is called the right-sentential form.

Concept of Parse Tree

可以将它定义为导出的图形描述。导出的开始符号用作解析树的根。在每个解析树中，叶节点是终结符，内部节点是非终结符。解析树的属性是：中序遍历将产生原始输入字符串。

It may be defined as the graphical depiction of a derivation. The start symbol of derivation serves as the root of the parse tree. In every parse tree, the leaf nodes are terminals and interior nodes are non-terminals. A property of parse tree is that in-order traversal will produce the original input string.

Concept of Grammar

语法对于描述构造良好的程序的句法结构至关重要。从文学意义上来说，它们表示自然语言对话的语法规则。语言学自英语、印地语等自然语言诞生以来，就尝试定义语法。

Grammar is very essential and important to describe the syntactic structure of well-formed programs. In the literary sense, they denote syntactical rules for conversation in natural languages. Linguistics have attempted to define grammars since the inception of natural languages like English, Hindi, etc.

形式语言理论也适用于计算机科学领域，主要在编程语言和数据结构中。例如，在“C”语言中，精确的语法规则说明了如何通过列表和语句来创建函数。

The theory of formal languages is also applicable in the fields of Computer Science mainly in programming languages and data structure. For example, in ‘C’ language, the precise grammar rules state how functions are made from lists and statements.

Noam Chomsky 在 1956 年给出了语法的数学模型，该模型可有效书写计算机语言。

A mathematical model of grammar was given by Noam Chomsky in 1956, which is effective for writing computer languages.

在数学上，语法 G 可以正式写为一个 4 元组 (N, T, S, P)，其中 −

Mathematically, a grammar G can be formally written as a 4-tuple (N, T, S, P) where −

N or VN = set of non-terminal symbols, i.e., variables.
T or ∑ = set of terminal symbols.
S = Start symbol where S ∈ N
P denotes the Production rules for Terminals as well as Non-terminals. It has the form α → β, where α and β are strings on VN ∪ ∑ and least one symbol of α belongs to VN

Phrase Structure or Constituency Grammar

Noam Chomsky 提出的短语结构语法基于成分关系。这就是它也被称为成分语法的缘故。它与依存语法相反。

Phrase structure grammar, introduced by Noam Chomsky, is based on the constituency relation. That is why it is also called constituency grammar. It is opposite to dependency grammar.

Example

在给出语群语法示例之前，我们需要了解语群语法和语群关系的基本点。

Before giving an example of constituency grammar, we need to know the fundamental points about constituency grammar and constituency relation.

All the related frameworks view the sentence structure in terms of constituency relation.
The constituency relation is derived from the subject-predicate division of Latin as well as Greek grammar.
The basic clause structure is understood in terms of noun phrase NP and verb phrase VP.

我们可以按如下所示书写句子{s2}：

We can write the sentence “This tree is illustrating the constituency relation” as follows −

Dependency Grammar

它与语群语法相反，并且基于从属关系。是由吕西安·泰尼埃首次提出的。从属语法 (DG) 与语群语法相反，因为它缺少短语节点。

It is opposite to the constituency grammar and based on dependency relation. It was introduced by Lucien Tesniere. Dependency grammar (DG) is opposite to the constituency grammar because it lacks phrasal nodes.

Example

在给出从属语法示例之前，我们需要了解从属语法和从属关系的基本点。

Before giving an example of Dependency grammar, we need to know the fundamental points about Dependency grammar and Dependency relation.

In DG, the linguistic units, i.e., words are connected to each other by directed links.
The verb becomes the center of the clause structure.
Every other syntactic units are connected to the verb in terms of directed link. These syntactic units are called dependencies.

我们可以按如下所示书写句子{s4}：

We can write the sentence “This tree is illustrating the dependency relation” as follows;

使用语群语法的解析树称为基于语群的解析树；而使用从属语法的解析树称为基于从属的解析树。

Parse tree that uses Constituency grammar is called constituency-based parse tree; and the parse trees that uses dependency grammar is called dependency-based parse tree.

Context Free Grammar

无上下文文法，也称为 CFG，是一种用于描述语言的符号以及正则语法的超集。它可以在下图中看到：

Context free grammar, also called CFG, is a notation for describing languages and a superset of Regular grammar. It can be seen in the following diagram −

Definition of CFG

CFG 由一组具有以下四个部分的有限语法规则组成：

CFG consists of finite set of grammar rules with the following four components −

Set of Non-terminals

它由 V 表示。非终结符是表示一组字符串的句法变量，这些字符串进一步帮助定义语法所生成的语言。

It is denoted by V. The non-terminals are syntactic variables that denote the sets of strings, which further help defining the language, generated by the grammar.

Set of Terminals

它也被称为标记并由 Σ 定义。字符串是由终结符的基本符号组成的。

It is also called tokens and defined by Σ. Strings are formed with the basic symbols of terminals.

Set of Productions

它由 P 表示。该集合定义了如何组合终结符和非终结符。每个产生式 (P) 由非终结符、箭头和终结符（终结符序列）组成。非终结符称为产生式的左侧，终结符称为产生式的右侧。

It is denoted by P. The set defines how the terminals and non-terminals can be combined. Every production(P) consists of non-terminals, an arrow, and terminals (the sequence of terminals). Non-terminals are called the left side of the production and terminals are called the right side of the production.

Start Symbol

产生式从开始符号开始。它由符号 S 表示。非终结符符号始终被指定为开始符号。

The production begins from the start symbol. It is denoted by symbol S. Non-terminal symbol is always designated as start symbol.

Natural Language Processing - Semantic Analysis

语义分析的目的是从文本中提取确切的含义，或者你可以说词典含义。语义分析器的作用是检查文本的含义。

The purpose of semantic analysis is to draw exact meaning, or you can say dictionary meaning from the text. The work of semantic analyzer is to check the text for meaningfulness.

我们已经知道词法分析也处理单词的含义，那么词法分析和语义分析有什么不同呢？词法分析基于较小的标记，但另一方面，语义分析则关注较大的块。这就是为什么语义分析可以分为以下两部分：

We already know that lexical analysis also deals with the meaning of the words, then how is semantic analysis different from lexical analysis? Lexical analysis is based on smaller token but on the other side semantic analysis focuses on larger chunks. That is why semantic analysis can be divided into the following two parts −

Studying meaning of individual word

它是语义分析的第一部分，其中执行单个单词的含义的研究。这部分称为词法语义。

It is the first part of the semantic analysis in which the study of the meaning of individual words is performed. This part is called lexical semantics.

Studying the combination of individual words

在第二部分中，单个单词将被组合起来为句子提供含义。

In the second part, the individual words will be combined to provide meaning in sentences.

语义分析最重要的任务是获得句子的恰当含义。例如，分析句子“在这句话中，说话者正谈论罗摩勋爵或一个叫罗摩的人。”这就是为什么语义分析器获得句子恰当含义的工作非常重要。

The most important task of semantic analysis is to get the proper meaning of the sentence. For example, analyze the sentence “Ram is great.” In this sentence, the speaker is talking either about Lord Ram or about a person whose name is Ram. That is why the job, to get the proper meaning of the sentence, of semantic analyzer is important.

Elements of Semantic Analysis

以下是语义分析的一些重要元素：

Followings are some important elements of semantic analysis −

Hyponymy

它可以被定义为泛词与泛词实例之间的关系。这里的泛词称为上位词，其实例称为下位词。例如，词“颜色”是上位词，而词“蓝色”、“黄色”等是下位词。

It may be defined as the relationship between a generic term and instances of that generic term. Here the generic term is called hypernym and its instances are called hyponyms. For example, the word color is hypernym and the color blue, yellow etc. are hyponyms.

Homonymy

它可以被定义为拼写或形式相同但含义不同且不相关的词。例如，单词“Bat”是一个同音异义词，因为bat可以是击球的工具，也可以是夜间飞行的哺乳动物。

It may be defined as the words having same spelling or same form but having different and unrelated meaning. For example, the word “Bat” is a homonymy word because bat can be an implement to hit a ball or bat is a nocturnal flying mammal also.

Polysemy

多义词是一个希腊词，意为“许多符号”。它是一个具有不同但相关意义的单词或短语。换句话说，我们可以说多义词具有相同的拼写但不同且相关的含义。例如，单词“bank”是一个多义词，具有以下含义：

Polysemy is a Greek word, which means “many signs”. It is a word or phrase with different but related sense. In other words, we can say that polysemy has the same spelling but different and related meaning. For example, the word “bank” is a polysemy word having the following meanings −

A financial institution.
The building in which such an institution is located.
A synonym for “to rely on”.

Difference between Polysemy and Homonymy

多义词和同音异义词都具有相同的语法或拼写。它们之间的主要区别在于，在多义词中，单词的含义是相关的，而在同音异义词中，单词的含义是不相关的。例如，如果我们谈论同一个单词“Bank”，我们可以写出“金融机构”或“河岸”的含义。在这种情况下，这将是同音异义词的例子，因为这些含义彼此无关。

Both polysemy and homonymy words have the same syntax or spelling. The main difference between them is that in polysemy, the meanings of the words are related but in homonymy, the meanings of the words are not related. For example, if we talk about the same word “Bank”, we can write the meaning ‘a financial institution’ or ‘a river bank’. In that case it would be the example of homonym because the meanings are unrelated to each other.

Synonymy

它是指具有不同形式但表示相同或相近含义的两个词素之间的关系。示例有“作者/作家”、“命运/天命”。

It is the relation between two lexical items having different forms but expressing the same or a close meaning. Examples are ‘author/writer’, ‘fate/destiny’.

Antonymy

它是指两个词素之间在其语义成分相对于某个轴对称的关系。反义词的范围如下所示：

It is the relation between two lexical items having symmetry between their semantic components relative to an axis. The scope of antonymy is as follows −

Application of property or not − Example is ‘life/death’, ‘certitude/incertitude’
Application of scalable property − Example is ‘rich/poor’, ‘hot/cold’
Application of a usage − Example is ‘father/son’, ‘moon/sun’.

Meaning Representation

语义分析创建句子的含义表示。但在进入与含义表示相关的概念和方法之前，我们需要了解语义系统的构建模块。

Semantic analysis creates a representation of the meaning of a sentence. But before getting into the concept and approaches related to meaning representation, we need to understand the building blocks of semantic system.

Building Blocks of Semantic System

在词语表示或词语含义的表示中，以下构建模块发挥着重要作用：

In word representation or representation of the meaning of the words, the following building blocks play an important role −

Entities − It represents the individual such as a particular person, location etc. For example, Haryana. India, Ram all are entities.
Concepts − It represents the general category of the individuals such as a person, city, etc.
Relations − It represents the relationship between entities and concept. For example, Ram is a person.
Predicates − It represents the verb structures. For example, semantic roles and case grammar are the examples of predicates.

现在，我们可以理解语义表征展示了如何将语义系统的构建模块放在一起。换句话说，它展示了如何将实体、概念、关系和谓词组合在一起，以描述一种情况。它还能够推理语义世界。

Now, we can understand that meaning representation shows how to put together the building blocks of semantic systems. In other words, it shows how to put together entities, concepts, relation and predicates to describe a situation. It also enables the reasoning about the semantic world.

Approaches to Meaning Representations

语义分析使用以下方法来表征意义——

Semantic analysis uses the following approaches for the representation of meaning −

First order predicate logic (FOPL)
Semantic Nets
Frames
Conceptual dependency (CD)
Rule-based architecture
Case Grammar
Conceptual Graphs

Need of Meaning Representations

这里出现的一个问题是我们为什么需要语义表征？以下是原因——

A question that arises here is why do we need meaning representation? Followings are the reasons for the same −

Linking of linguistic elements to non-linguistic elements

第一个原因是借助语义表征，可以将语言元素与非语言元素联系起来。

The very first reason is that with the help of meaning representation the linking of linguistic elements to the non-linguistic elements can be done.

Representing variety at lexical level

借助语义表征，可以在词汇层面上表征明确的规范形式。

With the help of meaning representation, unambiguous, canonical forms can be represented at the lexical level.

Can be used for reasoning

能使用语义表征来推理，以验证世界中什么是真实的，以及从语义表征中推断知识。

Meaning representation can be used to reason for verifying what is true in the world as well as to infer the knowledge from the semantic representation.

Lexical Semantics

语义分析的第一个部分——研究各个单词的含义被称为词汇语义。它包括单词、子词、词缀（子单位）、复合词和短语。所有单词、子词等统称为词汇项。换句话说，可以说词汇语义是词汇项、句子含义和句子语法之间的关系。

The first part of semantic analysis, studying the meaning of individual words is called lexical semantics. It includes words, sub-words, affixes (sub-units), compound words and phrases also. All the words, sub-words, etc. are collectively called lexical items. In other words, we can say that lexical semantics is the relationship between lexical items, meaning of sentences and syntax of sentence.

以下是在词汇语义中涉及的步骤——

Following are the steps involved in lexical semantics −

Classification of lexical items like words, sub-words, affixes, etc. is performed in lexical semantics.
Decomposition of lexical items like words, sub-words, affixes, etc. is performed in lexical semantics.
Differences as well as similarities between various lexical semantic structures is also analyzed.

NLP - Word Sense Disambiguation

我们知道，根据其在句子中的使用上下文，单词具有不同的含义。如果我们谈论人类语言，那么它们也是模棱两可的，因为许多单词可以根据其出现的上下文中以多种方式解释。

We understand that words have different meanings based on the context of its usage in the sentence. If we talk about human languages, then they are ambiguous too because many words can be interpreted in multiple ways depending upon the context of their occurrence.

在自然语言处理 (NLP) 中，词义消歧可能被定义为确定单词的哪种含义因在特定上下文中使用该单词而被激活的能力。词法歧义、句法或语义歧义是任何 NLP 系统面临的第一个问题之一。具有高准确率的词性 (POS) 标记器可以解决单词的句法歧义。另一方面，解决语义歧义的问题称为 WSD（词义消歧）。解决语义歧义比解决句法歧义更难。

Word sense disambiguation, in natural language processing (NLP), may be defined as the ability to determine which meaning of word is activated by the use of word in a particular context. Lexical ambiguity, syntactic or semantic, is one of the very first problem that any NLP system faces. Part-of-speech (POS) taggers with high level of accuracy can solve Word’s syntactic ambiguity. On the other hand, the problem of resolving semantic ambiguity is called WSD (word sense disambiguation). Resolving semantic ambiguity is harder than resolving syntactic ambiguity.

例如，考虑单词 “bass” 的不同含义的两个示例 −

For example, consider the two examples of the distinct sense that exist for the word “bass” −

I can hear bass sound.
He likes to eat grilled bass.

bass 词的出现清楚地表明了不同的含义。在第一句话中，这意味着 frequency ，在第二句话中，这意味着 fish 。因此，如果通过WSD消除歧义，则可以将正确含义分配给以上句子，如下所示：

The occurrence of the word bass clearly denotes the distinct meaning. In first sentence, it means frequency and in second, it means fish. Hence, if it would be disambiguated by WSD then the correct meaning to the above sentences can be assigned as follows −

I can hear bass/frequency sound.
He likes to eat grilled bass/fish.

Evaluation of WSD

WSD的评估需要以下两个输入：

The evaluation of WSD requires the following two inputs −

A Dictionary

第一个评估WSD的输入是词典，用于指定要消除歧义的含义。

The very first input for evaluation of WSD is dictionary, which is used to specify the senses to be disambiguated.

Test Corpus

WSD需要的另一个输入是具有目标或正确含义的高注释测试语料库。测试语料库可以分为两类和{s3}：

Another input required by WSD is the high-annotated test corpus that has the target or correct-senses. The test corpora can be of two types &minsu;

Lexical sample − This kind of corpora is used in the system, where it is required to disambiguate a small sample of words.
All-words − This kind of corpora is used in the system, where it is expected to disambiguate all the words in a piece of running text.

Approaches and Methods to Word Sense Disambiguation (WSD)

WSD的方法和分类根据单词消除歧义中使用的知识来源。

Approaches and methods to WSD are classified according to the source of knowledge used in word disambiguation.

现在让我们看看WSD的四种常规方法：

Let us now see the four conventional methods to WSD −

Dictionary-based or Knowledge-based Methods

顾名思义，对于消除歧义，这些方法主要依赖于字典、treasure和词汇知识库。它们不使用语料库证据来消除歧义。Lesk方法是Michael Lesk在1986年引入的开创性字典方法。Lesk定义，Lesk算法基于它的是 “measure overlap between sense definitions for all words in context” 。然而，在2000年，Kilgarriff和Rosensweig给出了简化的Lesk定义，即 “measure overlap between sense definitions of word and current context” ，这进一步意味着一次识别一个单词的正确含义。这里的当前上下文是句子或段落周围单词的集合。

As the name suggests, for disambiguation, these methods primarily rely on dictionaries, treasures and lexical knowledge base. They do not use corpora evidences for disambiguation. The Lesk method is the seminal dictionary-based method introduced by Michael Lesk in 1986. The Lesk definition, on which the Lesk algorithm is based is “measure overlap between sense definitions for all words in context”. However, in 2000, Kilgarriff and Rosensweig gave the simplified Lesk definition as “measure overlap between sense definitions of word and current context”, which further means identify the correct sense for one word at a time. Here the current context is the set of words in surrounding sentence or paragraph.

Supervised Methods

对于消除歧义，机器学习方法利用经过含义注释的语料库来训练。这些方法假设上下文自身可以提供足够的证据来消除含义的歧义。在这些方法中，单词知识和推理被认为是不必要的。上下文被表示为单词的一组“特征”。它还包括有关周围单词的信息。支持向量机和基于内存的学习是WSD最成功的监督学习方法。这些方法依赖于大量的经过手动含义标记的语料库，创建这些语料库非常昂贵。

For disambiguation, machine learning methods make use of sense-annotated corpora to train. These methods assume that the context can provide enough evidence on its own to disambiguate the sense. In these methods, the words knowledge and reasoning are deemed unnecessary. The context is represented as a set of “features” of the words. It includes the information about the surrounding words also. Support vector machine and memory-based learning are the most successful supervised learning approaches to WSD. These methods rely on substantial amount of manually sense-tagged corpora, which is very expensive to create.

Semi-supervised Methods

由于缺乏训练语料库，大多数单词含义消除歧义算法都使用半监督学习方法。这是因为半监督方法使用标记数据和未标记数据。这些方法只需要少量带注释的文本和大量未注释的纯文本。半监督方法使用的是从种子数据引导程序的技术。

Due to the lack of training corpus, most of the word sense disambiguation algorithms use semi-supervised learning methods. It is because semi-supervised methods use both labelled as well as unlabeled data. These methods require very small amount of annotated text and large amount of plain unannotated text. The technique that is used by semisupervised methods is bootstrapping from seed data.

Unsupervised Methods

这些方法假设相似的含义出现在相似的上下文中。这就是为什么可以根据上下文相似性度量使用单词出现集群来从文本中归纳意义。此任务称为单词含义归纳或区分。非监督方法有可能克服由于不依赖手动工作而导致的知识获取瓶颈。

These methods assume that similar senses occur in similar context. That is why the senses can be induced from text by clustering word occurrences by using some measure of similarity of the context. This task is called word sense induction or discrimination. Unsupervised methods have great potential to overcome the knowledge acquisition bottleneck due to non-dependency on manual efforts.

Applications of Word Sense Disambiguation (WSD)

单词含义消除歧义（WSD）几乎应用于语言技术的所有应用中。

Word sense disambiguation (WSD) is applied in almost every application of language technology.

现在让我们看看WSD的范围：

Let us now see the scope of WSD −

Machine Translation

机器翻译或MT是WSD最明显的应用。在MT中，WSD用于对具有不同含义的不同翻译的单词进行词汇选择。MT中的含义表示为目标语言中的单词。大多数机器翻译系统不使用显式WSD模块。

Machine translation or MT is the most obvious application of WSD. In MT, Lexical choice for the words that have distinct translations for different senses, is done by WSD. The senses in MT are represented as words in the target language. Most of the machine translation systems do not use explicit WSD module.

Information Retrieval (IR)

信息检索(IR)可以定义为一个软件程序，用于处理来自文档存储库中的信息（特别是文本信息）的组织、存储、检索和评估。该系统基本上协助用户查找所需的信息，但不会明确返回问题的答案。WSD用于解决提供给IR系统的查询的歧义。与MT一样，当前的IR系统并不明确地使用WSD模块，他们依赖于这样一个概念：用户将在查询中输入足够多的上下文，以便仅检索相关文档。

Information retrieval (IR) may be defined as a software program that deals with the organization, storage, retrieval and evaluation of information from document repositories particularly textual information. The system basically assists users in finding the information they required but it does not explicitly return the answers of the questions. WSD is used to resolve the ambiguities of the queries provided to IR system. As like MT, current IR systems do not explicitly use WSD module and they rely on the concept that user would type enough context in the query to only retrieve relevant documents.

Text Mining and Information Extraction (IE)

在绝大多数应用程序中，执行文本的精确分析需要 WSD。例如，WSD 能帮助智能收集系统标记正确的单词。例如，医学智能系统可能需要标记“非法药物”，而不是“医疗药物”。

In most of the applications, WSD is necessary to do accurate analysis of text. For example, WSD helps intelligent gathering system to do flagging of the correct words. For example, medical intelligent system might need flagging of “illegal drugs” rather than “medical drugs”

Lexicography

WSD 和词典编纂可以在循环中协同工作，因为现代词典编纂基于语料库。通过词典编纂，WSD 提供粗略的经验义项分组以及语义在统计上的重要上下文指标。

WSD and lexicography can work together in loop because modern lexicography is corpusbased. With lexicography, WSD provides rough empirical sense groupings as well as statistically significant contextual indicators of sense.

Difficulties in Word Sense Disambiguation (WSD)

以下是词义消歧 (WSD) 面临的一些困难：

Followings are some difficulties faced by word sense disambiguation (WSD) −

Differences between dictionaries

WSD 的主要问题是确定词义，因为不同的义项可能非常密切相关。甚至不同的词典和词库也可以针对词义提供不同的划分。

The major problem of WSD is to decide the sense of the word because different senses can be very closely related. Even different dictionaries and thesauruses can provide different divisions of words into senses.

Different algorithms for different applications

WSD 的另一个问题是，不同的应用程序可能需要截然不同的算法。例如，在机器翻译中，它采取目标词选择的形式；而在信息检索中，不需要词义清单。

Another problem of WSD is that completely different algorithm might be needed for different applications. For example, in machine translation, it takes the form of target word selection; and in information retrieval, a sense inventory is not required.

Inter-judge variance

WSD 的另一个问题是，WSD 系统通常通过将其结果应用于任务来进行测试，并与人类的任务进行比较。这被称为人际差异问题。

Another problem of WSD is that WSD systems are generally tested by having their results on a task compared against the task of human beings. This is called the problem of interjudge variance.

Word-sense discreteness

WSD 中的另一个困难是，单词无法轻松划分成离散的次义项。

Another difficulty in WSD is that words cannot be easily divided into discrete submeanings.

Natural Language Discourse Processing

人工智能最困难的问题是用计算机处理自然语言，或者换句话说，自然语言处理是人工智能中最困难的问题。如果我们讨论 NLP 中的主要问题，那么 NLP 中的一个主要问题就是话语处理 − 建立关于话语如何粘在一起形成 coherent discourse 的理论和模型。事实上，语言总是包含搭配、结构化和连贯的句子组，而不是像电影那样的孤立且不相关的句子。这些连贯的句子组称为话语。

The most difficult problem of AI is to process the natural language by computers or in other words natural language processing is the most difficult problem of artificial intelligence. If we talk about the major problems in NLP, then one of the major problems in NLP is discourse processing − building theories and models of how utterances stick together to form coherent discourse. Actually, the language always consists of collocated, structured and coherent groups of sentences rather than isolated and unrelated sentences like movies. These coherent groups of sentences are referred to as discourse.

Concept of Coherence

连贯性和话语结构在许多方面是相互关联的。连贯性与好文本的属性一起，用于评估自然语言生成系统的输出质量。这里出现的问题是文本连贯意味着什么？假设我们从报纸的每一页收集一句话，那它会是话语吗？当然不是。这是因为这些句子没有表现出连贯性。连贯的话语必须具备以下特性 −

Coherence and discourse structure are interconnected in many ways. Coherence, along with property of good text, is used to evaluate the output quality of natural language generation system. The question that arises here is what does it mean for a text to be coherent? Suppose we collected one sentence from every page of the newspaper, then will it be a discourse? Of-course, not. It is because these sentences do not exhibit coherence. The coherent discourse must possess the following properties −

Coherence relation between utterances

如果话语在其话语之间有有意义的联系，那么它将是连贯的。此属性称为连贯关系。例如，必须有一些解释来证明话语之间的联系。

The discourse would be coherent if it has meaningful connections between its utterances. This property is called coherence relation. For example, some sort of explanation must be there to justify the connection between utterances.

Relationship between entities

使话语连贯的另一个属性是实体之间必须存在某种关系。这种连贯性称为基于实体的连贯性。

Another property that makes a discourse coherent is that there must be a certain kind of relationship with the entities. Such kind of coherence is called entity-based coherence.

Discourse structure

关于话语的一个重要问题是话语必须具有什么样的结构。这个问题的答案取决于我们应用于话语的分割。话语分割可以定义为确定大型话语的结构类型。实施话语分割非常困难，但对以下 information retrieval, text summarization and information extraction 类型的应用程序非常重要。

An important question regarding discourse is what kind of structure the discourse must have. The answer to this question depends upon the segmentation we applied on discourse. Discourse segmentations may be defined as determining the types of structures for large discourse. It is quite difficult to implement discourse segmentation, but it is very important for information retrieval, text summarization and information extraction kind of applications.

Algorithms for Discourse Segmentation

在本节中，我们将学习话语分割的算法。算法如下所述 −

In this section, we will learn about the algorithms for discourse segmentation. The algorithms are described below −

Unsupervised Discourse Segmentation

无监督话语分割的类别通常表现为线性分割。我们可以借助一个示例了解线性分割的任务。在该示例中，有一个任务是将文本分割为多段落单位；这些单位代表着原始文本的段落。这些算法依赖于内聚力，内聚力可以定义为使用特定的语言设备将文本单位联系在一起。另一方面，词汇内聚力是两 (2) 个单位中两个 (2) 个或更多单词之间的关系指示的内聚力，例如使用同义词。

The class of unsupervised discourse segmentation is often represented as linear segmentation. We can understand the task of linear segmentation with the help of an example. In the example, there is a task of segmenting the text into multi-paragraph units; the units represent the passage of the original text. These algorithms are dependent on cohesion that may be defined as the use of certain linguistic devices to tie the textual units together. On the other hand, lexicon cohesion is the cohesion that is indicated by the relationship between two or more words in two units like the use of synonyms.

Supervised Discourse Segmentation

较早的方法没有任何手工标记的分割边界。另一方面，监督话语分割需要有边界标记的训练数据。获取该数据非常容易。在监督话语分割中，话语标记或提示词起着重要的作用。话语标记或提示词是表示话语结构的单词或短语。这些话语标记是特定于领域的。

The earlier method does not have any hand-labeled segment boundaries. On the other hand, supervised discourse segmentation needs to have boundary-labeled training data. It is very easy to acquire the same. In supervised discourse segmentation, discourse marker or cue words play an important role. Discourse marker or cue word is a word or phrase that functions to signal discourse structure. These discourse markers are domain-specific.

Text Coherence

词汇重复是找出话语结构的一种方法，但它不满足连贯话语的要求。为了实现连贯话语，我们必须特别关注连贯关系。正如我们所知，连贯关系定义了话语中言语之间的可能联系。赫布提出了以下此种关系 −

Lexical repetition is a way to find the structure in a discourse, but it does not satisfy the requirement of being coherent discourse. To achieve the coherent discourse, we must focus on coherence relations in specific. As we know that coherence relation defines the possible connection between utterances in a discourse. Hebb has proposed such kind of relations as follows −

我们采用两个术语 S0 和 S1 来表示两个 (2) 个相关句子的含义 −

We are taking two terms S0 and S1 to represent the meaning of the two related sentences −

Result

它推断出术语 S0 断言的状态可能导致术语 S1 断言的状态。例如，两个 (2) 个表述表明关系结果：拉姆被困在火中。他的皮肤被烧伤了。

It infers that the state asserted by term S0 could cause the state asserted by S1. For example, two statements show the relationship result: Ram was caught in the fire. His skin burned.

Explanation

它推断出术语 S1 断言的状态可能导致术语 S0 断言的状态。例如，两个 (2) 个表述表明关系 − 拉姆与夏姆的朋友打架。他喝醉了。

It infers that the state asserted by S1 could cause the state asserted by S0. For example, two statements show the relationship − Ram fought with Shyam’s friend. He was drunk.

Parallel

它根据断言 S0 推断出 p(a1,a2,…) 并根据断言 S1 推断出 p(b1,b2,…)。此处所有 i 的 ai 和 bi 相似。例如，两个 (2) 个表述是平行的 − 拉姆想要汽车。夏姆想要钱。

It infers p(a1,a2,…) from assertion of S0 and p(b1,b2,…) from assertion S1. Here ai and bi are similar for all i. For example, two statements are parallel − Ram wanted car. Shyam wanted money.

Elaboration

两种 (2) 断言 S0 和 S1 推断出相同的命题 P。例如，两个 (2) 个表述表明关系详细说明：拉姆来自昌迪加尔。夏姆来自喀拉拉邦。

It infers the same proposition P from both the assertions − S0 and S1 For example, two statements show the relation elaboration: Ram was from Chandigarh. Shyam was from Kerala.

Occasion

当可以从断言 S0 推断出状态更改，可以从 S1 推断出其最终状态，反之亦然时，就会发生这种情况。例如，两个 (2) 个表述表明关系场：拉姆拿起书。他把它给了夏姆。

It happens when a change of state can be inferred from the assertion of S0, final state of which can be inferred from S1 and vice-versa. For example, the two statements show the relation occasion: Ram picked up the book. He gave it to Shyam.

Building Hierarchical Discourse Structure

还可以通过连贯关系之间的层次结构考虑整个话语的连贯性。例如，以下段落可以表示为层次结构 −

The coherence of entire discourse can also be considered by hierarchical structure between coherence relations. For example, the following passage can be represented as hierarchical structure −

S1 − Ram went to the bank to deposit money.
S2 − He then took a train to Shyam’s cloth shop.
S3 − He wanted to buy some clothes.
S4 − He do not have new clothes for party.
S5 − He also wanted to talk to Shyam regarding his health

building hierarchical discourse structure

Reference Resolution

对任何话语中的句子进行解释是另一项重要任务，为了实现此项任务，我们需要知道讨论的是谁或什么实体。此处，解释参考是关键要素。 Reference 可以定义为表示实体或个体的语言表达。例如，在段落中，ABC 银行经理拉姆在一家商店见到了他的朋友夏姆。他去见他，拉姆、他的、他等语言表达是参考。

Interpretation of the sentences from any discourse is another important task and to achieve this we need to know who or what entity is being talked about. Here, interpretation reference is the key element. Reference may be defined as the linguistic expression to denote an entity or individual. For example, in the passage, Ram, the manager of ABC bank, saw his friend Shyam at a shop. He went to meet him, the linguistic expressions like Ram, His, He are reference.

同样， reference resolution 可以定义为确定哪些语言表达指代哪些实体的任务。

On the same note, reference resolution may be defined as the task of determining what entities are referred to by which linguistic expression.

Terminology Used in Reference Resolution

我们在引用消解中使用以下术语 −

We use the following terminologies in reference resolution −

Referring expression − The natural language expression that is used to perform reference is called a referring expression. For example, the passage used above is a referring expression.
Referent − It is the entity that is referred. For example, in the last given example Ram is a referent.
Corefer − When two expressions are used to refer to the same entity, they are called corefers. For example, Ram and he are corefers.
Antecedent − The term has the license to use another term. For example, Ram is the antecedent of the reference he.
Anaphora & Anaphoric − It may be defined as the reference to an entity that has been previously introduced into the sentence. And, the referring expression is called anaphoric.
Discourse model − The model that contains the representations of the entities that have been referred to in the discourse and the relationship they are engaged in.

Types of Referring Expressions

现在让我们看一看不同类型的指称表达式。下面描述了五种类型的指称表达式 −

Let us now see the different types of referring expressions. The five types of referring expressions are described below −

Indefinite Noun Phrases

这种引用表示对听众来说在话语上下文中是新的实体。例如 − 在句子“Ram 有一天四处走动给他带了一些食物”中 − some 是一种不确定的引用。

Such kind of reference represents the entities that are new to the hearer into the discourse context. For example − in the sentence Ram had gone around one day to bring him some food − some is an indefinite reference.

Definite Noun Phrases

与上述相反，这种引用表示对听众来说在话语上下文中并不新或不可识别的实体。例如，在句子“我过去常读《印度时报》”中 − 《印度时报》是一个明确的引用。

Opposite to above, such kind of reference represents the entities that are not new or identifiable to the hearer into the discourse context. For example, in the sentence - I used to read The Times of India – The Times of India is a definite reference.

Pronouns

这是一种明确的引用。例如，Ram 尽可能大声地笑了。单词 he 表示代词指称表达式。

It is a form of definite reference. For example, Ram laughed as loud as he could. The word he represents pronoun referring expression.

Demonstratives

它们与简单的明确代词不同，并且表现不同。例如，this 和 that 是指示代词。

These demonstrate and behave differently than simple definite pronouns. For example, this and that are demonstrative pronouns.

Names

它是最简单的指称表达式类型。它还可以是个人、组织和地点的名称。例如，在上面的示例中，Ram 是人名指称表达式。

It is the simplest type of referring expression. It can be the name of a person, organization and location also. For example, in the above examples, Ram is the name-refereeing expression.

Reference Resolution Tasks

下面描述了这两个引用解析任务。

The two reference resolution tasks are described below.

Coreference Resolution

这是在文本中查找引用同一实体的指称表达式的任务。用简单的话说，这是找到共指表达式的任务。一组共指表达式被称为共指链。例如 - 他、首席经理和他的 - 这些是作为示例给出的第一段中的指称表达式。

It is the task of finding referring expressions in a text that refer to the same entity. In simple words, it is the task of finding corefer expressions. A set of coreferring expressions are called coreference chain. For example - He, Chief Manager and His - these are referring expressions in the first passage given as example.

Constraint on Coreference Resolution

在英语中，共指解析的主要问题是代词 it。其背后的原因是代词 it 有很多用法。例如，它可以像 he 和 she 一样指代。代词 it 还指代不指特定事物的物体。例如，下雨了。真的很棒。

In English, the main problem for coreference resolution is the pronoun it. The reason behind this is that the pronoun it has many uses. For example, it can refer much like he and she. The pronoun it also refers to the things that do not refer to specific things. For example, It’s raining. It is really good.

Pronominal Anaphora Resolution

不同于共指解析，代词指代解析可能被定义为查找代词先行词的任务。例如，代词是他的，代词指代解析的任务是找到单词 Ram，因为 Ram 是先行词。

Unlike the coreference resolution, pronominal anaphora resolution may be defined as the task of finding the antecedent for a single pronoun. For example, the pronoun is his and the task of pronominal anaphora resolution is to find the word Ram because Ram is the antecedent.

Part of Speech (PoS) Tagging

标记是一种分类，可定义为对标记自动分配描述。这里的描述符称为标记，可以表示一部分语音、语义信息等。

Tagging is a kind of classification that may be defined as the automatic assignment of description to the tokens. Here the descriptor is called tag, which may represent one of the part-of-speech, semantic information and so on.

现在，如果我们谈论词性标记 (PoS)，那么可以将其定义为将各个词性分配给给定单词的过程。通常称为词性标记。简而言之，我们可以说词性标记是一项用适当的词性标记句子中每个单词的任务。我们已经知道词性包括名词、动词、副词、形容词、代词、连词及其子类别。

Now, if we talk about Part-of-Speech (PoS) tagging, then it may be defined as the process of assigning one of the parts of speech to the given word. It is generally called POS tagging. In simple words, we can say that POS tagging is a task of labelling each word in a sentence with its appropriate part of speech. We already know that parts of speech include nouns, verb, adverbs, adjectives, pronouns, conjunction and their sub-categories.

大多数词性标记属于规则词性标记、随机词性标记和基于转换的标记。

Most of the POS tagging falls under Rule Base POS tagging, Stochastic POS tagging and Transformation based tagging.

Rule-based POS Tagging

最古老的标记技术之一是基于规则的词性标记。基于规则的标记程序使用词典或词库来获取每个单词的可能标记。如果某个单词有多个可能的标记，则基于规则的标记程序使用手写规则来识别正确的标记。还可以通过分析单词的语言特征以及其前导词和后跟词来对基于规则的标记进行歧义消除。例如，假设单词的前导词是冠词，那么该单词必须是名词。

One of the oldest techniques of tagging is rule-based POS tagging. Rule-based taggers use dictionary or lexicon for getting possible tags for tagging each word. If the word has more than one possible tag, then rule-based taggers use hand-written rules to identify the correct tag. Disambiguation can also be performed in rule-based tagging by analyzing the linguistic features of a word along with its preceding as well as following words. For example, suppose if the preceding word of a word is article then word must be a noun.

顾名思义，基于规则的词性标记中的所有此类信息都以规则的形式编码。这些规则可以是——

As the name suggests, all such kind of information in rule-based POS tagging is coded in the form of rules. These rules may be either −

Context-pattern rules
Or, as Regular expression compiled into finite-state automata, intersected with lexically ambiguous sentence representation.

我们还可以通过基于规则的词性标记的两级架构来理解它——

We can also understand Rule-based POS tagging by its two-stage architecture −

First stage − In the first stage, it uses a dictionary to assign each word a list of potential parts-of-speech.
Second stage − In the second stage, it uses large lists of hand-written disambiguation rules to sort down the list to a single part-of-speech for each word.

Properties of Rule-Based POS Tagging

基于规则的词性标注器具有以下属性−

Rule-based POS taggers possess the following properties −

These taggers are knowledge-driven taggers.
The rules in Rule-based POS tagging are built manually.
The information is coded in the form of rules.
We have some limited number of rules approximately around 1000.
Smoothing and language modeling is defined explicitly in rule-based taggers.

Stochastic POS Tagging

另一种标注技术是随机词性标注。现在，这里出现的问题是哪些模型可以是随机的。包含频率或概率（统计）的模型可以称为随机模型。任何数量的不同词性标注方法都可以称为随机标注器。

Another technique of tagging is Stochastic POS Tagging. Now, the question that arises here is which model can be stochastic. The model that includes frequency or probability (statistics) can be called stochastic. Any number of different approaches to the problem of part-of-speech tagging can be referred to as stochastic tagger.

最简单的随机标注器对词性标注应用以下方法−

The simplest stochastic tagger applies the following approaches for POS tagging −

Word Frequency Approach

在此方法中，随机标注器根据单词出现特定标注的概率对单词进行消歧。我们还可以说，在训练集中单词最常遇到的标注是分配给该单词的歧义实例的标注。此方法的主要问题是，它可能会产生不可接受的标注序列。

In this approach, the stochastic taggers disambiguate the words based on the probability that a word occurs with a particular tag. We can also say that the tag encountered most frequently with the word in the training set is the one assigned to an ambiguous instance of that word. The main issue with this approach is that it may yield inadmissible sequence of tags.

Tag Sequence Probabilities

这是随机标注的另一种方法，其中标注器计算给定标注序列发生的概率。它也称为 n-gram 方法。之所以这么称呼，是因为给定单词的最佳标注是由它与前 n 个标注一起出现的概率决定的。

It is another approach of stochastic tagging, where the tagger calculates the probability of a given sequence of tags occurring. It is also called n-gram approach. It is called so because the best tag for a given word is determined by the probability at which it occurs with the n previous tags.

Properties of Stochastic POST Tagging

随机词性标注器具有以下属性−

Stochastic POS taggers possess the following properties −

This POS tagging is based on the probability of tag occurring.
It requires training corpus
There would be no probability for the words that do not exist in the corpus.
It uses different testing corpus (other than training corpus).
It is the simplest POS tagging because it chooses most frequent tags associated with a word in training corpus.

Transformation-based Tagging

基于转换的标注也称为 Brill 标注。它是基于转换的学习 (TBL) 的一个实例，TBL 是一个用于文本的自动词性标注的基于规则的算法。TBL 允许我们以可读的形式获取语言知识，通过使用转换规则将一个状态转换为另一个状态。

Transformation based tagging is also called Brill tagging. It is an instance of the transformation-based learning (TBL), which is a rule-based algorithm for automatic tagging of POS to the given text. TBL, allows us to have linguistic knowledge in a readable form, transforms one state to another state by using transformation rules.

它从前面解释的标注器（基于规则和随机）中汲取灵感。如果我们看到基于规则和转换标注器之间的相似性，那么像基于规则一样，它也基于规则来指定需要将哪些标注分配给哪些单词。另一方面，如果我们看到随机和转换标注器之间的相似性，那么它就像随机一样，它是一种机器学习技术，其中规则是从数据中自动推导出来的。

It draws the inspiration from both the previous explained taggers − rule-based and stochastic. If we see similarity between rule-based and transformation tagger, then like rule-based, it is also based on the rules that specify what tags need to be assigned to what words. On the other hand, if we see similarity between stochastic and transformation tagger then like stochastic, it is machine learning technique in which rules are automatically induced from data.

Working of Transformation Based Learning(TBL)

为了理解转换式标记器的作用和概念，我们需要理解基于转换式学习的作用。考虑以下理解 TBL 工作原理的步骤 −

In order to understand the working and concept of transformation-based taggers, we need to understand the working of transformation-based learning. Consider the following steps to understand the working of TBL −

Start with the solution − The TBL usually starts with some solution to the problem and works in cycles.
Most beneficial transformation chosen − In each cycle, TBL will choose the most beneficial transformation.
Apply to the problem − The transformation chosen in the last step will be applied to the problem.

当步骤 2 中选择的转换不再增添价值或不再有转换可选时，算法将停止。这种类型的学习最适合分类任务。

The algorithm will stop when the selected transformation in step 2 will not add either more value or there are no more transformations to be selected. Such kind of learning is best suited in classification tasks.

Advantages of Transformation-based Learning (TBL)

TBL 的优点如下 −

The advantages of TBL are as follows −

We learn small set of simple rules and these rules are enough for tagging.
Development as well as debugging is very easy in TBL because the learned rules are easy to understand.
Complexity in tagging is reduced because in TBL there is interlacing of machinelearned and human-generated rules.
Transformation-based tagger is much faster than Markov-model tagger.

Disadvantages of Transformation-based Learning (TBL)

TBL 的缺点如下 −

The disadvantages of TBL are as follows −

Transformation-based learning (TBL) does not provide tag probabilities.
In TBL, the training time is very long especially on large corpora.

Hidden Markov Model (HMM) POS Tagging

在深入研究 HMM 词性标注之前，我们必须理解隐马尔可夫模型 (HMM) 的概念。

Before digging deep into HMM POS tagging, we must understand the concept of Hidden Markov Model (HMM).

Hidden Markov Model

HMM 模型可以定义为双嵌入随机模型，其中基础随机过程是隐藏的。只有通过生成观察序列的另一组随机过程才能观察到这个隐藏的随机过程。

An HMM model may be defined as the doubly-embedded stochastic model, where the underlying stochastic process is hidden. This hidden stochastic process can only be observed through another set of stochastic processes that produces the sequence of observations.

Example

例如，进行了一系列隐藏掷硬币的实验，我们只看到由正面和反面组成的观察序列。过程的实际细节（使用了多少枚硬币，选择这些硬币的顺序）对我们来说是隐藏的。通过观察这组正面和反面的序列，我们可以建立多个 HMM 来解释这个序列。以下是针对此问题的一种形式的隐马尔可夫模型 −

For example, a sequence of hidden coin tossing experiments is done and we see only the observation sequence consisting of heads and tails. The actual details of the process - how many coins used, the order in which they are selected - are hidden from us. By observing this sequence of heads and tails, we can build several HMMs to explain the sequence. Following is one form of Hidden Markov Model for this problem −

我们假设 HMM 中有两个状态，每个状态对应于选择不同的偏置硬币。下矩阵给出了状态转换概率 −

We assumed that there are two states in the HMM and each of the state corresponds to the selection of different biased coin. Following matrix gives the state transition probabilities −

A = \begin{bmatrix}a11 & a12 \\a21 & a22 \end{bmatrix}

在此，

Here,

aij = probability of transition from one state to another from i to j.
a11 + a12 = 1 and a21 + a22 =1
P1 = probability of heads of the first coin i.e. the bias of the first coin.
P2 = probability of heads of the second coin i.e. the bias of the second coin.

我们还可以创建一个隐马尔可夫模型，假设有 3 枚或更多硬币。

We can also create an HMM model assuming that there are 3 coins or more.

这样，我们就可以通过以下元素表征隐马尔可夫模型：

This way, we can characterize HMM by the following elements −

N, the number of states in the model (in the above example N =2, only two states).
M, the number of distinct observations that can appear with each state in the above example M = 2, i.e., H or T).
A, the state transition probability distribution − the matrix A in the above example.
P, the probability distribution of the observable symbols in each state (in our example P1 and P2).
I, the initial state distribution.

Use of HMM for POS Tagging

词性标注过程是找到最可能生成给定单词序列的标签序列的过程。我们可以通过使用隐马尔可夫模型 (HMM) 建模此词性标注过程，其中 tags 是生成 observable output, （即 words ）的 hidden states 。

The POS tagging process is the process of finding the sequence of tags which is most likely to have generated a given word sequence. We can model this POS process by using a Hidden Markov Model (HMM), where tags are the hidden states that produced the observable output, i.e., the words.

在数学上，在词性标注中，我们始终有兴趣找到一个标签序列 ©，以最大化：

Mathematically, in POS tagging, we are always interested in finding a tag sequence © which maximizes −

P (C|W)

其中，

Where,

C = C1、C2、C3… CT

C = C1, C2, C3… CT

W = W1、W2、W3、WT

W = W1, W2, W3, WT

另一方面，事实是，我们需要大量的统计数据才能合理估计此类序列。然而，为了简化问题，我们可以应用一些数学变换以及一些假设。

On the other side of coin, the fact is that we need a lot of statistical data to reasonably estimate such kind of sequences. However, to simplify the problem, we can apply some mathematical transformations along with some assumptions.

使用隐马尔可夫模型进行词性标注是贝叶斯推理的一个特例。因此，我们将首先使用贝叶斯规则重新表述该问题，贝叶斯规则指出上述条件概率等于：

The use of HMM to do a POS tagging is a special case of Bayesian interference. Hence, we will start by restating the problem using Bayes’ rule, which says that the above-mentioned conditional probability is equal to −

(PROB (C1,…, CT) * PROB (W1,…, WT | C1,…, CT)) / PROB (W1,…, WT)

我们可以在所有这些情况下消除分母，因为我们有兴趣找到最大化上述值的序列 C。这不会影响我们的答案。现在，我们的问题简化为找到最大化以下内容的序列 C：

We can eliminate the denominator in all these cases because we are interested in finding the sequence C which maximizes the above value. This will not affect our answer. Now, our problem reduces to finding the sequence C that maximizes −

PROB (C1,…, CT) * PROB (W1,…, WT | C1,…, CT) (1)

即使减少上述表达式中的问题，仍需要大量数据。我们可以对上述表达式中的两个概率做出合理的独立性假设来克服该问题。

Even after reducing the problem in the above expression, it would require large amount of data. We can make reasonable independence assumptions about the two probabilities in the above expression to overcome the problem.

First Assumption

标签的概率取决于前一个（双语模型）或前两个（三语模型）或前 n 个标签（n 元模型），从数学角度可以解释如下 −

The probability of a tag depends on the previous one (bigram model) or previous two (trigram model) or previous n tags (n-gram model) which, mathematically, can be explained as follows −

PROB (C1,…, CT) = Πi=1..T PROB (Ci|Ci-n+1…Ci-1) (n-gram model)

PROB (C1,…, CT) = Πi=1..T PROB (Ci|Ci-1) (bigram model)

句子的开端可以通过假定每个标签的初始概率来解释。

The beginning of a sentence can be accounted for by assuming an initial probability for each tag.

PROB (C1|C0) = PROB initial (C1)

Second Assumption

上式（1）中的第二个概率可以通过假设单词出现在独立于前一类或后一类中的单词的类别中来近似，从数学角度可以解释如下 −

The second probability in equation (1) above can be approximated by assuming that a word appears in a category independent of the words in the preceding or succeeding categories which can be explained mathematically as follows −

PROB (W1,…, WT | C1,…, CT) = Πi=1..T PROB (Wi|Ci)

现在，基于以上两个假设，我们的目标简化为寻找一个使函数最大化的序列 C

Now, on the basis of the above two assumptions, our goal reduces to finding a sequence C which maximizes

Πi=1…T PROB(Ci|Ci-1) * PROB(Wi|Ci)

现在出现的问题是，将问题转换成上述形式是否真的对我们有帮助。答案是肯定的。如果我们有一个带标签的大语料库，则上述公式中的两个概率可以计算为 −

Now the question that arises here is has converting the problem to the above form really helped us. The answer is - yes, it has. If we have a large tagged corpus, then the two probabilities in the above formula can be calculated as −

PROB (Ci=VERB|Ci-1=NOUN) = (# of instances where Verb follows Noun) / (# of instances where Noun appears) (2)

PROB (Wi|Ci) = (# of instances where Wi appears in Ci) /(# of instances where Ci appears) (3)

Natural Language Processing - Inception

在本章中，我们将讨论自然语言处理中的自然语言始源。首先，让我们了解什么是自然语言语法。

In this chapter, we will discuss the natural language inception in Natural Language Processing. To begin with, let us first understand what is Natural Language Grammar.

Natural Language Grammar

对语言学来说，语言就是一组任意的声乐标志。我们可以说语言是有创造性的、受规则制约的、与生俱来的，同时又是普遍的。另一方面，它也是人性的。语言的性质因人而异。对语言的性质有很多误解。这就是理解模棱两可的术语 ‘grammar’ 的含义非常重要的原因。在语言学中，语法术语可以定义为语言运行所依据的规则或原理。广义上来说，我们可以将语法分为两类−

For linguistics, language is a group of arbitrary vocal signs. We may say that language is creative, governed by rules, innate as well as universal at the same time. On the other hand, it is humanly too. The nature of the language is different for different people. There is a lot of misconception about the nature of the language. That is why it is very important to understand the meaning of the ambiguous term ‘grammar’. In linguistics, the term grammar may be defined as the rules or principles with the help of which language works. In broad sense, we can divide grammar in two categories −

Descriptive Grammar

由语言学家和语法学家制定说话者语法的那套规则称为描述性语法。

The set of rules, where linguistics and grammarians formulate the speaker’s grammar is called descriptive grammar.

Perspective Grammar

这是一种截然不同的语法概念，它试图维持语言的正确性标准。这一类与语言的实际运作关系不大。

It is a very different sense of grammar, which attempts to maintain a standard of correctness in the language. This category has little to do with the actual working of the language.

Components of Language

研究语言时将其划分为相互关联的成分，这些成分是语言调查中约定俗成的、任意的划分。对这些成分的解释如下−

The language of study is divided into the interrelated components, which are conventional as well as arbitrary divisions of linguistic investigation. The explanation of these components is as follows −

Phonology

语言的第一个成分是音系学。它是对某一特定语言的语音进行的研究。这个词的起源可以追溯到希腊语，其中“phone”意为声音或语音。语音学是音系学的一个分支，它从语音的产生、感知或物理属性的角度研究人类语言的语音。国际音标 (IPA) 是一种在语音学研究中以规则的方式表示人声的工具。在国际音标中，每个书面符号只表示一个语音，反之亦然。

The very first component of language is phonology. It is the study of the speech sounds of a particular language. The origin of the word can be traced to Greek language, where ‘phone’ means sound or voice. Phonetics, a subdivision of phonology is the study of the speech sounds of human language from the perspective of their production, perception or their physical properties. IPA (International Phonetic Alphabet) is a tool that represents human sounds in a regular way while studying phonology. In IPA, every written symbol represents one and only one speech sound and vice-versa.

Phonemes

它可以被定义为语音单位之一，它将一个语言中的单词与另一个单词区分开来。在语言学中，音位用斜杠书写。例如，音位 /k/ 出现于 kit、skit 这样的单词中。

It may be defined as one of the units of sound that differentiate one word from other in a language. In linguistic, phonemes are written between slashes. For example, phoneme /k/ occurs in the words such as kit, skit.

Morphology

这是语言的第二个部分。它是对某一特定语言中单词的结构和分类的研究。这个词的起源来自希腊语，其中“morphe”一词意为“形式”。形态学考虑语言中词语形成的原理。换句话说，声音如何组合成有意义的单位，如前缀、后缀和词根。它还考虑了如何将单词归类为词性。

It is the second component of language. It is the study of the structure and classification of the words in a particular language. The origin of the word is from Greek language, where the word ‘morphe’ means ‘form’. Morphology considers the principles of formation of words in a language. In other words, how sounds combine into meaningful units like prefixes, suffixes and roots. It also considers how words can be grouped into parts of speech.

Lexeme

在语言学中，与某个单词所采用的形式集合相对应的形态分析抽象单位被称为词素。词素在句子中的用法由其语法范畴决定。词素可以是单个单词，也可以是多单词。例如，单词 talk 是单个词素的例子，它可能具有很多语法变体，如 talks、talked 和 talking。多词素可以由多个正字法词语组成。例如，speak up、pull through 等都是多词素的例子。

In linguistics, the abstract unit of morphological analysis that corresponds to a set of forms taken by a single word is called lexeme. The way in which a lexeme is used in a sentence is determined by its grammatical category. Lexeme can be individual word or multiword. For example, the word talk is an example of an individual word lexeme, which may have many grammatical variants like talks, talked and talking. Multiword lexeme can be made up of more than one orthographic word. For example, speak up, pull through, etc. are the examples of multiword lexemes.

Syntax

这是语言的第三个部分。它是对单词按序排列以及组合成更大单位的研究。这个单词可以追溯到希腊语，其中单词 suntassein 的意思是“按序排列”。它研究句子的类型及其结构、从句、短语。

It is the third component of language. It is the study of the order and arrangement of the words into larger units. The word can be traced to Greek language, where the word suntassein means ‘to put in order’. It studies the type of sentences and their structure, of clauses, of phrases.

Semantics

这是语言的第四个部分。它是对意义如何传达的研究。意义可以与外部世界相关，也可以与句子的语法相关。这个单词可以追溯到希腊语，其中单词 semainein 的意思是“表示”、“展示”、“信号”。

It is the fourth component of language. It is the study of how meaning is conveyed. The meaning can be related to the outside world or can be related to the grammar of the sentence. The word can be traced to Greek language, where the word semainein means means ‘to signify’, ‘show’, ‘signal’.

Pragmatics

这是语言的第五个部分。它是对语言功能及其在语境中的使用进行的研究。这个词的起源可以追溯到希腊语，其中单词“pragma” 的意思是“行为”、“事务”。

It is the fifth component of language. It is the study of the functions of the language and its use in context. The origin of the word can be traced to Greek language where the word ‘pragma’ means ‘deed’, ‘affair’.

Grammatical Categories

语法范畴可以定义为语言语法内单元或特征的一类。这些单元是语言的组成部分，并具有共同的特征集合。语法范畴也称为语法特征。

A grammatical category may be defined as a class of units or features within the grammar of a language. These units are the building blocks of language and share a common set of characteristics. Grammatical categories are also called grammatical features.

语法范畴的清单如下所述−

The inventory of grammatical categories is described below −

Number

这是最简单的语法范畴。我们有与这一范畴相关的两个术语——单数和复数。单数是“一个”的概念，而复数是“多个”的概念。例如，dog/dogs，this/these。

It is the simplest grammatical category. We have two terms related to this category −singular and plural. Singular is the concept of ‘one’ whereas, plural is the concept of ‘more than one’. For example, dog/dogs, this/these.

Gender

语法性通过人称代词和第三人称的变异来表达。语法性的例子有：单数——he、she、it；第一人称和第二人称形式——I、we 和 you；第三人称复数形式 they 是普通性或中性。

Grammatical gender is expressed by variation in personal pronouns and 3rd person. Examples of grammatical genders are singular − he, she, it; the first and second person forms − I, we and you; the 3rd person plural form they, is either common gender or neuter gender.

Person

另一个简单的语法范畴是人称。在此之下，识别出以下三个术语−

Another simple grammatical category is person. Under this, following three terms are recognized −

1st person − The person who is speaking is recognized as 1st person.
2nd person − The person who is the hearer or the person spoken to is recognized as 2nd person.
3rd person − The person or thing about whom we are speaking is recognized as 3rd person.

Case

它是语法中最难的范畴之一。它可以被定义为名词短语 (NP) 功能的指示，或名词短语与句子中的动词或其他名词短语之间的关系。我们有以下三个在人称和疑问代词中表达的格：

It is one of the most difficult grammatical categories. It may be defined as an indication of the function of a noun phrase (NP) or the relationship of a noun phrase to a verb or to the other noun phrases in the sentence. We have the following three cases expressed in personal and interrogative pronouns −

Nominative case − It is the function of subject. For example, I, we, you, he, she, it, they and who are nominative.
Genitive case − It is the function of possessor. For example, my/mine, our/ours, his, her/hers, its, their/theirs, whose are genitive.
Objective case − It is the function of object. For example, me, us, you, him, her, them, whom are objective.

Degree

这个语法范畴与形容词和副词有关。它有以下三个术语：

This grammatical category is related to adjectives and adverbs. It has the following three terms −

Positive degree − It expresses a quality. For example, big, fast, beautiful are positive degrees.
Comparative degree − It expresses greater degree or intensity of the quality in one of two items. For example, bigger, faster, more beautiful are comparative degrees.
Superlative degree − It expresses greatest degree or intensity of the quality in one of three or more items. For example, biggest, fastest, most beautiful are superlative degrees.

Definiteness and Indefiniteness

这两个概念都非常简单。正如我们所知，确定性表示一个指称者，该指称者是说话者或听众所知道、熟悉或可识别的。相反，不确定性表示一个不为人所知或不熟悉的指称者。这个概念可以在冠词与名词的共现中理解：

Both these concepts are very simple. Definiteness as we know represents a referent, which is known, familiar or identifiable by the speaker or hearer. On the other hand, indefiniteness represents a referent that is not known, or is unfamiliar. The concept can be understood in the co-occurrence of an article with a noun −

definite article − the
indefinite article − a/an

Tense

这个语法范畴与动词有关，可以定义为动作时间语言指示。现在时建立了一种关系，因为它表示事件发生的时间与说话时刻的关系。从广义上讲，它有以下三种类型：

This grammatical category is related to verb and can be defined as the linguistic indication of the time of an action. A tense establishes a relation because it indicates the time of an event with respect to the moment of speaking. Broadly, it is of the following three types −

Present tense − Represents the occurrence of an action in the present moment. For example, Ram works hard.
Past tense − Represents the occurrence of an action before the present moment. For example, it rained.
Future tense − Represents the occurrence of an action after the present moment. For example, it will rain.

Aspect

这个语法范畴可以定义为对事件的看法。它可以有以下类型：

This grammatical category may be defined as the view taken of an event. It can be of the following types −

Perfective aspect − The view is taken as whole and complete in the aspect. For example, the simple past tense like yesterday I met my friend, in English is perfective in aspect as it views the event as complete and whole.
Imperfective aspect − The view is taken as ongoing and incomplete in the aspect. For example, the present participle tense like I am working on this problem, in English is imperfective in aspect as it views the event as incomplete and ongoing.

Mood

这个语法范畴有点难以定义，但它可以简单地表示为说话者对他/她谈论内容的态度的迹象。它也是动词的语法特征。它不同于语法时态和语法语态。语态的例子是陈述语态、疑问语态、祈使语态、禁止语态、虚拟语态、可能语态、祈愿语态、现在分词和过去分词。

This grammatical category is a bit difficult to define but it can be simply stated as the indication of the speaker’s attitude towards what he/she is talking about. It is also the grammatical feature of verbs. It is distinct from grammatical tenses and grammatical aspect. The examples of moods are indicative, interrogative, imperative, injunctive, subjunctive, potential, optative, gerunds and participles.

Agreement

它也被称为一致。当一个词从依赖于它所关联的其他词中发生改变时，就会发生这种情况。换句话说，它涉及使不同单词或词性之间的某些语法范畴的值达成一致。以下是基于其他语法范畴的一致性：

It is also called concord. It happens when a word changes from depending on the other words to which it relates. In other words, it involves making the value of some grammatical category agree between different words or part of speech. Followings are the agreements based on other grammatical categories −

Agreement based on Person − It is the agreement between subject and the verb. For example, we always use “I am” and “He is” but never “He am” and “I is”.
Agreement based on Number − This agreement is between subject and the verb. In this case, there are specific verb forms for first person singular, second person plural and so on. For example, 1st person singular: I really am, 2nd person plural: We really are, 3rd person singular: The boy sings, 3rd person plural: The boys sing.
Agreement based on Gender − In English, there is agreement in gender between pronouns and antecedents. For example, He reached his destination. The ship reached her destination.
Agreement based on Case − This kind of agreement is not a significant feature of English. For example, who came first − he or his sister?

Spoken Language Syntax

书面英语和口语语法有很多共同特征，但除此之外，它们在许多方面也有所不同。以下特征区分了口语和书面英语语法：

The written English and spoken English grammar have many common features but along with that, they also differ in a number of aspects. The following features distinguish between the spoken and written English grammar −

Disfluencies and Repair

这个引人注目的特征使得口语和书面英语语法彼此不同。它分别称为不流畅现象，统称为修复现象。不流畅包括以下用法：

This striking feature makes spoken and written English grammar different from each other. It is individually known as phenomena of disfluencies and collectively as phenomena of repair. Disfluencies include the use of following −

Fillers words − Sometimes in between the sentence, we use some filler words. They are called fillers of filler pause. Examples of such words are uh and um.
Reparandum and repair − The repeated segment of words in between the sentence is called reparandum. In the same segment, the changed word is called repair. Consider the following example to understand this −

Does ABC airlines offer any one-way flights uh one-way fares for 5000 rupees?

在上述句子中，单程航班是待修复词，单程航班是修复词。

In the above sentence, one-way flight is a reparadum and one-way flights is a repair.

Restarts

在填充停顿后，会出现重新开始。例如，在上述句子中，当说话者开始询问单程航班然后停止，用填充停顿更正自己，然后重新开始询问单程票价时，就会重新开始。

After the filler pause, restarts occurs. For example, in the above sentence, restarts occur when the speaker starts asking about one-way flights then stops, correct himself by filler pause and then restarting asking about one-way fares.

Word Fragments

有时我们会用较小的单词片段来说话。例如， wwha-what is the time? 这里是 w-wha 单词片段。

Sometimes we speak the sentences with smaller fragments of words. For example, wwha-what is the time? Here the words w-wha are word fragments.

NLP - Information Retrieval

信息检索 (IR) 可以定义为一个软件程序，该程序处理信息库（尤其是文本信息）中的组织、存储、检索和评估。该系统协助用户查找所需信息，但它不会明确返回问题的答案。它会通知包含所需信息的文档的存在和位置。满足用户需求的文档称为相关文档。一个完美的信息检索系统只会检索相关文档。

Information retrieval (IR) may be defined as a software program that deals with the organization, storage, retrieval and evaluation of information from document repositories particularly textual information. The system assists users in finding the information they require but it does not explicitly return the answers of the questions. It informs the existence and location of documents that might consist of the required information. The documents that satisfy user’s requirement are called relevant documents. A perfect IR system will retrieve only relevant documents.

借助以下图表，我们可以了解信息检索 (IR) 的过程：

With the help of the following diagram, we can understand the process of information retrieval (IR) −

从上图可以明显看出，需要信息的用户必须以自然语言查询的形式制定请求。然后，IR 系统会通过以文档的形式检索所需信息的相关输出作出响应。

It is clear from the above diagram that a user who needs information will have to formulate a request in the form of query in natural language. Then the IR system will respond by retrieving the relevant output, in the form of documents, about the required information.

Classical Problem in Information Retrieval (IR) System

信息检索研究的主要目标是开发一个用于从文档库中检索信息的模型。这里，我们将讨论一个经典的问题 ad-hoc retrieval problem ，该问题与 IR 系统相关。

The main goal of IR research is to develop a model for retrieving information from the repositories of documents. Here, we are going to discuss a classical problem, named ad-hoc retrieval problem, related to the IR system.

在即席检索中，用户必须以描述所需信息的自然语言输入查询。然后，IR 系统将返回与所需信息相关的所需文档。例如，假设我们在互联网上搜寻某样东西，它提供了符合我们要求的一些确切页面，但也可能有一些不相关的页面。这是由于即席检索问题。

In ad-hoc retrieval, the user must enter a query in natural language that describes the required information. Then the IR system will return the required documents related to the desired information. For example, suppose we are searching something on the Internet and it gives some exact pages that are relevant as per our requirement but there can be some non-relevant pages too. This is due to the ad-hoc retrieval problem.

Aspects of Ad-hoc Retrieval

以下是 IR 研究中涉及的即席检索的一些方面：

Followings are some aspects of ad-hoc retrieval that are addressed in IR research −

How users with the help of relevance feedback can improve original formulation of a query?
How to implement database merging, i.e., how results from different text databases can be merged into one result set?
How to handle partly corrupted data? Which models are appropriate for the same?

Information Retrieval (IR) Model

在数学上，模型在许多科学领域中使用，旨在了解现实世界中的一些现象。信息检索模型预测并解释用户将在与给定查询相关的内容中找到什么。IR 模型基本上是一种模式，它定义了检索过程的上述方面，并包括以下内容：

Mathematically, models are used in many scientific areas having objective to understand some phenomenon in the real world. A model of information retrieval predicts and explains what a user will find in relevance to the given query. IR model is basically a pattern that defines the above-mentioned aspects of retrieval procedure and consists of the following −

A model for documents.
A model for queries.
A matching function that compares queries to documents.

在数学上，检索模型包括：

Mathematically, a retrieval model consists of −

D - 文档表示。

D − Representation for documents.

R - 查询表示。

R − Representation for queries.

F - D、Q 及其之间关系的建模框架。

F − The modeling framework for D, Q along with relationship between them.

R (q,di) - 根据查询对文档进行排序的相似性函数。它也称为排名。

R (q,di) − A similarity function which orders the documents with respect to the query. It is also called ranking.

Types of Information Retrieval (IR) Model

信息模型 (IR) 模型可以分为以下三种模型：

An information model (IR) model can be classified into the following three models −

Classical IR Model

它是实现最简单、最容易的 IR 模型。此模型基于容易识别和理解的数学知识。布尔、向量和概率是三个经典的 IR 模型。

It is the simplest and easy to implement IR model. This model is based on mathematical knowledge that was easily recognized and understood as well. Boolean, Vector and Probabilistic are the three classical IR models.

Non-Classical IR Model

它与经典 IR 模型完全相反。这种 IR 模型基于相似性、概率和布尔运算之外的原则。信息逻辑模型、情景理论模型和交互模型是非经典 IR 模型的示例。

It is completely opposite to classical IR model. Such kind of IR models are based on principles other than similarity, probability, Boolean operations. Information logic model, situation theory model and interaction models are the examples of non-classical IR model.

Alternative IR Model

增强经典信息检索模型并利用其他一些领域中的部分特定技术。群集模型、模糊模型和潜在语义索引 (LSI) 模型就是替代信息检索模型的示例。

It is the enhancement of classical IR model making use of some specific techniques from some other fields. Cluster model, fuzzy model and latent semantic indexing (LSI) models are the example of alternative IR model.

Design features of Information retrieval (IR) systems

下面我们来了解信息检索系统的设计特征。

Let us now learn about the design features of IR systems −

Inverted Index

大多数信息检索系统的主要数据结构是以倒排索引的形式存在的。我们可以将倒排索引定义为一种数据结构，针对每一个单词，列出所有包含该单词的文档以及该单词在各文档中出现的频率。这使得针对查询词搜索“命中”结果变得轻而易举。

The primary data structure of most of the IR systems is in the form of inverted index. We can define an inverted index as a data structure that list, for every word, all documents that contain it and frequency of the occurrences in document. It makes it easy to search for ‘hits’ of a query word.

Stop Word Elimination

停用词是指这些高频词不太可能用于搜索。它们具有较低的语义权重。所有此类词都在称为停用词表的一个列表中。例如，冠词“a”、“an”、“the”和介词“in”、“of”、“for”、“at”等都是停用词的示例。停用词表可以显著减少倒排索引的大小。根据齐普夫定律，涵盖几十个单词的停用词表将倒排索引的大小减少近一半。另一方面，有时停用词的排除可能会导致消除对搜索有用的术语。例如，如果我们从“维生素 A”中删除字母“A”，那么它将没有任何意义。

Stop words are those high frequency words that are deemed unlikely to be useful for searching. They have less semantic weights. All such kind of words are in a list called stop list. For example, articles “a”, “an”, “the” and prepositions like “in”, “of”, “for”, “at” etc. are the examples of stop words. The size of the inverted index can be significantly reduced by stop list. As per Zipf’s law, a stop list covering a few dozen words reduces the size of inverted index by almost half. On the other hand, sometimes the elimination of stop word may cause elimination of the term that is useful for searching. For example, if we eliminate the alphabet “A” from “Vitamin A” then it would have no significance.

Stemming

词干提取作为形态分析的简化形式，是一种通过切断单词末尾来提取单词基本形式的启发式过程。例如，单词 laughing、laughs、laughed 将被词干提取为词根 laugh。

Stemming, the simplified form of morphological analysis, is the heuristic process of extracting the base form of words by chopping off the ends of words. For example, the words laughing, laughs, laughed would be stemmed to the root word laugh.

在我们的后续章节中，我们将讨论一些重要且有用的信息检索模型。

In our subsequent sections, we will discuss about some important and useful IR models.

The Boolean Model

这是最古老的信息检索 (IR) 模型。该模型基于集合论和布尔代数，其中文档为术语集，而查询是对术语的布尔表达式。布尔模型可以定义为：

It is the oldest information retrieval (IR) model. The model is based on set theory and the Boolean algebra, where documents are sets of terms and queries are Boolean expressions on terms. The Boolean model can be defined as −

D − A set of words, i.e., the indexing terms present in a document. Here, each term is either present (1) or absent (0).
Q − A Boolean expression, where terms are the index terms and operators are logical products − AND, logical sum − OR and logical difference − NOT
F − Boolean algebra over sets of terms as well as over sets of documents If we talk about the relevance feedback, then in Boolean IR model the Relevance prediction can be defined as follows −
R − A document is predicted as relevant to the query expression if and only if it satisfies the query expression as −

((𝑡𝑒𝑥𝑡 ˅ 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛) ˄ 𝑟𝑒𝑟𝑖𝑒𝑣𝑎𝑙 ˄ ˜ 𝑡ℎ𝑒𝑜𝑟𝑦)

我们可以通过将查询项解释为文档集的明确定义来解释这个模型。

We can explain this model by a query term as an unambiguous definition of a set of documents.

例如，查询项 “economic” 定义索引有项 “economic” 的文档集。

For example, the query term “economic” defines the set of documents that are indexed with the term “economic”.

现在，在将项与布尔 AND 运算符组合之后结果会是什么？它会定义小于或等于任何单独项的文档集。例如，带有项 “social” 和 “economic” 的查询将产生同时索引有这两个项的文档集。换句话说，具有两种集合交集的文档集。

Now, what would be the result after combining terms with Boolean AND Operator? It will define a document set that is smaller than or equal to the document sets of any of the single terms. For example, the query with terms “social” and “economic” will produce the documents set of documents that are indexed with both the terms. In other words, document set with the intersection of both the sets.

现在，在将项与布尔 OR 运算符组合之后结果会是什么？它会定义大于或等于任何单独项的文档集。例如，带有项 “social” 或 “economic” 的查询将产生索引有项 “social” 或 “economic” 的文档集。换句话说，具有两种集合并集的文档集。

Now, what would be the result after combining terms with Boolean OR operator? It will define a document set that is bigger than or equal to the document sets of any of the single terms. For example, the query with terms “social” or “economic” will produce the documents set of documents that are indexed with either the term “social” or “economic”. In other words, document set with the union of both the sets.

Advantages of the Boolean Mode

布尔模型的优点如下 −

The advantages of the Boolean model are as follows −

The simplest model, which is based on sets.
Easy to understand and implement.
It only retrieves exact matches
It gives the user, a sense of control over the system.

Disadvantages of the Boolean Model

布尔模型的缺点如下 −

The disadvantages of the Boolean model are as follows −

The model’s similarity function is Boolean. Hence, there would be no partial matches. This can be annoying for the users.
In this model, the Boolean operator usage has much more influence than a critical word.
The query language is expressive, but it is complicated too.
No ranking for retrieved documents.

Vector Space Model

由于布尔模型的上述缺点，Gerard Salton 和他的同事们提出了一个基于 Luhn 相似度准则的模型。Luhn 制定的相似度准则指出，“两个表示在给定元素及其分布上越一致，则它们表示相似信息的概率就越高。”

Due to the above disadvantages of the Boolean model, Gerard Salton and his colleagues suggested a model, which is based on Luhn’s similarity criterion. The similarity criterion formulated by Luhn states, “the more two representations agreed in given elements and their distribution, the higher would be the probability of their representing similar information.”

考虑以下重要方面，以便更多地了解向量空间模型 −

Consider the following important points to understand more about the Vector Space Model −

The index representations (documents) and the queries are considered as vectors embedded in a high dimensional Euclidean space.
The similarity measure of a document vector to a query vector is usually the cosine of the angle between them.

Cosine Similarity Measure Formula

余弦是一种归一化点积，可以用以下公式计算：

Cosine is a normalized dot product, which can be calculated with the help of the following formula −

得分\lgroup\vec{d}\vec{q}\rgroup= \frac{\sum_{k=1}^m d_{k}\:.q_{k}}{\sqrt{\sum_{k=1}^m\lgroup d_{k}\rgroup^2}m\lgroup q_{k}\rgroup^2 }

Score \lgroup \vec{d} \vec{q} \rgroup= \frac{\sum_{k=1}^m d_{k}\:.q_{k}}{\sqrt{\sum_{k=1}^m\lgroup d_{k}\rgroup^{2}\:.\sqrt{\sum_{k=1}}m}m\lgroup q_{k}\rgroup^2 }

得分\lgroup \vec{d} \vec{q}\rgroup =1\:当\:d =q

Score \lgroup \vec{d} \vec{q}\rgroup =1\:when\:d =q

得分\lgroup\vec{d}\vec{q}\rgroup=0\:当\:d\:和\:q\:不共享任何项目时

Score \lgroup \vec{d} \vec{q}\rgroup =0\:when\:d\:and\:q\:share\:no\:items

Vector Space Representation with Query and Document

查询和文档由一个二维向量空间表示。术语为 car 和 insurance 。在向量空间中有一个查询和三个文档。

The query and documents are represented by a two-dimensional vector space. The terms are car and insurance. There is one query and three documents in the vector space.

作为对术语汽车和保险的响应，排名最高的文档将为文档 d2 ，因为 q 和 d2 之间的角度最小。造成这一结果的原因在于，汽车和保险这两个概念在d2中都很突出，因此具有较高的权重。另一方面， d1 和 d3 也提到了这两个术语，但在每个案例中，其中一个在文档中都不是一个中心重要的术语。

The top ranked document in response to the terms car and insurance will be the document d2 because the angle between q and d2 is the smallest. The reason behind this is that both the concepts car and insurance are salient in d2 and hence have the high weights. On the other side, d1 and d3 also mention both the terms but in each case, one of them is not a centrally important term in the document.

Term Weighting

术语加权表示向量空间中术余的权重。术语的权重越高，术语对余弦的影响就越大。应该为模型中更重要的术语分配更多权重。现在这里出现的问题是，我们如何对此建模。

Term weighting means the weights on the terms in vector space. Higher the weight of the term, greater would be the impact of the term on cosine. More weights should be assigned to the more important terms in the model. Now the question that arises here is how can we model this.

执行此操作的一种方法是将文档中的单词作为其术语权重计数。但是，您认为这会是一种有效的方法吗？

One way to do this is to count the words in a document as its term weight. However, do you think it would be effective method?

另一种更有效的方法是使用 term frequency (tfij), document frequency (dfi) 和 collection frequency (cfi) 。

Another method, which is more effective, is to use term frequency (tfij), document frequency (dfi) and collection frequency (cfi).

Term Frequency (tfij)

可以将其定义为 dj 中 wi 出现的次数。术语频率捕获的信息是一个词在给定文档中的显着程度，或者换句话说，我们可以说，术语频率越高，该词就越能很好地描述该文档的内容。

It may be defined as the number of occurrences of wi in dj. The information that is captured by term frequency is how salient a word is within the given document or in other words we can say that the higher the term frequency the more that word is a good description of the content of that document.

Document Frequency (dfi)

可以将其定义为集合中出现wi的文档总数。这是信息性的一个指标。与语义不集中的单词不同，语义集中的单词将在文档中多次出现。

It may be defined as the total number of documents in the collection in which wi occurs. It is an indicator of informativeness. Semantically focused words will occur several times in the document unlike the semantically unfocused words.

Collection Frequency (cfi)

可以将其定义为集合中 wi 出现的总数。

It may be defined as the total number of occurrences of wi in the collection.

数学上，$df_{i}\leq cf_{i}\:and\:\sum_{j}tf_{ij} = cf_{i}$

Mathematically, $df_{i}\leq cf_{i}\:and\:\sum_{j}tf_{ij} = cf_{i}$

Forms of Document Frequency Weighting

现在，让我们了解文档频率加权的不同形式。形式如下所述：

Let us now learn about the different forms of document frequency weighting. The forms are described below −

Term Frequency Factor

这也称为术语频率因子，这意味着如果某个术语 t 经常出现在某个文档中，那么包含 t 的查询应检索该文档。我们可以将单词 term frequency (tfij) 和 document frequency (dfi) 组合成一个权重，如下所示：

This is also classified as the term frequency factor, which means that if a term t appears often in a document then a query containing t should retrieve that document. We can combine word’s term frequency (tfij) and document frequency (dfi) into a single weight as follows −

权重 \left ( i,j \right ) =\begin{cases}(1+log(tf_{ij}))log\frac{N}{df_{i}}\:如果\:tf_{i,j}\:\geq1\\0 \:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\: 如果\:tf_{i,j}\:=0\end{cases}

weight \left ( i,j \right ) =\begin{cases}(1+log(tf_{ij}))log\frac{N}{df_{i}}\:if\:tf_{i,j}\:\geq1\\0 \:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\: if\:tf_{i,j}\:=0\end{cases}

在此 N 是文档的总数。

Here N is the total number of documents.

Inverse Document Frequency (idf)

这是另一种文档频率加权形式，通常称为 idf 加权或逆文档频率加权。idf 加权的重要点在于该术语在集合中的稀缺度是其重要性的衡量标准，而重要性与出现频率成反比。

This is another form of document frequency weighting and often called idf weighting or inverse document frequency weighting. The important point of idf weighting is that the term’s scarcity across the collection is a measure of its importance and importance is inversely proportional to frequency of occurrence.

在数学上，

Mathematically,

idf_{t} = log\left(1+\frac{N}{n_{t}}\right)

idf_{t} = log\left(\frac{N-n_{t}}{n_{t}}\right)

在此，

Here,

N = 集合中的文档

N = documents in the collection

nt = 包含术语 t 的文档

nt = documents containing term t

User Query Improvement

任何信息检索系统的首要目标都必须是准确性，即根据用户要求生成相关文档。然而，这里出现的问题是我们如何通过改进用户的查询形成样式来改进输出。当然，任何 IR 系统的输出都取决于用户的查询，格式良好的查询会产生更准确的结果。用户可以在 relevance feedback 的帮助下改进自己的查询，这是任何 IR 模型的一个重要方面。

The primary goal of any information retrieval system must be accuracy − to produce relevant documents as per the user’s requirement. However, the question that arises here is how can we improve the output by improving user’s query formation style. Certainly, the output of any IR system is dependent on the user’s query and a well-formatted query will produce more accurate results. The user can improve his/her query with the help of relevance feedback, an important aspect of any IR model.

Relevance Feedback

相关反馈采集到最初因给定查询而返回的输出。此初始输出可用于收集用户信息，并了解该输出是否与执行新查询相关。反馈可分类如下：

Relevance feedback takes the output that is initially returned from the given query. This initial output can be used to gather user information and to know whether that output is relevant to perform a new query or not. The feedbacks can be classified as follows −

Explicit Feedback

它可以定义为从相关性评估者获得的反馈。这些评估者还将指示从查询中检索到的文档的相关性。为了改进查询检索性能，相关反馈信息需要与原始查询插值。

It may be defined as the feedback that is obtained from the assessors of relevance. These assessors will also indicate the relevance of a document retrieved from the query. In order to improve query retrieval performance, the relevance feedback information needs to be interpolated with the original query.

系统评估者或其他用户可以通过使用以下相关性系统明确指示相关性：

Assessors or other users of the system may indicate the relevance explicitly by using the following relevance systems −

Binary relevance system − This relevance feedback system indicates that a document is either relevant (1) or irrelevant (0) for a given query.
Graded relevance system − The graded relevance feedback system indicates the relevance of a document, for a given query, on the basis of grading by using numbers, letters or descriptions. The description can be like “not relevant”, “somewhat relevant”, “very relevant” or “relevant”.

Implicit Feedback

它是从用户行为中推断出的反馈。行为包括用户花在查看文档上的时间、选择了哪些文档进行查看、哪些未查看、页面浏览和滚动操作等。隐式反馈的一个最佳示例是 dwell time ，它衡量用户在搜索结果中链接到的页面上花费了多少时间。

It is the feedback that is inferred from user behavior. The behavior includes the duration of time user spent viewing a document, which document is selected for viewing and which is not, page browsing and scrolling actions, etc. One of the best examples of implicit feedback is dwell time, which is a measure of how much time a user spends viewing the page linked to in a search result.

Pseudo Feedback

它也称为盲反馈。它提供了一种自动局部分析的方法。相关反馈的人工部分在伪相关反馈的帮助下实现自动化，这样用户无需进一步交互即可获得改进的检索性能。此反馈系统的主要优点是它不需要像明确相关反馈系统那样的评估者。

It is also called Blind feedback. It provides a method for automatic local analysis. The manual part of relevance feedback is automated with the help of Pseudo relevance feedback so that the user gets improved retrieval performance without an extended interaction. The main advantage of this feedback system is that it does not require assessors like in explicit relevance feedback system.

考虑以下步骤来实施此反馈：

Consider the following steps to implement this feedback −

Step 1 − First, the result returned by initial query must be taken as relevant result. The range of relevant result must be in top 10-50 results.
Step 2 − Now, select the top 20-30 terms from the documents using for instance term frequency(tf)-inverse document frequency(idf) weight.
Step 3 − Add these terms to the query and match the returned documents. Then return the most relevant documents.

Applications of NLP

自然语言处理 (NLP) 是一项新兴技术，它派生出我们在当下看到的各种形式的 AI，并且它在为人类和机器之间创造无缝且互动式界面中的用处将继续成为当今和未来越来越多的认知应用程序的首要任务。在这里，我们将讨论NLP的一些非常有用的应用。

Natural Language Processing (NLP) is an emerging technology that derives various forms of AI that we see in the present times and its use for creating a seamless as well as interactive interface between humans and machines will continue to be a top priority for today’s and tomorrow’s increasingly cognitive applications. Here, we are going to discuss about some of the very useful applications of NLP.

Machine Translation

机器翻译 (MT)，将一种源语言或文本翻译成另一种语言的过程，是NLP最重要的应用之一。借助以下流程图，我们可以了解机器翻译的过程−

Machine translation (MT), process of translating one source language or text into another language, is one of the most important applications of NLP. We can understand the process of machine translation with the help of the following flowchart −

Types of Machine Translation Systems

有不同类型的机器翻译系统。让我们看看不同的类型是什么。

There are different types of machine translation systems. Let us see what the different types are.

Bilingual MT System

双语 MT 系统生成两种特定语言之间的翻译。

Bilingual MT systems produce translations between two particular languages.

Multilingual MT System

多语言 MT 系统生成任何一对语言之间的翻译。它们在本质上可能是单向的或双向的。

Multilingual MT systems produce translations between any pair of languages. They may be either uni-directional or bi-directional in nature.

Approaches to Machine Translation (MT)

让我们现在了解机器翻译的重要方法。MT 的方法如下 −

Let us now learn about the important approaches to Machine Translation. The approaches to MT are as follows −

Direct MT Approach

这是较不流行但 MT 的最古老方法。使用此方法的系统能够将 SL（源语言）直接翻译成 TL（目标语言）。此类系统本质上是双语的，并且是单向的。

It is less popular but the oldest approach of MT. The systems that use this approach are capable of translating SL (source language) directly to TL (target language). Such systems are bi-lingual and uni-directional in nature.

Interlingua Approach

使用 Interlingua 方法的系统将 SL 翻译成称为 Interlingua (IL) 的中间语言，然后将 IL 翻译成 TL。Interlingua 方法可以通过以下 MT 金字塔理解 −

The systems that use Interlingua approach translate SL to an intermediate language called Interlingua (IL) and then translate IL to TL. The Interlingua approach can be understood with the help of the following MT pyramid −

Transfer Approach

这种方法涉及三个阶段。

Three stages are involved with this approach.

In the first stage, source language (SL) texts are converted to abstract SL-oriented representations.
In the second stage, SL-oriented representations are converted into equivalent target language (TL)-oriented representations.
In the third stage, the final text is generated.

Empirical MT Approach

这是 MT 的一种新兴方法。基本上，它以平行语料库的形式使用大量的原始数据。原始数据由文本及其翻译组成。基于类比、基于示例和基于内存的机器翻译技术使用基于经验的 MT 方法。

This is an emerging approach for MT. Basically, it uses large amount of raw data in the form of parallel corpora. The raw data consists of the text and their translations. Analogybased, example-based, memory-based machine translation techniques use empirical MTapproach.

Fighting Spam

如今最常见的问题之一是垃圾邮件。这使得垃圾邮件过滤器变得格外重要，因为它是针对此问题的第一道防线。

One of the most common problems these days is unwanted emails. This makes Spam filters all the more important because it is the first line of defense against this problem.

可以通过考虑主要的误报和漏报问题来使用 NLP 功能开发垃圾邮件过滤系统。

Spam filtering system can be developed by using NLP functionality by considering the major false-positive and false-negative issues.

Existing NLP models for spam filtering

以下是用于垃圾邮件过滤的一些现有 NLP 模型 −

Followings are some existing NLP models for spam filtering −

N-gram Modeling

N-Gram 模型是较长字符串的 N 字符切片。在此模型中，在处理和检测垃圾邮件时同时使用不同长度的 N-gram。

An N-Gram model is an N-character slice of a longer string. In this model, N-grams of several different lengths are used simultaneously in processing and detecting spam emails.

Word Stemming

垃圾邮件发送者通常会更改垃圾邮件中攻击性单词的一个或多个字符，以便他们可以突破基于内容的垃圾邮件过滤器。这就是为什么我们可以说，如果基于内容的过滤器无法理解电子邮件中单词或短语的含义，那么它们就没有用。为了消除垃圾邮件过滤中的此类问题，开发了一种基于规则的词干提取技术，它可以匹配看起来和听起来相似的单词。

Spammers, generators of spam emails, usually change one or more characters of attacking words in their spams so that they can breach content-based spam filters. That is why we can say that content-based filters are not useful if they cannot understand the meaning of the words or phrases in the email. In order to eliminate such issues in spam filtering, a rule-based word stemming technique, that can match words which look alike and sound alike, is developed.

Bayesian Classification

这现已成为垃圾邮件过滤的广泛使用技术。在统计技术中，电子邮件中单词的出现率针对其在未经请求的（垃圾邮件）和合法的（火腿）电子邮件消息数据库中的典型出现率进行衡量。

This has now become a widely-used technology for spam filtering. The incidence of the words in an email is measured against its typical occurrence in a database of unsolicited (spam) and legitimate (ham) email messages in a statistical technique.

Automatic Summarization

在这个数字时代，最有价值的是数据，或者你可以说信息。然而，我们是否真正获得有用的以及所需数量的信息？答案是“否”，因为信息超载，我们获取知识和信息的能力远远超过理解它们的能力。我们迫切需要自动文本摘要和信息，因为互联网上的信息泛滥不会停止。

In this digital era, the most valuable thing is data, or you can say information. However, do we really get useful as well as the required amount of information? The answer is ‘NO’ because the information is overloaded and our access to knowledge and information far exceeds our capacity to understand it. We are in a serious need of automatic text summarization and information because the flood of information over internet is not going to stop.

文本摘要可以定义为创建较长文本文档的简短准确摘要的技术。自动文本摘要将帮助我们在更短的时间内获得相关信息。自然语言处理 (NLP) 在开发自动文本摘要中起着重要作用。

Text summarization may be defined as the technique to create short, accurate summary of longer text documents. Automatic text summarization will help us with relevant information in less time. Natural language processing (NLP) plays an important role in developing an automatic text summarization.

Question-answering

自然语言处理 (NLP) 的另一个主要应用是问答。搜索引擎将世界的信息触手可及，但是当回答人类用自然语言提出的问题时，它们仍然存在缺陷。我们有谷歌等大型科技公司也在朝着这个方向努力。

Another main application of natural language processing (NLP) is question-answering. Search engines put the information of the world at our fingertips, but they are still lacking when it comes to answer the questions posted by human beings in their natural language. We have big tech companies like Google are also working in this direction.

问答是人工智能和 NLP 领域的计算机科学学科。它专注于构建系统，这些系统可以自动回答人类用其自然语言提出的问题。理解自然语言的计算机系统具有程序系统的能力，可以将人类编写的句子翻译成内部表示，以便系统能够生成有效答案。可以通过对问题进行语法和语义分析来生成确切的答案。词汇差距、歧义和多语言是 NLP 在构建良好的问答系统时面临的一些挑战。

Question-answering is a Computer Science discipline within the fields of AI and NLP. It focuses on building systems that automatically answer questions posted by human beings in their natural language. A computer system that understands the natural language has the capability of a program system to translate the sentences written by humans into an internal representation so that the valid answers can be generated by the system. The exact answers can be generated by doing syntax and semantic analysis of the questions. Lexical gap, ambiguity and multilingualism are some of the challenges for NLP in building good question answering system.

Sentiment Analysis

自然语言处理 (NLP) 的另一个重要应用是情绪分析。顾名思义，情绪分析用于识别多条帖子中的情绪。它还用于识别未明确表达情绪的情感。公司正在使用自然语言处理的应用程序情绪分析来识别客户在线上的意见和情感。它将帮助公司了解客户对产品和服务有何看法。借助情绪分析，公司可以从客户帖子中判断其整体声誉。通过这种方式，我们可以说，除了确定简单的情绪外，情绪分析还可以理解上下文中包含的情绪，以帮助我们更好地理解表达意见背后的出发点。

Another important application of natural language processing (NLP) is sentiment analysis. As the name suggests, sentiment analysis is used to identify the sentiments among several posts. It is also used to identify the sentiment where the emotions are not expressed explicitly. Companies are using sentiment analysis, an application of natural language processing (NLP) to identify the opinion and sentiment of their customers online. It will help companies to understand what their customers think about the products and services. Companies can judge their overall reputation from customer posts with the help of sentiment analysis. In this way, we can say that beyond determining simple polarity, sentiment analysis understands sentiments in context to help us better understand what is behind the expressed opinion.

Natural Language Processing - Python

在本章中，我们将学习使用 Python 进行语言处理。

In this chapter, we will learn about language processing using Python.

以下功能使 Python 与其他语言不同：

The following features make Python different from other languages −

Python is interpreted − We do not need to compile our Python program before executing it because the interpreter processes Python at runtime.
Interactive − We can directly interact with the interpreter to write our Python programs.
Object-oriented − Python is object-oriented in nature and it makes this language easier to write programs because with the help of this technique of programming it encapsulates code within objects.
Beginner can easily learn − Python is also called beginner’s language because it is very easy to understand, and it supports the development of a wide range of applications.

Prerequisites

Python 3 的最新版本是 Python 3.7.1，可用于 Windows、Mac OS 和大多数 Linux OS。

The latest version of Python 3 released is Python 3.7.1 is available for Windows, Mac OS and most of the flavors of Linux OS.

For windows, we can go to the link www.python.org/downloads/windows/ to download and install Python.
For MAC OS, we can use the link www.python.org/downloads/mac-osx/.
In case of Linux, different flavors of Linux use different package managers for installation of new packages. For example, to install Python 3 on Ubuntu Linux, we can use the following command from terminal −

$sudo apt-get install python3-minimal

为更深入地学习 Python 编程，请阅读 Python 3 基本教程 – Python 3

To study more about Python programming, read Python 3 basic tutorial – Python 3

Getting Started with NLTK

我们将使用 Python 库 NLTK（自然语言工具包）对英语文本进行文本分析。自然语言工具包 (NLTK) 是一组 Python 库，专门设计用于识别和标记在自然语言（如英语）文本中找到的词性。

We will be using Python library NLTK (Natural Language Toolkit) for doing text analysis in English Language. The Natural language toolkit (NLTK) is a collection of Python libraries designed especially for identifying and tag parts of speech found in the text of natural language like English.

Installing NLTK

在开始使用 NLTK 之前，我们需要将其安装。我们可以使用以下命令在 Python 环境中安装它：

Before starting to use NLTK, we need to install it. With the help of following command, we can install it in our Python environment −

pip install nltk

如果我们正在使用 Anaconda，则可以使用以下命令构建 NLTK 的 Conda 软件包：

If we are using Anaconda, then a Conda package for NLTK can be built by using the following command −

conda install -c anaconda nltk

Downloading NLTK’s Data

安装 NLTK 之后，另一项重要任务是下载其预设文本存储库，以便可以轻松使用它。但是，在此之前，我们需要像导入其他任何 Python 模块一样导入 NLTK。以下命令将帮助我们导入 NLTK：

After installing NLTK, another important task is to download its preset text repositories so that it can be easily used. However, before that we need to import NLTK the way we import any other Python module. The following command will help us in importing NLTK −

import nltk

现在，使用以下命令下载 NLTK 数据：

Now, download NLTK data with the help of the following command −

nltk.download()

安装所有可用的 NLTK 软件包需要一些时间。

It will take some time to install all available packages of NLTK.

Other Necessary Packages

其他一些 Python 软件包（如 gensim 和 pattern ）对于文本分析以及通过使用 NLTK 构建自然语言处理应用程序也是非常必要的。软件包可以按如下所示安装：

Some other Python packages like gensim and pattern are also very necessary for text analysis as well as building natural language processing applications by using NLTK. the packages can be installed as shown below −

gensim

gensim 是一个健壮的语义建模库，可用于许多应用程序。我们可以通过以下命令安装它：

gensim is a robust semantic modeling library which can be used for many applications. We can install it by following command −

pip install gensim

pattern

它可用于使 gensim 软件包正常工作。以下命令有助于安装 pattern：

It can be used to make gensim package work properly. The following command helps in installing pattern −

pip install pattern

Tokenization

标记化可以定义为将给定文本分解为较小的称为标记的单元的过程。单词、数字或标点符号可以是标记。它也可以称为词语分割。

Tokenization may be defined as the Process of breaking the given text, into smaller units called tokens. Words, numbers or punctuation marks can be tokens. It may also be called word segmentation.

Example

Input - 床和椅子是家具的类型。

Input − Bed and chair are types of furniture.

NLTK 提供了用于标记化的不同包。我们可以根据我们的要求使用这些包。包及其安装详细信息如下 -

We have different packages for tokenization provided by NLTK. We can use these packages based on our requirements. The packages and the details of their installation are as follows −

sent_tokenize package

此包可用于将输入文本划分为句子。我们可以使用以下命令导入它 -

This package can be used to divide the input text into sentences. We can import it by using the following command −

from nltk.tokenize import sent_tokenize

word_tokenize package

此包可用于将输入文本划分为单词。我们可以使用以下命令导入它 -

This package can be used to divide the input text into words. We can import it by using the following command −

from nltk.tokenize import word_tokenize

WordPunctTokenizer package

此包可用于将输入文本划分为单词和标点符号。我们可以使用以下命令导入它 -

This package can be used to divide the input text into words and punctuation marks. We can import it by using the following command −

from nltk.tokenize import WordPuncttokenizer

Stemming

由于语法原因，语言包含很多变异。语言（包括英语和其他语言）的变异在于它们具有不同的单词形式。例如，像 democracy ， democratic 和 democratization 这样的单词。对于机器学习项目而言，对于机器而言非常重要的是理解上面这样的这些不同的单词具有相同的词根。这就是在分析文本时提取单词词根非常有用的原因。

Due to grammatical reasons, language includes lots of variations. Variations in the sense that the language, English as well as other languages too, have different forms of a word. For example, the words like democracy, democratic, and democratization. For machine learning projects, it is very important for machines to understand that these different words, like above, have the same base form. That is why it is very useful to extract the base forms of the words while analyzing the text.

词干提取是一个启发式过程，它通过切断词尾来帮助提取单词的词根。

Stemming is a heuristic process that helps in extracting the base forms of the words by chopping of their ends.

NLTK 模块提供的用于词干提取的不同包如下 -

The different packages for stemming provided by NLTK module are as follows −

PorterStemmer package

这个词干提取包使用波特算法来提取单词的词根。借助以下命令，我们可以导入这个包 -

Porter’s algorithm is used by this stemming package to extract the base form of the words. With the help of the following command, we can import this package −

from nltk.stem.porter import PorterStemmer

例如， ‘writing’ 输入到这个词干提取器后， ‘write’ 将是输出。

For example, ‘write’ would be the output of the word ‘writing’ given as the input to this stemmer.

LancasterStemmer package

这个词干提取包使用兰开斯特算法来提取单词的词根。借助以下命令，我们可以导入这个包 -

Lancaster’s algorithm is used by this stemming package to extract the base form of the words. With the help of following command, we can import this package −

from nltk.stem.lancaster import LancasterStemmer

例如， ‘writing’ 输入到这个词干提取器后， ‘writing’ 将是输出。

For example, ‘writ’ would be the output of the word ‘writing’ given as the input to this stemmer.

SnowballStemmer package

这个词干提取包使用 Snowball 算法来提取单词的词根。借助以下命令，我们可以导入这个包 -

Snowball’s algorithm is used by this stemming package to extract the base form of the words. With the help of following command, we can import this package −

from nltk.stem.snowball import SnowballStemmer

例如， ‘writing’ 输入到这个词干提取器后， ‘write’ 将是输出。

For example, ‘write’ would be the output of the word ‘writing’ given as the input to this stemmer.

Lemmatization

这是提取单词词根的另一种方式，通常旨在通过使用词汇和形态分析来删除屈折词尾。词形还原后，任何单词的词根称为词素。

It is another way to extract the base form of words, normally aiming to remove inflectional endings by using vocabulary and morphological analysis. After lemmatization, the base form of any word is called lemma.

NLTK 模块提供了用于词形还原的以下包 -

NLTK module provides the following package for lemmatization −

WordNetLemmatizer package

这个包将根据单词是用作名词还是动词来提取词根。可以使用以下命令导入这个包 -

This package will extract the base form of the word depending upon whether it is used as a noun or as a verb. The following command can be used to import this package −

from nltk.stem import WordNetLemmatizer

Counting POS Tags–Chunking

借助切分，我们可以识别词性(POS)和短语。这是自然语言处理中的重要过程之一。正如我们了解用于创建标记的标记化过程，切分实际上是对这些标记进行标记。换句话说，我们可以说我们可以借助切分过程获取句子的结构。

The identification of parts of speech (POS) and short phrases can be done with the help of chunking. It is one of the important processes in natural language processing. As we are aware about the process of tokenization for the creation of tokens, chunking actually is to do the labeling of those tokens. In other words, we can say that we can get the structure of the sentence with the help of chunking process.

Example

在以下示例中，我们将通过使用 NLTK Python 模块实现名词短语切分，它是一种切分类别，它将在句子中找到名词短语块。

In the following example, we will implement Noun-Phrase chunking, a category of chunking which will find the noun phrase chunks in the sentence, by using NLTK Python module.

考虑以下步骤来实现名词短语切分 -

Consider the following steps to implement noun-phrase chunking −

Step 1: Chunk grammar definition

在此步骤中，我们需要定义块分析的语法。它将由我们需要遵循的规则组成。

In this step, we need to define the grammar for chunking. It would consist of the rules, which we need to follow.

Step 2: Chunk parser creation

接下来，我们需要创建块解析器。它将解析语法并提供输出。

Next, we need to create a chunk parser. It would parse the grammar and give the output.

Step 3: The Output

在此步骤中，我们将以树状格式获取输出。

In this step, we will get the output in a tree format.

Running the NLP Script

首先从导入 NLTK 包开始 −

Start by importing the the NLTK package −

import nltk

现在，我们需要定义句子。

Now, we need to define the sentence.

在此，

Here,

DT is the determinant
VBP is the verb
JJ is the adjective
IN is the preposition
NN is the noun

sentence = [("a", "DT"),("clever","JJ"),("fox","NN"),("was","VBP"),
   ("jumping","VBP"),("over","IN"),("the","DT"),("wall","NN")]

接下来，语法应以正则表达式的形式给出。

Next, the grammar should be given in the form of regular expression.

grammar = "NP:{<DT>?<JJ>*<NN>}"

现在，我们需要为解析语法定义一个解析器。

Now, we need to define a parser for parsing the grammar.

parser_chunking = nltk.RegexpParser(grammar)

现在，解析器将按如下方式解析句子 −

Now, the parser will parse the sentence as follows −

parser_chunking.parse(sentence)

接下来，输出将显示在变量中，如下所示：-

Next, the output will be in the variable as follows:-

Output = parser_chunking.parse(sentence)

现在，以下代码将帮助你以树的形式绘制你的输出。

Now, the following code will help you draw your output in the form of a tree.

output.draw()