Natural Language Processing 简明教程

Natural Language Discourse Processing

人工智能最困难的问题是用计算机处理自然语言，或者换句话说，自然语言处理是人工智能中最困难的问题。如果我们讨论 NLP 中的主要问题，那么 NLP 中的一个主要问题就是话语处理 − 建立关于话语如何粘在一起形成 coherent discourse 的理论和模型。事实上，语言总是包含搭配、结构化和连贯的句子组，而不是像电影那样的孤立且不相关的句子。这些连贯的句子组称为话语。

The most difficult problem of AI is to process the natural language by computers or in other words natural language processing is the most difficult problem of artificial intelligence. If we talk about the major problems in NLP, then one of the major problems in NLP is discourse processing − building theories and models of how utterances stick together to form coherent discourse. Actually, the language always consists of collocated, structured and coherent groups of sentences rather than isolated and unrelated sentences like movies. These coherent groups of sentences are referred to as discourse.

Concept of Coherence

连贯性和话语结构在许多方面是相互关联的。连贯性与好文本的属性一起，用于评估自然语言生成系统的输出质量。这里出现的问题是文本连贯意味着什么？假设我们从报纸的每一页收集一句话，那它会是话语吗？当然不是。这是因为这些句子没有表现出连贯性。连贯的话语必须具备以下特性 −

Coherence and discourse structure are interconnected in many ways. Coherence, along with property of good text, is used to evaluate the output quality of natural language generation system. The question that arises here is what does it mean for a text to be coherent? Suppose we collected one sentence from every page of the newspaper, then will it be a discourse? Of-course, not. It is because these sentences do not exhibit coherence. The coherent discourse must possess the following properties −

Coherence relation between utterances

如果话语在其话语之间有有意义的联系，那么它将是连贯的。此属性称为连贯关系。例如，必须有一些解释来证明话语之间的联系。

The discourse would be coherent if it has meaningful connections between its utterances. This property is called coherence relation. For example, some sort of explanation must be there to justify the connection between utterances.

Relationship between entities

使话语连贯的另一个属性是实体之间必须存在某种关系。这种连贯性称为基于实体的连贯性。

Another property that makes a discourse coherent is that there must be a certain kind of relationship with the entities. Such kind of coherence is called entity-based coherence.

Discourse structure

关于话语的一个重要问题是话语必须具有什么样的结构。这个问题的答案取决于我们应用于话语的分割。话语分割可以定义为确定大型话语的结构类型。实施话语分割非常困难，但对以下 information retrieval, text summarization and information extraction 类型的应用程序非常重要。

An important question regarding discourse is what kind of structure the discourse must have. The answer to this question depends upon the segmentation we applied on discourse. Discourse segmentations may be defined as determining the types of structures for large discourse. It is quite difficult to implement discourse segmentation, but it is very important for information retrieval, text summarization and information extraction kind of applications.

Algorithms for Discourse Segmentation

在本节中，我们将学习话语分割的算法。算法如下所述 −

In this section, we will learn about the algorithms for discourse segmentation. The algorithms are described below −

Unsupervised Discourse Segmentation

无监督话语分割的类别通常表现为线性分割。我们可以借助一个示例了解线性分割的任务。在该示例中，有一个任务是将文本分割为多段落单位；这些单位代表着原始文本的段落。这些算法依赖于内聚力，内聚力可以定义为使用特定的语言设备将文本单位联系在一起。另一方面，词汇内聚力是两 (2) 个单位中两个 (2) 个或更多单词之间的关系指示的内聚力，例如使用同义词。

The class of unsupervised discourse segmentation is often represented as linear segmentation. We can understand the task of linear segmentation with the help of an example. In the example, there is a task of segmenting the text into multi-paragraph units; the units represent the passage of the original text. These algorithms are dependent on cohesion that may be defined as the use of certain linguistic devices to tie the textual units together. On the other hand, lexicon cohesion is the cohesion that is indicated by the relationship between two or more words in two units like the use of synonyms.

Supervised Discourse Segmentation

较早的方法没有任何手工标记的分割边界。另一方面，监督话语分割需要有边界标记的训练数据。获取该数据非常容易。在监督话语分割中，话语标记或提示词起着重要的作用。话语标记或提示词是表示话语结构的单词或短语。这些话语标记是特定于领域的。

The earlier method does not have any hand-labeled segment boundaries. On the other hand, supervised discourse segmentation needs to have boundary-labeled training data. It is very easy to acquire the same. In supervised discourse segmentation, discourse marker or cue words play an important role. Discourse marker or cue word is a word or phrase that functions to signal discourse structure. These discourse markers are domain-specific.

Text Coherence

词汇重复是找出话语结构的一种方法，但它不满足连贯话语的要求。为了实现连贯话语，我们必须特别关注连贯关系。正如我们所知，连贯关系定义了话语中言语之间的可能联系。赫布提出了以下此种关系 −

Lexical repetition is a way to find the structure in a discourse, but it does not satisfy the requirement of being coherent discourse. To achieve the coherent discourse, we must focus on coherence relations in specific. As we know that coherence relation defines the possible connection between utterances in a discourse. Hebb has proposed such kind of relations as follows −

我们采用两个术语 S0 和 S1 来表示两个 (2) 个相关句子的含义 −

We are taking two terms S0 and S1 to represent the meaning of the two related sentences −

Result

它推断出术语 S0 断言的状态可能导致术语 S1 断言的状态。例如，两个 (2) 个表述表明关系结果：拉姆被困在火中。他的皮肤被烧伤了。

It infers that the state asserted by term S0 could cause the state asserted by S1. For example, two statements show the relationship result: Ram was caught in the fire. His skin burned.

Explanation

它推断出术语 S1 断言的状态可能导致术语 S0 断言的状态。例如，两个 (2) 个表述表明关系 − 拉姆与夏姆的朋友打架。他喝醉了。

It infers that the state asserted by S1 could cause the state asserted by S0. For example, two statements show the relationship − Ram fought with Shyam’s friend. He was drunk.

Parallel

它根据断言 S0 推断出 p(a1,a2,…) 并根据断言 S1 推断出 p(b1,b2,…)。此处所有 i 的 ai 和 bi 相似。例如，两个 (2) 个表述是平行的 − 拉姆想要汽车。夏姆想要钱。

It infers p(a1,a2,…) from assertion of S0 and p(b1,b2,…) from assertion S1. Here ai and bi are similar for all i. For example, two statements are parallel − Ram wanted car. Shyam wanted money.

Elaboration

两种 (2) 断言 S0 和 S1 推断出相同的命题 P。例如，两个 (2) 个表述表明关系详细说明：拉姆来自昌迪加尔。夏姆来自喀拉拉邦。

It infers the same proposition P from both the assertions − S0 and S1 For example, two statements show the relation elaboration: Ram was from Chandigarh. Shyam was from Kerala.

Occasion

当可以从断言 S0 推断出状态更改，可以从 S1 推断出其最终状态，反之亦然时，就会发生这种情况。例如，两个 (2) 个表述表明关系场：拉姆拿起书。他把它给了夏姆。

It happens when a change of state can be inferred from the assertion of S0, final state of which can be inferred from S1 and vice-versa. For example, the two statements show the relation occasion: Ram picked up the book. He gave it to Shyam.

Building Hierarchical Discourse Structure

还可以通过连贯关系之间的层次结构考虑整个话语的连贯性。例如，以下段落可以表示为层次结构 −

The coherence of entire discourse can also be considered by hierarchical structure between coherence relations. For example, the following passage can be represented as hierarchical structure −

S1 − Ram went to the bank to deposit money.
S2 − He then took a train to Shyam’s cloth shop.
S3 − He wanted to buy some clothes.
S4 − He do not have new clothes for party.
S5 − He also wanted to talk to Shyam regarding his health

building hierarchical discourse structure

Reference Resolution

对任何话语中的句子进行解释是另一项重要任务，为了实现此项任务，我们需要知道讨论的是谁或什么实体。此处，解释参考是关键要素。 Reference 可以定义为表示实体或个体的语言表达。例如，在段落中，ABC 银行经理拉姆在一家商店见到了他的朋友夏姆。他去见他，拉姆、他的、他等语言表达是参考。

Interpretation of the sentences from any discourse is another important task and to achieve this we need to know who or what entity is being talked about. Here, interpretation reference is the key element. Reference may be defined as the linguistic expression to denote an entity or individual. For example, in the passage, Ram, the manager of ABC bank, saw his friend Shyam at a shop. He went to meet him, the linguistic expressions like Ram, His, He are reference.

同样， reference resolution 可以定义为确定哪些语言表达指代哪些实体的任务。

On the same note, reference resolution may be defined as the task of determining what entities are referred to by which linguistic expression.

Terminology Used in Reference Resolution

我们在引用消解中使用以下术语 −

We use the following terminologies in reference resolution −

Referring expression − The natural language expression that is used to perform reference is called a referring expression. For example, the passage used above is a referring expression.
Referent − It is the entity that is referred. For example, in the last given example Ram is a referent.
Corefer − When two expressions are used to refer to the same entity, they are called corefers. For example, Ram and he are corefers.
Antecedent − The term has the license to use another term. For example, Ram is the antecedent of the reference he.
Anaphora & Anaphoric − It may be defined as the reference to an entity that has been previously introduced into the sentence. And, the referring expression is called anaphoric.
Discourse model − The model that contains the representations of the entities that have been referred to in the discourse and the relationship they are engaged in.

Types of Referring Expressions

现在让我们看一看不同类型的指称表达式。下面描述了五种类型的指称表达式 −

Let us now see the different types of referring expressions. The five types of referring expressions are described below −

Indefinite Noun Phrases

这种引用表示对听众来说在话语上下文中是新的实体。例如 − 在句子“Ram 有一天四处走动给他带了一些食物”中 − some 是一种不确定的引用。

Such kind of reference represents the entities that are new to the hearer into the discourse context. For example − in the sentence Ram had gone around one day to bring him some food − some is an indefinite reference.

Definite Noun Phrases

与上述相反，这种引用表示对听众来说在话语上下文中并不新或不可识别的实体。例如，在句子“我过去常读《印度时报》”中 − 《印度时报》是一个明确的引用。

Opposite to above, such kind of reference represents the entities that are not new or identifiable to the hearer into the discourse context. For example, in the sentence - I used to read The Times of India – The Times of India is a definite reference.

Pronouns

这是一种明确的引用。例如，Ram 尽可能大声地笑了。单词 he 表示代词指称表达式。

It is a form of definite reference. For example, Ram laughed as loud as he could. The word he represents pronoun referring expression.

Demonstratives

它们与简单的明确代词不同，并且表现不同。例如，this 和 that 是指示代词。

These demonstrate and behave differently than simple definite pronouns. For example, this and that are demonstrative pronouns.

Names

它是最简单的指称表达式类型。它还可以是个人、组织和地点的名称。例如，在上面的示例中，Ram 是人名指称表达式。

It is the simplest type of referring expression. It can be the name of a person, organization and location also. For example, in the above examples, Ram is the name-refereeing expression.

Reference Resolution Tasks

下面描述了这两个引用解析任务。

The two reference resolution tasks are described below.

Coreference Resolution

这是在文本中查找引用同一实体的指称表达式的任务。用简单的话说，这是找到共指表达式的任务。一组共指表达式被称为共指链。例如 - 他、首席经理和他的 - 这些是作为示例给出的第一段中的指称表达式。

It is the task of finding referring expressions in a text that refer to the same entity. In simple words, it is the task of finding corefer expressions. A set of coreferring expressions are called coreference chain. For example - He, Chief Manager and His - these are referring expressions in the first passage given as example.

Constraint on Coreference Resolution

在英语中，共指解析的主要问题是代词 it。其背后的原因是代词 it 有很多用法。例如，它可以像 he 和 she 一样指代。代词 it 还指代不指特定事物的物体。例如，下雨了。真的很棒。

In English, the main problem for coreference resolution is the pronoun it. The reason behind this is that the pronoun it has many uses. For example, it can refer much like he and she. The pronoun it also refers to the things that do not refer to specific things. For example, It’s raining. It is really good.

Pronominal Anaphora Resolution

不同于共指解析，代词指代解析可能被定义为查找代词先行词的任务。例如，代词是他的，代词指代解析的任务是找到单词 Ram，因为 Ram 是先行词。

Unlike the coreference resolution, pronominal anaphora resolution may be defined as the task of finding the antecedent for a single pronoun. For example, the pronoun is his and the task of pronominal anaphora resolution is to find the word Ram because Ram is the antecedent.