Natural Language Processing 简明教程

NLP - Word Sense Disambiguation

我们知道，根据其在句子中的使用上下文，单词具有不同的含义。如果我们谈论人类语言，那么它们也是模棱两可的，因为许多单词可以根据其出现的上下文中以多种方式解释。

We understand that words have different meanings based on the context of its usage in the sentence. If we talk about human languages, then they are ambiguous too because many words can be interpreted in multiple ways depending upon the context of their occurrence.

在自然语言处理 (NLP) 中，词义消歧可能被定义为确定单词的哪种含义因在特定上下文中使用该单词而被激活的能力。词法歧义、句法或语义歧义是任何 NLP 系统面临的第一个问题之一。具有高准确率的词性 (POS) 标记器可以解决单词的句法歧义。另一方面，解决语义歧义的问题称为 WSD（词义消歧）。解决语义歧义比解决句法歧义更难。

Word sense disambiguation, in natural language processing (NLP), may be defined as the ability to determine which meaning of word is activated by the use of word in a particular context. Lexical ambiguity, syntactic or semantic, is one of the very first problem that any NLP system faces. Part-of-speech (POS) taggers with high level of accuracy can solve Word’s syntactic ambiguity. On the other hand, the problem of resolving semantic ambiguity is called WSD (word sense disambiguation). Resolving semantic ambiguity is harder than resolving syntactic ambiguity.

例如，考虑单词 “bass” 的不同含义的两个示例 −

For example, consider the two examples of the distinct sense that exist for the word “bass” −

I can hear bass sound.
He likes to eat grilled bass.

bass 词的出现清楚地表明了不同的含义。在第一句话中，这意味着 frequency ，在第二句话中，这意味着 fish 。因此，如果通过WSD消除歧义，则可以将正确含义分配给以上句子，如下所示：

The occurrence of the word bass clearly denotes the distinct meaning. In first sentence, it means frequency and in second, it means fish. Hence, if it would be disambiguated by WSD then the correct meaning to the above sentences can be assigned as follows −

I can hear bass/frequency sound.
He likes to eat grilled bass/fish.

Evaluation of WSD

WSD的评估需要以下两个输入：

The evaluation of WSD requires the following two inputs −

A Dictionary

第一个评估WSD的输入是词典，用于指定要消除歧义的含义。

The very first input for evaluation of WSD is dictionary, which is used to specify the senses to be disambiguated.

Test Corpus

WSD需要的另一个输入是具有目标或正确含义的高注释测试语料库。测试语料库可以分为两类和{s3}：

Another input required by WSD is the high-annotated test corpus that has the target or correct-senses. The test corpora can be of two types &minsu;

Lexical sample − This kind of corpora is used in the system, where it is required to disambiguate a small sample of words.
All-words − This kind of corpora is used in the system, where it is expected to disambiguate all the words in a piece of running text.

Approaches and Methods to Word Sense Disambiguation (WSD)

WSD的方法和分类根据单词消除歧义中使用的知识来源。

Approaches and methods to WSD are classified according to the source of knowledge used in word disambiguation.

现在让我们看看WSD的四种常规方法：

Let us now see the four conventional methods to WSD −

Dictionary-based or Knowledge-based Methods

顾名思义，对于消除歧义，这些方法主要依赖于字典、treasure和词汇知识库。它们不使用语料库证据来消除歧义。Lesk方法是Michael Lesk在1986年引入的开创性字典方法。Lesk定义，Lesk算法基于它的是 “measure overlap between sense definitions for all words in context” 。然而，在2000年，Kilgarriff和Rosensweig给出了简化的Lesk定义，即 “measure overlap between sense definitions of word and current context” ，这进一步意味着一次识别一个单词的正确含义。这里的当前上下文是句子或段落周围单词的集合。

As the name suggests, for disambiguation, these methods primarily rely on dictionaries, treasures and lexical knowledge base. They do not use corpora evidences for disambiguation. The Lesk method is the seminal dictionary-based method introduced by Michael Lesk in 1986. The Lesk definition, on which the Lesk algorithm is based is “measure overlap between sense definitions for all words in context”. However, in 2000, Kilgarriff and Rosensweig gave the simplified Lesk definition as “measure overlap between sense definitions of word and current context”, which further means identify the correct sense for one word at a time. Here the current context is the set of words in surrounding sentence or paragraph.

Supervised Methods

对于消除歧义，机器学习方法利用经过含义注释的语料库来训练。这些方法假设上下文自身可以提供足够的证据来消除含义的歧义。在这些方法中，单词知识和推理被认为是不必要的。上下文被表示为单词的一组“特征”。它还包括有关周围单词的信息。支持向量机和基于内存的学习是WSD最成功的监督学习方法。这些方法依赖于大量的经过手动含义标记的语料库，创建这些语料库非常昂贵。

For disambiguation, machine learning methods make use of sense-annotated corpora to train. These methods assume that the context can provide enough evidence on its own to disambiguate the sense. In these methods, the words knowledge and reasoning are deemed unnecessary. The context is represented as a set of “features” of the words. It includes the information about the surrounding words also. Support vector machine and memory-based learning are the most successful supervised learning approaches to WSD. These methods rely on substantial amount of manually sense-tagged corpora, which is very expensive to create.

Semi-supervised Methods

由于缺乏训练语料库，大多数单词含义消除歧义算法都使用半监督学习方法。这是因为半监督方法使用标记数据和未标记数据。这些方法只需要少量带注释的文本和大量未注释的纯文本。半监督方法使用的是从种子数据引导程序的技术。

Due to the lack of training corpus, most of the word sense disambiguation algorithms use semi-supervised learning methods. It is because semi-supervised methods use both labelled as well as unlabeled data. These methods require very small amount of annotated text and large amount of plain unannotated text. The technique that is used by semisupervised methods is bootstrapping from seed data.

Unsupervised Methods

这些方法假设相似的含义出现在相似的上下文中。这就是为什么可以根据上下文相似性度量使用单词出现集群来从文本中归纳意义。此任务称为单词含义归纳或区分。非监督方法有可能克服由于不依赖手动工作而导致的知识获取瓶颈。

These methods assume that similar senses occur in similar context. That is why the senses can be induced from text by clustering word occurrences by using some measure of similarity of the context. This task is called word sense induction or discrimination. Unsupervised methods have great potential to overcome the knowledge acquisition bottleneck due to non-dependency on manual efforts.

Applications of Word Sense Disambiguation (WSD)

单词含义消除歧义（WSD）几乎应用于语言技术的所有应用中。

Word sense disambiguation (WSD) is applied in almost every application of language technology.

现在让我们看看WSD的范围：

Let us now see the scope of WSD −

Machine Translation

机器翻译或MT是WSD最明显的应用。在MT中，WSD用于对具有不同含义的不同翻译的单词进行词汇选择。MT中的含义表示为目标语言中的单词。大多数机器翻译系统不使用显式WSD模块。

Machine translation or MT is the most obvious application of WSD. In MT, Lexical choice for the words that have distinct translations for different senses, is done by WSD. The senses in MT are represented as words in the target language. Most of the machine translation systems do not use explicit WSD module.

Information Retrieval (IR)

信息检索(IR)可以定义为一个软件程序，用于处理来自文档存储库中的信息（特别是文本信息）的组织、存储、检索和评估。该系统基本上协助用户查找所需的信息，但不会明确返回问题的答案。WSD用于解决提供给IR系统的查询的歧义。与MT一样，当前的IR系统并不明确地使用WSD模块，他们依赖于这样一个概念：用户将在查询中输入足够多的上下文，以便仅检索相关文档。

Information retrieval (IR) may be defined as a software program that deals with the organization, storage, retrieval and evaluation of information from document repositories particularly textual information. The system basically assists users in finding the information they required but it does not explicitly return the answers of the questions. WSD is used to resolve the ambiguities of the queries provided to IR system. As like MT, current IR systems do not explicitly use WSD module and they rely on the concept that user would type enough context in the query to only retrieve relevant documents.

Text Mining and Information Extraction (IE)

在绝大多数应用程序中，执行文本的精确分析需要 WSD。例如，WSD 能帮助智能收集系统标记正确的单词。例如，医学智能系统可能需要标记“非法药物”，而不是“医疗药物”。

In most of the applications, WSD is necessary to do accurate analysis of text. For example, WSD helps intelligent gathering system to do flagging of the correct words. For example, medical intelligent system might need flagging of “illegal drugs” rather than “medical drugs”

Lexicography

WSD 和词典编纂可以在循环中协同工作，因为现代词典编纂基于语料库。通过词典编纂，WSD 提供粗略的经验义项分组以及语义在统计上的重要上下文指标。

WSD and lexicography can work together in loop because modern lexicography is corpusbased. With lexicography, WSD provides rough empirical sense groupings as well as statistically significant contextual indicators of sense.

Difficulties in Word Sense Disambiguation (WSD)

以下是词义消歧 (WSD) 面临的一些困难：

Followings are some difficulties faced by word sense disambiguation (WSD) −

Differences between dictionaries

WSD 的主要问题是确定词义，因为不同的义项可能非常密切相关。甚至不同的词典和词库也可以针对词义提供不同的划分。

The major problem of WSD is to decide the sense of the word because different senses can be very closely related. Even different dictionaries and thesauruses can provide different divisions of words into senses.

Different algorithms for different applications

WSD 的另一个问题是，不同的应用程序可能需要截然不同的算法。例如，在机器翻译中，它采取目标词选择的形式；而在信息检索中，不需要词义清单。

Another problem of WSD is that completely different algorithm might be needed for different applications. For example, in machine translation, it takes the form of target word selection; and in information retrieval, a sense inventory is not required.

Inter-judge variance

WSD 的另一个问题是，WSD 系统通常通过将其结果应用于任务来进行测试，并与人类的任务进行比较。这被称为人际差异问题。

Another problem of WSD is that WSD systems are generally tested by having their results on a task compared against the task of human beings. This is called the problem of interjudge variance.

Word-sense discreteness

WSD 中的另一个困难是，单词无法轻松划分成离散的次义项。

Another difficulty in WSD is that words cannot be easily divided into discrete submeanings.