Natural Language Toolkit 简明教程

Natural Language Toolkit - Quick Guide

Natural Language Toolkit - Introduction

What is Natural Language Processing (NLP)?

人类可以藉此说话、阅读和写作的交流方式,就是语言。换句话说,我们人类可以用我们的自然语言思考、制定计划、做出决定。这里的大问题是在人工智能、机器学习和深度学习的时代,人类是否可以通过自然语言与计算机/机器交流?开发NLP应用程序对我们来说是一个巨大的挑战,因为计算机需要结构化数据,但另一方面,人类语言是无结构且本质上含糊的。

The method of communication with the help of which humans can speak, read, and write, is language. In other words, we humans can think, make plans, make decisions in our natural language. Here the big question is, in the era of artificial intelligence, machine learning and deep learning, can humans communicate in natural language with computers/machines? Developing NLP applications is a huge challenge for us because computers require structured data, but on the other hand, human speech is unstructured and often ambiguous in nature.

自然语言是计算机科学,更具体地说是人工智能的一个子领域,它使计算机/机器能够理解、处理和处理人类语言。简单来说,NLP是机器分析、理解和从人类自然语言(如印地语、英语、法语、荷兰语等)中获取意义的一种方式。

Natural language is that subfield of computer science, more specifically of AI, which enables computers/machines to understand, process and manipulate human language. In simple words, NLP is a way of machines to analyze, understand and derive meaning from human natural languages like Hindi, English, French, Dutch, etc.

How does it work?

在深入了解NLP的工作原理之前,我们必须了解人类如何使用语言。每天,我们人类都会使用数百或数千个单词,而其他人会对它们进行解释并相应地回答。这对人类来说是一种简单的交流,不是吗?但我们知道单词的含义远不止于此,我们总是可以从所说的话和说话的方式中得出上下文。这就是为什么我们可以说,与关注语音调制相比,NLP确实利用了上下文模式。

Before getting deep dive into the working of NLP, we must have to understand how human beings use language. Every day, we humans use hundreds or thousands of words and other humans interpret them and answer accordingly. It’s a simple communication for humans, isn’t it? But we know words run much-much deeper than that and we always derive a context from what we say and how we say. That’s why we can say rather than focuses on voice modulation, NLP does draw on contextual pattern.

让我们用一个示例来理解它 −

Let us understand it with an example −

Man is to woman as king is to what?
We can interpret it easily and answer as follows:
Man relates to king, so woman can relate to queen.
Hence the answer is Queen.

人类如何知道哪个词是什么意思?这个问题的答案是我们通过经验学习。但是,机器/计算机如何学习相同的知识?

How humans know what word means what? The answer to this question is that we learn through our experience. But, how do machines/computers learn the same?

让我们通过以下简单的步骤来理解它 −

Let us understand it with following easy steps −

  1. First, we need to feed the machines with enough data so that machines can learn from experience.

  2. Then machine will create word vectors, by using deep learning algorithms, from the data we fed earlier as well as from its surrounding data.

  3. Then by performing simple algebraic operations on these word vectors, machine would be able to provide the answers as human beings.

Components of NLP

以下图表表示自然语言处理 (NLP) 的组成部分 −

Following diagram represents the components of natural language processing (NLP) −

components

Morphological Processing

形态处理是 NLP 的第一个组成部分。它包括将语言输入块分解为与段落、句子和单词相对应的标记集。例如,像 “everyday” 这样的单词可以分解为两个子单词标记 “every-day”

Morphological processing is the first component of NLP. It includes breaking of chunks of language input into sets of tokens corresponding to paragraphs, sentences and words. For example, a word like “everyday” can be broken into two sub-word tokens as “every-day”.

Syntax analysis

句法分析是 NLP 的第二个组成部分,也是 NLP 最重要的组成部分之一。此组件的目的如下 −

Syntax Analysis, the second component, is one of the most important components of NLP. The purposes of this component are as follows −

  1. To check that a sentence is well formed or not.

  2. To break it up into a structure that shows the syntactic relationships between the different words.

  3. E.g. The sentences like “The school goes to the student” would be rejected by syntax analyzer.

Semantic analysis

语义分析是 NLP 的第三个组成部分,用于检查文本的含义。它包括从文本中得出确切的含义,或者我们可以说字典含义。例如,“这是一款热冰淇淋。”之类的句子会被语义分析器丢弃。

Semantic Analysis is the third component of NLP which is used to check the meaningfulness of the text. It includes drawing exact meaning, or we can say dictionary meaning from the text. E.g. The sentences like “It’s a hot ice-cream.” would be discarded by semantic analyzer.

Pragmatic analysis

语用分析是 NLP 的第四个组成部分。它包括将每个上下文中存在的实际对象或事件与前一个组件(即语义分析)获得的对象引用进行匹配。例如,诸如 “Put the fruits in the basket on the table” 的句子可以有两种语义解释,因此语用分析器将在这两种可能性之间进行选择。

Pragmatic analysis is the fourth component of NLP. It includes fitting the actual objects or events that exist in each context with object references obtained by previous component i.e. semantic analysis. E.g. The sentences like “Put the fruits in the basket on the table” can have two semantic interpretations hence the pragmatic analyzer will choose between these two possibilities.

Examples of NLP Applications

NLP 是一种新兴技术,可推导出各种形式的 AI,我们现在习惯于看到这些形式。对于当今和未来的认知应用,在人机之间创建无缝交互界面的 NLP 用途将继续成为重中之重。以下是 NLP 的一些非常有用的应用。

NLP, an emerging technology, derives various forms of AI we used to see these days. For today’s and tomorrow’s increasingly cognitive applications, the use of NLP in creating a seamless and interactive interface between humans and machines will continue to be a top priority. Following are some of the very useful applications of NLP.

Machine Translation

机器翻译 (MT) 是自然语言处理最重要的应用之一。机器翻译基本上是一个将一种源语言或文本翻译成另一种语言的过程。机器翻译系统可以是双语或多语言的。

Machine translation (MT) is one of the most important applications of natural language processing. MT is basically a process of translating one source language or text into another language. Machine translation system can be of either Bilingual or Multilingual.

Fighting Spam

由于不需要的电子邮件大量增加,垃圾邮件过滤器变得很重要,因为它是对付此问题的第一道防线。通过将其误报和漏报问题视为主要问题,NLP 的功能可用于开发垃圾邮件过滤系统。

Due to enormous increase in unwanted emails, spam filters have become important because it is the first line of defense against this problem. By considering its false-positive and false-negative issues as the main issues, the functionality of NLP can be used to develop spam filtering system.

N-gram 模型、词干提取和贝叶斯分类是一些现有的 NLP 模型,可用于垃圾邮件过滤。

N-gram modelling, Word Stemming and Bayesian classification are some of the existing NLP models that can be used for spam filtering.

大多数搜索引擎,如 Google、Yahoo、Bing、WolframAlpha 等等,都将其机器翻译 (MT) 技术建立在 NLP 深度学习模型之上。此类深度学习模型允许算法阅读网页上的文本,解释其含义并将其翻译成另一种语言。

Most of the search engines like Google, Yahoo, Bing, WolframAlpha, etc., base their machine translation (MT) technology on NLP deep learning models. Such deep learning models allow algorithms to read text on webpage, interprets its meaning and translate it to another language.

Automatic Text Summarization

自动文本摘要是一种技术,它可以创建较长文本文档的简短、准确的摘要。因此,它可以帮助我们用更少的时间获取相关信息。在这个数字时代,我们迫切需要自动文本摘要,因为互联网上的信息洪流不会停止。NLP 及其功能在开发自动文本摘要时发挥着重要作用。

Automatic text summarization is a technique which creates a short, accurate summary of longer text documents. Hence, it helps us in getting relevant information in less time. In this digital era, we are in a serious need of automatic text summarization because we have the flood of information over internet which is not going to stop. NLP and its functionalities play an important role in developing an automatic text summarization.

Grammar Correction

拼写检查和语法检查是 Microsoft Word 之类的文字处理器软件的一项非常有用的功能。自然语言处理 (NLP) 被广泛用于此目的。

Spelling correction & grammar correction is a very useful feature of word processor software like Microsoft Word. Natural language processing (NLP) is widely used for this purpose.

Question-answering

问答,自然语言处理 (NLP) 的另一项主要应用,专注于构建自动回答用户用自然语言发布的问题的系统。

Question-answering, another main application of natural language processing (NLP), focuses on building systems which automatically answer the question posted by user in their natural language.

Sentiment analysis

情感分析是自然语言处理 (NLP) 的另一项重要应用。正如其名称所示,情感分析用于以下目的:

Sentiment analysis is among one other important applications of natural language processing (NLP). As its name implies, Sentiment analysis is used to −

  1. Identify the sentiments among several posts and

  2. Identify the sentiment where the emotions are not expressed explicitly.

Amazon、ebay 等在线电子商务公司正在使用情感分析识别其客户在网上表达的意见和情绪。它将帮助他们了解客户对其产品和服务的想法。

Online E-commerce companies like Amazon, ebay, etc., are using sentiment analysis to identify the opinion and sentiment of their customers online. It will help them to understand what their customers think about their products and services.

Speech engines

Siri、Google Voice、Alexa 等语音引擎构建在 NLP 上,以便我们用自然语言与之交流。

Speech engines like Siri, Google Voice, Alexa are built on NLP so that we can communicate with them in our natural language.

Implementing NLP

为了构建上述应用程序,我们需要具备特定技能组,并且非常理解语言及其高效处理语言的工具。为了实现此目的,我们有各种可用的开源工具。其中一些是开源的,而另一些则是由组织开发用来构建他们自己的 NLP 应用程序的。以下是一些 NLP 工具的列表:

In order to build the above-mentioned applications, we need to have specific skill set with a great understanding of language and tools to process the language efficiently. To achieve this, we have various open-source tools available. Some of them are open-sourced while others are developed by organizations to build their own NLP applications. Following is the list of some NLP tools −

  1. Natural Language Tool Kit (NLTK)

  2. Mallet

  3. GATE

  4. Open NLP

  5. UIMA

  6. Genism

  7. Stanford toolkit

这些工具大部分都是用 Java 编写的。

Most of these tools are written in Java.

Natural Language Tool Kit (NLTK)

在上述 NLP 工具中,NLTK 在易用性和概念解释方面得分非常高。Python 的学习曲线非常快,NLTK 是用 Python 编写的,因此 NLTK 也有非常好的学习工具包。NLTK 已经融入了大多数任务,如标记化、词干、词根化、标点符号、字符计数和单词计数。它非常优雅,易于使用。

Among the above-mentioned NLP tool, NLTK scores very high when it comes to the ease of use and explanation of the concept. The learning curve of Python is very fast and NLTK is written in Python so NLTK is also having very good learning kit. NLTK has incorporated most of the tasks like tokenization, stemming, Lemmatization, Punctuation, Character Count, and Word count. It is very elegant and easy to work with.

Natural Language Toolkit - Getting Started

为了安装 NLTK,我们必须在电脑上安装 Python。您可以访问链接 www.python.org/downloads 并为您的操作系统(即 Windows、Mac 和 Linux/Unix)选择最新版本。有关 Python 的基本教程,您可以参考链接 www.tutorialspoint.com/python3/index.htm

In order to install NLTK, we must have Python installed on our computers. You can go to the link www.python.org/downloads and select the latest version for your OS i.e. Windows, Mac and Linux/Unix. For basic tutorial on Python you can refer to the link www.tutorialspoint.com/python3/index.htm.

install natural language toolkit

现在,在您的计算机系统上安装 Python 之后,让我们了解如何安装 NLTK。

Now, once you have Python installed on your computer system, let us understand how we can install NLTK.

Installing NLTK

我们可以在不同的操作系统上安装 NLTK,如下所示:

We can install NLTK on various OS as follows −

On Windows

为了在 Windows 操作系统上安装 NLTK,请按照以下步骤操作:

In order to install NLTK on Windows OS, follow the below steps −

  1. First, open the Windows command prompt and navigate to the location of the pip folder.

  2. Next, enter the following command to install NLTK −

pip3 install nltk

现在,从 Windows 开始菜单中打开 PythonShell,并输入以下命令来验证 NLTK 的安装:

Now, open the PythonShell from Windows Start Menu and type the following command in order to verify NLTK’s installation −

Import nltk

如果未出现错误,则表示您已在具有 Python3 的 Windows 操作系统上成功安装了 NLTK。

If you get no error, you have successfully installed NLTK on your Windows OS having Python3.

On Mac/Linux

为了在 Mac/Linux 操作系统上安装 NLTK,请编写以下命令:

In order to install NLTK on Mac/Linux OS, write the following command −

sudo pip install -U nltk

如果你电脑上没有安装 pip,请按照下面的说明安装 pip

If you don’t have pip installed on your computer, then follow the instruction given below to first install pip

首先,通过如下命令更新包索引 −

First, update the package index by following using following command −

sudo apt update

现在,键入如下命令安装 Python 3 的 pip

Now, type the following command to install pip for python 3 −

sudo apt install python3-pip

Through Anaconda

要通过 Anaconda 安装 NLTK,请按照如下步骤操作 −

In order to install NLTK through Anaconda, follow the below steps −

首先,安装 Anaconda,访问链接 https://www.anaconda.com/download 然后选择你需要安装的 Python 版本。

First, to install Anaconda, go to the link https://www.anaconda.com/download and then select the version of Python you need to install.

anaconda

你的电脑系统安装了 Anaconda 之后,转到其命令提示符然后输入如下命令 −

Once you have Anaconda on your computer system, go to its command prompt and write the following command −

conda install -c anaconda nltk
anaconda command

你需要检查输出并输入“是”。NLTK 将下载并安装到你的 Anaconda 包中。

You need to review the output and enter ‘yes’. NLTK will be downloaded and installed in your Anaconda package.

Downloading NLTK’s Dataset and Packages

现在我们已经安装了 NLTK,但是为了使用它,我们需要下载其数据组(语料库)。一些重要的数据组包括 stpwords, guntenberg, framenet_v15 等。

Now we have NLTK installed on our computers but in order to use it we need to download the datasets (corpus) available in it. Some of the important datasets available are stpwords, guntenberg, framenet_v15 and so on.

通过如下命令,我们可以下载所有 NLTK 数据组 −

With the help of following commands, we can download all the NLTK datasets −

import nltk
nltk.download()
natural language toolkit datasets

你会看到如下 NLTK 下载窗口。

You will get the following NLTK downloaded window.

natural language toolkit download

现在,点击下载按钮下载数据组。

Now, click on the download button to download the datasets.

How to run NLTK script?

下面是使用 PorterStemmer nltk 类实现 Porter Stemmer 算法的示例。利用此示例,你可以了解如何运行 NLTK 脚本。

Following is the example in which we are implementing Porter Stemmer algorithm by using PorterStemmer nltk class. with this example you would be able to understand how to run NLTK script.

首先,我们需要导入自然语言工具包 (nltk)。

First, we need to import the natural language toolkit(nltk).

import nltk

现在,导入 PorterStemmer 类来实现波特词干化器算法。

Now, import the PorterStemmer class to implement the Porter Stemmer algorithm.

from nltk.stem import PorterStemmer

然后,按以下步骤创建波特词干化器类的实例 −

Next, create an instance of Porter Stemmer class as follows −

word_stemmer = PorterStemmer()

现在,输入你想提取词干的单词。−

Now, input the word you want to stem. −

word_stemmer.stem('writing')

Output

'write'
word_stemmer.stem('eating')

Output

'eat'

Natural Language Toolkit - Tokenizing Text

What is Tokenizing?

它可以被定义为将一段文本分解为更小的部分(例如句子和单词)的过程。这些较小的部分称为标记。例如,单词是句子中的标记,句子是段落中的标记。

It may be defined as the process of breaking up a piece of text into smaller parts, such as sentences and words. These smaller parts are called tokens. For example, a word is a token in a sentence, and a sentence is a token in a paragraph.

众所周知,NLP 用于构建诸如情感分析、问答系统、语言翻译、智能聊天机器人、声音系统等应用程序,因此,为了构建它们,了解文本中的模式变得至关重要。上面提到的标记在查找和理解这些模式方面非常有用。我们可以将标记化视为其他流程(如词干提取和词形还原)的基本步骤。

As we know that NLP is used to build applications such as sentiment analysis, QA systems, language translation, smart chatbots, voice systems, etc., hence, in order to build them, it becomes vital to understand the pattern in the text. The tokens, mentioned above, are very useful in finding and understanding these patterns. We can consider tokenization as the base step for other recipes such as stemming and lemmatization.

NLTK package

nltk.tokenize 是 NLTK 模块提供的用于实现标记化过程的包。

nltk.tokenize is the package provided by NLTK module to achieve the process of tokenization.

Tokenizing sentences into words

将句子拆分词语或从字符串中创建词语列表是每个文本处理活动中必不可少的部分。让我们通过 nltk.tokenize 包提供的各种函数/模块来理解它。

Splitting the sentence into words or creating a list of words from a string is an essential part of every text processing activity. Let us understand it with the help of various functions/modules provided by nltk.tokenize package.

word_tokenize module

word_tokenize 模块用于基本单词标记。以下示例将使用此模块将句子拆分为词语。

word_tokenize module is used for basic word tokenization. Following example will use this module to split a sentence into words.

Example

import nltk
from nltk.tokenize import word_tokenize
word_tokenize('Tutorialspoint.com provides high quality technical tutorials for free.')

Output

['Tutorialspoint.com', 'provides', 'high', 'quality', 'technical', 'tutorials', 'for', 'free', '.']

TreebankWordTokenizer Class

上面使用的 word_tokenize 模块基本上是一个封装函数,它将 tokenize() 函数作为 TreebankWordTokenizer 类的实例进行调用。它将给出与我们使用 word_tokenize() 模块将句子拆分为词语时获得的相同输出。让我们看看上面实现的相同示例 −

word_tokenize module, used above is basically a wrapper function that calls tokenize() function as an instance of the TreebankWordTokenizer class. It will give the same output as we get while using word_tokenize() module for splitting the sentences into word. Let us see the same example implemented above −

Example

首先,我们需要导入自然语言工具包 (nltk)。

First, we need to import the natural language toolkit(nltk).

import nltk

现在,导入 TreebankWordTokenizer 类以实现单词标记算法。

Now, import the TreebankWordTokenizer class to implement the word tokenizer algorithm −

from nltk.tokenize import TreebankWordTokenizer

接下来,按照以下方式创建 TreebankWordTokenizer 类的实例 −

Next, create an instance of TreebankWordTokenizer class as follows −

Tokenizer_wrd = TreebankWordTokenizer()

现在,输入你想转换为令牌的句子 −

Now, input the sentence you want to convert to tokens −

Tokenizer_wrd.tokenize(
   'Tutorialspoint.com provides high quality technical tutorials for free.'
)

Output

[
   'Tutorialspoint.com', 'provides', 'high', 'quality',
   'technical', 'tutorials', 'for', 'free', '.'
]

Complete implementation example

让我们在下面看到完整的实现示例

Let us see the complete implementation example below

import nltk
from nltk.tokenize import TreebankWordTokenizer
tokenizer_wrd = TreebankWordTokenizer()
tokenizer_wrd.tokenize('Tutorialspoint.com provides high quality technical
tutorials for free.')

Output

[
   'Tutorialspoint.com', 'provides', 'high', 'quality',
   'technical', 'tutorials','for', 'free', '.'
]

分词器最重要的约定是区分缩写。例如,如果我们为此目的使用 word_tokenize() 模块,它将以如下方式输出 −

The most significant convention of a tokenizer is to separate contractions. For example, if we use word_tokenize() module for this purpose, it will give the output as follows −

Example

import nltk
from nltk.tokenize import word_tokenize
word_tokenize('won’t')

Output

['wo', "n't"]]

TreebankWordTokenizer 的这种约定是不可接受的。这就是为什么我们有两个可选分词器,即 PunktWordTokenizerWordPunctTokenizer

Such kind of convention by TreebankWordTokenizer is unacceptable. That’s why we have two alternative word tokenizers namely PunktWordTokenizer and WordPunctTokenizer.

WordPunktTokenizer Class

一个可选的分词器,它将所有标点符号分成单独的令牌。让我们用以下简单的示例来理解它 −

An alternative word tokenizer that splits all punctuation into separate tokens. Let us understand it with the following simple example −

Example

from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
tokenizer.tokenize(" I can't allow you to go home early")

Output

['I', 'can', "'", 't', 'allow', 'you', 'to', 'go', 'home', 'early']

Tokenizing text into sentences

在本节中,我们将把文本/段落分成句子。NLTK为此目的提供 sent_tokenize 模块。

In this section we are going to split text/paragraph into sentences. NLTK provides sent_tokenize module for this purpose.

Why is it needed?

一个出现在我们脑海中的明显问题是,当我们有分词器时,为什么我们需要句子分词器,或者为什么我们需要将文本分词成句子。假设我们需要计算句子中的平均单词,我们如何做到?为了完成这项任务,我们需要句子分词和单词分词。

An obvious question that came in our mind is that when we have word tokenizer then why do we need sentence tokenizer or why do we need to tokenize text into sentences. Suppose we need to count average words in sentences, how we can do this? For accomplishing this task, we need both sentence tokenization and word tokenization.

让我们借助以下简单的示例了解句子和单词分词之间的差异 −

Let us understand the difference between sentence and word tokenizer with the help of following simple example −

Example

import nltk
from nltk.tokenize import sent_tokenize
text = "Let us understand the difference between sentence & word tokenizer.
It is going to be a simple example."
sent_tokenize(text)

Output

[
   "Let us understand the difference between sentence & word tokenizer.",
   'It is going to be a simple example.'
]

Sentence tokenization using regular expressions

如果您觉得分词器的输出不可接受,并且想要完全控制如何分词文本,我们有正则表达式,可在进行句子分词时使用。NLTK提供 RegexpTokenizer 类来实现此目的。

If you feel that the output of word tokenizer is unacceptable and want complete control over how to tokenize the text, we have regular expression which can be used while doing sentence tokenization. NLTK provide RegexpTokenizer class to achieve this.

让我们借助以下两个示例来理解这个概念。

Let us understand the concept with the help of two examples below.

在第一个示例中,我们将使用正则表达式来匹配字母数字令牌和单引号,以避免拆分 “won’t” 之类的缩写。

In first example we will be using regular expression for matching alphanumeric tokens plus single quotes so that we don’t split contractions like “won’t”.

Example 1

import nltk
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w']+")
tokenizer.tokenize("won't is a contraction.")
tokenizer.tokenize("can't is a contraction.")

Output

["won't", 'is', 'a', 'contraction']
["can't", 'is', 'a', 'contraction']

在第一个示例中,我们将使用正则表达式在空格上分词。

In first example, we will be using regular expression to tokenize on whitespace.

Example 2

import nltk
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('/s+' , gaps = True)
tokenizer.tokenize("won't is a contraction.")

Output

["won't", 'is', 'a', 'contraction']

从以上输出中,我们可以看到标点符号保留在令牌中。参数gaps=True表示该模式将识别要分词的间隙。另一方面,如果我们将使用gaps=False参数,则该模式将用于识别标记,这可以在以下示例中看到 −

From the above output, we can see that the punctuation remains in the tokens. The parameter gaps = True means the pattern is going to identify the gaps to tokenize on. On the other hand, if we will use gaps = False parameter then the pattern would be used to identify the tokens which can be seen in following example −

import nltk
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('/s+' , gaps = False)
tokenizer.tokenize("won't is a contraction.")

Output

[ ]

它将为我们提供空白输出。

It will give us the blank output.

Training Tokenizer & Filtering Stopwords

Why to train own sentence tokenizer?

这是一个非常重要的问题:如果我们有 NLTK 的默认句子分词器,我们为什么还要训练句子分词器呢?此问题答案取决于 NLTK 的默认句子分词器的质量。NLTK 的默认分词器基本上是一种通用分词器。虽然它运行良好,但对于非标准文本(我们的文本可能是这种情况),或者对于具有唯一格式的文本,它可能不是好的选择。要对这种文本分词并获得最佳结果,我们应训练我们自己的句子分词器。

This is very important question that if we have NLTK’s default sentence tokenizer then why do we need to train a sentence tokenizer? The answer to this question lies in the quality of NLTK’s default sentence tokenizer. The NLTK’s default tokenizer is basically a general-purpose tokenizer. Although it works very well but it may not be a good choice for nonstandard text, that perhaps our text is, or for a text that is having a unique formatting. To tokenize such text and get best results, we should train our own sentence tokenizer.

Implementation Example

在这个示例中,我们将使用 webtext 语料库。我们将从该语料库中使用的一个文本文件,其文本格式如下图所示 −

For this example, we will be using the webtext corpus. The text file which we are going to use from this corpus is having the text formatted as dialogs shown below −

Guy: How old are you?
Hipster girl: You know, I never answer that question. Because to me, it's about
how mature you are, you know? I mean, a fourteen year old could be more mature
than a twenty-five year old, right? I'm sorry, I just never answer that question.
Guy: But, uh, you're older than eighteen, right?
Hipster girl: Oh, yeah.

我们使用训练_分词器名称保存此文本文件。NLTK 提供一个名为 PunktSentenceTokenizer 的类,其可帮助我们训练原始文本以生成自定义句子分词器。我们可以使用 raw() 方法通过读取文件或从 NLTK 语料库中获取原始文本。

We have saved this text file with the name of training_tokenizer. NLTK provides a class named PunktSentenceTokenizer with the help of which we can train on raw text to produce a custom sentence tokenizer. We can get raw text either by reading in a file or from an NLTK corpus using the raw() method.

我们来看看下面的示例以了解更多信息 −

Let us see the example below to get more insight into it −

首先,从 nltk.tokenize 包中导入 PunktSentenceTokenizer 类 −

First, import PunktSentenceTokenizer class from nltk.tokenize package −

from nltk.tokenize import PunktSentenceTokenizer

现在,从 nltk.corpus 包中导入 webtext 语料库

Now, import webtext corpus from nltk.corpus package

from nltk.corpus import webtext

接下来,使用 raw() 方法,如下所示获取 training_tokenizer.txt 文件中的原始文本 −

Next, by using raw() method, get the raw text from training_tokenizer.txt file as follows −

text = webtext.raw('C://Users/Leekha/training_tokenizer.txt')

现在,创建一个 PunktSentenceTokenizer 实例,并如下所示打印文本文件中的标记化句子 −

Now, create an instance of PunktSentenceTokenizer and print the tokenize sentences from text file as follows −

sent_tokenizer = PunktSentenceTokenizer(text)
sents_1 = sent_tokenizer.tokenize(text)
print(sents_1[0])

Output

White guy: So, do you have any plans for this evening?
print(sents_1[1])
Output:
Asian girl: Yeah, being angry!
print(sents_1[670])
Output:
Guy: A hundred bucks?
print(sents_1[675])
Output:
Girl: But you already have a Big Mac...

Complete implementation example

from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import webtext
text = webtext.raw('C://Users/Leekha/training_tokenizer.txt')
sent_tokenizer = PunktSentenceTokenizer(text)
sents_1 = sent_tokenizer.tokenize(text)
print(sents_1[0])

Output

White guy: So, do you have any plans for this evening?

为了了解NLTK的默认句子标记器和我们自己训练的句子标记器之间的区别,让我们用默认句子标记器(即sent_tokenize())对同一个文件进行标记化。

To understand the difference between NLTK’s default sentence tokenizer and our own trained sentence tokenizer, let us tokenize the same file with default sentence tokenizer i.e. sent_tokenize().

from nltk.tokenize import sent_tokenize
   from nltk.corpus import webtext
   text = webtext.raw('C://Users/Leekha/training_tokenizer.txt')
sents_2 = sent_tokenize(text)

print(sents_2[0])
Output:

White guy: So, do you have any plans for this evening?
print(sents_2[675])
Output:
Hobo: Y'know what I'd do if I was rich?

借助输出结果的不同,我们可以理解为什么自己训练一个句子标记器是有用的。

With the help of difference in the output, we can understand the concept that why it is useful to train our own sentence tokenizer.

What are stopwords?

一些文本中存在的常见单词对句子的含义没有贡献。此类单词在信息检索或自然语言处理的目的上一点也不重要。最常见的停用词是“the”和“a”。

Some common words that are present in text but do not contribute in the meaning of a sentence. Such words are not at all important for the purpose of information retrieval or natural language processing. The most common stopwords are ‘the’ and ‘a’.

NLTK stopwords corpus

实际上,自然语言工具包带有包含多种语言单词列表的停用词语料库。让我们通过以下示例了解其用法 −

Actually, Natural Language Tool kit comes with a stopword corpus containing word lists for many languages. Let us understand its usage with the help of the following example −

首先,从nltk.corpus包导入停用词语料库 −

First, import the stopwords copus from nltk.corpus package −

from nltk.corpus import stopwords

现在,我们将使用英语语言的停用词

Now, we will be using stopwords from English Languages

english_stops = set(stopwords.words('english'))
words = ['I', 'am', 'a', 'writer']
[word for word in words if word not in english_stops]

Output

['I', 'writer']

Complete implementation example

from nltk.corpus import stopwords
english_stops = set(stopwords.words('english'))
words = ['I', 'am', 'a', 'writer']
[word for word in words if word not in english_stops]

Output

['I', 'writer']

Finding complete list of supported languages

借助以下Python脚本,我们还可以查找NLTK停用词语料库支持的完整语言列表 −

With the help of following Python script, we can also find the complete list of languages supported by NLTK stopwords corpus −

from nltk.corpus import stopwords
stopwords.fileids()

Output

[
   'arabic', 'azerbaijani', 'danish', 'dutch', 'english', 'finnish', 'french',
   'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali',
   'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish',
   'swedish', 'tajik', 'turkish'
]

Looking up words in Wordnet

What is Wordnet?

Wordnet 是由普林斯顿大学创建的一个大型英语词汇数据库。它是 NLTK 语料库的一部分。名词、动词、形容词和副词都被归入到同义词集集合中,即认知同义词。这里每个同义词集都表示一个不同的含义。以下是 Wordnet 的一些使用案例−

Wordnet is a large lexical database of English, which was created by Princeton. It is a part of the NLTK corpus. Nouns, verbs, adjectives and adverbs all are grouped into set of synsets, i.e., cognitive synonyms. Here each set of synsets express a distinct meaning. Following are some use cases of Wordnet −

  1. It can be used to look up the definition of a word

  2. We can find synonyms and antonyms of a word

  3. Word relations and similarities can be explored using Wordnet

  4. Word sense disambiguation for those words having multiple uses and definitions

How to import Wordnet?

Wordnet 可以借助以下命令导入 −

Wordnet can be imported with the help of following command −

from nltk.corpus import wordnet

对于更紧凑的命令,请使用以下命令 −

For more compact command, use the following −

from nltk.corpus import wordnet as wn

Synset instances

同义词集是对表示相同概念的同义词的词组。当你使用 Wordnet 查找单词时,你会得到一个同义词集实例列表。

Synset are groupings of synonyms words that express the same concept. When you use Wordnet to look up words, you will get a list of Synset instances.

wordnet.synsets(word)

要获取同义词集列表,我们可以使用 wordnet.synsets(word) 在 Wordnet 中查找任何单词。例如,在下一个 Python 食谱中,我们将查找“dog”的同义词集以及同义词集的一些属性和方法−

To get a list of Synsets, we can look up any word in Wordnet by using wordnet.synsets(word). For example, in next Python recipe, we are going to look up the Synset for the ‘dog’ along with some properties and methods of Synset −

Example

首先,如下导入 Wordnet −

First, import the wordnet as follows −

from nltk.corpus import wordnet as wn

现在,提供你要为其查找同义词集的单词 −

Now, provide the word you want to look up the Synset for −

syn = wn.synsets('dog')[0]

在这里,我们使用 name() 方法为同义词集获取唯一名称,该名称可用于直接获取同义词集 −

Here, we are using name() method to get the unique name for the synset which can be used to get the Synset directly −

syn.name()
Output:
'dog.n.01'

接下来,我们使用 definition() 方法,该方法将为我们提供单词的定义 −

Next, we are using definition() method which will give us the definition of the word −

syn.definition()
Output:
'a member of the genus Canis (probably descended from the common wolf) that has
been domesticated by man since prehistoric times; occurs in many breeds'

另一个方法是 examples() ,它将为我们提供与单词相关的示例 −

Another method is examples() which will give us the examples related to the word −

syn.examples()
Output:
['the dog barked all night']

Complete implementation example

from nltk.corpus import wordnet as wn
syn = wn.synsets('dog')[0]
syn.name()
syn.definition()
syn.examples()

Getting Hypernyms

同义词集以继承树状结构组织,其中 Hypernyms 表示更抽象的术语,而 Hyponyms 表示更具体的术语。一件重要的事情是,该树可以一直追溯到根词义上位词。让我们借助以下示例了解这个概念 −

Synsets are organized in an inheritance tree like structure in which Hypernyms represents more abstracted terms while Hyponyms represents the more specific terms. One of the important things is that this tree can be traced all the way to a root hypernym. Let us understand the concept with the help of the following example −

from nltk.corpus import wordnet as wn
syn = wn.synsets('dog')[0]
syn.hypernyms()

Output

[Synset('canine.n.02'), Synset('domestic_animal.n.01')]

在这里,我们可以看到犬科动物和家畜是“狗”的上位词。

Here, we can see that canine and domestic_animal are the hypernyms of ‘dog’.

现在,我们可以通过以下步骤查找“狗”的低义词 −

Now, we can find hyponyms of ‘dog’ as follows −

syn.hypernyms()[0].hyponyms()

Output

[
   Synset('bitch.n.04'),
   Synset('dog.n.01'),
   Synset('fox.n.01'),
   Synset('hyena.n.01'),
   Synset('jackal.n.01'),
   Synset('wild_dog.n.01'),
   Synset('wolf.n.01')
]

从上面的输出中,我们可以看出“狗”只是“domestic_animals”的众多低义词之一。

From the above output, we can see that ‘dog’ is only one of the many hyponyms of ‘domestic_animals’.

为找到所有这些词的根,我们可以使用以下命令 −

To find the root of all these, we can use the following command −

syn.root_hypernyms()

Output

[Synset('entity.n.01')]

从上面的输出中,我们可以看到它只有一个根。

From the above output, we can see it has only one root.

Complete implementation example

from nltk.corpus import wordnet as wn
syn = wn.synsets('dog')[0]
syn.hypernyms()
syn.hypernyms()[0].hyponyms()
syn.root_hypernyms()

Output

[Synset('entity.n.01')]

Lemmas in Wordnet

在语言学中,单词的规范形式或形态形式称为词干。为查找某个单词的同义词和反义词,我们还可在 WordNet 中查找词干。让我们看看如何查找。

In linguistics, the canonical form or morphological form of a word is called a lemma. To find a synonym as well as antonym of a word, we can also lookup lemmas in WordNet. Let us see how.

Finding Synonyms

通过使用 lemma() 方法,我们可以查找同义词集的同义词数。让我们在“dog”同义词集上应用此方法 −

By using the lemma() method, we can find the number of synonyms of a Synset. Let us apply this method on ‘dog’ synset −

Example

from nltk.corpus import wordnet as wn
syn = wn.synsets('dog')[0]
lemmas = syn.lemmas()
len(lemmas)

Output

3

上面的输出显示“dog”有三个词干。

The above output shows ‘dog’ has three lemmas.

通过以下方法获得第一个词干名称 −

Getting the name of the first lemma as follows −

lemmas[0].name()
Output:
'dog'

通过以下方法获得第二个词干名称 −

Getting the name of the second lemma as follows −

lemmas[1].name()
Output:
'domestic_dog'

通过以下方法获得第三个词干名称 −

Getting the name of the third lemma as follows −

lemmas[2].name()
Output:
'Canis_familiaris'

事实上,同义词集表示一组具有相似含义的所有词干,而词干表示不同的单词形式。

Actually, a Synset represents a group of lemmas that all have similar meaning while a lemma represents a distinct word form.

Finding Antonyms

在 WordNet 中,某些词干也有反义词。例如,单词“good”共有 27 个同义词集,其中 5 个同义词集包含有反义词的词干。我们来查找反义词(当单词“good”作为名词使用时和当单词“good”作为形容词使用时)。

In WordNet, some lemmas also have antonyms. For example, the word ‘good ‘has a total of 27 synets, among them, 5 have lemmas with antonyms. Let us find the antonyms (when the word ‘good’ used as noun and when the word ‘good’ used as adjective).

Example 1

from nltk.corpus import wordnet as wn
   syn1 = wn.synset('good.n.02')
   antonym1 = syn1.lemmas()[0].antonyms()[0]
antonym1.name()

Output

'evil'
antonym1.synset().definition()

Output

'the quality of being morally wrong in principle or practice'

上面的示例表明,当单词“good”作为名词使用时,第一个反义词是“evil”。

The above example shows that the word ‘good’, when used as noun, have the first antonym ‘evil’.

Example 2

from nltk.corpus import wordnet as wn
   syn2 = wn.synset('good.a.01')
   antonym2 = syn2.lemmas()[0].antonyms()[0]
antonym2.name()

Output

'bad'
antonym2.synset().definition()

Output

'having undesirable or negative qualities’

上面的示例表明,当单词“good”作为形容词使用时,第一个反义词是“bad”。

The above example shows that the word ‘good’, when used as adjective, have the first antonym ‘bad’.

Stemming & Lemmatization

What is Stemming?

词干化是一种用于通过去除词缀来提取词根的技术。这就像把树枝砍到树干一样。例如,单词 eating, eats, eaten 的词干是 eat

Stemming is a technique used to extract the base form of the words by removing affixes from them. It is just like cutting down the branches of a tree to its stems. For example, the stem of the words eating, eats, eaten is eat.

搜索引擎使用词干化来对单词进行索引。这就是为什么搜索引擎不必存储某个单词的所有形式,而只需存储词干的原因。这样一来,词干化就减小了索引的大小并提高了检索精度。

Search engines use stemming for indexing the words. That’s why rather than storing all forms of a word, a search engine can store only the stems. In this way, stemming reduces the size of the index and increases retrieval accuracy.

Various Stemming algorithms

在 NLTK 中, stemmerI ,它有 stem() 方法,接口中提供了我们接下来会涉及的所有词干化器。让我们借助以下图表理解一下它

In NLTK, stemmerI, which have stem() method, interface has all the stemmers which we are going to cover next. Let us understand it with the following diagram

stemming algorithms

Porter stemming algorithm

它是使用最广泛的词干化算法之一,它基本上是设计来去除和替换英语单词的众所周知的词缀的。

It is one of the most common stemming algorithms which is basically designed to remove and replace well-known suffixes of English words.

PorterStemmer class

NLTK 提供 PorterStemmer 类,借助它,我们可以轻松地为我们想要进行词干化的单词实现波特词干化器算法。此类知道若干规则词形和词缀,借助这些词形和词缀,它可以将输入词转换为最终词干。所得词干通常是具有相同词根意义的较短单词。让我们看一个示例 −

NLTK has PorterStemmer class with the help of which we can easily implement Porter Stemmer algorithms for the word we want to stem. This class knows several regular word forms and suffixes with the help of which it can transform the input word to a final stem. The resulting stem is often a shorter word having the same root meaning. Let us see an example −

首先,我们需要导入自然语言工具包 (nltk)。

First, we need to import the natural language toolkit(nltk).

import nltk

现在,导入 PorterStemmer 类来实现波特词干化器算法。

Now, import the PorterStemmer class to implement the Porter Stemmer algorithm.

from nltk.stem import PorterStemmer

然后,按以下步骤创建波特词干化器类的实例 −

Next, create an instance of Porter Stemmer class as follows −

word_stemmer = PorterStemmer()

现在,输入你想要进行词干化的单词。

Now, input the word you want to stem.

word_stemmer.stem('writing')

Output

'write'
word_stemmer.stem('eating')

Output

'eat'

Complete implementation example

import nltk
from nltk.stem import PorterStemmer
word_stemmer = PorterStemmer()
word_stemmer.stem('writing')

Output

'write'

Lancaster stemming algorithm

它是在兰卡斯特大学开发的,是另一种非常常用的词干化算法。

It was developed at Lancaster University and it is another very common stemming algorithms.

LancasterStemmer class

NLTK 提供 LancasterStemmer 类,借助它,我们可以轻松地为我们想要进行词干化的单词实现兰开斯特词干化器算法。让我们看一个示例 −

NLTK has LancasterStemmer class with the help of which we can easily implement Lancaster Stemmer algorithms for the word we want to stem. Let us see an example −

首先,我们需要导入自然语言工具包 (nltk)。

First, we need to import the natural language toolkit(nltk).

import nltk

现在,导入 LancasterStemmer 类来实现兰开斯特词干化器算法

Now, import the LancasterStemmer class to implement Lancaster Stemmer algorithm

from nltk.stem import LancasterStemmer

然后,按以下步骤创建 LancasterStemmer 类的实例 −

Next, create an instance of LancasterStemmer class as follows −

Lanc_stemmer = LancasterStemmer()

现在,输入你想要进行词干化的单词。

Now, input the word you want to stem.

Lanc_stemmer.stem('eats')

Output

'eat'

Complete implementation example

import nltk
from nltk.stem import LancatserStemmer
Lanc_stemmer = LancasterStemmer()
Lanc_stemmer.stem('eats')

Output

'eat'

Regular Expression stemming algorithm

借助这种词干化算法,我们可以构建我们自己的词干化器。

With the help of this stemming algorithm, we can construct our own stemmer.

RegexpStemmer class

NLTK 提供 RegexpStemmer 类,借助它,我们可以轻松地实现正则表达式词干化器算法。它基本使用单个正则表达式,并删除与该表达式匹配的任何前缀或后缀。让我们看一个示例 −

NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression. Let us see an example −

首先,我们需要导入自然语言工具包 (nltk)。

First, we need to import the natural language toolkit(nltk).

import nltk

现在,导入 RegexpStemmer 类来实现正则表达式词干化器算法。

Now, import the RegexpStemmer class to implement the Regular Expression Stemmer algorithm.

from nltk.stem import RegexpStemmer

然后,创建一个 RegexpStemmer 类的实例,并按以下方式提供要从单词中删除的后缀或前缀 −

Next, create an instance of RegexpStemmer class and provides the suffix or prefix you want to remove from the word as follows −

Reg_stemmer = RegexpStemmer(‘ing’)

现在,输入你想要进行词干化的单词。

Now, input the word you want to stem.

Reg_stemmer.stem('eating')

Output

'eat'
Reg_stemmer.stem('ingeat')

Output

'eat'
Reg_stemmer.stem('eats')

Output

'eat'

Complete implementation example

import nltk
from nltk.stem import RegexpStemmer
Reg_stemmer = RegexpStemmer()
Reg_stemmer.stem('ingeat')

Output

'eat'

Snowball stemming algorithm

这是另一种非常有用的词干化算法。

It is another very useful stemming algorithm.

SnowballStemmer class

NLTK 提供 SnowballStemmer 类,借助它,我们可以轻松地实现 Snowball 词干化器算法。它支持 15 种非英语语言。为了使用这个词干化类,我们需要创建一个使用我们使用的语言名称的实例,然后调用 stem() 方法。让我们看一个示例 −

NLTK has SnowballStemmer class with the help of which we can easily implement Snowball Stemmer algorithms. It supports 15 non-English languages. In order to use this steaming class, we need to create an instance with the name of the language we are using and then call the stem() method. Let us see an example −

首先,我们需要导入自然语言工具包 (nltk)。

First, we need to import the natural language toolkit(nltk).

import nltk

现在,导入 SnowballStemmer 类以实现 Snowball Stemmer 算法

Now, import the SnowballStemmer class to implement Snowball Stemmer algorithm

from nltk.stem import SnowballStemmer

让我们看看它支持哪些语言 −

Let us see the languages it supports −

SnowballStemmer.languages

Output

(
   'arabic',
   'danish',
   'dutch',
   'english',
   'finnish',
   'french',
   'german',
   'hungarian',
   'italian',
   'norwegian',
   'porter',
   'portuguese',
   'romanian',
   'russian',
   'spanish',
   'swedish'
)

接下来,使用您想要使用的语言创建 SnowballStemmer 类的实例。在这里,我们正在为“法语”语言创建一个词干提取器。

Next, create an instance of SnowballStemmer class with the language you want to use. Here, we are creating the stemmer for ‘French’ language.

French_stemmer = SnowballStemmer(‘french’)

现在,调用 stem() 方法并输入您想要进行词干提取的单词。

Now, call the stem() method and input the word you want to stem.

French_stemmer.stem (‘Bonjoura’)

Output

'bonjour'

Complete implementation example

import nltk
from nltk.stem import SnowballStemmer
French_stemmer = SnowballStemmer(‘french’)
French_stemmer.stem (‘Bonjoura’)

Output

'bonjour'

What is Lemmatization?

词形还原技术就像词干提取一样。词形还原后我们得到的输出被称为“词干”,它是一个词根,而不是词干提取的输出,即词干。在词形还原之后,我们将会得到一个具有相同含义的有效单词。

Lemmatization technique is like stemming. The output we will get after lemmatization is called ‘lemma’, which is a root word rather than root stem, the output of stemming. After lemmatization, we will be getting a valid word that means the same thing.

NLTK 提供了 WordNetLemmatizer 类,它是 wordnet 语料库的简单封装。此类使用 morphy() 函数向 WordNet CorpusReader 类提供词干,以查找词条。让我们通过一个示例来理解它 −

NLTK provides WordNetLemmatizer class which is a thin wrapper around the wordnet corpus. This class uses morphy() function to the WordNet CorpusReader class to find a lemma. Let us understand it with an example −

Example

首先,我们需要导入自然语言工具包 (nltk)。

First, we need to import the natural language toolkit(nltk).

import nltk

现在,导入 WordNetLemmatizer 类来实现词形还原技术。

Now, import the WordNetLemmatizer class to implement the lemmatization technique.

from nltk.stem import WordNetLemmatizer

接下来,创建 WordNetLemmatizer 类的实例。

Next, create an instance of WordNetLemmatizer class.

lemmatizer = WordNetLemmatizer()

现在,调用 lemmatize() 方法并输入您想要查找词条的单词。

Now, call the lemmatize() method and input the word of which you want to find lemma.

lemmatizer.lemmatize('eating')

Output

'eating'
lemmatizer.lemmatize('books')

Output

'book'

Complete implementation example

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('books')

Output

'book'

Difference between Stemming & Lemmatization

让我们通过以下示例来理解词干提取和词形还原的区别 −

Let us understand the difference between Stemming and Lemmatization with the help of the following example −

import nltk
from nltk.stem import PorterStemmer
word_stemmer = PorterStemmer()
word_stemmer.stem('believes')

Output

believ
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize(' believes ')

Output

believ

两个程序的输出说明了词干提取和词形还原之间的主要区别。 PorterStemmer 类从单词中截去“es”。另一方面, WordNetLemmatizer 类查找有效单词。简而言之,词干提取技术仅着眼于单词的形式,而词形还原技术着眼于单词的含义。这意味着在应用词形还原之后,我们总是会得到一个有效单词。

The output of both programs tells the major difference between stemming and lemmatization. PorterStemmer class chops off the ‘es’ from the word. On the other hand, WordNetLemmatizer class finds a valid word. In simple words, stemming technique only looks at the form of the word whereas lemmatization technique looks at the meaning of the word. It means after applying lemmatization, we will always get a valid word.

Natural Language Toolkit - Word Replacement

词干提取和词形还原可以被视为一种语言压缩。从这个意义上说,单词替换可以被认为是文本归一化或错误纠正。

Stemming and lemmatization can be considered as a kind of linguistic compression. In the same sense, word replacement can be thought of as text normalization or error correction.

但是为什么我们需要单词替换?假设如果我们谈论标记化,那么它就会出现缩写问题(例如 can’t、won’t 等)。因此,为了处理此类问题,我们需要单词替换。例如,我们可以用缩写的扩展形式替换缩写。

But why we needed word replacement? Suppose if we talk about tokenization, then it is having issues with contractions (like can’t, won’t, etc.). So, to handle such issues we need word replacement. For example, we can replace contractions with their expanded forms.

Word replacement using regular expression

首先,我们将替换与正则表达式匹配的单词。但为此,我们必须对正则表达式以及 python re 模块有基本的了解。在下面的示例中,我们将用缩写的扩展形式(例如,“can’t”将被替换为“cannot”)替换缩写,所有这些都使用正则表达式。

First, we are going to replace words that matches the regular expression. But for this we must have a basic understanding of regular expressions as well as python re module. In the example below, we will be replacing contraction with their expanded forms (e.g. “can’t” will be replaced with “cannot”), all that by using regular expressions.

Example

首先,导入必需的包 re 来处理正则表达式。

First, import the necessary package re to work with regular expressions.

import re
from nltk.corpus import wordnet

接下来,按如下所示定义您选择的替换模式 −

Next, define the replacement patterns of your choice as follows −

R_patterns = [
   (r'won\'t', 'will not'),
   (r'can\'t', 'cannot'),
   (r'i\'m', 'i am'),
   r'(\w+)\'ll', '\g<1> will'),
   (r'(\w+)n\'t', '\g<1> not'),
   (r'(\w+)\'ve', '\g<1> have'),
   (r'(\w+)\'s', '\g<1> is'),
   (r'(\w+)\'re', '\g<1> are'),
]

现在,创建一个可用于替换单词的类 −

Now, create a class that can be used for replacing words −

class REReplacer(object):
   def __init__(self, pattern = R_patterns):
      self.pattern = [(re.compile(regex), repl) for (regex, repl) in patterns]
   def replace(self, text):
      s = text
      for (pattern, repl) in self.pattern:
         s = re.sub(pattern, repl, s)
      return s

保存此 python 程序(比如 repRE.py)并从 python 命令提示符运行它。运行它后,在您想要替换单词时导入 REReplacer 类。让我们看看怎么做。

Save this python program (say repRE.py) and run it from python command prompt. After running it, import REReplacer class when you want to replace words. Let us see how.

from repRE import REReplacer
rep_word = REReplacer()
rep_word.replace("I won't do it")
Output:
'I will not do it'
rep_word.replace("I can’t do it")
Output:
'I cannot do it'

Complete implementation example

import re
from nltk.corpus import wordnet
R_patterns = [
   (r'won\'t', 'will not'),
   (r'can\'t', 'cannot'),
   (r'i\'m', 'i am'),
   r'(\w+)\'ll', '\g<1> will'),
   (r'(\w+)n\'t', '\g<1> not'),
   (r'(\w+)\'ve', '\g<1> have'),
   (r'(\w+)\'s', '\g<1> is'),
   (r'(\w+)\'re', '\g<1> are'),
]
class REReplacer(object):
def __init__(self, patterns=R_patterns):
   self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns]
def replace(self, text):
   s = text
   for (pattern, repl) in self.patterns:
      s = re.sub(pattern, repl, s)
   return s

现在,一旦您保存了上述程序并运行它,您就可以导入该类并按如下方式使用它 −

Now once you saved the above program and run it, you can import the class and use it as follows −

from replacerRE import REReplacer
rep_word = REReplacer()
rep_word.replace("I won't do it")

Output

'I will not do it'

Replacement before text processing

使用自然语言处理 (NLP) 时的一种常见做法是在文本处理之前清理文本。对此,我们也可以在文本处理之前的步骤中使用上面在前一个示例中创建的 REReplacer 类,即标记化。

One of the common practices while working with natural language processing (NLP) is to clean up the text before text processing. In this concern we can also use our REReplacer class created above in previous example, as a preliminary step before text processing i.e. tokenization.

Example

from nltk.tokenize import word_tokenize
from replacerRE import REReplacer
rep_word = REReplacer()
word_tokenize("I won't be able to do this now")
Output:
['I', 'wo', "n't", 'be', 'able', 'to', 'do', 'this', 'now']
word_tokenize(rep_word.replace("I won't be able to do this now"))
Output:
['I', 'will', 'not', 'be', 'able', 'to', 'do', 'this', 'now']

在上面的 Python 食谱中,我们可以轻松理解带有正则表达式替换的词标记器输出和不带有正则表达式替换的词标记器输出之间的区别。

In the above Python recipe, we can easily understand the difference between the output of word tokenizer without and with using regular expression replace.

Removal of repeating characters

我们的日常语言严格符合语法吗?不,并不会。例如,有时我们要写“Hiiiiiiiiiiii Mohan”来强调“Hi”这个词。但是计算机系统不知道“Hiiiiiiiiiiii”是“Hi”这个单词的一种变体。在下面的示例中,我们将创建一个名为 rep_word_removal 的类,可用于移除重复的单词。

Do we strictly grammatical in our everyday language? No, we are not. For example, sometimes we write ‘Hiiiiiiiiiiii Mohan’ in order to emphasize the word ‘Hi’. But computer system does not know that ‘Hiiiiiiiiiiii’ is a variation of the word “Hi”. In the example below, we will be creating a class named rep_word_removal which can be used for removing the repeating words.

Example

首先,导入必要的包 re 来使用正则表达式

First, import the necessary package re to work with regular expressions

import re
from nltk.corpus import wordnet

现在,创建一个类,可用于移除重复的单词 −

Now, create a class that can be used for removing the repeating words −

class Rep_word_removal(object):
   def __init__(self):
      self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
      self.repl = r'\1\2\3'
   def replace(self, word):
      if wordnet.synsets(word):
      return word
   repl_word = self.repeat_regexp.sub(self.repl, word)
   if repl_word != word:
      return self.replace(repl_word)
   else:
      return repl_word

保存此 python 程序(例如 removalrepeat.py)并从 python 命令提示符运行它。运行之后,在你想要移除重复的单词的时候导入 Rep_word_removal 类。我们看看怎么操作?

Save this python program (say removalrepeat.py) and run it from python command prompt. After running it, import Rep_word_removal class when you want to remove the repeating words. Let us see how?

from removalrepeat import Rep_word_removal
rep_word = Rep_word_removal()
rep_word.replace ("Hiiiiiiiiiiiiiiiiiiiii")
Output:
'Hi'
rep_word.replace("Hellooooooooooooooo")
Output:
'Hello'

Complete implementation example

import re
from nltk.corpus import wordnet
class Rep_word_removal(object):
   def __init__(self):
      self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
      self.repl = r'\1\2\3'
   def replace(self, word):
      if wordnet.synsets(word):
         return word
   replace_word = self.repeat_regexp.sub(self.repl, word)
   if replace_word != word:
      return self.replace(replace_word)
   else:
      return replace_word

现在,一旦您保存了上述程序并运行它,您就可以导入该类并按如下方式使用它 −

Now once you saved the above program and run it, you can import the class and use it as follows −

from removalrepeat import Rep_word_removal
rep_word = Rep_word_removal()
rep_word.replace ("Hiiiiiiiiiiiiiiiiiiiii")

Output

'Hi'

Synonym & Antonym Replacement

Replacing words with common synonyms

在使用NLP时,尤其是在频率分析和文本索引的情况下,在不丢失含义的情况下压缩词汇始终是有益的,因为它可以节省大量内存。为了实现此目标,我们必须定义单词与其同义词之间的映射。在以下示例中,我们将创建一个名为 word_syn_replacer 的类,该类可用于将单词替换为其常用同义词。

While working with NLP, especially in the case of frequency analysis and text indexing, it is always beneficial to compress the vocabulary without losing meaning because it saves lots of memory. To achieve this, we must have to define mapping of a word to its synonyms. In the example below, we will be creating a class named word_syn_replacer which can be used for replacing the words with their common synonyms.

Example

首先,导入必要的包 re 以使用正则表达式。

First, import the necessary package re to work with regular expressions.

import re
from nltk.corpus import wordnet

接下来,创建一个接受单词替换映射的类 −

Next, create the class that takes a word replacement mapping −

class word_syn_replacer(object):
   def __init__(self, word_map):
   self.word_map = word_map
def replace(self, word):
   return self.word_map.get(word, word)

保存该Python程序(例如replacesyn.py),并从Python命令提示符处运行它。在运行它之后,导入 word_syn_replacer 类,以便用常用同义词替换单词。让我们看看如何操作。

Save this python program (say replacesyn.py) and run it from python command prompt. After running it, import word_syn_replacer class when you want to replace words with common synonyms. Let us see how.

from replacesyn import word_syn_replacer
rep_syn = word_syn_replacer ({‘bday’: ‘birthday’)
rep_syn.replace(‘bday’)

Output

'birthday'

Complete implementation example

import re
from nltk.corpus import wordnet
class word_syn_replacer(object):
   def __init__(self, word_map):
   self.word_map = word_map
def replace(self, word):
   return self.word_map.get(word, word)

现在,一旦您保存了上述程序并运行它,您就可以导入该类并按如下方式使用它 −

Now once you saved the above program and run it, you can import the class and use it as follows −

from replacesyn import word_syn_replacer
rep_syn = word_syn_replacer ({‘bday’: ‘birthday’)
rep_syn.replace(‘bday’)

Output

'birthday'

上述方法的缺点是我们必须在Python词典中硬编码同义词。我们有两个更好的选择,即CSV和YAML文件。我们可以将同义词表保存在上述任何文件中,并可以从中构建 word_map 词典。让我们借助示例了解该概念。

The disadvantage of the above method is that we should have to hardcode the synonyms in a Python dictionary. We have two better alternatives in the form of CSV and YAML file. We can save our synonym vocabulary in any of the above-mentioned files and can construct word_map dictionary from them. Let us understand the concept with the help of examples.

Using CSV file

为了将CSV文件用于此目的,该文件应有两列,第一列包含单词,第二列包含用于替换单词的同义词。让我们将此文件保存为 syn.csv. 在下面的示例中,我们将创建一个名为 CSVword_syn_replacer 的类,该类将扩展 replacesyn.py 文件中 word_syn_replacer 中的内容,并将用于从 syn.csv 文件中构建 word_map 词典。

In order to use CSV file for this purpose, the file should have two columns, first column consist of word and the second column consists of the synonyms meant to replace it. Let us save this file as syn.csv. In the example below, we will be creating a class named CSVword_syn_replacer which will extends word_syn_replacer in replacesyn.py file and will be used to construct the word_map dictionary from syn.csv file.

Example

首先,导入必需的包。

First, import the necessary packages.

import csv

接下来,创建一个接受单词替换映射的类 −

Next, create the class that takes a word replacement mapping −

class CSVword_syn_replacer(word_syn_replacer):
   def __init__(self, fname):
      word_map = {}
      for line in csv.reader(open(fname)):
         word, syn = line
         word_map[word] = syn
      super(Csvword_syn_replacer, self).__init__(word_map)

在运行它之后,导入 CSVword_syn_replacer 类,以便用常用同义词替换单词。让我们看看如何操作?

After running it, import CSVword_syn_replacer class when you want to replace words with common synonyms. Let us see how?

from replacesyn import CSVword_syn_replacer
rep_syn = CSVword_syn_replacer (‘syn.csv’)
rep_syn.replace(‘bday’)

Output

'birthday'

Complete implementation example

import csv
class CSVword_syn_replacer(word_syn_replacer):
def __init__(self, fname):
word_map = {}
for line in csv.reader(open(fname)):
   word, syn = line
   word_map[word] = syn
super(Csvword_syn_replacer, self).__init__(word_map)

现在,一旦您保存了上述程序并运行它,您就可以导入该类并按如下方式使用它 −

Now once you saved the above program and run it, you can import the class and use it as follows −

from replacesyn import CSVword_syn_replacer
rep_syn = CSVword_syn_replacer (‘syn.csv’)
rep_syn.replace(‘bday’)

Output

'birthday'

Using YAML file

由于我们使用了CSV文件,因此还可以将YAML文件用于此目的(我们必须安装了PyYAML)。让我们将此文件保存为 syn.yaml. 在下面的示例中,我们将创建一个名为 YAMLword_syn_replacer 的类,该类将扩展 replacesyn.py 文件中 word_syn_replacer 中的内容,并将用于从 syn.yaml 文件中构建 word_map 词典。

As we have used CSV file, we can also use YAML file to for this purpose (we must have PyYAML installed). Let us save the file as syn.yaml. In the example below, we will be creating a class named YAMLword_syn_replacer which will extends word_syn_replacer in replacesyn.py file and will be used to construct the word_map dictionary from syn.yaml file.

Example

首先,导入必需的包。

First, import the necessary packages.

import yaml

接下来,创建一个接受单词替换映射的类 −

Next, create the class that takes a word replacement mapping −

class YAMLword_syn_replacer(word_syn_replacer):
   def __init__(self, fname):
   word_map = yaml.load(open(fname))
   super(YamlWordReplacer, self).__init__(word_map)

在运行它之后,导入 YAMLword_syn_replacer 类,以便用常用同义词替换单词。让我们看看如何操作?

After running it, import YAMLword_syn_replacer class when you want to replace words with common synonyms. Let us see how?

from replacesyn import YAMLword_syn_replacer
rep_syn = YAMLword_syn_replacer (‘syn.yaml’)
rep_syn.replace(‘bday’)

Output

'birthday'

Complete implementation example

import yaml
class YAMLword_syn_replacer(word_syn_replacer):
def __init__(self, fname):
   word_map = yaml.load(open(fname))
   super(YamlWordReplacer, self).__init__(word_map)

现在,一旦您保存了上述程序并运行它,您就可以导入该类并按如下方式使用它 −

Now once you saved the above program and run it, you can import the class and use it as follows −

from replacesyn import YAMLword_syn_replacer
rep_syn = YAMLword_syn_replacer (‘syn.yaml’)
rep_syn.replace(‘bday’)

Output

'birthday'

Antonym replacement

众所周知,反义词是一个与另一个单词含义相反的单词,而反义词替换的反义词是同义词替换。在本节中,我们将处理反义词替换,即使用 WordNet 用明确的反义词替换单词。在下面的示例中,我们将创建一个名为 word_antonym_replacer 的类,它有两种方法,一种用于替换单词,另一种用于去除否定。

As we know that an antonym is a word having opposite meaning of another word, and the opposite of synonym replacement is called antonym replacement. In this section, we will be dealing with antonym replacement, i.e., replacing words with unambiguous antonyms by using WordNet. In the example below, we will be creating a class named word_antonym_replacer which have two methods, one for replacing the word and other for removing the negations.

Example

首先,导入必需的包。

First, import the necessary packages.

from nltk.corpus import wordnet

接下来,创建名为 word_antonym_replacer 的类 -

Next, create the class named word_antonym_replacer

class word_antonym_replacer(object):
   def replace(self, word, pos=None):
      antonyms = set()
      for syn in wordnet.synsets(word, pos=pos):
         for lemma in syn.lemmas():
            for antonym in lemma.antonyms():
               antonyms.add(antonym.name())
      if len(antonyms) == 1:
         return antonyms.pop()
      else:
         return None
   def replace_negations(self, sent):
      i, l = 0, len(sent)
      words = []
      while i < l:
         word = sent[i]
         if word == 'not' and i+1 < l:
            ant = self.replace(sent[i+1])
            if ant:
               words.append(ant)
               i += 2
               continue
         words.append(word)
         i += 1
      return words

保存这个 Python 程序(例如 replaceantonym.py)并从 Python 命令提示符运行它。运行它后,当您想用明确的反义词替换单词时,导入 word_antonym_replacer 类。让我们来看看怎么做。

Save this python program (say replaceantonym.py) and run it from python command prompt. After running it, import word_antonym_replacer class when you want to replace words with their unambiguous antonyms. Let us see how.

from replacerantonym import word_antonym_replacer
rep_antonym = word_antonym_replacer ()
rep_antonym.replace(‘uglify’)

Output

['beautify'']
sentence = ["Let us", 'not', 'uglify', 'our', 'country']
rep_antonym.replace _negations(sentence)

Output

["Let us", 'beautify', 'our', 'country']

Complete implementation example

nltk.corpus import wordnet
class word_antonym_replacer(object):
def replace(self, word, pos=None):
   antonyms = set()
   for syn in wordnet.synsets(word, pos=pos):
      for lemma in syn.lemmas():
      for antonym in lemma.antonyms():
         antonyms.add(antonym.name())
   if len(antonyms) == 1:
      return antonyms.pop()
   else:
      return None
def replace_negations(self, sent):
   i, l = 0, len(sent)
   words = []
   while i < l:
      word = sent[i]
      if word == 'not' and i+1 < l:
         ant = self.replace(sent[i+1])
         if ant:
            words.append(ant)
            i += 2
            continue
      words.append(word)
      i += 1
   return words

现在,一旦您保存了上述程序并运行它,您就可以导入该类并按如下方式使用它 −

Now once you saved the above program and run it, you can import the class and use it as follows −

from replacerantonym import word_antonym_replacer
rep_antonym = word_antonym_replacer ()
rep_antonym.replace(‘uglify’)
sentence = ["Let us", 'not', 'uglify', 'our', 'country']
rep_antonym.replace _negations(sentence)

Output

["Let us", 'beautify', 'our', 'country']

Corpus Readers and Custom Corpora

What is a corpus?

语料库是由以自然交流环境产生的并以机器可读文本形式存在的结构化文档组成的庞大集合。Corpora 一词是语料库的复数形式。语料库可以通过以下多种方式获取 −

A corpus is large collection, in structured format, of machine-readable texts that have been produced in a natural communicative setting. The word Corpora is the plural of Corpus. Corpus can be derived in many ways as follows −

  1. From the text that was originally electronic

  2. From the transcripts of spoken language

  3. From optical character recognition and so on

语料库代表性、语料库平衡、抽样、语料库大小是在设计语料库的过程中起重要作用的元素。一些最常见的 NLP 任务语料库是 TreeBank、PropBank、VarbNet 和 WordNet。

Corpus representativeness, Corpus Balance, Sampling, Corpus Size are the elements that plays an important role while designing corpus. Some of the most popular corpus for NLP tasks are TreeBank, PropBank, VarbNet and WordNet.

How to build custom corpus?

在下载 NLTK 时,我们还安装了 NLTK 数据包。因此,我们的计算机上已经安装了 NLTK 数据包。如果我们谈论 Windows,我们将假定此数据包已安装在 C:\natural_language_toolkit_data 处,如果我们谈论 Linux、Unix 和 Mac OS X,我们将假定此数据包已安装在 /usr/share/natural_language_toolkit_data 处。

While downloading NLTK, we also installed NLTK data package. So, we already have NLTK data package installed on our computer. If we talk about Windows, we’ll assume that this data package is installed at C:\natural_language_toolkit_data and if we talk about Linux, Unix and Mac OS X, we ‘ll assume that this data package is installed at /usr/share/natural_language_toolkit_data.

在以下 Python 配方中,我们将创建自定义语料库,它必须位于 NLTK 定义的其中一条路径中。这是因为可以由 NLTK 找到它。为了避免与官方 NLTK 数据包的冲突,让我们在主目录中创建一个自定义 natural_language_toolkit_data 目录。

In the following Python recipe, we are going to create custom corpora which must be within one of the paths defined by NLTK. It is so because it can be found by NLTK. In order to avoid conflict with the official NLTK data package, let us create a custom natural_language_toolkit_data directory in our home directory.

import os, os.path
path = os.path.expanduser('~/natural_language_toolkit_data')
if not os.path.exists(path):
   os.mkdir(path)
os.path.exists(path)

Output

True

现在,让我们检查主目录中是否有名为 natural_language_toolkit_data 的目录

Now, Let us check whether we have natural_language_toolkit_data directory in our home directory or not −

import nltk.data
path in nltk.data.path

Output

True

由于我们已经得到了输出 True,这意味着我们的主目录中包含 nltk_data 目录。

As we have got the output True, means we have nltk_data directory in our home directory.

现在,我们将创建一个名为 wordfile.txt 的单词表文件,并将其放入名为 corpus 的目录中 nltk_data 目录 (~/nltk_data/corpus/wordfile.txt) ,并将通过使用 nltk.data.load 加载它

Now we will make a wordlist file, named wordfile.txt and put it in a folder, named corpus in nltk_data directory (~/nltk_data/corpus/wordfile.txt) and will load it by using nltk.data.load

import nltk.data
nltk.data.load(‘corpus/wordfile.txt’, format = ‘raw’)

Output

b’tutorialspoint\n’

Corpus readers

NLTK 提供各种 CorpusReader 类。我们将在以下 python 配方中介绍它们

NLTK provides various CorpusReader classes. We are going to cover them in the following python recipes

Creating wordlist corpus

NLTK 具有 WordListCorpusReader 类,它提供对包含单词列表的文件的访问。对于以下 Python 配方,我们需要创建一个单词表文件,该文件可以是 CSV 或普通文本文件。例如,我们创建了一个名为“列表”的文件,其中包含以下数据:

NLTK has WordListCorpusReader class that provides access to the file containing a list of words. For the following Python recipe, we need to create a wordlist file which can be CSV or normal text file. For example, we have created a file named ‘list’ that contains the following data −

tutorialspoint
Online
Free
Tutorials

现在让我们实例化一个 WordListCorpusReader 类,该类从我们创建的文件 ‘list’ 中生成单词列表。

Now Let us instantiate a WordListCorpusReader class producing the list of words from our created file ‘list’.

from nltk.corpus.reader import WordListCorpusReader
reader_corpus = WordListCorpusReader('.', ['list'])
reader_corpus.words()

Output

['tutorialspoint', 'Online', 'Free', 'Tutorials']

Creating POS tagged word corpus

NLTK 具有 TaggedCorpusReader 类,借助它,我们可以创建一个 POS 标记词语料库。实际上,POS 标记是识别单词的词性标记的过程。

NLTK has TaggedCorpusReader class with the help of which we can create a POS tagged word corpus. Actually, POS tagging is the process of identifying the part-of-speech tag for a word.

标记语料库最简单的格式之一是“单词/标记”形式,如下所示,摘自布朗语料库

One of the simplest formats for a tagged corpus is of the form ‘word/tag’like following excerpt from the brown corpus −

The/at-tl expense/nn and/cc time/nn involved/vbn are/ber
astronomical/jj ./.

在上面的摘录中,每个单词都有一个标记,表示其 POS。例如, vb 表示动词。

In the above excerpt, each word has a tag which denotes its POS. For example, vb refers to a verb.

现在让我们实例化一个 TaggedCorpusReader*class producing POS tagged words form the file *‘list.pos’ ,其中包含上述摘录。

Now Let us instantiate a TaggedCorpusReader*class producing POS tagged words form the file *‘list.pos’, which has the above excerpt.

from nltk.corpus.reader import TaggedCorpusReader
reader_corpus = TaggedCorpusReader('.', r'.*\.pos')
reader_corpus.tagged_words()

Output

[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ...]

Creating Chunked phrase corpus

NLTK 具有 ChnkedCorpusReader 类,借助它,我们可以创建一个分块短语语料库。实际上,块是句子的短语。

NLTK has ChnkedCorpusReader class with the help of which we can create a Chunked phrase corpus. Actually, a chunk is a short phrase in a sentence.

例如,我们有以下摘录,摘自标记 treebank 语料库

For example, we have the following excerpt from the tagged treebank corpus −

[Earlier/JJR staff-reduction/NN moves/NNS] have/VBP trimmed/VBN about/
IN [300/CD jobs/NNS] ,/, [the/DT spokesman/NN] said/VBD ./.

在上面的摘录中,每个块都是一个名词短语,但不在括号中的单词是句子树的一部分,而不是任何名词短语子树的一部分。

In the above excerpt, every chunk is a noun phrase but the words that are not in brackets are part of the sentence tree and not part of any noun phrase subtree.

现在让我们实例化一个 ChunkedCorpusReader 类,该类从文件 ‘list.chunk’ 中生成分块短语,其中包含上述摘录。

Now Let us instantiate a ChunkedCorpusReader class producing chunked phrase from the file ‘list.chunk’, which has the above excerpt.

from nltk.corpus.reader import ChunkedCorpusReader
reader_corpus = TaggedCorpusReader('.', r'.*\.chunk')
reader_corpus.chunked_words()

Output

[
   Tree('NP', [('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS')]),
   ('have', 'VBP'), ...
]

Creating Categorized text corpus

NLTK 具有 CategorizedPlaintextCorpusReader 类,借助它,我们可以创建一个分类文本语料库。当我们拥有大量的文本语料库并希望将其分类到不同的部分时,这非常有用。

NLTK has CategorizedPlaintextCorpusReader class with the help of which we can create a categorized text corpus. It is very useful in case when we have a large corpus of text and want to categorize that into separate sections.

例如,布朗语料库有几个不同的类别。让我们借助以下 Python 代码找出它们

For example, the brown corpus has several different categories. Let us find out them with the help of following Python code −

from nltk.corpus import brown^M
brown.categories()

Output

[
   'adventure', 'belles_lettres', 'editorial', 'fiction', 'government',
   'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion',
   'reviews', 'romance', 'science_fiction'
]

对语料库进行分类的最简单方法是针对每个类别创建一个文件。例如,让我们看看 movie_reviews 语料库中的两个摘录

One of the easiest ways to categorize a corpus is to have one file for every category. For example, let us see the two excerpts from the movie_reviews corpus −

movie_pos.txt

细红线有缺陷,但它会激起反应。

The thin red line is flawed but it provokes.

movie_neg.txt

高成本且制作精良的制作无法弥补其电视剧中普遍缺乏的自发性。

A big-budget and glossy production cannot make up for a lack of spontaneity that permeates their tv show.

因此,从上述两个文件中,我们有两个类别,即 posneg

So, from above two files, we have two categories namely pos and neg.

现在,让我们实例化一个 CategorizedPlaintextCorpusReader 类。

Now let us instantiate a CategorizedPlaintextCorpusReader class.

from nltk.corpus.reader import CategorizedPlaintextCorpusReader
reader_corpus = CategorizedPlaintextCorpusReader('.', r'movie_.*\.txt',
cat_pattern = r'movie_(\w+)\.txt')
reader_corpus.categories()
reader_corpus.fileids(categories = [‘neg’])
reader_corpus.fileids(categories = [‘pos’])

Output

['neg', 'pos']
['movie_neg.txt']
['movie_pos.txt']

Basics of Part-of-Speech (POS) Tagging

What is POS tagging?

标记是一种分类,是对标记的描述的自动分配。我们称描述符为“标记”,它表示词性(名词、动词、副词、形容词、代词、连词及其子类别)、语义信息等之一。

Tagging, a kind of classification, is the automatic assignment of the description of the tokens. We call the descriptor s ‘tag’, which represents one of the parts of speech (nouns, verb, adverbs, adjectives, pronouns, conjunction and their sub-categories), semantic information and so on.

另一方面,如果我们谈论词性 (POS) 标记,可以将其定义为将一个句子转换为单词列表,再转换为元组列表的过程。此处,元组的形式为 (word, tag)。我们还可以称 POS 标记为将词性之一分配给给定单词的过程。

On the other hand, if we talk about Part-of-Speech (POS) tagging, it may be defined as the process of converting a sentence in the form of a list of words, into a list of tuples. Here, the tuples are in the form of (word, tag). We can also call POS tagging a process of assigning one of the parts of speech to the given word.

下表表示宾夕法尼亚树库语料库中使用最频繁的 POS 通知 -

Following table represents the most frequent POS notification used in Penn Treebank corpus −

Sr.No

Tag

Description

1

NNP

Proper noun, singular

2

NNPS

Proper noun, plural

3

PDT

Pre determiner

4

POS

Possessive ending

5

PRP

Personal pronoun

6

PRP$

Possessive pronoun

7

RB

Adverb

8

RBR

Adverb, comparative

9

RBS

Adverb, superlative

10

RP

Particle

11

SYM

Symbol (mathematical or scientific)

12

TO

to

13

UH

Interjection

14

VB

Verb, base form

15

VBD

Verb, past tense

16

VBG

Verb, gerund/present participle

17

VBN

Verb, past

18

WP

Wh-pronoun

19

WP$

Possessive wh-pronoun

20

WRB

Wh-adverb

21

#

Pound sign

22

$

Dollar sign

23

.

Sentence-final punctuation

24

,

Comma

25

:

Colon, semi-colon

26

(

Left bracket character

27

)

Right bracket character

28

"

Straight double quote

29

'

Left open single quote

30

"

Left open double quote

31

'

Right close single quote

32

"

Right open double quote

Example

让我们通过一个 Python 实验来理解它 -

Let us understand it with a Python experiment −

import nltk
from nltk import word_tokenize
sentence = "I am going to school"
print (nltk.pos_tag(word_tokenize(sentence)))

Output

[('I', 'PRP'), ('am', 'VBP'), ('going', 'VBG'), ('to', 'TO'), ('school', 'NN')]

Why POS tagging?

词性标注是自然语言处理中的重要组成部分,因为它可作为进一步的自然语言处理分析的前提,如下所示 −

POS tagging is an important part of NLP because it works as the prerequisite for further NLP analysis as follows −

  1. Chunking

  2. Syntax Parsing

  3. Information extraction

  4. Machine Translation

  5. Sentiment Analysis

  6. Grammar analysis & word-sense disambiguation

TaggerI - Base class

所有标记器均位于 NLTK 的 nltk.tag 包中。这些标记器的基础类为 TaggerI ,这意味着所有标记器都从该类继承。

All the taggers reside in NLTK’s nltk.tag package. The base class of these taggers is TaggerI, means all the taggers inherit from this class.

Methods − TaggerI 类具有以下两个方法,必须由其所有子类实现 −

Methods − TaggerI class have the following two methods which must be implemented by all its subclasses −

  1. tag() method − As the name implies, this method takes a list of words as input and returns a list of tagged words as output.

  2. evaluate() method − With the help of this method, we can evaluate the accuracy of the tagger.

taggeri

The Baseline of POS Tagging

词性标注的基本步骤或基准是 Default Tagging ,可以使用 NLTK 的 DefaultTagger 类执行它。默认标注会为每个标记分配相同的词性标记。默认标注还提供了衡量准确性提升的基准。

The baseline or the basic step of POS tagging is Default Tagging, which can be performed using the DefaultTagger class of NLTK. Default tagging simply assigns the same POS tag to every token. Default tagging also provides a baseline to measure accuracy improvements.

DefaultTagger class

通过使用 DefaultTagging 类执行默认标注,该类采用一个变量,即我们想要应用标记。

Default tagging is performed by using DefaultTagging class, which takes the single argument, i.e., the tag we want to apply.

How does it work?

如前所述,所有标记器都继承自 TaggerI 类。 DefaultTaggerSequentialBackoffTagger 继承,而后者是 TaggerI class 的子类。让我们使用以下图表理解它 −

As told earlier, all the taggers are inherited from TaggerI class. The DefaultTagger is inherited from SequentialBackoffTagger which is a subclass of TaggerI class. Let us understand it with the following diagram −

taggeri class

作为 SeuentialBackoffTagger 的部分, DefaultTagger 必须实现 choose_tag() 方法,该方法采用以下三个变量。

As being the part of SeuentialBackoffTagger, the DefaultTagger must implement choose_tag() method which takes the following three arguments.

  1. Token’s list

  2. Current token’s index

  3. Previous token’s list, i.e., the history

Example

import nltk
from nltk.tag import DefaultTagger
exptagger = DefaultTagger('NN')
exptagger.tag(['Tutorials','Point'])

Output

[('Tutorials', 'NN'), ('Point', 'NN')]

在此示例中,我们选择了名词标记,因为它是单词中最常见的类型。此外, DefaultTagger 在选择最常见的词性标记时也最有效。

In this example, we chose a noun tag because it is the most common types of words. Moreover, DefaultTagger is also most useful when we choose the most common POS tag.

Accuracy evaluation

DefaultTagger 也是评估标记器准确性的基准。这就是我们可以将它与 evaluate() 方法结合使用以衡量准确性的原因。 evaluate() 方法将标记标记的列表作为黄金标准,以便评估标记器。

The DefaultTagger is also the baseline for evaluating accuracy of taggers. That is the reason we can use it along with evaluate() method for measuring accuracy. The evaluate() method takes a list of tagged tokens as a gold standard to evaluate the tagger.

以下是一个示例,我们在其中使用了我们创建的默认标记器 exptagger (以上创建),以评估 treebank 语料库标注语句的子集的准确性 −

Following is an example in which we used our default tagger, named exptagger, created above, to evaluate the accuracy of a subset of treebank corpus tagged sentences −

Example

import nltk
from nltk.tag import DefaultTagger
exptagger = DefaultTagger('NN')
from nltk.corpus import treebank
testsentences = treebank.tagged_sents() [1000:]
exptagger.evaluate (testsentences)

Output

0.13198749536374715

以上输出表明,在为每个标记选择 NN 的情况下,我们可以在 treebank 语料库的 1000 个条目上执行准确度测试,实现大约 13% 的准确度。

The output above shows that by choosing NN for every tag, we can achieve around 13% accuracy testing on 1000 entries of the treebank corpus.

Tagging a list of sentences

除了标记单个语句外,NLTK 的 TaggerI 类还为我们提供 tag_sents() 方法,我们可借助该方法标记一系列语句。以下是我们标记两个简单语句的示例

Rather than tagging a single sentence, the NLTK’s TaggerI class also provides us a tag_sents() method with the help of which we can tag a list of sentences. Following is the example in which we tagged two simple sentences

Example

import nltk
from nltk.tag import DefaultTagger
exptagger = DefaultTagger('NN')
exptagger.tag_sents([['Hi', ','], ['How', 'are', 'you', '?']])

Output

[
   [
      ('Hi', 'NN'),
      (',', 'NN')
   ],
   [
      ('How', 'NN'),
      ('are', 'NN'),
      ('you', 'NN'),
      ('?', 'NN')
   ]
]

在上例中,我们使用了我们先前创建的、名为 exptagger 的默认标记器。

In the above example, we used our earlier created default tagger named exptagger.

Un-tagging a sentence

我们还可以取消对语句的标记。NLTK 提供 nltk.tag.untag() 方法用于此目的。它将接受一个标记的语句作为输入,并提供一个不带标记的单词列表。让我们看一个示例 −

We can also un-tag a sentence. NLTK provides nltk.tag.untag() method for this purpose. It will take a tagged sentence as input and provides a list of words without tags. Let us see an example −

Example

import nltk
from nltk.tag import untag
untag([('Tutorials', 'NN'), ('Point', 'NN')])

Output

['Tutorials', 'Point']

Natural Language Toolkit - Unigram Tagger

What is Unigram Tagger?

顾名思义,一元标记器是一种只使用单个单词作为上下文来确定 POS(词性)标记的标记器。用简单的话来说,一元标记器是一种基于上下文的标记器,其上下文是一个单词,即一元。

As the name implies, unigram tagger is a tagger that only uses a single word as its context for determining the POS(Part-of-Speech) tag. In simple words, Unigram Tagger is a context-based tagger whose context is a single word, i.e., Unigram.

How does it work?

NLTK 提供了一个名为 UnigramTagger 的模块来实现此目的。但在深入了解其工作原理之前,让我们借助以下图表了解层次结构 −

NLTK provides a module named UnigramTagger for this purpose. But before getting deep dive into its working, let us understand the hierarchy with the help of following diagram −

unigram tagger

从上图可以看出, UnigramTagger 继承自 NgramTagger ,后者是 ContextTagger 的子类,它从 SequentialBackoffTagger 继承。

From the above diagram, it is understood that UnigramTagger is inherited from NgramTagger which is a subclass of ContextTagger, which inherits from SequentialBackoffTagger.

UnigramTagger 的工作原理通过以下步骤进行解释 −

The working of UnigramTagger is explained with the help of following steps −

  1. As we have seen, UnigramTagger inherits from ContextTagger, it implements a context() method. This context() method takes the same three arguments as choose_tag() method.

  2. The result of context() method will be the word token which is further used to create the model. Once the model is created, the word token is also used to look up the best tag.

  3. In this way, UnigramTagger will build a context model from the list of tagged sentences.

Training a Unigram Tagger

NLTK 的 UnigramTagger 可以在初始化时通过提供标记句子的列表来进行训练。在下面的示例中,我们将使用树库语料库的标记句子。我们将使用该语料库中的前 2500 个句子。

NLTK’s UnigramTagger can be trained by providing a list of tagged sentences at the time of initialization. In the example below, we are going to use the tagged sentences of the treebank corpus. We will be using first 2500 sentences from that corpus.

Example

首先从 NLTK 导入 UniframTagger 模块 −

First import the UniframTagger module from nltk −

from nltk.tag import UnigramTagger

接下来,导入想要使用的语料库。这里我们使用 treebank 语料库 −

Next, import the corpus you want to use. Here we are using treebank corpus −

from nltk.corpus import treebank

现在,获取句子以用于训练。我们使用前 2500 个句子以用于训练目的并且将会标记它们 −

Now, take the sentences for training purpose. We are taking first 2500 sentences for training purpose and will tag them −

train_sentences = treebank.tagged_sents()[:2500]

接下来,将 UnigramTagger 应用到用于训练目的的句子上 −

Next, apply UnigramTagger on the sentences used for training purpose −

Uni_tagger = UnigramTagger(train_sentences)

获取一些句子,用于训练目的,这些句子等于或少于训练目的的句子,即 2500,用于测试目的。这里我们取前 1500 个句子用于测试目的 −

Take some sentences, either equal to or less taken for training purpose i.e. 2500, for testing purpose. Here we are taking first 1500 for testing purpose −

test_sentences = treebank.tagged_sents()[1500:]
Uni_tagger.evaluate(test_sents)

Output

0.8942306156033808

这里,一个标记器使用单个单词查找来确定词性标记,我们获得了大约 89% 的精确度。

Here, we got around 89 percent accuracy for a tagger that uses single word lookup to determine the POS tag.

Complete implementation example

from nltk.tag import UnigramTagger
from nltk.corpus import treebank
train_sentences = treebank.tagged_sents()[:2500]
Uni_tagger = UnigramTagger(train_sentences)
test_sentences = treebank.tagged_sents()[1500:]
Uni_tagger.evaluate(test_sentences)

Output

0.8942306156033808

Overriding the context model

从上面显示 UnigramTagger 层级的图表中,我们知道所有从 ContextTagger 继承的标记器,都可以使用预构建的模型,而不是训练自己的模型。该预构建的模型只是一组将上下文键映射到标记的 Python 字典。对于 UnigramTagger ,上下文键是个别单词,而对于其他 NgramTagger 子类,它将是元组。

From the above diagram showing hierarchy for UnigramTagger, we know all the taggers that inherit from ContextTagger, instead of training their own, can take a pre-built model. This pre-built model is simply a Python dictionary mapping of a context key to a tag. And for UnigramTagger, context keys are individual words while for other NgramTagger subclasses, it will be tuples.

我们可以将另一个简单模型传递到 UnigramTagger 类中,而不是传递训练集,来重写此上下文模型。让我们借助下面的一个简单示例来理解它 −

We can override this context model by passing another simple model to the UnigramTagger class instead of passing training set. Let us understand it with the help of an easy example below −

Example

from nltk.tag import UnigramTagger
from nltk.corpus import treebank
Override_tagger = UnigramTagger(model = {‘Vinken’ : ‘NN’})
Override_tagger.tag(treebank.sents()[0])

Output

[
   ('Pierre', None),
   ('Vinken', 'NN'),
   (',', None),
   ('61', None),
   ('years', None),
   ('old', None),
   (',', None),
   ('will', None),
   ('join', None),
   ('the', None),
   ('board', None),
   ('as', None),
   ('a', None),
   ('nonexecutive', None),
   ('director', None),
   ('Nov.', None),
   ('29', None),
   ('.', None)
]

因为我们的模型包含“Vinken”作为唯一的上下文键,所以你可以从上面的输出中观察到只有此单词获得了标记且其他每个单词都有 None 作为标记。

As our model contains ‘Vinken’ as the only context key, you can observe from the output above that only this word got tag and every other word has None as a tag.

Setting a minimum frequency threshold

为了决定哪个标记最可能用于给定上下文, ContextTagger 类使用出现频率。即使上下文单词和标记仅出现一次,它也会默认这么做,但我们可以通过将 cutoff 值传递到 UnigramTagger 类来设置最小频率阈值。在下面的示例中,我们传递了我们训练了 UnigramTagger 中的先前配方中的截止值 −

For deciding which tag is most likely for a given context, the ContextTagger class uses frequency of occurrence. It will do it by default even if the context word and tag occur only once, but we can set a minimum frequency threshold by passing a cutoff value to the UnigramTagger class. In the example below, we are passing the cutoff value in previous recipe in which we trained a UnigramTagger −

Example

from nltk.tag import UnigramTagger
from nltk.corpus import treebank
train_sentences = treebank.tagged_sents()[:2500]
Uni_tagger = UnigramTagger(train_sentences, cutoff = 4)
test_sentences = treebank.tagged_sents()[1500:]
Uni_tagger.evaluate(test_sentences)

Output

0.7357651629613641

Natural Language Toolkit - Combining Taggers

Combining Taggers

组合标记器或将标记器相互链接是 NLTK 的一大重要特性。组合标记器的基本概念是,如果一个标记器不知道如何标记一个单词,则可以将其传递到链接的标记器。为了实现此目的, SequentialBackoffTagger 为我们提供了 Backoff tagging 特性。

Combining taggers or chaining taggers with each other is one of the important features of NLTK. The main concept behind combining taggers is that, in case if one tagger doesn’t know how to tag a word, it would be passed to the chained tagger. To achieve this purpose, SequentialBackoffTagger provides us the Backoff tagging feature.

Backoff Tagging

正如前面所述,反向标记是 SequentialBackoffTagger 的重要特性之一,它允许我们组合标记器,如果一个标记器不知道如何标记一个单词,则该单词将传递给下一个标记器,以此类推,直到没有反向标记器可以检查。

As told earlier, backoff tagging is one of the important features of SequentialBackoffTagger, which allows us to combine taggers in a way that if one tagger doesn’t know how to tag a word, the word would be passed to the next tagger and so on until there are no backoff taggers left to check.

How does it work?

实际上, SequentialBackoffTagger 的每个子类都可以使用“反向标记”关键字参数。此关键字参数的值是 SequentialBackoffTagger 的另一个实例。现在,每当初始化该 SequentialBackoffTagger 类时,将创建一个内部反向标记器列表(它本身为第一个元素)。此外,如果给出反向标记器,则将追加该反向标记器的内部列表。

Actually, every subclass of SequentialBackoffTagger can take a ‘backoff’ keyword argument. The value of this keyword argument is another instance of a SequentialBackoffTagger. Now whenever this SequentialBackoffTagger class is initialized, an internal list of backoff taggers (with itself as the first element) will be created. Moreover, if a backoff tagger is given, the internal list of this backoff taggers would be appended.

在下面的示例中,我们在上面的 Python 配方中将 DefaulTagger 作为反向标记器,我们用它来训练 UnigramTagger

In the example below, we are taking DefaulTagger as the backoff tagger in the above Python recipe with which we have trained the UnigramTagger.

Example

在此示例中,我们使用 DefaulTagger 作为反向标记器。每当 UnigramTagger 无法标记一个单词时,反向标记器(在本例中为 DefaulTagger )将用“NN”标记它。

In this example, we are using DefaulTagger as the backoff tagger. Whenever the UnigramTagger is unable to tag a word, backoff tagger, i.e. DefaulTagger, in our case, will tag it with ‘NN’.

from nltk.tag import UnigramTagger
from nltk.tag import DefaultTagger
from nltk.corpus import treebank
train_sentences = treebank.tagged_sents()[:2500]
back_tagger = DefaultTagger('NN')
Uni_tagger = UnigramTagger(train_sentences, backoff = back_tagger)
test_sentences = treebank.tagged_sents()[1500:]
Uni_tagger.evaluate(test_sentences)

Output

0.9061975746536931

从上面的输出中,你可以看到通过添加反向标记器,准确度提高了大约 2%。

From the above output, you can observe that by adding a backoff tagger the accuracy is increased by around 2%.

Saving taggers with pickle

正如我们所看到的,训练标记器非常麻烦,而且费时。为了节省时间,我们可以将训练过的标记器腌制起来,以便稍后使用。在下面的示例中,我们将对已训练过的标记器 ‘Uni_tagger’ 执行此操作。

As we have seen that training a tagger is very cumbersome and also takes time. To save time, we can pickle a trained tagger for using it later. In the example below, we are going to do this to our already trained tagger named ‘Uni_tagger’.

Example

import pickle
f = open('Uni_tagger.pickle','wb')
pickle.dump(Uni_tagger, f)
f.close()
f = open('Uni_tagger.pickle','rb')
Uni_tagger = pickle.load(f)

NgramTagger Class

从以前单元中讨论的层次结构图中, UnigramTaggerNgarmTagger 类继承,但我们还有两个 NgarmTagger 类的子类 −

From the hierarchy diagram discussed in previous unit, UnigramTagger is inherited from NgarmTagger class but we have two more subclasses of NgarmTagger class −

BigramTagger subclass

实际上,一个 n-元语法是一组 n 个项目的子序列,因此,顾名思义, BigramTagger 子类会查看这两个项目。第一个项目是前一个标记的单词,第二个项目是当前标记的单词。

Actually an ngram is a subsequence of n items, hence, as name implies, BigramTagger subclass looks at the two items. First item is the previous tagged word and the second item is current tagged word.

TrigramTagger subclass

对于 BigramTagger, TrigramTagger 子类也是如此,它会查看三个项目,即两个前一个标记的单词和一个当前标记的单词。

On the same note of BigramTagger, TrigramTagger subclass looks at the three items i.e. two previous tagged words and one current tagged word.

实际上,如果我们像使用 UnigramTagger 子类一样单独应用 BigramTaggerTrigramTagger 子类,它们的表现都非常糟糕。让我们看下面的示例:

Practically if we apply BigramTagger and TrigramTagger subclasses individually as we did with UnigramTagger subclass, they both perform very poorly. Let us see in the examples below:

Using BigramTagger Subclass

from nltk.tag import BigramTagger
from nltk.corpus import treebank
train_sentences = treebank.tagged_sents()[:2500]
Bi_tagger = BigramTagger(train_sentences)
test_sentences = treebank.tagged_sents()[1500:]
Bi_tagger.evaluate(test_sentences)

Output

0.44669191071913594

Using TrigramTagger Subclass

from nltk.tag import TrigramTagger
from nltk.corpus import treebank
train_sentences = treebank.tagged_sents()[:2500]
Tri_tagger = TrigramTagger(train_sentences)
test_sentences = treebank.tagged_sents()[1500:]
Tri_tagger.evaluate(test_sentences)

Output

0.41949863394526193

你可以比较我们以前使用过的 UnigramTagger 的性能(准确度约为 89%)与 BigramTagger(准确度约为 44%)和 TrigramTagger(准确度约为 41%)。原因是 Bigram 和 Trigram 标记器无法从句子中的第一个单词(几个单词)中学习上下文。另一方面,UnigramTagger 类不关心之前的上下文,并猜测每个单词最常见的标记,因此能够有很高的基线准确度。

You can compare the performance of UnigramTagger, we used previously (gave around 89% accuracy) with BigramTagger (gave around 44% accuracy) and TrigramTagger (gave around 41% accuracy). The reason is that Bigram and Trigram taggers cannot learn context from the first word(s) in a sentence. On the other hand, UnigramTagger class doesn’t care about the previous context and guesses the most common tag for each word, hence able to have high baseline accuracy.

Combining ngram taggers

从上面的示例中,显然当我们将 Bigram 和 Trigram 标记器与反向标记结合使用时,它们可以有所贡献。在下面的示例中,我们将 Unigram、Bigram 和 Trigram 标记器与反向标记结合使用。该概念与将 UnigramTagger 与反向标记器结合使用时的前一个方法相同。唯一的区别在于,我们使用下面给出的 tagger_util.py 中名为 backoff_tagger() 的函数进行反向操作。

As from the above examples, it is obvious that Bigram and Trigram taggers can contribute when we combine them with backoff tagging. In the example below, we are combining Unigram, Bigram and Trigram taggers with backoff tagging. The concept is same as the previous recipe while combining the UnigramTagger with backoff tagger. The only difference is that we are using the function named backoff_tagger() from tagger_util.py, given below, for backoff operation.

def backoff_tagger(train_sentences, tagger_classes, backoff=None):
   for cls in tagger_classes:
      backoff = cls(train_sentences, backoff=backoff)
   return backoff

Example

from tagger_util import backoff_tagger
from nltk.tag import UnigramTagger
from nltk.tag import BigramTagger
from nltk.tag import TrigramTagger
from nltk.tag import DefaultTagger
from nltk.corpus import treebank
train_sentences = treebank.tagged_sents()[:2500]
back_tagger = DefaultTagger('NN')
Combine_tagger = backoff_tagger(train_sentences,
[UnigramTagger, BigramTagger, TrigramTagger], backoff = back_tagger)
test_sentences = treebank.tagged_sents()[1500:]
Combine_tagger.evaluate(test_sentences)

Output

0.9234530029238365

从上面的输出中,我们可以看到它将准确度提高了约 3%。

From the above output, we can see it increases the accuracy by around 3%.

More Natural Language Toolkit Taggers

Affix Tagger

ContextTagger 子类的另一个重要类是 AffixTagger。在 AffixTagger 类中,上下文是单词的前缀或后缀。由于这个原因,AffixTagger 类能够根据单词开头或结尾的固定长度子串来学习标记。

One another important class of ContextTagger subclass is AffixTagger. In AffixTagger class, the context is either prefix or suffix of a word. That is the reason AffixTagger class can learn tags based on fixed-length substrings of the beginning or ending of a word.

How does it work?

它的运作取决于名为 affix_length 的参数,该参数指定前缀或后缀的长度。默认值为 3。但是,它如何区别 AffixTagger 类学习的是单词的前缀还是后缀?

Its working depends upon the argument named affix_length which specifies the length of the prefix or suffix. The default value is 3. But how it distinguishes whether AffixTagger class learned word’s prefix or suffix?

  1. affix_length=positive − If the value of affix_lenght is positive then it means that the AffixTagger class will learn word’s prefixes.

  2. affix_length=negative − If the value of affix_lenght is negative then it means that the AffixTagger class will learn word’s suffixes.

为了让其更清晰,在下面的示例中,我们将对带标记的树库句子使用 AffixTagger 类。

To make it clearer, in the example below, we will be using AffixTagger class on tagged treebank sentences.

Example

在此示例中,AffixTagger 将学习单词的前缀,因为我们没有为 affix_length 参数指定任何值。参数将采用默认值 3 −

In this example, AffixTagger will learn word’s prefix because we are not specifying any value for affix_length argument. The argument will take default value 3 −

from nltk.tag import AffixTagger
from nltk.corpus import treebank
train_sentences = treebank.tagged_sents()[:2500]
Prefix_tagger = AffixTagger(train_sentences)
test_sentences = treebank.tagged_sents()[1500:]
Prefix_tagger.evaluate(test_sentences)

Output

0.2800492099250667

让我们在下面的示例中看看,当我们为 affix_length 参数提供值 4 时准确度会如何 −

Let us see in the example below what will be the accuracy when we provide value 4 to affix_length argument −

from nltk.tag import AffixTagger
from nltk.corpus import treebank
train_sentences = treebank.tagged_sents()[:2500]
Prefix_tagger = AffixTagger(train_sentences, affix_length=4 )
test_sentences = treebank.tagged_sents()[1500:]
Prefix_tagger.evaluate(test_sentences)

Output

0.18154947354966527

Example

在此示例中,AffixTagger 将学习单词的后缀,因为我们将为 affix_length 参数指定负值。

In this example, AffixTagger will learn word’s suffix because we will specify negative value for affix_length argument.

from nltk.tag import AffixTagger
from nltk.corpus import treebank
train_sentences = treebank.tagged_sents()[:2500]
Suffix_tagger = AffixTagger(train_sentences, affix_length = -3)
test_sentences = treebank.tagged_sents()[1500:]
Suffix_tagger.evaluate(test_sentences)

Output

0.2800492099250667

Brill Tagger

Brill Tagger 是一种基于转换的标记器。NLTK 提供了 BrillTagger 类,该类不是 SequentialBackoffTagger 的子类的第一个标记器。与之相反, BrillTagger 使用一系列规则来校正初始标记器的结果。

Brill Tagger is a transformation-based tagger. NLTK provides BrillTagger class which is the first tagger that is not a subclass of SequentialBackoffTagger. Opposite to it, a series of rules to correct the results of an initial tagger is used by BrillTagger.

How does it work?

要使用 BrillTaggerTrainer 训练 BrillTagger 类,我们定义以下函数−

To train a BrillTagger class using BrillTaggerTrainer we define the following function −

def train_brill_tagger(initial_tagger, train_sentences, **kwargs)

def train_brill_tagger(initial_tagger, train_sentences, **kwargs)

templates = [
   brill.Template(brill.Pos([-1])),
   brill.Template(brill.Pos([1])),
   brill.Template(brill.Pos([-2])),
   brill.Template(brill.Pos([2])),
   brill.Template(brill.Pos([-2, -1])),
   brill.Template(brill.Pos([1, 2])),
   brill.Template(brill.Pos([-3, -2, -1])),
   brill.Template(brill.Pos([1, 2, 3])),
   brill.Template(brill.Pos([-1]), brill.Pos([1])),
   brill.Template(brill.Word([-1])),
   brill.Template(brill.Word([1])),
   brill.Template(brill.Word([-2])),
   brill.Template(brill.Word([2])),
   brill.Template(brill.Word([-2, -1])),
   brill.Template(brill.Word([1, 2])),
   brill.Template(brill.Word([-3, -2, -1])),
   brill.Template(brill.Word([1, 2, 3])),
   brill.Template(brill.Word([-1]), brill.Word([1])),
]
trainer = brill_trainer.BrillTaggerTrainer(initial_tagger, templates, deterministic=True)
return trainer.train(train_sentences, **kwargs)

正如我们所看到的,此函数需要 initial_taggertrain_sentences 。它采用一个 initial_tagger 参数和一个模板列表,实现了 BrillTemplate 接口。 BrillTemplate 接口位于 nltk.tbl.template 模块中。此类实现之一是 brill.Template 类。

As we can see, this function requires initial_tagger and train_sentences. It takes an initial_tagger argument and a list of templates, which implements the BrillTemplate interface. The BrillTemplate interface is found in the nltk.tbl.template module. One of such implementation is brill.Template class.

基于转换的标记器的主要作用是生成转换规则,校正初始标记器的输出,使其更符合训练语句。让我们看看下面的工作流−

The main role of transformation-based tagger is to generate transformation rules that correct the initial tagger’s output to be more in-line with the training sentences. Let us see the workflow below −

brilltemplate

Example

对于此示例,我们将使用我们在组合标记器时创建的 combine_tagger (在上一份食谱中),即 NgramTagger 类作为后备链条的 initial_tagger 。首先,让我们使用 Combine.tagger 评估结果,然后将其用作 initial_tagger 以训练 bril 标记器。

For this example, we will be using combine_tagger which we created while combing taggers (in the previous recipe) from a backoff chain of NgramTagger classes, as initial_tagger. First, let us evaluate the result using Combine.tagger and then use that as initial_tagger to train brill tagger.

from tagger_util import backoff_tagger
from nltk.tag import UnigramTagger
from nltk.tag import BigramTagger
from nltk.tag import TrigramTagger
from nltk.tag import DefaultTagger
from nltk.corpus import treebank
train_sentences = treebank.tagged_sents()[:2500]
back_tagger = DefaultTagger('NN')
Combine_tagger = backoff_tagger(
   train_sentences, [UnigramTagger, BigramTagger, TrigramTagger], backoff = back_tagger
)
test_sentences = treebank.tagged_sents()[1500:]
Combine_tagger.evaluate(test_sentences)

Output

0.9234530029238365

现在,让我们看看当 Combine_tagger 用作 initial_tagger 来训练 brill 标记器时的评估结果。

Now, let us see the evaluation result when Combine_tagger is used as initial_tagger to train brill tagger −

from tagger_util import train_brill_tagger
brill_tagger = train_brill_tagger(combine_tagger, train_sentences)
brill_tagger.evaluate(test_sentences)

Output

0.9246832510505041

我们可以注意到, BrillTagger 类的准确度比 Combine_tagger 略有提高。

We can notice that BrillTagger class has slight increased accuracy over the Combine_tagger.

Complete implementation example

from tagger_util import backoff_tagger
from nltk.tag import UnigramTagger
from nltk.tag import BigramTagger
from nltk.tag import TrigramTagger
from nltk.tag import DefaultTagger
from nltk.corpus import treebank
train_sentences = treebank.tagged_sents()[:2500]
back_tagger = DefaultTagger('NN')
Combine_tagger = backoff_tagger(train_sentences,
[UnigramTagger, BigramTagger, TrigramTagger], backoff = back_tagger)
test_sentences = treebank.tagged_sents()[1500:]
Combine_tagger.evaluate(test_sentences)
from tagger_util import train_brill_tagger
brill_tagger = train_brill_tagger(combine_tagger, train_sentences)
brill_tagger.evaluate(test_sentences)

Output

0.9234530029238365
0.9246832510505041

TnT Tagger

TnT 标记器,表示 Trigrams’nTags,是一个基于二阶马尔可夫模型的统计标记器。

TnT Tagger, stands for Trigrams’nTags, is a statistical tagger which is based on second order Markov models.

How does it work?

我们可以借助以下步骤了解 TnT 标记器的工作原理:

We can understand the working of TnT tagger with the help of following steps −

  1. First based on training data, TnT tegger maintains several internal FreqDist and ConditionalFreqDist instances.

  2. After that unigrams, bigrams and trigrams will be counted by these frequency distributions.

  3. Now, during tagging, by using frequencies, it will calculate the probabilities of possible tags for each word.

这就是为什么它不构建 NgramTagger 的后备链,而是将所有 ngram 模型一起用来为每个单词选择最佳标记。让我们在以下示例中评估 TnT 标记器的准确性:

That’s why instead of constructing a backoff chain of NgramTagger, it uses all the ngram models together to choose the best tag for each word. Let us evaluate the accuracy with TnT tagger in the following example −

from nltk.tag import tnt
from nltk.corpus import treebank
train_sentences = treebank.tagged_sents()[:2500]
tnt_tagger = tnt.TnT()
tnt_tagger.train(train_sentences)
test_sentences = treebank.tagged_sents()[1500:]
tnt_tagger.evaluate(test_sentences)

Output

0.9165508316157791

我们的准确度略低于 Brill Tagger。

We have a slight less accuracy than we got with Brill Tagger.

请注意,在 evaluate() 之前,我们需要调用 train() ,否则我们将获得 0% 的准确度。

Please note that we need to call train() before evaluate() otherwise we will get 0% accuracy.

Natural Language Toolkit - Parsing

Parsing and its relevance in NLP

单词“Parsing”源自拉丁语单词 ‘pars’ (含义为 ‘part’ ),用于从文本中提取确切含义或字典含义。它也称为句法分析或语法分析。通过比较正式语法的规则,语法分析检查了文本的含义。例如,诸如“给我热冰淇淋”之类句子将被解析器或句法分析器所拒绝。

The word ‘Parsing’ whose origin is from Latin word ‘pars’ (which means ‘part’), is used to draw exact meaning or dictionary meaning from the text. It is also called Syntactic analysis or syntax analysis. Comparing the rules of formal grammar, syntax analysis checks the text for meaningfulness. The sentence like “Give me hot ice-cream”, for example, would be rejected by parser or syntactic analyzer.

从这个意义上说,我们可以定义解析、句法分析或语法分析如下 −

In this sense, we can define parsing or syntactic analysis or syntax analysis as follows −

可以将其定义为分析符号串的过程,该串是符合正式语法规则的自然语言。

It may be defined as the process of analyzing the strings of symbols in natural language conforming to the rules of formal grammar.

relevance

我们可以借助以下几点理解解析在 NLP 中的相关性 −

We can understand the relevance of parsing in NLP with the help of following points −

  1. Parser is used to report any syntax error.

  2. It helps to recover from commonly occurring error so that the processing of the remainder of program can be continued.

  3. Parse tree is created with the help of a parser.

  4. Parser is used to create symbol table, which plays an important role in NLP.

  5. Parser is also used to produce intermediate representations (IR).

Deep Vs Shallow Parsing

Deep Parsing

Shallow Parsing

In deep parsing, the search strategy will give a complete syntactic structure to a sentence.

It is the task of parsing a limited part of the syntactic information from the given task.

It is suitable for complex NLP applications.

It can be used for less complex NLP applications.

Dialogue systems and summarization are the examples of NLP applications where deep parsing is used.

Information extraction and text mining are the examples of NLP applications where deep parsing is used.

It is also called full parsing.

It is also called chunking.

Various types of parsers

如讨论的那样,解析器基本上是对语法的程序解释。它在通过各种树的空间搜索后为给定的句子找到了最佳树。请参阅下方一些可用的解析器 –

As discussed, a parser is basically a procedural interpretation of grammar. It finds an optimal tree for the given sentence after searching through the space of a variety of trees. Let us see some of the available parsers below −

Recursive descent parser

递归下降解析是最直接的解析形式之一。以下是有关递归下降解析器的一些重要要点 –

Recursive descent parsing is one of the most straightforward forms of parsing. Following are some important points about recursive descent parser −

  1. It follows a top down process.

  2. It attempts to verify that the syntax of the input stream is correct or not.

  3. It reads the input sentence from left to right.

  4. One necessary operation for recursive descent parser is to read characters from the input stream and matching them with the terminals from the grammar.

Shift-reduce parser

以下是有关移-约解析器的一些重要要点 –

Following are some important points about shift-reduce parser −

  1. It follows a simple bottom-up process.

  2. It tries to find a sequence of words and phrases that correspond to the right-hand side of a grammar production and replaces them with the left-hand side of the production.

  3. The above attempt to find a sequence of word continues until the whole sentence is reduced.

  4. In other simple words, shift-reduce parser starts with the input symbol and tries to construct the parser tree up to the start symbol.

Chart parser

以下是有关图表解析器的一些重要要点 –

Following are some important points about chart parser −

  1. It is mainly useful or suitable for ambiguous grammars, including grammars of natural languages.

  2. It applies dynamic programing to the parsing problems.

  3. Because of dynamic programing, partial hypothesized results are stored in a structure called a ‘chart’.

  4. The ‘chart’ can also be re-used.

Regexp parser

正则表达式解析是最常用的解析技术之一。以下是关于正则表达式解析器的一些重要要点 -

Regexp parsing is one of the mostly used parsing technique. Following are some important points about Regexp parser −

  1. As the name implies, it uses a regular expression defined in the form of grammar on top of a POS-tagged string.

  2. It basically uses these regular expressions to parse the input sentences and generate a parse tree out of this.

Example

以下是正则表达式解析器的实际示例 -

Following is a working example of Regexp Parser −

import nltk
sentence = [
   ("a", "DT"),
   ("clever", "JJ"),
   ("fox","NN"),
   ("was","VBP"),
   ("jumping","VBP"),
   ("over","IN"),
   ("the","DT"),
   ("wall","NN")
]
grammar = "NP:{<DT>?<JJ>*<NN>}"
Reg_parser = nltk.RegexpParser(grammar)
Reg_parser.parse(sentence)
Output = Reg_parser.parse(sentence)
Output.draw()

Output

regexp parser

Dependency Parsing

依存关系解析 (DP),一种现代解析机制,其主要概念是每个语言单元(即单词)通过直接链接相互关联。这些直接链接在语言学中实际上是 ‘dependencies’ 。例如,下图显示了句子 “John can hit the ball” 的依存关系语法。

Dependency Parsing (DP), a modern parsing mechanism, whose main concept is that each linguistic unit i.e. words relates to each other by a direct link. These direct links are actually ‘dependencies’ in linguistic. For example, the following diagram shows dependency grammar for the sentence “John can hit the ball”.

dependency parsing

NLTK Package

以下是使用 NLTK 进行依存关系解析的两种方式 -

We have following the two ways to do dependency parsing with NLTK −

Probabilistic, projective dependency parser

这是我们可以使用 NLTK 进行依存关系解析的第一种方式。但此解析器对使用有限的训练数据进行训练有限制。

This is the first way we can do dependency parsing with NLTK. But this parser has the restriction of training with a limited set of training data.

Stanford parser

这是我们可以使用 NLTK 执行依存关系解析的另一种方式。斯坦福解析器是一种最先进的依存关系解析器。NLTK 对此进行了包装。要使用它,我们需要下载以下两样东西 -

This is another way we can do dependency parsing with NLTK. Stanford parser is a state-of-the-art dependency parser. NLTK has a wrapper around it. To use it we need to download following two things −

Language model ,适用于所需语言。例如,英语语言模型。

Language model for desired language. For example, English language model.

Example

下载模型后,我们可以通过 NLTK 使用它,如下所示 -

Once you downloaded the model, we can use it through NLTK as follows −

from nltk.parse.stanford import StanfordDependencyParser
path_jar = 'path_to/stanford-parser-full-2014-08-27/stanford-parser.jar'
path_models_jar = 'path_to/stanford-parser-full-2014-08-27/stanford-parser-3.4.1-models.jar'
dep_parser = StanfordDependencyParser(
   path_to_jar = path_jar, path_to_models_jar = path_models_jar
)
result = dep_parser.raw_parse('I shot an elephant in my sleep')
depndency = result.next()
list(dependency.triples())

Output

[
   ((u'shot', u'VBD'), u'nsubj', (u'I', u'PRP')),
   ((u'shot', u'VBD'), u'dobj', (u'elephant', u'NN')),
   ((u'elephant', u'NN'), u'det', (u'an', u'DT')),
   ((u'shot', u'VBD'), u'prep', (u'in', u'IN')),
   ((u'in', u'IN'), u'pobj', (u'sleep', u'NN')),
   ((u'sleep', u'NN'), u'poss', (u'my', u'PRP$'))
]

Chunking & Information Extraction

What is Chunking?

分块是自然语言处理中最重要的过程之一,用于识别词性 (POS) 和短语。换句话说,通过分块,我们可以获得句子的结构。它也称为 partial parsing

Chunking, one of the important processes in natural language processing, is used to identify parts of speech (POS) and short phrases. In other simple words, with chunking, we can get the structure of the sentence. It is also called partial parsing.

Chunk patterns and chinks

Chunk patterns 是词性 (POS) 标记模式,用于定义组成块的单词类型。我们可以借助改进的正则表达式来定义块模式。

Chunk patterns are the patterns of part-of-speech (POS) tags that define what kind of words made up a chunk. We can define chunk patterns with the help of modified regular expressions.

此外,我们还可以定义不应该在块中的单词类型的模式,这些未分块的单词称为 chinks

Moreover, we can also define patterns for what kind of words should not be in a chunk and these unchunked words are known as chinks.

Implementation example

在以下示例中,除了解析句子 “the book has many chapters”, 的结果之外,一个名词短语的语法将块和裂纹模式都结合在一起 −

In the example below, along with the result of parsing the sentence “the book has many chapters”, there is a grammar for noun phrases that combines both a chunk and a chink pattern −

import nltk
sentence = [
   ("the", "DT"),
   ("book", "NN"),
   ("has","VBZ"),
   ("many","JJ"),
   ("chapters","NNS")
]
chunker = nltk.RegexpParser(
   r'''
   NP:{<DT><NN.*><.*>*<NN.*>}
   }<VB.*>{
   '''
)
chunker.parse(sentence)
Output = chunker.parse(sentence)
Output.draw()

Output

chunk patterns

正如上文所示,用于指定块的模式如下使用大括号 −

As seen above, the pattern for specifying a chunk is to use curly braces as follows −

{<DT><NN>}

为了指定一个细微差别,我们可以翻转大括号,如下所示 −

And to specify a chink, we can flip the braces such as follows −

}<VB>{.

现在,就特定的短语类型而言,这些规则可以组合成一个语法。

Now, for a particular phrase type, these rules can be combined into a grammar.

Information Extraction

我们已经经历了标记器以及可以用来构建信息提取引擎的解析器。让我们看看一个基本的信息提取管道 −

We have gone through taggers as well as parsers that can be used to build information extraction engine. Let us see a basic information extraction pipeline −

extraction

信息提取有很多应用程序,包括 −

Information extraction has many applications including −

  1. Business intelligence

  2. Resume harvesting

  3. Media analysis

  4. Sentiment detection

  5. Patent search

  6. Email scanning

Named-entity recognition (NER)

命名实体识别 (NER) 实际上是提取一些最常见实体(如姓名、组织、位置等)的一种方法。让我们看看一个示例,它采用了所有预处理步骤,如句子标记化、词性标注、组块、NER,并遵循上图中提供的管道。

Named-entity recognition (NER) is actually a way of extracting some of most common entities like names, organizations, location, etc. Let us see an example that took all the preprocessing steps such as sentence tokenization, POS tagging, chunking, NER, and follows the pipeline provided in the figure above.

Example

Import nltk
file = open (
   # provide here the absolute path for the file of text for which we want NER
)
data_text = file.read()
sentences = nltk.sent_tokenize(data_text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
for sent in tagged_sentences:
print nltk.ne_chunk(sent)

一些修改后的命名实体识别 (NER) 也可用于提取诸如产品名称、生物医学实体、品牌名称等实体。

Some of the modified Named-entity recognition (NER) can also be used to extract entities such as product names, bio-medical entities, brand name and much more.

Relation extraction

关系提取是另一种常用的信息提取操作,它是提取各种实体之间不同关系的过程。可能有不同的关系,如继承、同义词、类比等,其定义取决于信息需求。例如,假设如果我们想要查找一本书的作者,那么作者身份将是作者名称和书名之间的关系。

Relation extraction, another commonly used information extraction operation, is the process of extracting the different relationships between various entities. There can be different relationships like inheritance, synonyms, analogous, etc., whose definition depends on the information need. For example, suppose if we want to look for write of a book then the authorship would be a relation between the author name and book name.

Example

在以下示例中,我们使用与上图中所示相同的 IE 管道,我们一直使用它直到命名实体关系 (NER),并使用基于 NER 标记的关系模式对其进行扩展。

In the following example, we use the same IE pipeline, as shown in the above diagram, that we used till Named-entity relation (NER) and extend it with a relation pattern based on the NER tags.

import nltk
import re
IN = re.compile(r'.*\bin\b(?!\b.+ing)')
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
for rel in nltk.sem.extract_rels('ORG', 'LOC', doc, corpus = 'ieer',
pattern = IN):
print(nltk.sem.rtuple(rel))

Output

[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
[ORG: 'McGlashan & Sarrail'] 'firm in' [LOC: 'San Mateo']
[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']
[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']
[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']
[ORG: 'WGBH'] 'in' [LOC: 'Boston']
[ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']
[ORG: 'Omnicom'] 'in' [LOC: 'New York']
[ORG: 'DDB Needham'] 'in' [LOC: 'New York']
[ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']
[ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']
[ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']

在上面的代码中,我们使用了一个名为 ieer 的内置语料库。在这个语料库中,句子被标记到命名实体关系 (NER)。在这里,我们只需要指定我们想要的关系列模式以及我们希望关系定义的 NER 类型。在我们的示例中,我们定义了组织和位置之间的关系。我们提取了这些模式的所有组合。

In the above code, we have used an inbuilt corpus named ieer. In this corpus, the sentences are tagged till Named-entity relation (NER). Here we only need to specify the relation pattern that we want and the kind of NER we want the relation to define. In our example, we defined relationship between an organization and a location. We extracted all the combinations of these patterns.

Natural Language Toolkit - Transforming Chunks

Why transforming Chunks?

到目前为止,我们已经从句子中获得了部分或短语,但我们应该用它们做什么。其中一项重要任务是对它们进行转换。但是为什么呢?是为了执行以下操作 -

Till now we have got chunks or phrases from sentences but what are we supposed to do with them. One of the important tasks is to transform them. But why? It is to do the following −

  1. grammatical correction and

  2. rearranging phrases

Filtering insignificant/useless words

假设您想判断短语的含义,那么有许多常用单词,例如,“the”、“a”是无关紧要的或无用的。例如,请看以下短语 -

Suppose if you want to judge the meaning of a phrase then there are many commonly used words such as, ‘the’, ‘a’, are insignificant or useless. For example, see the following phrase −

“The movie was good”。

‘The movie was good’.

这里最重要的单词是“movie”和“good”。其他单词“the”和“was”都是无用的或无关紧要的。这是因为没有它们,我们也可以获得短语的相同含义。“Good movie”。

Here the most significant words are ‘movie’ and ‘good’. Other words, ‘the’ and ‘was’ both are useless or insignificant. It is because without them also we can get the same meaning of the phrase. ‘Good movie’.

在以下 Python 配方中,我们将学习如何使用 POS 标记删除无用/无关紧要的单词并保留有意义的单词。

In the following python recipe, we will learn how to remove useless/insignificant words and keep the significant words with the help of POS tags.

Example

首先,通过查看 treebank 语料库以获取停用词,我们需要确定哪些词性标记是有意义的,哪些是没有意义的。让我们看看下表,其中包含无关紧要的单词和标记 -

First, by looking through treebank corpus for stopwords we need to decide which part-of-speech tags are significant and which are not. Let us see the following table of insignificant words and tags −

Word

Tag

a

DT

All

PDT

An

DT

And

CC

Or

CC

That

WDT

The

DT

从上表中,我们可以看到除了 CC 之外,所有其他标签都以 DT 结尾,这意味着我们可以通过查看标签的后缀来过滤掉无关紧要的单词。

From the above table, we can see other than CC, all the other tags end with DT which means we can filter out insignificant words by looking at the tag’s suffix.

对于此示例,我们将使用一个名为 filter() 的函数,它获取一个块并返回一个不带任何无关紧要标记单词的新块。此函数会过滤掉所有以 DT 或 CC 结尾的标记。

For this example, we are going to use a function named filter() which takes a single chunk and returns a new chunk without any insignificant tagged words. This function filters out any tags that end with DT or CC.

Example

import nltk
def filter(chunk, tag_suffixes=['DT', 'CC']):
   significant = []
   for word, tag in chunk:
      ok = True
      for suffix in tag_suffixes:
         if tag.endswith(suffix):
            ok = False
            break
      if ok:
         significant.append((word, tag))
   return (significant)

现在,让我们在 Python 配方中使用此函数 filter() 来删除无关紧要的单词 -

Now, let us use this function filter() in our Python recipe to delete insignificant words −

from chunk_parse import filter
filter([('the', 'DT'),('good', 'JJ'),('movie', 'NN')])

Output

[('good', 'JJ'), ('movie', 'NN')]

Verb Correction

在现实世界语言中,我们经常看到不正确的动词形式。例如,“is you fine?”是不正确的。这个句子中的动词形式不正确。这个句子应该是“are you fine?”NLTK 通过创建动词更正映射为我们提供了纠正此类错误的方法。这些更正映射的使用取决于块中是否有复数或单数名词。

Many times, in real-world language we see incorrect verb forms. For example, ‘is you fine?’ is not correct. The verb form is not correct in this sentence. The sentence should be ‘are you fine?’ NLTK provides us the way to correct such mistakes by creating verb correction mappings. These correction mappings are used depending on whether there is a plural or singular noun in the chunk.

Example

要实现 Python 配方,我们首先需要定义动词更正映射。让我们创建两个映射,如下所示 -

To implement Python recipe, we first need to need define verb correction mappings. Let us create two mapping as follows −

Plural to Singular mappings

Plural to Singular mappings

plural= {
   ('is', 'VBZ'): ('are', 'VBP'),
   ('was', 'VBD'): ('were', 'VBD')
}

Singular to Plural mappings

Singular to Plural mappings

singular = {
   ('are', 'VBP'): ('is', 'VBZ'),
   ('were', 'VBD'): ('was', 'VBD')
}

如上所示,每个映射都有一个标记动词,它映射到另一个标记动词。我们示例中的初始映射涵盖了映射 is to are, was to were 的基础,反之亦然。

As seen above, each mapping has a tagged verb which maps to another tagged verb. The initial mappings in our example cover the basic of mappings is to are, was to were, and vice versa.

接下来,我们将定义一个名为 verbs() 的函数,您可以在其中传递一个动词形式不正确的部分,我们将会从 verb() 函数获取一个更正的部分。要完成此操作, verb() 函数使用一个名为 index_chunk() 的帮助函数,它将在片段中搜索第一个标记单词的位置。

Next, we will define a function named verbs(), in which you can pass a chink with incorrect verb form and ‘ll get a corrected chunk back. To get it done, verb() function uses a helper function named index_chunk() which will search the chunk for the position of the first tagged word.

让我们看看这些函数 -

Let us see these functions −

def index_chunk(chunk, pred, start = 0, step = 1):
   l = len(chunk)
   end = l if step > 0 else -1
   for i in range(start, end, step):
      if pred(chunk[i]):
         return i
      return None
def tag_startswith(prefix):
   def f(wt):
      return wt[1].startswith(prefix)
   return f

def verbs(chunk):
   vbidx = index_chunk(chunk, tag_startswith('VB'))
   if vbidx is None:
      return chunk
   verb, vbtag = chunk[vbidx]
   nnpred = tag_startswith('NN')
   nnidx = index_chunk(chunk, nnpred, start = vbidx+1)
   if nnidx is None:
      nnidx = index_chunk(chunk, nnpred, start = vbidx-1, step = -1)
   if nnidx is None:
      return chunk
   noun, nntag = chunk[nnidx]
   if nntag.endswith('S'):
      chunk[vbidx] = plural.get((verb, vbtag), (verb, vbtag))
   else:
      chunk[vbidx] = singular.get((verb, vbtag), (verb, vbtag))
   return chunk

将这些函数保存在 Python 文件中,在安装了 Python 或 Anaconda 的本地目录中运行该文件。我已将该文件保存在 verbcorrect.py 中。

Save these functions in a Python file in your local directory where Python or Anaconda is installed and run it. I have saved it as verbcorrect.py.

现在,我们对 is you fine 块上的 verbs() 函数使用 POS 标记 −

Now, let us call verbs() function on a POS tagged is you fine chunk −

from verbcorrect import verbs
verbs([('is', 'VBZ'), ('you', 'PRP$'), ('fine', 'VBG')])

Output

[('are', 'VBP'), ('you', 'PRP$'), ('fine','VBG')]

Eliminating passive voice from phrases

另一项有用的任务是从短语中消除被动语态。可以使用围绕动词交换单词来执行此操作。例如,可以将 ‘the tutorial was great’ 转变成 ‘the great tutorial’

Another useful task is to eliminate passive voice from phrases. This can be done with the help of swapping the words around a verb. For example, ‘the tutorial was great’ can be transformed into ‘the great tutorial’.

Example

为实现这一点,我们定义了一个名为 eliminate_passive() 的函数,其通过使用动词作为枢轴点来将块的右侧与左侧进行交换。为了找到围绕动词进行枢纽,它还将使用上述 index_chunk() 函数。

To achieve this we are defining a function named eliminate_passive() that will swap the right-hand side of the chunk with the left-hand side by using the verb as the pivot point. In order to find the verb to pivot around, it will also use the index_chunk() function defined above.

def eliminate_passive(chunk):
   def vbpred(wt):
      word, tag = wt
      return tag != 'VBG' and tag.startswith('VB') and len(tag) > 2
   vbidx = index_chunk(chunk, vbpred)
   if vbidx is None:
      return chunk
   return chunk[vbidx+1:] + chunk[:vbidx]

现在,我们对 the tutorial was great 块上的 eliminate_passive() 函数使用 POS 标记 −

Now, let us call eliminate_passive() function on a POS tagged the tutorial was great chunk −

from passiveverb import eliminate_passive
eliminate_passive(
   [
      ('the', 'DT'), ('tutorial', 'NN'), ('was', 'VBD'), ('great', 'JJ')
   ]
)

Output

[('great', 'JJ'), ('the', 'DT'), ('tutorial', 'NN')]

Swapping noun cardinals

如我们所知,诸如 5 的基数词在块中标记为 CD。这些基数词常常在名词之前或之后出现,但出于规范化目的,将其始终放在名词之前会很有用。例如,可以将日期 January 5 写成 5 January 。让我们通过以下示例来理解。

As we know, a cardinal word such as 5, is tagged as CD in a chunk. These cardinal words often occur before or after a noun but for normalization purpose it is useful to put them before the noun always. For example, the date January 5 can be written as 5 January. Let us understand it with the following example.

Example

为实现这一点,我们定义了一个名为 swapping_cardinals() 的函数,其将紧跟在名词之后的任何基数与该名词进行交换。通过此操作,基数将立即出现在名词之前。为与给定的标记进行相等性比较,该函数将使用一个名为 tag_eql() 的辅助函数。

To achieve this we are defining a function named swapping_cardinals() that will swap any cardinal that occurs immediately after a noun with the noun. With this the cardinal will occur immediately before the noun. In order to do equality comparison with the given tag, it uses a helper function which we named as tag_eql().

def tag_eql(tag):
   def f(wt):
      return wt[1] == tag
   return f

现在,我们可以定义 swapping_cardinals()−

Now we can define swapping_cardinals() −

def swapping_cardinals (chunk):
   cdidx = index_chunk(chunk, tag_eql('CD'))
   if not cdidx or not chunk[cdidx-1][1].startswith('NN'):
      return chunk
   noun, nntag = chunk[cdidx-1]
   chunk[cdidx-1] = chunk[cdidx]
   chunk[cdidx] = noun, nntag
   return chunk

现在,让我们对 “January 5” 上的 swapping_cardinals() 函数使用日期 −

Now, Let us call swapping_cardinals() function on a date “January 5”

from Cardinals import swapping_cardinals()
swapping_cardinals([('Janaury', 'NNP'), ('5', 'CD')])

Output

[('10', 'CD'), ('January', 'NNP')]
10 January

Natural Language Toolkit - Transforming Trees

以下是变换树的两个原因 −

Following are the two reasons to transform the trees −

  1. To modify deep parse tree and

  2. To flatten deep parse trees

Converting Tree or Subtree to Sentence

我们即将在此讨论的第一个秘诀是将树或子树转换回句子或块字符串。这非常简单,让我们通过以下示例进行了解 −

The first recipe we are going to discuss here is to convert a Tree or subtree back to a sentence or chunk string. This is very simple, let us see in the following example −

Example

from nltk.corpus import treebank_chunk
tree = treebank_chunk.chunked_sents()[2]
' '.join([w for w, t in tree.leaves()])

Output

'Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields
PLC , was named a nonexecutive director of this British industrial
conglomerate .'

Deep tree flattening

嵌套短语的深度树无法用于训练块,因此我们必须在使用之前将其展开。在以下示例中,我们将从 treebank 语料库中使用第 3 个分析句子,它是嵌套短语的深度树。

Deep trees of nested phrases can’t be used for training a chunk hence we must flatten them before using. In the following example, we are going to use 3rd parsed sentence, which is deep tree of nested phrases, from the treebank corpus.

Example

为实现这一点,我们定义了一个名为 deeptree_flat() 的函数,该函数将采用一个 Tree 并将返回一个仅保留最低级别树的新 Tree。为对大多数工作进行操作,该函数将使用一个名为 childtree_flat() 的辅助函数。

To achieve this, we are defining a function named deeptree_flat() that will take a single Tree and will return a new Tree that keeps only the lowest level trees. In order to do most of the work, it uses a helper function which we named as childtree_flat().

from nltk.tree import Tree
def childtree_flat(trees):
   children = []
   for t in trees:
      if t.height() < 3:
         children.extend(t.pos())
      elif t.height() == 3:
         children.append(Tree(t.label(), t.pos()))
      else:
         children.extend(flatten_childtrees([c for c in t]))
   return children
def deeptree_flat(tree):
   return Tree(tree.label(), flatten_childtrees([c for c in tree]))

现在,让我们从 treebank 语料库中对第 3 个分析句子(是嵌套短语的深度树)调用 deeptree_flat() 函数。我们将这些函数保存在名为 deeptree.py 的文件中。

Now, let us call deeptree_flat() function on 3rd parsed sentence, which is deep tree of nested phrases, from the treebank corpus. We saved these functions in a file named deeptree.py.

from deeptree import deeptree_flat
from nltk.corpus import treebank
deeptree_flat(treebank.parsed_sents()[2])

Output

Tree('S', [Tree('NP', [('Rudolph', 'NNP'), ('Agnew', 'NNP')]),
(',', ','), Tree('NP', [('55', 'CD'),
('years', 'NNS')]), ('old', 'JJ'), ('and', 'CC'),
Tree('NP', [('former', 'JJ'),
('chairman', 'NN')]), ('of', 'IN'), Tree('NP', [('Consolidated', 'NNP'),
('Gold', 'NNP'), ('Fields', 'NNP'), ('PLC',
'NNP')]), (',', ','), ('was', 'VBD'),
('named', 'VBN'), Tree('NP-SBJ', [('*-1', '-NONE-')]),
Tree('NP', [('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN')]),
('of', 'IN'), Tree('NP',
[('this', 'DT'), ('British', 'JJ'),
('industrial', 'JJ'), ('conglomerate', 'NN')]), ('.', '.')])

Building Shallow tree

在前面一节中,我们通过仅保留最低级别的子树来展开嵌套短语的深度树。在本节中,我们将仅保留最高级别的子树,即构建浅树。在以下示例中,我们将从 treebank 语料库中使用第 3 个分析句子,它是嵌套短语的深度树。

In the previous section, we flatten a deep tree of nested phrases by only keeping the lowest level subtrees. In this section, we are going to keep only the highest-level subtrees i.e. to build the shallow tree. In the following example we are going to use 3rd parsed sentence, which is deep tree of nested phrases, from the treebank corpus.

Example

为实现这一点,我们定义了一个名为 tree_shallow() 的函数,该函数将通过仅保留顶部子树标签来消除所有的嵌套子树。

To achieve this, we are defining a function named tree_shallow() that will eliminate all the nested subtrees by keeping only the top subtree labels.

from nltk.tree import Tree
def tree_shallow(tree):
   children = []
   for t in tree:
      if t.height() < 3:
         children.extend(t.pos())
      else:
         children.append(Tree(t.label(), t.pos()))
   return Tree(tree.label(), children)

现在,让我们从 treebank 语料库中对第 3 个分析句子(是嵌套短语的深度树)调用 tree_shallow() 函数。我们将这些函数保存在名为 shallowtree.py 的文件中。

Now, let us call tree_shallow() function on 3rd parsed sentence, which is deep tree of nested phrases, from the treebank corpus. We saved these functions in a file named shallowtree.py.

from shallowtree import shallow_tree
from nltk.corpus import treebank
tree_shallow(treebank.parsed_sents()[2])

Output

Tree('S', [Tree('NP-SBJ-1', [('Rudolph', 'NNP'), ('Agnew', 'NNP'), (',', ','),
('55', 'CD'), ('years', 'NNS'), ('old', 'JJ'), ('and', 'CC'),
('former', 'JJ'), ('chairman', 'NN'), ('of', 'IN'), ('Consolidated', 'NNP'),
('Gold', 'NNP'), ('Fields', 'NNP'), ('PLC', 'NNP'), (',', ',')]),
Tree('VP', [('was', 'VBD'), ('named', 'VBN'), ('*-1', '-NONE-'), ('a', 'DT'),
('nonexecutive', 'JJ'), ('director', 'NN'), ('of', 'IN'), ('this', 'DT'),
('British', 'JJ'), ('industrial', 'JJ'), ('conglomerate', 'NN')]), ('.', '.')])

借助于获取树的高度,我们可以看到差异 −

We can see the difference with the help of getting the height of the trees −

from nltk.corpus import treebank
tree_shallow(treebank.parsed_sents()[2]).height()

Output

3
from nltk.corpus import treebank
treebank.parsed_sents()[2].height()

Output

9

Tree labels conversion

在剖析树中有多种 Tree 标签类型,但在区块树中不存在。但是,在使用剖析树训练区块器时,我们希望通过将一些树标签转换为更常见的标签类型来减少这种差异。例如,我们有两个替代的 NP 子树,即 NP-SBL 和 NP-TMP。我们可以将它们都转换为 NP。让我们在以下示例中了解如何执行此操作。

In parse trees there are variety of Tree label types that are not present in chunk trees. But while using parse tree to train a chunker, we would like to reduce this variety by converting some of Tree labels to more common label types. For example, we have two alternative NP subtrees namely NP-SBL and NP-TMP. We can convert both of them into NP. Let us see how to do it in the following example.

Example

为了实现这一点,我们正在定义一个名为 tree_convert() 的函数,它接受以下两个参数 −

To achieve this we are defining a function named tree_convert() that takes following two arguments −

  1. Tree to convert

  2. A label conversion mapping

此函数将返回一棵新树,其中所有匹配的标签都根据映射中的值进行替换。

This function will return a new Tree with all matching labels replaced based on the values in the mapping.

from nltk.tree import Tree
def tree_convert(tree, mapping):
   children = []
   for t in tree:
      if isinstance(t, Tree):
         children.append(convert_tree_labels(t, mapping))
      else:
         children.append(t)
   label = mapping.get(tree.label(), tree.label())
   return Tree(label, children)

现在,让我们对 treebank 语料库中的第 3 个已剖析句子(嵌套短语的深度树)调用 tree_convert() 函数。我们将这些函数保存在名为 converttree.py 的文件中。

Now, let us call tree_convert() function on 3rd parsed sentence, which is deep tree of nested phrases, from the treebank corpus. We saved these functions in a file named converttree.py.

from converttree import tree_convert
from nltk.corpus import treebank
mapping = {'NP-SBJ': 'NP', 'NP-TMP': 'NP'}
convert_tree_labels(treebank.parsed_sents()[2], mapping)

Output

Tree('S', [Tree('NP-SBJ-1', [Tree('NP', [Tree('NNP', ['Rudolph']),
Tree('NNP', ['Agnew'])]), Tree(',', [',']),
Tree('UCP', [Tree('ADJP', [Tree('NP', [Tree('CD', ['55']),
Tree('NNS', ['years'])]),
Tree('JJ', ['old'])]), Tree('CC', ['and']),
Tree('NP', [Tree('NP', [Tree('JJ', ['former']),
Tree('NN', ['chairman'])]), Tree('PP', [Tree('IN', ['of']),
Tree('NP', [Tree('NNP', ['Consolidated']),
Tree('NNP', ['Gold']), Tree('NNP', ['Fields']),
Tree('NNP', ['PLC'])])])])]), Tree(',', [','])]),
Tree('VP', [Tree('VBD', ['was']),Tree('VP', [Tree('VBN', ['named']),
Tree('S', [Tree('NP', [Tree('-NONE-', ['*-1'])]),
Tree('NP-PRD', [Tree('NP', [Tree('DT', ['a']),
Tree('JJ', ['nonexecutive']), Tree('NN', ['director'])]),
Tree('PP', [Tree('IN', ['of']), Tree('NP',
[Tree('DT', ['this']), Tree('JJ', ['British']), Tree('JJ', ['industrial']),
Tree('NN', ['conglomerate'])])])])])])]), Tree('.', ['.'])])

Natural Language Toolkit - Text Classification

What is text classification?

正如名称所暗示的那样,文本分类是对文本或文档进行分类的方法。但是,这里 возникает вопрос: 为什么我们需要使用文本分类器?一旦检查文档或文本中的单词用法,分类器将能够决定应为其分配哪个类标签。

Text classification, as the name implies, is the way to categorize pieces of text or documents. But here the question arises that why we need to use text classifiers? Once examining the word usage in a document or piece of text, classifiers will be able to decide what class label should be assigned to it.

Binary Classifier

顾名思义,二元分类器将在两个标签之间进行选择。例如,正或负。在其中,文本或文档片段既可以是其中一个标签,也可以是另一个标签,但不能同时是这两个标签。

As name implies, binary classifier will decide between two labels. For example, positive or negative. In this the piece of text or document can either be one label or another, but not both.

Multi-label Classifier

与二元分类器相反,多标签分类器可以向文本或文档片段分配一个或多个标签。

Opposite to binary classifier, multi-label classifier can assign one or more labels to a piece of text or document.

Labeled Vs Unlabeled Feature set

将特征名称映射为特征值称为特征集。标记的特征集或训练数据对于分类训练非常重要,以便它以后可以对未标记的特征集进行分类。

A key-value mapping of feature names to feature values is called a feature set. Labeled feature sets or training data is very important for classification training so that it can later classify unlabeled feature set.

Labeled Feature Set

Unlabeled Feature Set

It is a tuple that look like (feat, label).

It is a feat itself.

It is an instance with a known class label.

Without associated label, we can call it an instance.

Used for training a classification algorithm.

Once trained, classification algorithm can classify an unlabeled feature set.

Text Feature Extraction

正如名称所暗示的那样,文本特征提取是将单词列表转换成本分类器可用的特征集的过程。我们必须将文本转换为 ‘dict’ 样式的特征集,因为自然语言工具包 (NLTK) 需要 ‘dict’ 样式的特征集。

Text feature extraction, as the name implies, is the process of transforming a list of words into a feature set that is usable by a classifier. We must have to transform our text into ‘dict’ style feature sets because Natural Language Tool Kit (NLTK) expect ‘dict’ style feature sets.

Bag of Words (BoW) model

BoW 是 NLP 中最简单的模型之一,用于从文本或文档片段中提取特征,以便可以在建模中使用它,这样它就可以在 ML 算法中使用。它基本上从实例的所有单词中构造单词存在特征集。此方法背后的概念是,它不关心单词出现多少次或单词的顺序,它只关心单词是否存在于单词列表中。

BoW, one of the simplest models in NLP, is used to extract the features from piece of text or document so that it can be used in modeling such that in ML algorithms. It basically constructs a word presence feature set from all the words of an instance. The concept behind this method is that it doesn’t care about how many times a word occurs or about the order of the words, it only cares weather the word is present in a list of words or not.

Example

对于此示例,我们将定义一个名为 bow() 的函数 −

For this example, we are going to define a function named bow() −

def bow(words):
   return dict([(word, True) for word in words])

现在,让我们对单词调用 bow() 函数。我们将这些函数保存在名为 bagwords.py 的文件中。

Now, let us call bow() function on words. We saved this functions in a file named bagwords.py.

from bagwords import bow
bow(['we', 'are', 'using', 'tutorialspoint'])

Output

{'we': True, 'are': True, 'using': True, 'tutorialspoint': True}

Training classifiers

在前面的章节中,我们学习了如何从文本中提取特征。所以现在我们可以训练一个分类器。第一个也是最简单的分类器是 NaiveBayesClassifier 类。

In previous sections, we learned how to extract features from the text. So now we can train a classifier. The first and easiest classifier is NaiveBayesClassifier class.

Naïve Bayes Classifier

为了预测给定的特征集合属于特定标签的概率,它使用贝叶斯定理。贝叶斯定理的公式如下。

To predict the probability that a given feature set belongs to a particular label, it uses Bayes theorem. The formula of Bayes theorem is as follows.

在此,

Here,

P(A|B) − 它也被称为后验概率,即在已知第二个事件(即 B)发生的情况下,第一个事件(即 A)发生的概率。

P(A|B) − It is also called the posterior probability i.e. the probability of first event i.e. A to occur given that second event i.e. B occurred.

P(B|A) − 它是在第一个事件(即 A)发生后,第二个事件(即 B)发生的概率。

P(B|A) − It is the probability of second event i.e. B to occur after first event i.e. A occurred.

P(A), P(B) − 它也被称为先验概率,即第一个事件(即 A)或第二个事件(即 B)发生的概率。

P(A), P(B) − It is also called prior probability i.e. the probability of first event i.e. A or second event i.e. B to occur.

为了训练朴素贝叶斯分类器,我们将使用 NLTK 中的 movie_reviews 语料库。此语料库有两个类别的文本,即: posneg 。这些类别使训练它们的分类器成为二元分类器。语料库中的每个文件都由两部分组成,一部分是正面的电影评论,另一部分是负面的电影评论。在我们的例子中,我们将每个文件用作训练和测试分类器的单个实例。

To train Naïve Bayes classifier, we will be using the movie_reviews corpus from NLTK. This corpus has two categories of text, namely: pos and neg. These categories make a classifier trained on them a binary classifier. Every file in the corpus is composed of two,one is positive movie review and other is negative movie review. In our example, we are going to use each file as a single instance for both training and testing the classifier.

Example

对于训练分类器,我们需要一个标记的特征集合列表,其形式为 [( featureset, label )]. 这里 featureset 变量是一个 dict ,标签是 featureset 的已知类标签。我们准备创建一个名为 label_corpus() 的函数,它将接受一个名为 movie_reviews*and also a function named *feature_detector 的语料库,该语料库默认为 bag of words 。它将构建并返回一个这种形式的映射:{label: [featureset]}。之后,我们将使用此映射来创建一个标记的训练实例和测试实例列表。

For training classifier, we need a list of labeled feature sets, which will be in the form [(featureset, label)]. Here the featureset variable is a dict and label is the known class label for the featureset. We are going to create a function named label_corpus() which will take a corpus named movie_reviews*and also a function named *feature_detector, which defaults to bag of words. It will construct and returns a mapping of the form, {label: [featureset]}. After that we will use this mapping to create a list of labeled training instances and testing instances.

import collections

import collections

def label_corpus(corp, feature_detector=bow):
   label_feats = collections.defaultdict(list)
   for label in corp.categories():
      for fileid in corp.fileids(categories=[label]):
         feats = feature_detector(corp.words(fileids=[fileid]))
         label_feats[label].append(feats)
   return label_feats

借助上述函数,我们将得到一个映射 {label:fetaureset} 。现在,我们准备定义另一个名为 split 的函数,它将接受 label_corpus() 函数返回的映射,并将每个特征集合的列表分割为标记的训练实例和测试实例。

With the help of above function we will get a mapping {label:fetaureset}. Now we are going to define one more function named split that will take a mapping returned from label_corpus() function and splits each list of feature sets into labeled training as well as testing instances.

def split(lfeats, split=0.75):
   train_feats = []
   test_feats = []
   for label, feats in lfeats.items():
      cutoff = int(len(feats) * split)
      train_feats.extend([(feat, label) for feat in feats[:cutoff]])
      test_feats.extend([(feat, label) for feat in feats[cutoff:]])
   return train_feats, test_feats

现在,让我们在我们的语料库中使用这些函数,即 movie_reviews −

Now, let us use these functions on our corpus, i.e. movie_reviews −

from nltk.corpus import movie_reviews
from featx import label_feats_from_corpus, split_label_feats
movie_reviews.categories()

Output

['neg', 'pos']

Example

lfeats = label_feats_from_corpus(movie_reviews)
lfeats.keys()

Output

dict_keys(['neg', 'pos'])

Example

train_feats, test_feats = split_label_feats(lfeats, split = 0.75)
len(train_feats)

Output

1500

Example

len(test_feats)

Output

500

我们已经看到,在 movie_reviews 语料库中有 1000 个 pos文件和 1000 个 neg 文件。我们最终得到了 1500 个标记的训练实例和 500 个标记的测试实例。

We have seen that in movie_reviews corpus, there are 1000 pos files and 1000 neg files. We also end up with 1500 labeled training instances and 500 labeled testing instances.

现在,让我们使用其 train() 类方法训练 NaïveBayesClassifier

Now let us train NaïveBayesClassifier using its train() class method −

from nltk.classify import NaiveBayesClassifier
NBC = NaiveBayesClassifier.train(train_feats)
NBC.labels()

Output

['neg', 'pos']

Decision Tree Classifier

另一个重要的分类器是决策树分类器。这里为了训练 DecisionTreeClassifier 类将创建一个树结构。在此树结构中,每个节点对应一个特征名称,分支对应特征值。沿着分支向下,我们将到达树叶,即分类标签。

Another important classifier is decision tree classifier. Here to train it the DecisionTreeClassifier class will create a tree structure. In this tree structure each node corresponds to a feature name and the branches correspond to the feature values. And down the branches we will get to the leaves of the tree i.e. the classification labels.

为了训练决策树分类器,我们将使用我们从 movie_reviews 语料库中创建的相同的训练和测试特征,即 train_featstest_feats 变量。

To train decision tree classifier, we will use the same training and testing features i.e. train_feats and test_feats, variables we have created from movie_reviews corpus.

Example

为了训练此分类器,我们将调用 DecisionTreeClassifier.train() 类方法,如下所示:

To train this classifier, we will call DecisionTreeClassifier.train() class method as follows −

from nltk.classify import DecisionTreeClassifier
decisiont_classifier = DecisionTreeClassifier.train(
   train_feats, binary = True, entropy_cutoff = 0.8,
   depth_cutoff = 5, support_cutoff = 30
)
accuracy(decisiont_classifier, test_feats)

Output

0.725

Maximum Entropy Classifier

另一个重要的分类器是 MaxentClassifier ,它也被称为 conditional exponential classifierlogistic regression classifier 。在此处为了训练它, MaxentClassifier 类将使用编码将标记的特征集合转换为矢量。

Another important classifier is MaxentClassifier which is also known as a conditional exponential classifier or logistic regression classifier. Here to train it, the MaxentClassifier class will convert labeled feature sets to vector using encoding.

为了训练决策树分类器,我们将使用我们从 movie_reviews 语料库中创建的相同的训练和测试特征,即 train_feats*and *test_feats 变量。

To train decision tree classifier, we will use the same training and testing features i.e. train_feats*and *test_feats, variables we have created from movie_reviews corpus.

Example

为了训练此分类器,我们将调用 MaxentClassifier.train() 类方法,如下所示:

To train this classifier, we will call MaxentClassifier.train() class method as follows −

from nltk.classify import MaxentClassifier
maxent_classifier = MaxentClassifier
.train(train_feats,algorithm = 'gis', trace = 0, max_iter = 10, min_lldelta = 0.5)
accuracy(maxent_classifier, test_feats)

Output

0.786

Scikit-learn Classifier

最好的机器学习(ML)库之一是 Scikit-Learn。它实际上包含各种用途的各种 ML 算法,但它们都具有以下相同的拟合设计模式 −

One of the best machine learning (ML) libraries is Scikit-learn. It actually contains all sorts of ML algorithms for various purposes, but they all have the same fit design pattern as follows −

  1. Fitting the model to the data

  2. And use that model to make predictions

与直接访问 scikit-learn 模型不同,我将使用 NLTK 的 SklearnClassifier 类。此类是 scikit-learn 模型的封装类,用于使其符合 NLTK 的 Classifier 接口。

Rather than accessing scikit-learn models directly, here we are going to use NLTK’s SklearnClassifier class. This class is a wrapper class around a scikit-learn model to make it conform to NLTK’s Classifier interface.

我们将按照以下步骤训练 SklearnClassifier 类−

We will follow following steps to train a SklearnClassifier class −

Step 1 − 我们先创建训练功能,就像我们在之前的食谱中所做的那样。

Step 1 − First we will create training features as we did in previous recipes.

Step 2 − 现在,选择并导入 Scikit-learn 算法。

Step 2 − Now, choose and import a Scikit-learn algorithm.

Step 3 − 接下来,我们需要使用所选算法构造 SklearnClassifier 类。

Step 3 − Next, we need to construct a SklearnClassifier class with the chosen algorithm.

Step 4 − 最后,我们将用训练功能训练 SklearnClassifier 类。

Step 4 − Last, we will train SklearnClassifier class with our training features.

让我们在下面的 Python 食谱中实现这些步骤−

Let us implement these steps in the below Python recipe −

from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB
sklearn_classifier = SklearnClassifier(MultinomialNB())
sklearn_classifier.train(train_feats)
<SklearnClassifier(MultinomialNB(alpha = 1.0,class_prior = None,fit_prior = True))>
accuracy(sk_classifier, test_feats)

Output

0.885

Measuring precision and recall

在训练各种分类器时,我们还测量了它们的准确性。但除了准确性之外,还有许多其他指标可用于评估分类器。其中两个指标是 precisionrecall

While training various classifiers we have measured their accuracy also. But apart from accuracy there are number of other metrics which are used to evaluate the classifiers. Two of these metrics are precision and recall.

Example

在此示例中,我们将计算我们之前训练的 NaiveBayesClassifier 类的精度和召回率。为此,我们将创建一个名为 metrics_PR() 的函数,它将使用两个参数,一个是训练后的分类器,另一个是带标签的测试功能。这两个参数与我们在计算分类器准确性时传递的参数相同−

In this example, we are going to calculate precision and recall of the NaiveBayesClassifier class we trained earlier. To achieve this we will create a function named metrics_PR() which will take two arguments, one is the trained classifier and other is the labeled test features. Both the arguments are same as we passed while calculating the accuracy of the classifiers −

import collections
from nltk import metrics
def metrics_PR(classifier, testfeats):
   refsets = collections.defaultdict(set)
   testsets = collections.defaultdict(set)
   for i, (feats, label) in enumerate(testfeats):
      refsets[label].add(i)
      observed = classifier.classify(feats)
         testsets[observed].add(i)
   precisions = {}
   recalls = {}
   for label in classifier.labels():
   precisions[label] = metrics.precision(refsets[label],testsets[label])
   recalls[label] = metrics.recall(refsets[label], testsets[label])
   return precisions, recalls

让我们调用此函数以查找精度和召回率−

Let us call this function to find the precision and recall −

from metrics_classification import metrics_PR
nb_precisions, nb_recalls = metrics_PR(nb_classifier,test_feats)
nb_precisions['pos']

Output

0.6713532466435213

Example

nb_precisions['neg']

Output

0.9676271186440678

Example

nb_recalls['pos']

Output

0.96

Example

nb_recalls['neg']

Output

0.478

Combination of classifier and voting

组合分类器是提高分类性能的最佳方法之一。投票是合并多个分类器的最佳方式之一。对于投票,我们需要奇数个分类器。在以下 Python 食谱中,我们将结合三个分类器,即 NaiveBayesClassifier 类、DecisionTreeClassifier 类和 MaxentClassifier 类。

Combining classifiers is one of the best ways to improve classification performance. And voting is one of the best ways to combine multiple classifiers. For voting we need to have odd number of classifiers. In the following Python recipe we are going to combine three classifiers namely NaiveBayesClassifier class, DecisionTreeClassifier class and MaxentClassifier class.

为了实现这一点,我们将定义一个名为 voting_classifiers() 的函数,如下所示。

To achieve this we are going to define a function named voting_classifiers() as follows.

import itertools
from nltk.classify import ClassifierI
from nltk.probability import FreqDist
class Voting_classifiers(ClassifierI):
   def __init__(self, *classifiers):
      self._classifiers = classifiers
      self._labels = sorted(set(itertools.chain(*[c.labels() for c in classifiers])))
   def labels(self):
      return self._labels
   def classify(self, feats):
      counts = FreqDist()
      for classifier in self._classifiers:
         counts[classifier.classify(feats)] += 1
      return counts.max()

让我们调用此函数以组合三个分类器并找到准确性−

Let us call this function to combine three classifiers and find the accuracy −

from vote_classification import Voting_classifiers
combined_classifier = Voting_classifiers(NBC, decisiont_classifier, maxent_classifier)
combined_classifier.labels()

Output

['neg', 'pos']

Example

accuracy(combined_classifier, test_feats)

Output

0.948

从上面的输出中,我们可以看到合并的分类器的准确性高于各个分类器。

From the above output, we can see that the combined classifiers got highest accuracy than the individual classifiers.