Natural Language Processing 简明教程

Natural Language Processing - Python

在本章中，我们将学习使用 Python 进行语言处理。

In this chapter, we will learn about language processing using Python.

以下功能使 Python 与其他语言不同：

The following features make Python different from other languages −

Python is interpreted − We do not need to compile our Python program before executing it because the interpreter processes Python at runtime.
Interactive − We can directly interact with the interpreter to write our Python programs.
Object-oriented − Python is object-oriented in nature and it makes this language easier to write programs because with the help of this technique of programming it encapsulates code within objects.
Beginner can easily learn − Python is also called beginner’s language because it is very easy to understand, and it supports the development of a wide range of applications.

Prerequisites

Python 3 的最新版本是 Python 3.7.1，可用于 Windows、Mac OS 和大多数 Linux OS。

The latest version of Python 3 released is Python 3.7.1 is available for Windows, Mac OS and most of the flavors of Linux OS.

For windows, we can go to the link www.python.org/downloads/windows/ to download and install Python.
For MAC OS, we can use the link www.python.org/downloads/mac-osx/.
In case of Linux, different flavors of Linux use different package managers for installation of new packages. For example, to install Python 3 on Ubuntu Linux, we can use the following command from terminal −

$sudo apt-get install python3-minimal

为更深入地学习 Python 编程，请阅读 Python 3 基本教程 – Python 3

To study more about Python programming, read Python 3 basic tutorial – Python 3

Getting Started with NLTK

我们将使用 Python 库 NLTK（自然语言工具包）对英语文本进行文本分析。自然语言工具包 (NLTK) 是一组 Python 库，专门设计用于识别和标记在自然语言（如英语）文本中找到的词性。

We will be using Python library NLTK (Natural Language Toolkit) for doing text analysis in English Language. The Natural language toolkit (NLTK) is a collection of Python libraries designed especially for identifying and tag parts of speech found in the text of natural language like English.

Installing NLTK

在开始使用 NLTK 之前，我们需要将其安装。我们可以使用以下命令在 Python 环境中安装它：

Before starting to use NLTK, we need to install it. With the help of following command, we can install it in our Python environment −

pip install nltk

如果我们正在使用 Anaconda，则可以使用以下命令构建 NLTK 的 Conda 软件包：

If we are using Anaconda, then a Conda package for NLTK can be built by using the following command −

conda install -c anaconda nltk

Downloading NLTK’s Data

安装 NLTK 之后，另一项重要任务是下载其预设文本存储库，以便可以轻松使用它。但是，在此之前，我们需要像导入其他任何 Python 模块一样导入 NLTK。以下命令将帮助我们导入 NLTK：

After installing NLTK, another important task is to download its preset text repositories so that it can be easily used. However, before that we need to import NLTK the way we import any other Python module. The following command will help us in importing NLTK −

import nltk

现在，使用以下命令下载 NLTK 数据：

Now, download NLTK data with the help of the following command −

nltk.download()

安装所有可用的 NLTK 软件包需要一些时间。

It will take some time to install all available packages of NLTK.

Other Necessary Packages

其他一些 Python 软件包（如 gensim 和 pattern ）对于文本分析以及通过使用 NLTK 构建自然语言处理应用程序也是非常必要的。软件包可以按如下所示安装：

Some other Python packages like gensim and pattern are also very necessary for text analysis as well as building natural language processing applications by using NLTK. the packages can be installed as shown below −

gensim

gensim 是一个健壮的语义建模库，可用于许多应用程序。我们可以通过以下命令安装它：

gensim is a robust semantic modeling library which can be used for many applications. We can install it by following command −

pip install gensim

pattern

它可用于使 gensim 软件包正常工作。以下命令有助于安装 pattern：

It can be used to make gensim package work properly. The following command helps in installing pattern −

pip install pattern

Tokenization

标记化可以定义为将给定文本分解为较小的称为标记的单元的过程。单词、数字或标点符号可以是标记。它也可以称为词语分割。

Tokenization may be defined as the Process of breaking the given text, into smaller units called tokens. Words, numbers or punctuation marks can be tokens. It may also be called word segmentation.

Example

Input - 床和椅子是家具的类型。

Input − Bed and chair are types of furniture.

NLTK 提供了用于标记化的不同包。我们可以根据我们的要求使用这些包。包及其安装详细信息如下 -

We have different packages for tokenization provided by NLTK. We can use these packages based on our requirements. The packages and the details of their installation are as follows −

sent_tokenize package

此包可用于将输入文本划分为句子。我们可以使用以下命令导入它 -

This package can be used to divide the input text into sentences. We can import it by using the following command −

from nltk.tokenize import sent_tokenize

word_tokenize package

此包可用于将输入文本划分为单词。我们可以使用以下命令导入它 -

This package can be used to divide the input text into words. We can import it by using the following command −

from nltk.tokenize import word_tokenize

WordPunctTokenizer package

此包可用于将输入文本划分为单词和标点符号。我们可以使用以下命令导入它 -

This package can be used to divide the input text into words and punctuation marks. We can import it by using the following command −

from nltk.tokenize import WordPuncttokenizer

Stemming

由于语法原因，语言包含很多变异。语言（包括英语和其他语言）的变异在于它们具有不同的单词形式。例如，像 democracy ， democratic 和 democratization 这样的单词。对于机器学习项目而言，对于机器而言非常重要的是理解上面这样的这些不同的单词具有相同的词根。这就是在分析文本时提取单词词根非常有用的原因。

Due to grammatical reasons, language includes lots of variations. Variations in the sense that the language, English as well as other languages too, have different forms of a word. For example, the words like democracy, democratic, and democratization. For machine learning projects, it is very important for machines to understand that these different words, like above, have the same base form. That is why it is very useful to extract the base forms of the words while analyzing the text.

词干提取是一个启发式过程，它通过切断词尾来帮助提取单词的词根。

Stemming is a heuristic process that helps in extracting the base forms of the words by chopping of their ends.

NLTK 模块提供的用于词干提取的不同包如下 -

The different packages for stemming provided by NLTK module are as follows −

PorterStemmer package

这个词干提取包使用波特算法来提取单词的词根。借助以下命令，我们可以导入这个包 -

Porter’s algorithm is used by this stemming package to extract the base form of the words. With the help of the following command, we can import this package −

from nltk.stem.porter import PorterStemmer

例如， ‘writing’ 输入到这个词干提取器后， ‘write’ 将是输出。

For example, ‘write’ would be the output of the word ‘writing’ given as the input to this stemmer.

LancasterStemmer package

这个词干提取包使用兰开斯特算法来提取单词的词根。借助以下命令，我们可以导入这个包 -

Lancaster’s algorithm is used by this stemming package to extract the base form of the words. With the help of following command, we can import this package −

from nltk.stem.lancaster import LancasterStemmer

例如， ‘writing’ 输入到这个词干提取器后， ‘writing’ 将是输出。

For example, ‘writ’ would be the output of the word ‘writing’ given as the input to this stemmer.

SnowballStemmer package

这个词干提取包使用 Snowball 算法来提取单词的词根。借助以下命令，我们可以导入这个包 -

Snowball’s algorithm is used by this stemming package to extract the base form of the words. With the help of following command, we can import this package −

from nltk.stem.snowball import SnowballStemmer

例如， ‘writing’ 输入到这个词干提取器后， ‘write’ 将是输出。

For example, ‘write’ would be the output of the word ‘writing’ given as the input to this stemmer.

Lemmatization

这是提取单词词根的另一种方式，通常旨在通过使用词汇和形态分析来删除屈折词尾。词形还原后，任何单词的词根称为词素。

It is another way to extract the base form of words, normally aiming to remove inflectional endings by using vocabulary and morphological analysis. After lemmatization, the base form of any word is called lemma.

NLTK 模块提供了用于词形还原的以下包 -

NLTK module provides the following package for lemmatization −

WordNetLemmatizer package

这个包将根据单词是用作名词还是动词来提取词根。可以使用以下命令导入这个包 -

This package will extract the base form of the word depending upon whether it is used as a noun or as a verb. The following command can be used to import this package −

from nltk.stem import WordNetLemmatizer

Counting POS Tags–Chunking

借助切分，我们可以识别词性(POS)和短语。这是自然语言处理中的重要过程之一。正如我们了解用于创建标记的标记化过程，切分实际上是对这些标记进行标记。换句话说，我们可以说我们可以借助切分过程获取句子的结构。

The identification of parts of speech (POS) and short phrases can be done with the help of chunking. It is one of the important processes in natural language processing. As we are aware about the process of tokenization for the creation of tokens, chunking actually is to do the labeling of those tokens. In other words, we can say that we can get the structure of the sentence with the help of chunking process.

Example

在以下示例中，我们将通过使用 NLTK Python 模块实现名词短语切分，它是一种切分类别，它将在句子中找到名词短语块。

In the following example, we will implement Noun-Phrase chunking, a category of chunking which will find the noun phrase chunks in the sentence, by using NLTK Python module.

考虑以下步骤来实现名词短语切分 -

Consider the following steps to implement noun-phrase chunking −

Step 1: Chunk grammar definition

在此步骤中，我们需要定义块分析的语法。它将由我们需要遵循的规则组成。

In this step, we need to define the grammar for chunking. It would consist of the rules, which we need to follow.

Step 2: Chunk parser creation

接下来，我们需要创建块解析器。它将解析语法并提供输出。

Next, we need to create a chunk parser. It would parse the grammar and give the output.

Step 3: The Output

在此步骤中，我们将以树状格式获取输出。

In this step, we will get the output in a tree format.

Running the NLP Script

首先从导入 NLTK 包开始 −

Start by importing the the NLTK package −

import nltk

现在，我们需要定义句子。

Now, we need to define the sentence.

在此，

Here,

DT is the determinant
VBP is the verb
JJ is the adjective
IN is the preposition
NN is the noun

sentence = [("a", "DT"),("clever","JJ"),("fox","NN"),("was","VBP"),
   ("jumping","VBP"),("over","IN"),("the","DT"),("wall","NN")]

接下来，语法应以正则表达式的形式给出。

Next, the grammar should be given in the form of regular expression.

grammar = "NP:{<DT>?<JJ>*<NN>}"

现在，我们需要为解析语法定义一个解析器。

Now, we need to define a parser for parsing the grammar.

parser_chunking = nltk.RegexpParser(grammar)

现在，解析器将按如下方式解析句子 −

Now, the parser will parse the sentence as follows −

parser_chunking.parse(sentence)

接下来，输出将显示在变量中，如下所示：-

Next, the output will be in the variable as follows:-

Output = parser_chunking.parse(sentence)

现在，以下代码将帮助你以树的形式绘制你的输出。

Now, the following code will help you draw your output in the form of a tree.

output.draw()