Natural Language Toolkit 简明教程

Natural Language Toolkit - Tokenizing Text

What is Tokenizing?

它可以被定义为将一段文本分解为更小的部分(例如句子和单词)的过程。这些较小的部分称为标记。例如,单词是句子中的标记,句子是段落中的标记。

It may be defined as the process of breaking up a piece of text into smaller parts, such as sentences and words. These smaller parts are called tokens. For example, a word is a token in a sentence, and a sentence is a token in a paragraph.

众所周知,NLP 用于构建诸如情感分析、问答系统、语言翻译、智能聊天机器人、声音系统等应用程序,因此,为了构建它们,了解文本中的模式变得至关重要。上面提到的标记在查找和理解这些模式方面非常有用。我们可以将标记化视为其他流程(如词干提取和词形还原)的基本步骤。

As we know that NLP is used to build applications such as sentiment analysis, QA systems, language translation, smart chatbots, voice systems, etc., hence, in order to build them, it becomes vital to understand the pattern in the text. The tokens, mentioned above, are very useful in finding and understanding these patterns. We can consider tokenization as the base step for other recipes such as stemming and lemmatization.

NLTK package

nltk.tokenize 是 NLTK 模块提供的用于实现标记化过程的包。

nltk.tokenize is the package provided by NLTK module to achieve the process of tokenization.

Tokenizing sentences into words

将句子拆分词语或从字符串中创建词语列表是每个文本处理活动中必不可少的部分。让我们通过 nltk.tokenize 包提供的各种函数/模块来理解它。

Splitting the sentence into words or creating a list of words from a string is an essential part of every text processing activity. Let us understand it with the help of various functions/modules provided by nltk.tokenize package.

word_tokenize module

word_tokenize 模块用于基本单词标记。以下示例将使用此模块将句子拆分为词语。

word_tokenize module is used for basic word tokenization. Following example will use this module to split a sentence into words.

Example

import nltk
from nltk.tokenize import word_tokenize
word_tokenize('Tutorialspoint.com provides high quality technical tutorials for free.')

Output

['Tutorialspoint.com', 'provides', 'high', 'quality', 'technical', 'tutorials', 'for', 'free', '.']

TreebankWordTokenizer Class

上面使用的 word_tokenize 模块基本上是一个封装函数,它将 tokenize() 函数作为 TreebankWordTokenizer 类的实例进行调用。它将给出与我们使用 word_tokenize() 模块将句子拆分为词语时获得的相同输出。让我们看看上面实现的相同示例 −

word_tokenize module, used above is basically a wrapper function that calls tokenize() function as an instance of the TreebankWordTokenizer class. It will give the same output as we get while using word_tokenize() module for splitting the sentences into word. Let us see the same example implemented above −

Example

首先,我们需要导入自然语言工具包 (nltk)。

First, we need to import the natural language toolkit(nltk).

import nltk

现在,导入 TreebankWordTokenizer 类以实现单词标记算法。

Now, import the TreebankWordTokenizer class to implement the word tokenizer algorithm −

from nltk.tokenize import TreebankWordTokenizer

接下来,按照以下方式创建 TreebankWordTokenizer 类的实例 −

Next, create an instance of TreebankWordTokenizer class as follows −

Tokenizer_wrd = TreebankWordTokenizer()

现在,输入你想转换为令牌的句子 −

Now, input the sentence you want to convert to tokens −

Tokenizer_wrd.tokenize(
   'Tutorialspoint.com provides high quality technical tutorials for free.'
)

Output

[
   'Tutorialspoint.com', 'provides', 'high', 'quality',
   'technical', 'tutorials', 'for', 'free', '.'
]

Complete implementation example

让我们在下面看到完整的实现示例

Let us see the complete implementation example below

import nltk
from nltk.tokenize import TreebankWordTokenizer
tokenizer_wrd = TreebankWordTokenizer()
tokenizer_wrd.tokenize('Tutorialspoint.com provides high quality technical
tutorials for free.')

Output

[
   'Tutorialspoint.com', 'provides', 'high', 'quality',
   'technical', 'tutorials','for', 'free', '.'
]

分词器最重要的约定是区分缩写。例如,如果我们为此目的使用 word_tokenize() 模块,它将以如下方式输出 −

The most significant convention of a tokenizer is to separate contractions. For example, if we use word_tokenize() module for this purpose, it will give the output as follows −

Example

import nltk
from nltk.tokenize import word_tokenize
word_tokenize('won’t')

Output

['wo', "n't"]]

TreebankWordTokenizer 的这种约定是不可接受的。这就是为什么我们有两个可选分词器,即 PunktWordTokenizerWordPunctTokenizer

Such kind of convention by TreebankWordTokenizer is unacceptable. That’s why we have two alternative word tokenizers namely PunktWordTokenizer and WordPunctTokenizer.

WordPunktTokenizer Class

一个可选的分词器,它将所有标点符号分成单独的令牌。让我们用以下简单的示例来理解它 −

An alternative word tokenizer that splits all punctuation into separate tokens. Let us understand it with the following simple example −

Example

from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
tokenizer.tokenize(" I can't allow you to go home early")

Output

['I', 'can', "'", 't', 'allow', 'you', 'to', 'go', 'home', 'early']

Tokenizing text into sentences

在本节中,我们将把文本/段落分成句子。NLTK为此目的提供 sent_tokenize 模块。

In this section we are going to split text/paragraph into sentences. NLTK provides sent_tokenize module for this purpose.

Why is it needed?

一个出现在我们脑海中的明显问题是,当我们有分词器时,为什么我们需要句子分词器,或者为什么我们需要将文本分词成句子。假设我们需要计算句子中的平均单词,我们如何做到?为了完成这项任务,我们需要句子分词和单词分词。

An obvious question that came in our mind is that when we have word tokenizer then why do we need sentence tokenizer or why do we need to tokenize text into sentences. Suppose we need to count average words in sentences, how we can do this? For accomplishing this task, we need both sentence tokenization and word tokenization.

让我们借助以下简单的示例了解句子和单词分词之间的差异 −

Let us understand the difference between sentence and word tokenizer with the help of following simple example −

Example

import nltk
from nltk.tokenize import sent_tokenize
text = "Let us understand the difference between sentence & word tokenizer.
It is going to be a simple example."
sent_tokenize(text)

Output

[
   "Let us understand the difference between sentence & word tokenizer.",
   'It is going to be a simple example.'
]

Sentence tokenization using regular expressions

如果您觉得分词器的输出不可接受,并且想要完全控制如何分词文本,我们有正则表达式,可在进行句子分词时使用。NLTK提供 RegexpTokenizer 类来实现此目的。

If you feel that the output of word tokenizer is unacceptable and want complete control over how to tokenize the text, we have regular expression which can be used while doing sentence tokenization. NLTK provide RegexpTokenizer class to achieve this.

让我们借助以下两个示例来理解这个概念。

Let us understand the concept with the help of two examples below.

在第一个示例中,我们将使用正则表达式来匹配字母数字令牌和单引号,以避免拆分 “won’t” 之类的缩写。

In first example we will be using regular expression for matching alphanumeric tokens plus single quotes so that we don’t split contractions like “won’t”.

Example 1

import nltk
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w']+")
tokenizer.tokenize("won't is a contraction.")
tokenizer.tokenize("can't is a contraction.")

Output

["won't", 'is', 'a', 'contraction']
["can't", 'is', 'a', 'contraction']

在第一个示例中,我们将使用正则表达式在空格上分词。

In first example, we will be using regular expression to tokenize on whitespace.

Example 2

import nltk
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('/s+' , gaps = True)
tokenizer.tokenize("won't is a contraction.")

Output

["won't", 'is', 'a', 'contraction']

从以上输出中,我们可以看到标点符号保留在令牌中。参数gaps=True表示该模式将识别要分词的间隙。另一方面,如果我们将使用gaps=False参数,则该模式将用于识别标记,这可以在以下示例中看到 −

From the above output, we can see that the punctuation remains in the tokens. The parameter gaps = True means the pattern is going to identify the gaps to tokenize on. On the other hand, if we will use gaps = False parameter then the pattern would be used to identify the tokens which can be seen in following example −

import nltk
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('/s+' , gaps = False)
tokenizer.tokenize("won't is a contraction.")

Output

[ ]

它将为我们提供空白输出。

It will give us the blank output.