Python Data Science 简明教程

Python - Stemming and Lemmatization

在自然语言处理领域中,我们会遇到两种或两种以上单词具有相同词根的情况。例如,三个单词“同意”(agreed), “同意中”(agreeing)和“令人愉快的”(agreeable)具有相同的词根“agree”。涉及任何这些单词的搜索都应将它们视为具有相同词根的同一个单词。因此,将所有单词链接到它们的词根变得至关重要。 NLTK 库具有执行此链接并给出显示词根的输出的方法。

In the areas of Natural Language Processing we come across situation where two or more words have a common root. For example, the three words - agreed, agreeing and agreeable have the same root word agree. A search involving any of these words should treat them as the same word which is the root word. So it becomes essential to link all the words into their root word. The NLTK library has methods to do this linking and give the output showing the root word.

下面的程序使用 Porter Stemming Algorithm 来进行词干提取。

The below program uses the Porter Stemming Algorithm for stemming.

import nltk
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()

word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
# First Word tokenization
nltk_tokens = nltk.word_tokenize(word_data)
#Next find the roots of the word
for w in nltk_tokens:
       print "Actual: %s  Stem: %s"  % (w,porter_stemmer.stem(w))

当我们执行上面的代码时,它会产生以下结果。

When we execute the above code, it produces the following result.

Actual: It  Stem: It
Actual: originated  Stem: origin
Actual: from  Stem: from
Actual: the  Stem: the
Actual: idea  Stem: idea
Actual: that  Stem: that
Actual: there  Stem: there
Actual: are  Stem: are
Actual: readers  Stem: reader
Actual: who  Stem: who
Actual: prefer  Stem: prefer
Actual: learning  Stem: learn
Actual: new  Stem: new
Actual: skills  Stem: skill
Actual: from  Stem: from
Actual: the  Stem: the
Actual: comforts  Stem: comfort
Actual: of  Stem: of
Actual: their  Stem: their
Actual: drawing  Stem: draw
Actual: rooms  Stem: room

词形还原与词干提取类似,但它为单词带来了上下文的概念。因此,它通过将具有相似含义的单词链接到一个单词,从而迈出了进一步的步骤。例如,如果一个段落具有像汽车(cars)、火车(trains)和汽车(automobile)这样的词,那么它将所有这些词链接到汽车(automobile)。在下面的程序中,我们使用 WordNet 词汇数据库来进行词形还原。

Lemmatization is similar ti stemming but it brings context to the words.So it goes a steps further by linking words with similar meaning to one word. For example if a paragraph has words like cars, trains and automobile, then it will link all of them to automobile. In the below program we use the WordNet lexical database for lemmatization.

import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
nltk_tokens = nltk.word_tokenize(word_data)
for w in nltk_tokens:
       print "Actual: %s  Lemma: %s"  % (w,wordnet_lemmatizer.lemmatize(w))

当我们执行上面的代码时,它会产生以下结果。

When we execute the above code, it produces the following result.

Actual: It  Lemma: It
Actual: originated  Lemma: originated
Actual: from  Lemma: from
Actual: the  Lemma: the
Actual: idea  Lemma: idea
Actual: that  Lemma: that
Actual: there  Lemma: there
Actual: are  Lemma: are
Actual: readers  Lemma: reader
Actual: who  Lemma: who
Actual: prefer  Lemma: prefer
Actual: learning  Lemma: learning
Actual: new  Lemma: new
Actual: skills  Lemma: skill
Actual: from  Lemma: from
Actual: the  Lemma: the
Actual: comforts  Lemma: comfort
Actual: of  Lemma: of
Actual: their  Lemma: their
Actual: drawing  Lemma: drawing
Actual: rooms  Lemma: room