Python Data Science 简明教程
Python - Stemming and Lemmatization
在自然语言处理领域中,我们会遇到两种或两种以上单词具有相同词根的情况。例如,三个单词“同意”(agreed), “同意中”(agreeing)和“令人愉快的”(agreeable)具有相同的词根“agree”。涉及任何这些单词的搜索都应将它们视为具有相同词根的同一个单词。因此,将所有单词链接到它们的词根变得至关重要。 NLTK 库具有执行此链接并给出显示词根的输出的方法。
In the areas of Natural Language Processing we come across situation where two or more words have a common root. For example, the three words - agreed, agreeing and agreeable have the same root word agree. A search involving any of these words should treat them as the same word which is the root word. So it becomes essential to link all the words into their root word. The NLTK library has methods to do this linking and give the output showing the root word.
下面的程序使用 Porter Stemming Algorithm 来进行词干提取。
The below program uses the Porter Stemming Algorithm for stemming.
import nltk
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()
word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
# First Word tokenization
nltk_tokens = nltk.word_tokenize(word_data)
#Next find the roots of the word
for w in nltk_tokens:
print "Actual: %s Stem: %s" % (w,porter_stemmer.stem(w))
当我们执行上面的代码时,它会产生以下结果。
When we execute the above code, it produces the following result.
Actual: It Stem: It
Actual: originated Stem: origin
Actual: from Stem: from
Actual: the Stem: the
Actual: idea Stem: idea
Actual: that Stem: that
Actual: there Stem: there
Actual: are Stem: are
Actual: readers Stem: reader
Actual: who Stem: who
Actual: prefer Stem: prefer
Actual: learning Stem: learn
Actual: new Stem: new
Actual: skills Stem: skill
Actual: from Stem: from
Actual: the Stem: the
Actual: comforts Stem: comfort
Actual: of Stem: of
Actual: their Stem: their
Actual: drawing Stem: draw
Actual: rooms Stem: room
词形还原与词干提取类似,但它为单词带来了上下文的概念。因此,它通过将具有相似含义的单词链接到一个单词,从而迈出了进一步的步骤。例如,如果一个段落具有像汽车(cars)、火车(trains)和汽车(automobile)这样的词,那么它将所有这些词链接到汽车(automobile)。在下面的程序中,我们使用 WordNet 词汇数据库来进行词形还原。
Lemmatization is similar ti stemming but it brings context to the words.So it goes a steps further by linking words with similar meaning to one word. For example if a paragraph has words like cars, trains and automobile, then it will link all of them to automobile. In the below program we use the WordNet lexical database for lemmatization.
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
nltk_tokens = nltk.word_tokenize(word_data)
for w in nltk_tokens:
print "Actual: %s Lemma: %s" % (w,wordnet_lemmatizer.lemmatize(w))
当我们执行上面的代码时,它会产生以下结果。
When we execute the above code, it produces the following result.
Actual: It Lemma: It
Actual: originated Lemma: originated
Actual: from Lemma: from
Actual: the Lemma: the
Actual: idea Lemma: idea
Actual: that Lemma: that
Actual: there Lemma: there
Actual: are Lemma: are
Actual: readers Lemma: reader
Actual: who Lemma: who
Actual: prefer Lemma: prefer
Actual: learning Lemma: learning
Actual: new Lemma: new
Actual: skills Lemma: skill
Actual: from Lemma: from
Actual: the Lemma: the
Actual: comforts Lemma: comfort
Actual: of Lemma: of
Actual: their Lemma: their
Actual: drawing Lemma: drawing
Actual: rooms Lemma: room