Python Data Science 简明教程
Python - Stemming and Lemmatization
在自然语言处理领域中,我们会遇到两种或两种以上单词具有相同词根的情况。例如,三个单词“同意”(agreed), “同意中”(agreeing)和“令人愉快的”(agreeable)具有相同的词根“agree”。涉及任何这些单词的搜索都应将它们视为具有相同词根的同一个单词。因此,将所有单词链接到它们的词根变得至关重要。 NLTK 库具有执行此链接并给出显示词根的输出的方法。
下面的程序使用 Porter Stemming Algorithm 来进行词干提取。
import nltk
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()
word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
# First Word tokenization
nltk_tokens = nltk.word_tokenize(word_data)
#Next find the roots of the word
for w in nltk_tokens:
print "Actual: %s Stem: %s" % (w,porter_stemmer.stem(w))
当我们执行上面的代码时,它会产生以下结果。
Actual: It Stem: It
Actual: originated Stem: origin
Actual: from Stem: from
Actual: the Stem: the
Actual: idea Stem: idea
Actual: that Stem: that
Actual: there Stem: there
Actual: are Stem: are
Actual: readers Stem: reader
Actual: who Stem: who
Actual: prefer Stem: prefer
Actual: learning Stem: learn
Actual: new Stem: new
Actual: skills Stem: skill
Actual: from Stem: from
Actual: the Stem: the
Actual: comforts Stem: comfort
Actual: of Stem: of
Actual: their Stem: their
Actual: drawing Stem: draw
Actual: rooms Stem: room
词形还原与词干提取类似,但它为单词带来了上下文的概念。因此,它通过将具有相似含义的单词链接到一个单词,从而迈出了进一步的步骤。例如,如果一个段落具有像汽车(cars)、火车(trains)和汽车(automobile)这样的词,那么它将所有这些词链接到汽车(automobile)。在下面的程序中,我们使用 WordNet 词汇数据库来进行词形还原。
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
nltk_tokens = nltk.word_tokenize(word_data)
for w in nltk_tokens:
print "Actual: %s Lemma: %s" % (w,wordnet_lemmatizer.lemmatize(w))
当我们执行上面的代码时,它会产生以下结果。
Actual: It Lemma: It
Actual: originated Lemma: originated
Actual: from Lemma: from
Actual: the Lemma: the
Actual: idea Lemma: idea
Actual: that Lemma: that
Actual: there Lemma: there
Actual: are Lemma: are
Actual: readers Lemma: reader
Actual: who Lemma: who
Actual: prefer Lemma: prefer
Actual: learning Lemma: learning
Actual: new Lemma: new
Actual: skills Lemma: skill
Actual: from Lemma: from
Actual: the Lemma: the
Actual: comforts Lemma: comfort
Actual: of Lemma: of
Actual: their Lemma: their
Actual: drawing Lemma: drawing
Actual: rooms Lemma: room