Natural Language Toolkit 简明教程
Stemming & Lemmatization
What is Stemming?
词干化是一种用于通过去除词缀来提取词根的技术。这就像把树枝砍到树干一样。例如,单词 eating, eats, eaten 的词干是 eat 。
Stemming is a technique used to extract the base form of the words by removing affixes from them. It is just like cutting down the branches of a tree to its stems. For example, the stem of the words eating, eats, eaten is eat.
搜索引擎使用词干化来对单词进行索引。这就是为什么搜索引擎不必存储某个单词的所有形式,而只需存储词干的原因。这样一来,词干化就减小了索引的大小并提高了检索精度。
Search engines use stemming for indexing the words. That’s why rather than storing all forms of a word, a search engine can store only the stems. In this way, stemming reduces the size of the index and increases retrieval accuracy.
Various Stemming algorithms
在 NLTK 中, stemmerI ,它有 stem() 方法,接口中提供了我们接下来会涉及的所有词干化器。让我们借助以下图表理解一下它
In NLTK, stemmerI, which have stem() method, interface has all the stemmers which we are going to cover next. Let us understand it with the following diagram

Porter stemming algorithm
它是使用最广泛的词干化算法之一,它基本上是设计来去除和替换英语单词的众所周知的词缀的。
It is one of the most common stemming algorithms which is basically designed to remove and replace well-known suffixes of English words.
PorterStemmer class
NLTK 提供 PorterStemmer 类,借助它,我们可以轻松地为我们想要进行词干化的单词实现波特词干化器算法。此类知道若干规则词形和词缀,借助这些词形和词缀,它可以将输入词转换为最终词干。所得词干通常是具有相同词根意义的较短单词。让我们看一个示例 −
NLTK has PorterStemmer class with the help of which we can easily implement Porter Stemmer algorithms for the word we want to stem. This class knows several regular word forms and suffixes with the help of which it can transform the input word to a final stem. The resulting stem is often a shorter word having the same root meaning. Let us see an example −
首先,我们需要导入自然语言工具包 (nltk)。
First, we need to import the natural language toolkit(nltk).
import nltk
现在,导入 PorterStemmer 类来实现波特词干化器算法。
Now, import the PorterStemmer class to implement the Porter Stemmer algorithm.
from nltk.stem import PorterStemmer
然后,按以下步骤创建波特词干化器类的实例 −
Next, create an instance of Porter Stemmer class as follows −
word_stemmer = PorterStemmer()
现在,输入你想要进行词干化的单词。
Now, input the word you want to stem.
word_stemmer.stem('writing')
Lancaster stemming algorithm
它是在兰卡斯特大学开发的,是另一种非常常用的词干化算法。
It was developed at Lancaster University and it is another very common stemming algorithms.
LancasterStemmer class
NLTK 提供 LancasterStemmer 类,借助它,我们可以轻松地为我们想要进行词干化的单词实现兰开斯特词干化器算法。让我们看一个示例 −
NLTK has LancasterStemmer class with the help of which we can easily implement Lancaster Stemmer algorithms for the word we want to stem. Let us see an example −
首先,我们需要导入自然语言工具包 (nltk)。
First, we need to import the natural language toolkit(nltk).
import nltk
现在,导入 LancasterStemmer 类来实现兰开斯特词干化器算法
Now, import the LancasterStemmer class to implement Lancaster Stemmer algorithm
from nltk.stem import LancasterStemmer
然后,按以下步骤创建 LancasterStemmer 类的实例 −
Next, create an instance of LancasterStemmer class as follows −
Lanc_stemmer = LancasterStemmer()
现在,输入你想要进行词干化的单词。
Now, input the word you want to stem.
Lanc_stemmer.stem('eats')
Regular Expression stemming algorithm
借助这种词干化算法,我们可以构建我们自己的词干化器。
With the help of this stemming algorithm, we can construct our own stemmer.
RegexpStemmer class
NLTK 提供 RegexpStemmer 类,借助它,我们可以轻松地实现正则表达式词干化器算法。它基本使用单个正则表达式,并删除与该表达式匹配的任何前缀或后缀。让我们看一个示例 −
NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression. Let us see an example −
首先,我们需要导入自然语言工具包 (nltk)。
First, we need to import the natural language toolkit(nltk).
import nltk
现在,导入 RegexpStemmer 类来实现正则表达式词干化器算法。
Now, import the RegexpStemmer class to implement the Regular Expression Stemmer algorithm.
from nltk.stem import RegexpStemmer
然后,创建一个 RegexpStemmer 类的实例,并按以下方式提供要从单词中删除的后缀或前缀 −
Next, create an instance of RegexpStemmer class and provides the suffix or prefix you want to remove from the word as follows −
Reg_stemmer = RegexpStemmer(‘ing’)
现在,输入你想要进行词干化的单词。
Now, input the word you want to stem.
Reg_stemmer.stem('eating')
Snowball stemming algorithm
这是另一种非常有用的词干化算法。
It is another very useful stemming algorithm.
SnowballStemmer class
NLTK 提供 SnowballStemmer 类,借助它,我们可以轻松地实现 Snowball 词干化器算法。它支持 15 种非英语语言。为了使用这个词干化类,我们需要创建一个使用我们使用的语言名称的实例,然后调用 stem() 方法。让我们看一个示例 −
NLTK has SnowballStemmer class with the help of which we can easily implement Snowball Stemmer algorithms. It supports 15 non-English languages. In order to use this steaming class, we need to create an instance with the name of the language we are using and then call the stem() method. Let us see an example −
首先,我们需要导入自然语言工具包 (nltk)。
First, we need to import the natural language toolkit(nltk).
import nltk
现在,导入 SnowballStemmer 类以实现 Snowball Stemmer 算法
Now, import the SnowballStemmer class to implement Snowball Stemmer algorithm
from nltk.stem import SnowballStemmer
让我们看看它支持哪些语言 −
Let us see the languages it supports −
SnowballStemmer.languages
Output
(
'arabic',
'danish',
'dutch',
'english',
'finnish',
'french',
'german',
'hungarian',
'italian',
'norwegian',
'porter',
'portuguese',
'romanian',
'russian',
'spanish',
'swedish'
)
接下来,使用您想要使用的语言创建 SnowballStemmer 类的实例。在这里,我们正在为“法语”语言创建一个词干提取器。
Next, create an instance of SnowballStemmer class with the language you want to use. Here, we are creating the stemmer for ‘French’ language.
French_stemmer = SnowballStemmer(‘french’)
现在,调用 stem() 方法并输入您想要进行词干提取的单词。
Now, call the stem() method and input the word you want to stem.
French_stemmer.stem (‘Bonjoura’)
What is Lemmatization?
词形还原技术就像词干提取一样。词形还原后我们得到的输出被称为“词干”,它是一个词根,而不是词干提取的输出,即词干。在词形还原之后,我们将会得到一个具有相同含义的有效单词。
Lemmatization technique is like stemming. The output we will get after lemmatization is called ‘lemma’, which is a root word rather than root stem, the output of stemming. After lemmatization, we will be getting a valid word that means the same thing.
NLTK 提供了 WordNetLemmatizer 类,它是 wordnet 语料库的简单封装。此类使用 morphy() 函数向 WordNet CorpusReader 类提供词干,以查找词条。让我们通过一个示例来理解它 −
NLTK provides WordNetLemmatizer class which is a thin wrapper around the wordnet corpus. This class uses morphy() function to the WordNet CorpusReader class to find a lemma. Let us understand it with an example −
Example
首先,我们需要导入自然语言工具包 (nltk)。
First, we need to import the natural language toolkit(nltk).
import nltk
现在,导入 WordNetLemmatizer 类来实现词形还原技术。
Now, import the WordNetLemmatizer class to implement the lemmatization technique.
from nltk.stem import WordNetLemmatizer
接下来,创建 WordNetLemmatizer 类的实例。
Next, create an instance of WordNetLemmatizer class.
lemmatizer = WordNetLemmatizer()
现在,调用 lemmatize() 方法并输入您想要查找词条的单词。
Now, call the lemmatize() method and input the word of which you want to find lemma.
lemmatizer.lemmatize('eating')
Difference between Stemming & Lemmatization
让我们通过以下示例来理解词干提取和词形还原的区别 −
Let us understand the difference between Stemming and Lemmatization with the help of the following example −
import nltk
from nltk.stem import PorterStemmer
word_stemmer = PorterStemmer()
word_stemmer.stem('believes')
Output
believ
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize(' believes ')
Output
believ
两个程序的输出说明了词干提取和词形还原之间的主要区别。 PorterStemmer 类从单词中截去“es”。另一方面, WordNetLemmatizer 类查找有效单词。简而言之,词干提取技术仅着眼于单词的形式,而词形还原技术着眼于单词的含义。这意味着在应用词形还原之后,我们总是会得到一个有效单词。
The output of both programs tells the major difference between stemming and lemmatization. PorterStemmer class chops off the ‘es’ from the word. On the other hand, WordNetLemmatizer class finds a valid word. In simple words, stemming technique only looks at the form of the word whereas lemmatization technique looks at the meaning of the word. It means after applying lemmatization, we will always get a valid word.