Artificial Intelligence With Python 简明教程

AI with Python – NLTK Package

在本章中，我们将学习如何开始使用自然语言工具包包。

In this chapter, we will learn how to get started with the Natural Language Toolkit Package.

Prerequisite

如果我们想使用自然语言处理来构建应用程序，那么上下文的改变会使它变得最困难。上下文因素会影响机器如何理解特定句子。因此，我们需要使用机器学习方法开发自然语言应用程序，以便机器也可以理解人类理解上下文的方式。

If we want to build applications with Natural Language processing then the change in context makes it most difficult. The context factor influences how the machine understands a particular sentence. Hence, we need to develop Natural language applications by using machine learning approaches so that machine can also understand the way a human can understand the context.

为了构建此类应用程序，我们将使用名为 NLTK（自然语言工具包包）的 Python 包。

To build such applications we will use the Python package called NLTK (Natural Language Toolkit Package).

Importing NLTK

我们需要在使用 NLTK 之前进行安装。它可以通过以下命令进行安装 -

We need to install NLTK before using it. It can be installed with the help of the following command −

pip install nltk

要为 NLTK 构建 conda 包，请使用以下命令 -

To build a conda package for NLTK, use the following command −

conda install -c anaconda nltk

现在，在安装了 NLTK 包后，我们需要通过 python 命令提示符对其进行导入。我们可以通过在 Python 命令提示符上编写以下命令来导入它 -

Now after installing the NLTK package, we need to import it through the python command prompt. We can import it by writing the following command on the Python command prompt −

>>> import nltk

Downloading NLTK’s Data

现在，在导入 NLTK 之后，我们需要下载所需的数据。它可以通过在 Python 命令提示符上使用以下命令来完成 -

Now after importing NLTK, we need to download the required data. It can be done with the help of the following command on the Python command prompt −

>>> nltk.download()

Installing Other Necessary Packages

使用 NLTK 构建自然语言处理应用程序，我们需要安装必要的软件包。这些软件包如下 −

For building natural language processing applications by using NLTK, we need to install the necessary packages. The packages are as follows −

gensim

这是一个健壮语义建模库，对很多应用程序很有用。我们可以通过执行以下命令来安装它 −

It is a robust semantic modeling library that is useful for many applications. We can install it by executing the following command −

pip install gensim

pattern

用于使 gensim 软件包正常工作。我们可以通过执行以下命令来安装它

It is used to make gensim package work properly. We can install it by executing the following command

pip install pattern

Concept of Tokenization, Stemming, and Lemmatization

在本节中，我们将了解什么是符号化、词干化和词形还原。

In this section, we will understand what is tokenization, stemming, and lemmatization.

Tokenization

可以将它定义为将给定文本（即字符序列）分解成称为词元的更小单元的过程。词元可以是单词、数字或标点符号。它也被称为单词分割。以下是符号化的一个简单示例 −

It may be defined as the process of breaking the given text i.e. the character sequence into smaller units called tokens. The tokens may be the words, numbers or punctuation marks. It is also called word segmentation. Following is a simple example of tokenization −

Input − 芒果、香蕉、菠萝和苹果都是水果。

Input − Mango, banana, pineapple and apple all are fruits.

Output −

可以借助查找单词边界来完成分解给定文本的过程。单词的词尾和新单词的词首被称为单词边界。书写系统和单词的印刷结构会影响边界。

The process of breaking the given text can be done with the help of locating the word boundaries. The ending of a word and the beginning of a new word are called word boundaries. The writing system and the typographical structure of the words influence the boundaries.

在 Python NLTK 模块中，我们有不同的与符号化相关的软件包，我们可以使用它们按照我们的要求将文本划分为词元。一些软件包如下 −

In the Python NLTK module, we have different packages related to tokenization which we can use to divide the text into tokens as per our requirements. Some of the packages are as follows −

sent_tokenize package

正如名称所示，这个软件包会将输入文本划分为句子。我们可以借助以下 Python 代码导入此软件包 −

As the name suggest, this package will divide the input text into sentences. We can import this package with the help of the following Python code −

from nltk.tokenize import sent_tokenize

word_tokenize package

这个软件包将输入文本划分为单词。我们可以借助以下 Python 代码导入此软件包 −

This package divides the input text into words. We can import this package with the help of the following Python code −

from nltk.tokenize import word_tokenize

WordPunctTokenizer package

这个软件包将输入文本划分为单词和标点符号。我们可以借助以下 Python 代码导入此软件包 −

This package divides the input text into words as well as the punctuation marks. We can import this package with the help of the following Python code −

from nltk.tokenize import WordPuncttokenizer

Stemming

在使用单词时，我们遇到了很多由于语法原因导致的变化。这里的变化概念是指我们必须处理相同单词的不同形式，如 democracy, democratic, 和 democratization 。机器必须了解到这些不同的单词具有相同的基本形式。这样，在分析文本时提取单词的基本形式会很有用。

While working with words, we come across a lot of variations due to grammatical reasons. The concept of variations here means that we have to deal with different forms of the same words like democracy, democratic, and democratization. It is very necessary for machines to understand that these different words have the same base form. In this way, it would be useful to extract the base forms of the words while we are analyzing the text.

我们可以通过词干化来实现这一点。这样，我们可以说词干化是通过切断单词词尾来提取单词基本形式的启发式过程。

We can achieve this by stemming. In this way, we can say that stemming is the heuristic process of extracting the base forms of the words by chopping off the ends of words.

在 Python NLTK 模块中，我们有不同的与词干化相关的软件包。可以使用这些软件包来获取单词的基本形式。这些软件包使用算法。一些软件包如下 −

In the Python NLTK module, we have different packages related to stemming. These packages can be used to get the base forms of word. These packages use algorithms. Some of the packages are as follows −

PorterStemmer package

此 Python 软件包使用波特算法来提取基本形式。我们可以借助以下 Python 代码导入此软件包 −

This Python package uses the Porter’s algorithm to extract the base form. We can import this package with the help of the following Python code −

from nltk.stem.porter import PorterStemmer

例如，如果我们将单词 ‘writing’ 作为此词干分析器的输入，那么词干化后我们将得到单词 ‘write’ 。

For example, if we will give the word ‘writing’ as the input to this stemmer them we will get the word ‘write’ after stemming.

LancasterStemmer package

此 Python 软件包将使用兰开斯特算法来提取基本形式。我们可以借助以下 Python 代码导入此软件包 −

This Python package will use the Lancaster’s algorithm to extract the base form. We can import this package with the help of the following Python code −

from nltk.stem.lancaster import LancasterStemmer

例如，如果我们将单词 ‘writing’ 作为此词干分析器的输入，那么词干化后我们将得到单词 ‘write’ 。

For example, if we will give the word ‘writing’ as the input to this stemmer them we will get the word ‘write’ after stemming.

SnowballStemmer package

此 Python 软件包将使用 Snowball 算法来提取基本形式。我们可以借助以下 Python 代码导入此软件包 −

This Python package will use the snowball’s algorithm to extract the base form. We can import this package with the help of the following Python code −

from nltk.stem.snowball import SnowballStemmer

例如，如果我们将单词 ‘writing’ 作为此词干分析器的输入，那么词干化后我们将得到单词 ‘write’ 。

For example, if we will give the word ‘writing’ as the input to this stemmer them we will get the word ‘write’ after stemming.

所有这些算法都有不同的严格程度。如果我们比较这三个词干分析器，那么波特词干分析器是不严格的，而兰开斯特是最严格的。Snowball 词干分析器在速度和严格程度方面都很好用。

All of these algorithms have different level of strictness. If we compare these three stemmers then the Porter stemmers is the least strict and Lancaster is the strictest. Snowball stemmer is good to use in terms of speed as well as strictness.

Lemmatization

我们还可以通过词形还原提取单词的基本形式。它主要通过词汇和单词的形态分析来完成此任务，通常旨在仅移除屈折词尾。任何单词的这种基本形式称为词干。

We can also extract the base form of words by lemmatization. It basically does this task with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only. This kind of base form of any word is called lemma.

词干提取和词形还原之间的主要区别是词汇的使用和单词的形态分析。另一个区别在于，词干提取通常会合并派生相关单词，而词形还原通常只会合并词干的不同屈折形式。例如，如果我们提供“saw”作为输入单词，则词干提取可能会返回单词“s”，而词形还原会尝试根据令牌用作动词还是名词来返回单词“see”或“saw”。

The main difference between stemming and lemmatization is the use of vocabulary and morphological analysis of the words. Another difference is that stemming most commonly collapses derivationally related words whereas lemmatization commonly only collapses the different inflectional forms of a lemma. For example, if we provide the word saw as the input word then stemming might return the word ‘s’ but lemmatization would attempt to return the word either see or saw depending on whether the use of the token was a verb or a noun.

在 Python NLTK 模块中，我们有以下与词形还原过程相关的包，我们可以使用该包来获取单词的基本形式−

In the Python NLTK module, we have the following package related to lemmatization process which we can use to get the base forms of word −

WordNetLemmatizer package

此 Python 包将根据单词用作名词还是用作成动词来提取单词的基本形式。我们可以使用以下 Python 代码导入此包−

This Python package will extract the base form of the word depending upon whether it is used as a noun or as a verb. We can import this package with the help of the following Python code −

from nltk.stem import WordNetLemmatizer

Chunking: Dividing Data into Chunks

它是自然语言处理中的一个重要过程。分块的主要工作是识别词性以及名词短语等短语。我们已经学习了标记化和令牌创建的过程。分块基本上是这些令牌的标记。换句话说，分块将向我们展示句子的结构。

It is one of the important processes in natural language processing. The main job of chunking is to identify the parts of speech and short phrases like noun phrases. We have already studied the process of tokenization, the creation of tokens. Chunking basically is the labeling of those tokens. In other words, chunking will show us the structure of the sentence.

在以下部分中，我们将了解不同类型的分块。

In the following section, we will learn about the different types of Chunking.

Types of chunking

分块有两种类型。类型如下−

There are two types of chunking. The types are as follows −

Chunking up

在这个分块过程中，对象、事物等变得越来越笼统，语言变得越来越抽象。达成一致的可能性更大。在这个过程中，我们会跳出来。例如，如果我们对这个问题进行分块：“汽车有什么用途”？我们可能会得到“运输”的答案。

In this process of chunking, the object, things, etc. move towards being more general and the language gets more abstract. There are more chances of agreement. In this process, we zoom out. For example, if we will chunk up the question that “for what purpose cars are”? We may get the answer “transport”.

Chunking down

在这个分块过程中，对象、事物等变得越来越具体，语言变得越来越透彻。在分块过程中会检查更深层的结构。在这个过程中，我们会深入了解。例如，如果我们将问题“详细说明一下汽车”分块？我们将获得有关汽车的更小信息片段。

In this process of chunking, the object, things, etc. move towards being more specific and the language gets more penetrated. The deeper structure would be examined in chunking down. In this process, we zoom in. For example, if we chunk down the question “Tell specifically about a car”? We will get smaller pieces of information about the car.

Example

在此示例中，我们将执行名词短语分块，这是一个分块类别，它将使用 Python 中的 NLTK 模块在句子中查找名词短语块−

In this example, we will do Noun-Phrase chunking, a category of chunking which will find the noun phrases chunks in the sentence, by using the NLTK module in Python −

Follow these steps in python for implementing noun phrase chunking −

Step 1 − 在此步骤中，我们需要定义分块语法。它将包含我们需要遵循的规则。

Step 1 − In this step, we need to define the grammar for chunking. It would consist of the rules which we need to follow.

Step 2 − 在此步骤中，我们需要创建一个块解析器。它将解析语法并给出输出。

Step 2 − In this step, we need to create a chunk parser. It would parse the grammar and give the output.

Step 3 − 在最后一步中，输出以树状格式生成。

Step 3 − In this last step, the output is produced in a tree format.

让我们如下导入必要的 NLTK 包 −

Let us import the necessary NLTK package as follows −

import nltk

现在，我们需要定义句子。此处，DT 表示限定词，VBP 表示动词，JJ 表示形容词，IN 表示介词，NN 表示名词。

Now, we need to define the sentence. Here, DT means the determinant, VBP means the verb, JJ means the adjective, IN means the preposition and NN means the noun.

sentence=[("a","DT"),("clever","JJ"),("fox","NN"),("was","VBP"),
          ("jumping","VBP"),("over","IN"),("the","DT"),("wall","NN")]

现在，我们需要提供语法。在此，我们将以正则表达式的形式提供语法。

Now, we need to give the grammar. Here, we will give the grammar in the form of regular expression.

grammar = "NP:{<DT>?<JJ>*<NN>}"

我们需要定义一个解析器来解析语法。

We need to define a parser which will parse the grammar.

parser_chunking = nltk.RegexpParser(grammar)

解析器按如下方式解析句子−

The parser parses the sentence as follows −

parser_chunking.parse(sentence)

下一步，我们需要获取输出。输出生成在名为 output_chunk 的简单变量中。

Next, we need to get the output. The output is generated in the simple variable called output_chunk.

Output_chunk = parser_chunking.parse(sentence)

在执行以下代码后，我们可以以树状图的形式绘制我们的输出。

Upon execution of the following code, we can draw our output in the form of a tree.

output.draw()

Bag of Word (BoW) Model

词袋（BoW），自然语言处理中的一个模型，基本用于从文本中提取特征，以便文本可以用于建模，例如机器学习算法。

Bag of Word (BoW), a model in natural language processing, is basically used to extract the features from text so that the text can be used in modeling such that in machine learning algorithms.

现在的问题是为什么我们需要从文本中提取特征。这是因为机器学习算法无法处理原始数据，它们需要数值数据才能从中提取有意义的信息。将文本数据转换为数值数据的过程称为特征提取或特征编码。

Now the question arises that why we need to extract the features from text. It is because the machine learning algorithms cannot work with raw data and they need numeric data so that they can extract meaningful information out of it. The conversion of text data into numeric data is called feature extraction or feature encoding.

How it works

这是一个从文本中提取特征的非常简单的方法。假设我们有一个文本文档，希望将其转换为数值数据或从中提取特征，那么首先这个模型会从文档中的所有单词中提取一个词汇。然后通过使用文档术语矩阵，它将建立一个模型。这种方式下，BoW 仅将文档表示为一个词袋。文档中单词的顺序或结构的任何信息都会被舍弃。

This is very simple approach for extracting the features from text. Suppose we have a text document and we want to convert it into numeric data or say want to extract the features out of it then first of all this model extracts a vocabulary from all the words in the document. Then by using a document term matrix, it will build a model. In this way, BoW represents the document as a bag of words only. Any information about the order or structure of words in the document is discarded.

Concept of document term matrix

BoW 算法通过使用文档术语矩阵建立一个模型。顾名思义，文档术语矩阵是文档中出现的各种单词计数的矩阵。借助这个矩阵，可以将文本文档表示成各种单词的加权组合。通过设置阈值并选择更有意义的单词，我们可以建立一个用于文档中所有单词的直方图，可用作特征向量。以下是了解文档术语矩阵概念的示例 -

The BoW algorithm builds a model by using the document term matrix. As the name suggests, the document term matrix is the matrix of various word counts that occur in the document. With the help of this matrix, the text document can be represented as a weighted combination of various words. By setting the threshold and choosing the words that are more meaningful, we can build a histogram of all the words in the documents that can be used as a feature vector. Following is an example to understand the concept of document term matrix −

Example

假设我们有以下两个句子：

Suppose we have the following two sentences −

Sentence 1 − We are using the Bag of Words model.
Sentence 2 − Bag of Words model is used for extracting the features.

现在，通过考虑这两个句子，我们有以下 13 个不同的单词 -

Now, by considering these two sentences, we have the following 13 distinct words −

we
are
using
the
bag
of
words
model
is
used
for
extracting
features

现在，我们需要通过使用每个句子中的单词计数为每个句子构建一个直方图 -

Now, we need to build a histogram for each sentence by using the word count in each sentence −

Sentence 1 − [1,1,1,1,1,1,1,1,0,0,0,0,0]
Sentence 2 − [0,0,0,1,1,1,1,1,1,1,1,1,1]

通过这种方式，我们有了提取的特征向量。每个特征向量都是 13 维的，因为我们有 13 个不同的单词。

In this way, we have the feature vectors that have been extracted. Each feature vector is 13-dimensional because we have 13 distinct words.

Concept of the Statistics

这个统计概念称为词频-逆文档频率（tf-idf）。文中的每个单词都很重要。这些统计数据可以帮助我们理解每个单词的重要性。

The concept of the statistics is called TermFrequency-Inverse Document Frequency (tf-idf). Every word is important in the document. The statistics help us nderstand the importance of every word.

Term Frequency(tf)

它衡量每个单词在文档中出现的频率。可以通过将每个单词的计数除以给定文档中的单词总数来获得。

It is the measure of how frequently each word appears in a document. It can be obtained by dividing the count of each word by the total number of words in a given document.

Inverse Document Frequency(idf)

它是该文档的独有词的衡量标准，在给定的文档集中。为计算 idf 和制定一个独特的特征向量，我们需要降低常用词（如“the”）的权重并提高罕见词的权重。

It is the measure of how unique a word is to this document in the given set of documents. For calculating idf and formulating a distinctive feature vector, we need to reduce the weights of commonly occurring words like the and weigh up the rare words.

Building a Bag of Words Model in NLTK

在本节中，我们将使用 CountVectorizer 定义一个字符串集合，以根据这些句子创建向量。

In this section, we will define a collection of strings by using CountVectorizer to create vectors from these sentences.

让我们导入必要的包 −

Let us import the necessary package −

from sklearn.feature_extraction.text import CountVectorizer

现在定义句子集合。

Now define the set of sentences.

Sentences = ['We are using the Bag of Word model', 'Bag of Word model is
           used for extracting the features.']

vectorizer_count = CountVectorizer()

features_text = vectorizer.fit_transform(Sentences).todense()

print(vectorizer.vocabulary_)

以上程序生成以下所示的输出。它表明在上述两句话中有 13 个不同的词 −

The above program generates the output as shown below. It shows that we have 13 distinct words in the above two sentences −

{'we': 11, 'are': 0, 'using': 10, 'the': 8, 'bag': 1, 'of': 7,
 'word': 12, 'model': 6, 'is': 5, 'used': 9, 'for': 4, 'extracting': 2, 'features': 3}

这些是特征向量（文本到数字形式），可用于机器学习。

These are the feature vectors (text to numeric form) which can be used for machine learning.

Solving Problems

在本节中，我们将解决几个相关问题。

In this section, we will solve a few related problems.

Category Prediction

在一组文档中，不仅单词很重要，而且单词的类别也很重要；具体单词属于哪种文本类别。例如，我们希望预测给定的句子属于电子邮件、新闻、体育、计算机等类别。在以下示例中，我们将使用 tf-idf 来制定一个特征向量来查找文档的类别。我们将使用 sklearn 中的 20 个新闻组数据集的数据。

In a set of documents, not only the words but the category of the words is also important; in which category of text a particular word falls. For example, we want to predict whether a given sentence belongs to the category email, news, sports, computer, etc. In the following example, we are going to use tf-idf to formulate a feature vector to find the category of documents. We will use the data from 20 newsgroup dataset of sklearn.

我们需要导入必要的软件包−

We need to import the necessary packages −

from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

定义类别映射。我们使用五个不同的类别，分别称为宗教、汽车、体育、电子和空间。

Define the category map. We are using five different categories named Religion, Autos, Sports, Electronics and Space.

category_map = {'talk.religion.misc':'Religion','rec.autos''Autos',
   'rec.sport.hockey':'Hockey','sci.electronics':'Electronics', 'sci.space': 'Space'}

创建训练集 −

Create the training set −

training_data = fetch_20newsgroups(subset = 'train',
   categories = category_map.keys(), shuffle = True, random_state = 5)

构建一个计数向量化器并提取术语计数 −

Build a count vectorizer and extract the term counts −

vectorizer_count = CountVectorizer()
train_tc = vectorizer_count.fit_transform(training_data.data)
print("\nDimensions of training data:", train_tc.shape)

tf-idf 转换器创建如下 −

The tf-idf transformer is created as follows −

tfidf = TfidfTransformer()
train_tfidf = tfidf.fit_transform(train_tc)

现在，定义测试数据 −

Now, define the test data −

input_data = [
   'Discovery was a space shuttle',
   'Hindu, Christian, Sikh all are religions',
   'We must have to drive safely',
   'Puck is a disk made of rubber',
   'Television, Microwave, Refrigrated all uses electricity'
]

以上数据将帮助我们训练一个多项式朴素贝叶斯分类器 −

The above data will help us train a Multinomial Naive Bayes classifier −

classifier = MultinomialNB().fit(train_tfidf, training_data.target)

使用计数向量化器转换输入数据 −

Transform the input data using the count vectorizer −

input_tc = vectorizer_count.transform(input_data)

现在，我们将使用 tfidf 转换器转换向量化后的数据 −

Now, we will transform the vectorized data using the tfidf transformer −

input_tfidf = tfidf.transform(input_tc)

我们将预测输出类别 −

We will predict the output categories −

predictions = classifier.predict(input_tfidf)

输出生成如下 −

The output is generated as follows −

for sent, category in zip(input_data, predictions):
   print('\nInput Data:', sent, '\n Category:', \
      category_map[training_data.target_names[category]])

类别预测器生成以下输出 −

The category predictor generates the following output −

Dimensions of training data: (2755, 39297)

Input Data: Discovery was a space shuttle
Category: Space

Input Data: Hindu, Christian, Sikh all are religions
Category: Religion

Input Data: We must have to drive safely
Category: Autos

Input Data: Puck is a disk made of rubber
Category: Hockey

Input Data: Television, Microwave, Refrigrated all uses electricity
Category: Electronics

Gender Finder

在这个问题陈述中，将通过提供名称来训练分类器来寻找性别（男性或女性）。我们需要使用启发式方法来构建特征向量并训练分类器。我们将使用 scikit-learn 软件包中的标记数据。以下是用 Python 代码构建性别查找器 −

In this problem statement, a classifier would be trained to find the gender (male or female) by providing the names. We need to use a heuristic to construct a feature vector and train the classifier. We will be using the labeled data from the scikit-learn package. Following is the Python code to build a gender finder −

让我们导入必要的软件包 −

Let us import the necessary packages −

import random

from nltk import NaiveBayesClassifier
from nltk.classify import accuracy as nltk_accuracy
from nltk.corpus import names

现在，我们需要从输入单词中提取最后的 N 个字母。这些字母将充当特征 −

Now we need to extract the last N letters from the input word. These letters will act as features −

def extract_features(word, N = 2):
   last_n_letters = word[-N:]
   return {'feature': last_n_letters.lower()}

if __name__=='__main__':

使用 NLTK 中提供的标记名称（男性和女性）创建训练数据 −

Create the training data using labeled names (male as well as female) available in NLTK −

male_list = [(name, 'male') for name in names.words('male.txt')]
female_list = [(name, 'female') for name in names.words('female.txt')]
data = (male_list + female_list)

random.seed(5)
random.shuffle(data)

现在，测试数据将创建如下 −

Now, test data will be created as follows −

namesInput = ['Rajesh', 'Gaurav', 'Swati', 'Shubha']

使用以下代码定义样例总数以训练和测试

Define the number of samples used for train and test with the following code

train_sample = int(0.8 * len(data))

现在，我们需要遍历不同长度的数据，以便比较准确度 -

Now, we need to iterate through different lengths so that the accuracy can be compared −

for i in range(1, 6):
   print('\nNumber of end letters:', i)
   features = [(extract_features(n, i), gender) for (n, gender) in data]
   train_data, test_data = features[:train_sample],
features[train_sample:]
   classifier = NaiveBayesClassifier.train(train_data)

可以如下计算分类器的准确度 -

The accuracy of the classifier can be computed as follows −

accuracy_classifier = round(100 * nltk_accuracy(classifier, test_data), 2)
   print('Accuracy = ' + str(accuracy_classifier) + '%')

现在，我们可以预测输出 -

Now, we can predict the output −

for name in namesInput:
   print(name, '==>', classifier.classify(extract_features(name, i)))

上面的程序将生成以下输出 −

The above program will generate the following output −

Number of end letters: 1
Accuracy = 74.7%
Rajesh -> female
Gaurav -> male
Swati -> female
Shubha -> female

Number of end letters: 2
Accuracy = 78.79%
Rajesh -> male
Gaurav -> male
Swati -> female
Shubha -> female

Number of end letters: 3
Accuracy = 77.22%
Rajesh -> male
Gaurav -> female
Swati -> female
Shubha -> female

Number of end letters: 4
Accuracy = 69.98%
Rajesh -> female
Gaurav -> female
Swati -> female
Shubha -> female

Number of end letters: 5
Accuracy = 64.63%
Rajesh -> female
Gaurav -> female
Swati -> female
Shubha -> female

在上面的输出中，我们可以看到准确度以末尾字母的最大数量为两位数，并且随着末尾字母的数量的增加而降低。

In the above output, we can see that accuracy in maximum number of end letters are two and it is decreasing as the number of end letters are increasing.

Topic Modeling: Identifying Patterns in Text Data

我们知道文档通常被归类为主题。有时候我们需要识别文本中的模式，使其与特定主题相对应。执行此操作的技术称为主题建模。换句话说，我们可以说主题建模是一种技术，用于发掘给定文档集中抽象的主题或隐藏结构。

We know that generally documents are grouped into topics. Sometimes we need to identify the patterns in a text that correspond to a particular topic. The technique of doing this is called topic modeling. In other words, we can say that topic modeling is a technique to uncover abstract themes or hidden structure in the given set of documents.

我们可以在以下场景中使用主题建模技术 -

We can use the topic modeling technique in the following scenarios −

Text Classification

在主题建模的帮助下，可以改进分类，因为它对相似的单词进行分组，而不是将每个单词单独用作特征。

With the help of topic modeling, classification can be improved because it groups similar words together rather than using each word separately as a feature.

Recommender Systems

在主题建模的帮助下，我们可以使用相似度度量来构建推荐器系统。

With the help of topic modeling, we can build the recommender systems by using similarity measures.

Algorithms for Topic Modeling

主题建模可以通过使用算法来实现。算法如下 -

Topic modeling can be implemented by using algorithms. The algorithms are as follows −

Latent Dirichlet Allocation(LDA)

此算法是主题建模中最流行的算法。它使用概率图模型来实现主题建模。我们需要在 Python 中导入 gensim 包来使用 LDA 算法。

This algorithm is the most popular for topic modeling. It uses the probabilistic graphical models for implementing topic modeling. We need to import gensim package in Python for using LDA slgorithm.

Latent Semantic Analysis(LDA) or Latent Semantic Indexing(LSI)

此算法基于线性代数。它主要使用文档项矩阵的 SVD（奇异值分解）概念。

This algorithm is based upon Linear Algebra. Basically it uses the concept of SVD (Singular Value Decomposition) on the document term matrix.

Non-Negative Matrix Factorization (NMF)

它还基于线性代数。

It is also based upon Linear Algebra.

对于主题建模的所有上述算法， number of topics 为参数， Document-Word Matrix 为输入， WTM (Word Topic Matrix) 和 TDM (Topic Document Matrix) 为输出。

All of the above mentioned algorithms for topic modeling would have the number of topics as a parameter, Document-Word Matrix as an input and WTM (Word Topic Matrix) & TDM (Topic Document Matrix) as output.