Natural Language Toolkit 简明教程

Natural Language Toolkit - Text Classification

What is text classification?

正如名称所暗示的那样,文本分类是对文本或文档进行分类的方法。但是,这里 возникает вопрос: 为什么我们需要使用文本分类器?一旦检查文档或文本中的单词用法,分类器将能够决定应为其分配哪个类标签。

Text classification, as the name implies, is the way to categorize pieces of text or documents. But here the question arises that why we need to use text classifiers? Once examining the word usage in a document or piece of text, classifiers will be able to decide what class label should be assigned to it.

Binary Classifier

顾名思义,二元分类器将在两个标签之间进行选择。例如,正或负。在其中,文本或文档片段既可以是其中一个标签,也可以是另一个标签,但不能同时是这两个标签。

As name implies, binary classifier will decide between two labels. For example, positive or negative. In this the piece of text or document can either be one label or another, but not both.

Multi-label Classifier

与二元分类器相反,多标签分类器可以向文本或文档片段分配一个或多个标签。

Opposite to binary classifier, multi-label classifier can assign one or more labels to a piece of text or document.

Labeled Vs Unlabeled Feature set

将特征名称映射为特征值称为特征集。标记的特征集或训练数据对于分类训练非常重要,以便它以后可以对未标记的特征集进行分类。

A key-value mapping of feature names to feature values is called a feature set. Labeled feature sets or training data is very important for classification training so that it can later classify unlabeled feature set.

Labeled Feature Set

Unlabeled Feature Set

It is a tuple that look like (feat, label).

It is a feat itself.

It is an instance with a known class label.

Without associated label, we can call it an instance.

Used for training a classification algorithm.

Once trained, classification algorithm can classify an unlabeled feature set.

Text Feature Extraction

正如名称所暗示的那样,文本特征提取是将单词列表转换成本分类器可用的特征集的过程。我们必须将文本转换为 ‘dict’ 样式的特征集,因为自然语言工具包 (NLTK) 需要 ‘dict’ 样式的特征集。

Text feature extraction, as the name implies, is the process of transforming a list of words into a feature set that is usable by a classifier. We must have to transform our text into ‘dict’ style feature sets because Natural Language Tool Kit (NLTK) expect ‘dict’ style feature sets.

Bag of Words (BoW) model

BoW 是 NLP 中最简单的模型之一,用于从文本或文档片段中提取特征,以便可以在建模中使用它,这样它就可以在 ML 算法中使用。它基本上从实例的所有单词中构造单词存在特征集。此方法背后的概念是,它不关心单词出现多少次或单词的顺序,它只关心单词是否存在于单词列表中。

BoW, one of the simplest models in NLP, is used to extract the features from piece of text or document so that it can be used in modeling such that in ML algorithms. It basically constructs a word presence feature set from all the words of an instance. The concept behind this method is that it doesn’t care about how many times a word occurs or about the order of the words, it only cares weather the word is present in a list of words or not.

Example

对于此示例,我们将定义一个名为 bow() 的函数 −

For this example, we are going to define a function named bow() −

def bow(words):
   return dict([(word, True) for word in words])

现在,让我们对单词调用 bow() 函数。我们将这些函数保存在名为 bagwords.py 的文件中。

Now, let us call bow() function on words. We saved this functions in a file named bagwords.py.

from bagwords import bow
bow(['we', 'are', 'using', 'tutorialspoint'])

Output

{'we': True, 'are': True, 'using': True, 'tutorialspoint': True}

Training classifiers

在前面的章节中,我们学习了如何从文本中提取特征。所以现在我们可以训练一个分类器。第一个也是最简单的分类器是 NaiveBayesClassifier 类。

In previous sections, we learned how to extract features from the text. So now we can train a classifier. The first and easiest classifier is NaiveBayesClassifier class.

Naïve Bayes Classifier

为了预测给定的特征集合属于特定标签的概率,它使用贝叶斯定理。贝叶斯定理的公式如下。

To predict the probability that a given feature set belongs to a particular label, it uses Bayes theorem. The formula of Bayes theorem is as follows.

在此,

Here,

P(A|B) − 它也被称为后验概率,即在已知第二个事件(即 B)发生的情况下,第一个事件(即 A)发生的概率。

P(A|B) − It is also called the posterior probability i.e. the probability of first event i.e. A to occur given that second event i.e. B occurred.

P(B|A) − 它是在第一个事件(即 A)发生后,第二个事件(即 B)发生的概率。

P(B|A) − It is the probability of second event i.e. B to occur after first event i.e. A occurred.

P(A), P(B) − 它也被称为先验概率,即第一个事件(即 A)或第二个事件(即 B)发生的概率。

P(A), P(B) − It is also called prior probability i.e. the probability of first event i.e. A or second event i.e. B to occur.

为了训练朴素贝叶斯分类器,我们将使用 NLTK 中的 movie_reviews 语料库。此语料库有两个类别的文本,即: posneg 。这些类别使训练它们的分类器成为二元分类器。语料库中的每个文件都由两部分组成,一部分是正面的电影评论,另一部分是负面的电影评论。在我们的例子中,我们将每个文件用作训练和测试分类器的单个实例。

To train Naïve Bayes classifier, we will be using the movie_reviews corpus from NLTK. This corpus has two categories of text, namely: pos and neg. These categories make a classifier trained on them a binary classifier. Every file in the corpus is composed of two,one is positive movie review and other is negative movie review. In our example, we are going to use each file as a single instance for both training and testing the classifier.

Example

对于训练分类器,我们需要一个标记的特征集合列表,其形式为 [( featureset, label )]. 这里 featureset 变量是一个 dict ,标签是 featureset 的已知类标签。我们准备创建一个名为 label_corpus() 的函数,它将接受一个名为 movie_reviews*and also a function named *feature_detector 的语料库,该语料库默认为 bag of words 。它将构建并返回一个这种形式的映射:{label: [featureset]}。之后,我们将使用此映射来创建一个标记的训练实例和测试实例列表。

For training classifier, we need a list of labeled feature sets, which will be in the form [(featureset, label)]. Here the featureset variable is a dict and label is the known class label for the featureset. We are going to create a function named label_corpus() which will take a corpus named movie_reviews*and also a function named *feature_detector, which defaults to bag of words. It will construct and returns a mapping of the form, {label: [featureset]}. After that we will use this mapping to create a list of labeled training instances and testing instances.

import collections

import collections

def label_corpus(corp, feature_detector=bow):
   label_feats = collections.defaultdict(list)
   for label in corp.categories():
      for fileid in corp.fileids(categories=[label]):
         feats = feature_detector(corp.words(fileids=[fileid]))
         label_feats[label].append(feats)
   return label_feats

借助上述函数,我们将得到一个映射 {label:fetaureset} 。现在,我们准备定义另一个名为 split 的函数,它将接受 label_corpus() 函数返回的映射,并将每个特征集合的列表分割为标记的训练实例和测试实例。

With the help of above function we will get a mapping {label:fetaureset}. Now we are going to define one more function named split that will take a mapping returned from label_corpus() function and splits each list of feature sets into labeled training as well as testing instances.

def split(lfeats, split=0.75):
   train_feats = []
   test_feats = []
   for label, feats in lfeats.items():
      cutoff = int(len(feats) * split)
      train_feats.extend([(feat, label) for feat in feats[:cutoff]])
      test_feats.extend([(feat, label) for feat in feats[cutoff:]])
   return train_feats, test_feats

现在,让我们在我们的语料库中使用这些函数,即 movie_reviews −

Now, let us use these functions on our corpus, i.e. movie_reviews −

from nltk.corpus import movie_reviews
from featx import label_feats_from_corpus, split_label_feats
movie_reviews.categories()

Output

['neg', 'pos']

Example

lfeats = label_feats_from_corpus(movie_reviews)
lfeats.keys()

Output

dict_keys(['neg', 'pos'])

Example

train_feats, test_feats = split_label_feats(lfeats, split = 0.75)
len(train_feats)

Output

1500

Example

len(test_feats)

Output

500

我们已经看到,在 movie_reviews 语料库中有 1000 个 pos文件和 1000 个 neg 文件。我们最终得到了 1500 个标记的训练实例和 500 个标记的测试实例。

We have seen that in movie_reviews corpus, there are 1000 pos files and 1000 neg files. We also end up with 1500 labeled training instances and 500 labeled testing instances.

现在,让我们使用其 train() 类方法训练 NaïveBayesClassifier

Now let us train NaïveBayesClassifier using its train() class method −

from nltk.classify import NaiveBayesClassifier
NBC = NaiveBayesClassifier.train(train_feats)
NBC.labels()

Output

['neg', 'pos']

Decision Tree Classifier

另一个重要的分类器是决策树分类器。这里为了训练 DecisionTreeClassifier 类将创建一个树结构。在此树结构中,每个节点对应一个特征名称,分支对应特征值。沿着分支向下,我们将到达树叶,即分类标签。

Another important classifier is decision tree classifier. Here to train it the DecisionTreeClassifier class will create a tree structure. In this tree structure each node corresponds to a feature name and the branches correspond to the feature values. And down the branches we will get to the leaves of the tree i.e. the classification labels.

为了训练决策树分类器,我们将使用我们从 movie_reviews 语料库中创建的相同的训练和测试特征,即 train_featstest_feats 变量。

To train decision tree classifier, we will use the same training and testing features i.e. train_feats and test_feats, variables we have created from movie_reviews corpus.

Example

为了训练此分类器,我们将调用 DecisionTreeClassifier.train() 类方法,如下所示:

To train this classifier, we will call DecisionTreeClassifier.train() class method as follows −

from nltk.classify import DecisionTreeClassifier
decisiont_classifier = DecisionTreeClassifier.train(
   train_feats, binary = True, entropy_cutoff = 0.8,
   depth_cutoff = 5, support_cutoff = 30
)
accuracy(decisiont_classifier, test_feats)

Output

0.725

Maximum Entropy Classifier

另一个重要的分类器是 MaxentClassifier ,它也被称为 conditional exponential classifierlogistic regression classifier 。在此处为了训练它, MaxentClassifier 类将使用编码将标记的特征集合转换为矢量。

Another important classifier is MaxentClassifier which is also known as a conditional exponential classifier or logistic regression classifier. Here to train it, the MaxentClassifier class will convert labeled feature sets to vector using encoding.

为了训练决策树分类器,我们将使用我们从 movie_reviews 语料库中创建的相同的训练和测试特征,即 train_feats*and *test_feats 变量。

To train decision tree classifier, we will use the same training and testing features i.e. train_feats*and *test_feats, variables we have created from movie_reviews corpus.

Example

为了训练此分类器,我们将调用 MaxentClassifier.train() 类方法,如下所示:

To train this classifier, we will call MaxentClassifier.train() class method as follows −

from nltk.classify import MaxentClassifier
maxent_classifier = MaxentClassifier
.train(train_feats,algorithm = 'gis', trace = 0, max_iter = 10, min_lldelta = 0.5)
accuracy(maxent_classifier, test_feats)

Output

0.786

Scikit-learn Classifier

最好的机器学习(ML)库之一是 Scikit-Learn。它实际上包含各种用途的各种 ML 算法,但它们都具有以下相同的拟合设计模式 −

One of the best machine learning (ML) libraries is Scikit-learn. It actually contains all sorts of ML algorithms for various purposes, but they all have the same fit design pattern as follows −

  1. Fitting the model to the data

  2. And use that model to make predictions

与直接访问 scikit-learn 模型不同,我将使用 NLTK 的 SklearnClassifier 类。此类是 scikit-learn 模型的封装类,用于使其符合 NLTK 的 Classifier 接口。

Rather than accessing scikit-learn models directly, here we are going to use NLTK’s SklearnClassifier class. This class is a wrapper class around a scikit-learn model to make it conform to NLTK’s Classifier interface.

我们将按照以下步骤训练 SklearnClassifier 类−

We will follow following steps to train a SklearnClassifier class −

Step 1 − 我们先创建训练功能,就像我们在之前的食谱中所做的那样。

Step 1 − First we will create training features as we did in previous recipes.

Step 2 − 现在,选择并导入 Scikit-learn 算法。

Step 2 − Now, choose and import a Scikit-learn algorithm.

Step 3 − 接下来,我们需要使用所选算法构造 SklearnClassifier 类。

Step 3 − Next, we need to construct a SklearnClassifier class with the chosen algorithm.

Step 4 − 最后,我们将用训练功能训练 SklearnClassifier 类。

Step 4 − Last, we will train SklearnClassifier class with our training features.

让我们在下面的 Python 食谱中实现这些步骤−

Let us implement these steps in the below Python recipe −

from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB
sklearn_classifier = SklearnClassifier(MultinomialNB())
sklearn_classifier.train(train_feats)
<SklearnClassifier(MultinomialNB(alpha = 1.0,class_prior = None,fit_prior = True))>
accuracy(sk_classifier, test_feats)

Output

0.885

Measuring precision and recall

在训练各种分类器时,我们还测量了它们的准确性。但除了准确性之外,还有许多其他指标可用于评估分类器。其中两个指标是 precisionrecall

While training various classifiers we have measured their accuracy also. But apart from accuracy there are number of other metrics which are used to evaluate the classifiers. Two of these metrics are precision and recall.

Example

在此示例中,我们将计算我们之前训练的 NaiveBayesClassifier 类的精度和召回率。为此,我们将创建一个名为 metrics_PR() 的函数,它将使用两个参数,一个是训练后的分类器,另一个是带标签的测试功能。这两个参数与我们在计算分类器准确性时传递的参数相同−

In this example, we are going to calculate precision and recall of the NaiveBayesClassifier class we trained earlier. To achieve this we will create a function named metrics_PR() which will take two arguments, one is the trained classifier and other is the labeled test features. Both the arguments are same as we passed while calculating the accuracy of the classifiers −

import collections
from nltk import metrics
def metrics_PR(classifier, testfeats):
   refsets = collections.defaultdict(set)
   testsets = collections.defaultdict(set)
   for i, (feats, label) in enumerate(testfeats):
      refsets[label].add(i)
      observed = classifier.classify(feats)
         testsets[observed].add(i)
   precisions = {}
   recalls = {}
   for label in classifier.labels():
   precisions[label] = metrics.precision(refsets[label],testsets[label])
   recalls[label] = metrics.recall(refsets[label], testsets[label])
   return precisions, recalls

让我们调用此函数以查找精度和召回率−

Let us call this function to find the precision and recall −

from metrics_classification import metrics_PR
nb_precisions, nb_recalls = metrics_PR(nb_classifier,test_feats)
nb_precisions['pos']

Output

0.6713532466435213

Example

nb_precisions['neg']

Output

0.9676271186440678

Example

nb_recalls['pos']

Output

0.96

Example

nb_recalls['neg']

Output

0.478

Combination of classifier and voting

组合分类器是提高分类性能的最佳方法之一。投票是合并多个分类器的最佳方式之一。对于投票,我们需要奇数个分类器。在以下 Python 食谱中,我们将结合三个分类器,即 NaiveBayesClassifier 类、DecisionTreeClassifier 类和 MaxentClassifier 类。

Combining classifiers is one of the best ways to improve classification performance. And voting is one of the best ways to combine multiple classifiers. For voting we need to have odd number of classifiers. In the following Python recipe we are going to combine three classifiers namely NaiveBayesClassifier class, DecisionTreeClassifier class and MaxentClassifier class.

为了实现这一点,我们将定义一个名为 voting_classifiers() 的函数,如下所示。

To achieve this we are going to define a function named voting_classifiers() as follows.

import itertools
from nltk.classify import ClassifierI
from nltk.probability import FreqDist
class Voting_classifiers(ClassifierI):
   def __init__(self, *classifiers):
      self._classifiers = classifiers
      self._labels = sorted(set(itertools.chain(*[c.labels() for c in classifiers])))
   def labels(self):
      return self._labels
   def classify(self, feats):
      counts = FreqDist()
      for classifier in self._classifiers:
         counts[classifier.classify(feats)] += 1
      return counts.max()

让我们调用此函数以组合三个分类器并找到准确性−

Let us call this function to combine three classifiers and find the accuracy −

from vote_classification import Voting_classifiers
combined_classifier = Voting_classifiers(NBC, decisiont_classifier, maxent_classifier)
combined_classifier.labels()

Output

['neg', 'pos']

Example

accuracy(combined_classifier, test_feats)

Output

0.948

从上面的输出中,我们可以看到合并的分类器的准确性高于各个分类器。

From the above output, we can see that the combined classifiers got highest accuracy than the individual classifiers.