Machine Learning With Python 简明教程

Classification Algorithms - Naïve Bayes

Introduction to Naïve Bayes Algorithm

朴素贝叶斯算法是一种分类技术,它基于应用贝叶斯定理,并且有一个强假设,即所有预测变量都相互独立。简而言之,假设是类别中特征的存在独立于同一类别中任何其他特征的存在。例如,如果手机有触摸屏、便携功能、好的摄像头等,它可以被认为是智能的。尽管所有这些特性是相互依赖的,但它们会独立地影响该手机是智能手机的概率。

Naïve Bayes algorithms is a classification technique based on applying Bayes’ theorem with a strong assumption that all the predictors are independent to each other. In simple words, the assumption is that the presence of a feature in a class is independent to the presence of any other feature in the same class. For example, a phone may be considered as smart if it is having touch screen, internet facility, good camera etc. Though all these features are dependent on each other, they contribute independently to the probability of that the phone is a smart phone.

在贝叶斯分类中,主要目的是找到后验概率,即给定某些观察到的特征的标签概率,P(L | fatures)。借助贝叶斯定理,我们可以将其定量表示如下−

In Bayesian classification, the main interest is to find the posterior probabilities i.e. the probability of a label given some observed features, 𝑃(𝐿 | 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠). With the help of Bayes theorem, we can express this in quantitative form as follows −

此处,P(L | fatures) 是类的后验概率。

Here, 𝑃(𝐿 | 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠) is the posterior probability of class.

P(L) 是类的先验概率。

𝑃(𝐿) is the prior probability of class.

P(fatures | L) 是可能性,即给定类的预测变量的概率。

𝑃(𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠 | 𝐿) is the likelihood which is the probability of predictor given class.

P(fatures) 是预测变量的先验概率。

𝑃(𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠) is the prior probability of predictor.

Building model using Naïve Bayes in Python

Python 库 Scikit learn 是帮助我们在 Python 中构建朴素贝叶斯模型的最有用的库。我们可以在 Scikit learn Python 库中找到以下三种类型的朴素贝叶斯模型−

Python library, Scikit learn is the most useful library that helps us to build a Naïve Bayes model in Python. We have the following three types of Naïve Bayes model under Scikit learn Python library −

Gaussian Naïve Bayes

它是最简单的朴素贝叶斯分类器,假设来自每个标签的数据都是从一个简单的正态分布中获取的。

It is the simplest Naïve Bayes classifier having the assumption that the data from each label is drawn from a simple Gaussian distribution.

Multinomial Naïve Bayes

另一个有用的朴素贝叶斯分类器是多项式朴素贝叶斯,其中假设特征是从一个简单的多项式分布中获取的。这种朴素贝叶斯最适合表示离散计数的特征。

Another useful Naïve Bayes classifier is Multinomial Naïve Bayes in which the features are assumed to be drawn from a simple Multinomial distribution. Such kind of Naïve Bayes are most appropriate for the features that represents discrete counts.

Bernoulli Naïve Bayes

另一个重要的模型是伯努利朴素贝叶斯,其中假设特征是二进制的(0 和 1)。用“词袋”模型进行文本分类可以成为伯努利朴素贝叶斯的应用。

Another important model is Bernoulli Naïve Bayes in which features are assumed to be binary (0s and 1s). Text classification with ‘bag of words’ model can be an application of Bernoulli Naïve Bayes.

Example

根据我们的数据集,我们可以选择上面解释的任何朴素贝叶斯模型。在这里,我们在 Python 中实现高斯朴素贝叶斯模型−

Depending on our data set, we can choose any of the Naïve Bayes model explained above. Here, we are implementing Gaussian Naïve Bayes model in Python −

我们将从所需导入开始,如下所示−

We will start with required imports as follows −

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

现在,通过使用 Scikit learn 的 make_blobs() 函数,我们可以生成具有正态分布的点团,如下所示−

Now, by using make_blobs() function of Scikit learn, we can generate blobs of points with Gaussian distribution as follows −

from sklearn.datasets import make_blobs
X, y = make_blobs(300, 2, centers=2, random_state=2, cluster_std=1.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='summer');

接下来,为了使用 GaussianNB 模型,我们需要导入它并使其成为对象,如下所示−

Next, for using GaussianNB model, we need to import and make its object as follows −

from sklearn.naive_bayes import GaussianNB
model_GBN = GaussianNB()
model_GNB.fit(X, y);

现在,我们必须进行预测。它可以在生成一些新数据后按照以下步骤进行 −

Now, we have to do prediction. It can be done after generating some new data as follows −

rng = np.random.RandomState(0)
Xnew = [-6, -14] + [14, 18] * rng.rand(2000, 2)
ynew = model_GNB.predict(Xnew)

接下来,我们要绘制新数据以找到它的边界 −

Next, we are plotting new data to find its boundaries −

plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='summer')
lim = plt.axis()
plt.scatter(Xnew[:, 0], Xnew[:, 1], c=ynew, s=20, cmap='summer', alpha=0.1)
plt.axis(lim);

现在,借助以下代码行,我们可以找到第一个和第二个标签的后验概率 −

Now, with the help of following line of codes, we can find the posterior probabilities of first and second label −

yprob = model_GNB.predict_proba(Xnew)
yprob[-10:].round(3)

Output

array([[0.998, 0.002],
   [1.   , 0.   ],
   [0.987, 0.013],
   [1.   , 0.   ],
   [1.   , 0.   ],
   [1.   , 0.   ],
   [1.   , 0.   ],
   [1.   , 0.   ],
   [0.   , 1.   ],
   [0.986, 0.014]]
)

Pros & Cons

Pros

以下是使用朴素贝叶斯分类器的一些优点 −

The followings are some pros of using Naïve Bayes classifiers −

  1. Naïve Bayes classification is easy to implement and fast.

  2. It will converge faster than discriminative models like logistic regression.

  3. It requires less training data.

  4. It is highly scalable in nature, or they scale linearly with the number of predictors and data points.

  5. It can make probabilistic predictions and can handle continuous as well as discrete data.

  6. Naïve Bayes classification algorithm can be used for binary as well as multi-class classification problems both.

Cons

以下是使用朴素贝叶斯分类器的一些缺点 −

The followings are some cons of using Naïve Bayes classifiers −

  1. One of the most important cons of Naïve Bayes classification is its strong feature independence because in real life it is almost impossible to have a set of features which are completely independent of each other.

  2. Another issue with Naïve Bayes classification is its ‘zero frequency’ which means that if a categorial variable has a category but not being observed in training data set, then Naïve Bayes model will assign a zero probability to it and it will be unable to make a prediction.

Applications of Naïve Bayes classification

以下是朴素贝叶斯分类的一些常见应用程序 −

The following are some common applications of Naïve Bayes classification −

Real-time prediction − 由于易于实现和快速计算,它可用于进行实时预测。

Real-time prediction − Due to its ease of implementation and fast computation, it can be used to do prediction in real-time.

Multi-class prediction − 朴素贝叶斯分类算法可用于预测目标变量的多个类的后验概率。

Multi-class prediction − Naïve Bayes classification algorithm can be used to predict posterior probability of multiple classes of target variable.

Text classification − 由于多类预测的特性,朴素贝叶斯分类算法非常适合文本分类。这就是它也用于解决垃圾邮件过滤和情绪分析等问题的原因。

Text classification − Due to the feature of multi-class prediction, Naïve Bayes classification algorithms are well suited for text classification. That is why it is also used to solve problems like spam-filtering and sentiment analysis.

Recommendation system − 除了协同过滤等算法之外,朴素贝叶斯还构成一个推荐系统,该系统可用于过滤未见信息并预测用户是否会喜欢给定的资源。

Recommendation system − Along with the algorithms like collaborative filtering, Naïve Bayes makes a Recommendation system which can be used to filter unseen information and to predict weather a user would like the given resource or not.