Scikit Learn 简明教程

Scikit Learn - Classification with Naïve Bayes

朴素贝叶斯方法是一组基于应用贝叶斯定理的监督学习算法,其有一个强假设,即所有预测变量彼此独立,即一个类别中的特征的存在与同一类别中任何其他特征的存在无关。这是一个朴素的假设,这就是为什么这些方法被称为朴素贝叶斯方法。

Naïve Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with a strong assumption that all the predictors are independent to each other i.e. the presence of a feature in a class is independent to the presence of any other feature in the same class. This is naïve assumption that is why these methods are called Naïve Bayes methods.

贝叶斯定理陈述了以下关系,以找到类的后验概率,即标签和某些观察到的特征的概率,$P\left(\begin{array}{c} Y\arrowvert features\end{array}\right)$。

Bayes theorem states the following relationship in order to find the posterior probability of class i.e. the probability of a label and some observed features, $P\left(\begin{array}{c} Y\arrowvert features\end{array}\right)$.

此处,$P\left(\begin{array}{c} Y\arrowvert features\end{array}\right)$ 是类的后验概率。

Here, $P\left(\begin{array}{c} Y\arrowvert features\end{array}\right)$ is the posterior probability of class.

$P\left(\begin{array}{c} Y\end{array}\right)$ 是类的先验概率。

$P\left(\begin{array}{c} Y\end{array}\right)$ is the prior probability of class.

$P\left(\begin{array}{c} features\arrowvert Y\end{array}\right)$ 是给定类的预测变量的概率即似然度。

$P\left(\begin{array}{c} features\arrowvert Y\end{array}\right)$ is the likelihood which is the probability of predictor given class.

$P\left(\begin{array}{c} features\end{array}\right)$ 是预测变量的先验概率。

$P\left(\begin{array}{c} features\end{array}\right)$ is the prior probability of predictor.

Scikit-learn 提供了不同的朴素贝叶斯分类器模型,即高斯、多项式、补集和伯努利。所有这些模型主要根据它们对 𝑷$P\left(\begin{array}{c} features\arrowvert Y\end{array}\right)$,即给定类的预测变量的概率的分布的假设而有所不同。

The Scikit-learn provides different naïve Bayes classifiers models namely Gaussian, Multinomial, Complement and Bernoulli. All of them differ mainly by the assumption they make regarding the distribution of 𝑷$P\left(\begin{array}{c} features\arrowvert Y\end{array}\right)$ i.e. the probability of predictor given class.

Sr.No

Model & Description

1

Gaussian Naïve BayesGaussian Naïve Bayes classifier assumes that the data from each label is drawn from a simple Gaussian distribution.

2

Multinomial Naïve BayesIt assumes that the features are drawn from a simple Multinomial distribution.

3

Bernoulli Naïve BayesThe assumption in this model is that the features binary (0s and 1s) in nature. An application of Bernoulli Naïve Bayes classification is Text classification with ‘bag of words’ model

4

Complement Naïve BayesIt was designed to correct the severe assumptions made by Multinomial Bayes classifier. This kind of NB classifier is suitable for imbalanced data sets

Building Naïve Bayes Classifier

我们还可以在 Scikit-learn 数据集上应用朴素贝叶斯分类器。在下面的示例中,我们应用 GaussianNB 并拟合 Scikit-leran 的 breast_cancer 数据集。

We can also apply Naïve Bayes classifier on Scikit-learn dataset. In the example below, we are applying GaussianNB and fitting the breast_cancer dataset of Scikit-leran.

Example

Import Sklearn
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
data = load_breast_cancer()
label_names = data['target_names']
labels = data['target']
feature_names = data['feature_names']
features = data['data']
   print(label_names)
   print(labels[0])
   print(feature_names[0])
   print(features[0])
train, test, train_labels, test_labels = train_test_split(
   features,labels,test_size = 0.40, random_state = 42
)
from sklearn.naive_bayes import GaussianNB
GNBclf = GaussianNB()
model = GNBclf.fit(train, train_labels)
preds = GNBclf.predict(test)
print(preds)

Output

[
   1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1
   1 1 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1
   1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0
   1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0
   1 1 0 1 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0 1 1 1
   0 1 1 0 0 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1
   1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 0 1 1 0
   1 0 1 1 1 1 0 1 1 0 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 0 1
   1 1 1 1 0 1 0 0 1 1 0 1
]

以上输出包含一系列 0 和 1,这些基本上是来自肿瘤类别(恶性和良性)的预测值。

The above output consists of a series of 0s and 1s which are basically the predicted values from tumor classes namely malignant and benign.