Machine Learning 简明教程

Machine Learning - Naive Bayes Algorithm

朴素贝叶斯算法是一种基于贝叶斯定理的分类算法。该算法假设特征相互独立,这就是将其称为“朴素”算法的原因。它根据特征的概率计算样本属于特定类别的概率。例如,如果一部手机具有触摸屏、互联网功能和良好的摄像头,则可以将其视为智能手机。即使所有这些特征都相互依赖,但所有这些特征都独立地影响了该手机是智能手机的概率。

The Naive Bayes algorithm is a classification algorithm based on Bayes' theorem. The algorithm assumes that the features are independent of each other, which is why it is called "naive." It calculates the probability of a sample belonging to a particular class based on the probabilities of its features. For example, a phone may be considered as smart if it has touch-screen, internet facility, good camera, etc. Even if all these features are dependent on each other, but all these features independently contribute to the probability of that the phone is a smart phone.

在贝叶斯分类中,主要兴趣是找到后验概率,即给定某些观测特征的标签概率,即 P(𝐿L | 特征)。借助贝叶斯定理,我们可以用定量形式表示如下 −

In Bayesian classification, the main interest is to find the posterior probabilities i.e. the probability of a label given some observed features, P(𝐿L | features). With the help of Bayes theorem, we can express this in quantitative form as follows −

P\left ( L| 特征\right )=\frac{P\left ( L \right )P\left (特征| L\right )}{P\left (特征\right )}

P\left ( L| features\right )=\frac{P\left ( L \right )P\left (features| L\right )}{P\left (features\right )}

在此,

Here,

  1. $P\left ( L| features\right )$ is the posterior probability of class.

  2. $P\left ( L \right )$ is the prior probability of class.

  3. $P\left (features| L\right )$ is the likelihood which is the probability of predictor given class.

  4. $P\left (features\right )$ is the prior probability of predictor.

在朴素贝叶斯算法中,我们使用贝叶斯定理来计算样本属于特定类别的概率。我们计算给定类别的样本的每个特征的概率,然后将它们相乘以获得样本属于该类别的可能性。然后,我们将似然与该类别的先验概率相乘,以获得样本属于该类别的后验概率。我们对每个类别重复此过程,然后选择具有最高概率的类别作为样本的类别。

In the Naive Bayes algorithm, we use Bayes' theorem to calculate the probability of a sample belonging to a particular class. We calculate the probability of each feature of the sample given the class and multiply them to get the likelihood of the sample belonging to the class. We then multiply the likelihood with the prior probability of the class to get the posterior probability of the sample belonging to the class. We repeat this process for each class and choose the class with the highest probability as the class of the sample.

Types of Naive Bayes Algorithm

朴素贝叶斯算法有三种类型 −

There are three types of Naive Bayes algorithm −

  1. Gaussian Naive Bayes − This algorithm is used when the features are continuous variables that follow a normal distribution. It assumes that the probability distribution of each feature is Gaussian, which means it is a bell-shaped curve.

  2. Multinomial Naive Bayes − This algorithm is used when the features are discrete variables. It is commonly used in text classification tasks where the features are the frequency of words in a document.

  3. Bernoulli Naive Bayes − This algorithm is used when the features are binary variables. It is also commonly used in text classification tasks where the features are whether a word is present or not in a document.

Implementation in Python

下面,我们将用 Python 实现高斯朴素贝叶斯算法。我们将使用鸢尾花数据集,这是一个用于分类任务的流行数据集。它包含 150 个鸢尾花样本,每个样本都有四个特征:萼片长度、萼片宽度、花瓣长度和花瓣宽度。这些花属于三类:山鸢尾、杂色鸢尾和维吉尼亚鸢尾。

Here we will implement the Gaussian Naive Bayes algorithm in Python. We will use the iris dataset, which is a popular dataset for classification tasks. It contains 150 samples of iris flowers, each with four features: sepal length, sepal width, petal length, and petal width. The flowers belong to three classes: setosa, versicolor, and virginica.

首先,我们将导入必要的库并加载数据集 −

First, we will import the necessary libraries and load the datase −

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

# load the iris dataset
iris = load_iris()

# split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(iris.data,
iris.target, test_size=0.35, random_state=0)

然后,我们创建一个高斯朴素贝叶斯分类器的实例,并在训练集上对其进行训练 −

We then create an instance of the Gaussian Naive Bayes classifier and train it on the training set −

# Create a Gaussian Naive Bayes classifier
gnb = GaussianNB()

#fit the classifier to the training data:
gnb.fit(X_train, y_train)

现在,我们可以使用训练好的分类器对测试集进行预测 −

We can now use the trained classifier to make predictions on the testing set −

#make predictions on the testing data
y_pred = gnb.predict(X_test)

我们可以通过计算其准确性来评估分类器的性能——

We can evaluate the performance of the classifier by calculating its accuracy −

#Calculate the accuracy of the classifier
accuracy = np.sum(y_pred == y_test) / len(y_test) print("Accuracy:", accuracy)

Complete Implementation Example

下面给出了一个完整的贝叶斯分类算法在 python 中使用鸢尾花数据集的实现示例——

Given below is the complete implementation example of Naïve Bayes Classification algorithm in python using the iris dataset −

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

# load the iris dataset
iris = load_iris()

# split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(iris.data,
iris.target, test_size=0.35, random_state=0)

# Create a Gaussian Naive Bayes classifier
gnb = GaussianNB()

#fit the classifier to the training data:
gnb.fit(X_train, y_train)

#make predictions on the testing data
y_pred = gnb.predict(X_test)

#Calculate the accuracy of the classifier
accuracy = np.sum(y_pred == y_test) / len(y_test)
print("Accuracy:", accuracy)

当你执行这个程序时,它将产生以下输出——

When you execute this program, it will produce the following output −

Accuracy: 0.9622641509433962