Artificial Intelligence With Python 简明教程

AI with Python – Supervised Learning: Classification

在本教程中,我们将重点介绍实施监督学习-分类。

In this chapter, we will focus on implementing supervised learning − classification.

分类技术或模型尝试从观测值中获取某些结论。在分类问题中,我们有分类输出,例如“黑色”或“白色”、“教学”和“非教学”。在构建分类模型时,我们需要有包含数据点和对应标签的训练数据集。例如,如果我们想检查该图像是否是汽车的图像。为了检查这一点,我们将构建一个具有与“汽车”和“非汽车”相关的两个类的训练数据集。然后我们需要使用训练样本来训练模型。分类模型主要用于人脸识别、垃圾邮件识别等。

The classification technique or model attempts to get some conclusion from observed values. In classification problem, we have the categorized output such as “Black” or “white” or “Teaching” and “Non-Teaching”. While building the classification model, we need to have training dataset that contains data points and the corresponding labels. For example, if we want to check whether the image is of a car or not. For checking this, we will build a training dataset having the two classes related to “car” and “no car”. Then we need to train the model by using the training samples. The classification models are mainly used in face recognition, spam identification, etc.

Steps for Building a Classifier in Python

为了在 Python 中构建分类器,我们将使用 Python 3 和 Scikit-learn(一种机器学习工具)。按照以下步骤在 Python 中构建分类器 −

For building a classifier in Python, we are going to use Python 3 and Scikit-learn which is a tool for machine learning. Follow these steps to build a classifier in Python −

Step 1 − Import Scikit-learn

这是在 Python 中构建分类器的第一步。在此步骤中,我们将安装一个名为 Scikit-learn 的 Python 包,它是 Python 中最好的机器学习模块之一。以下命令将有助于我们导入包 −

This would be very first step for building a classifier in Python. In this step, we will install a Python package called Scikit-learn which is one of the best machine learning modules in Python. The following command will help us import the package −

Import Sklearn

Step 2 − Import Scikit-learn’s dataset

在此步骤中,我们可以开始针对机器学习模型处理数据集。此处,我们将使用乳腺癌威斯康星诊断数据库。该数据集包括有关乳腺癌肿瘤的各种信息,以及 malignantbenign 分类标签。该数据集具有 569 个实例,或数据,针对 569 个肿瘤,并且包括有关 30 个属性,或特征,的信息,例如肿瘤的半径、纹理、光滑度和面积。借助以下命令,我们可以导入 Scikit-learn 的乳腺癌数据集 −

In this step, we can begin working with the dataset for our machine learning model. Here, we are going to use the Breast Cancer Wisconsin Diagnostic Database. The dataset includes various information about breast cancer tumors, as well as classification labels of malignant or benign. The dataset has 569 instances, or data, on 569 tumors and includes information on 30 attributes, or features, such as the radius of the tumor, texture, smoothness, and area. With the help of the following command, we can import the Scikit-learn’s breast cancer dataset −

from sklearn.datasets import load_breast_cancer

现在,以下命令将加载数据集。

Now, the following command will load the dataset.

data = load_breast_cancer()

以下是重要字典键的列表 −

Following is a list of important dictionary keys −

  1. Classification label names(target_names)

  2. The actual labels(target)

  3. The attribute/feature names(feature_names)

  4. The attribute (data)

现在,借助以下命令,我们可以为每个重要信息集创建新变量并分配数据。换句话说,我们可以使用以下命令来组织数据 −

Now, with the help of the following command, we can create new variables for each important set of information and assign the data. In other words, we can organize the data with the following commands −

label_names = data['target_names']
labels = data['target']
feature_names = data['feature_names']
features = data['data']

现在,为了更清楚,我们可以借助以下命令来打印类标签、第一个数据实例的标签、我们的特征名称和特征值 −

Now, to make it clearer we can print the class labels, the first data instance’s label, our feature names and the feature’s value with the help of the following commands −

print(label_names)

上述命令将打印分别为恶性和良性的类名称。如下所示显示为输出 −

The above command will print the class names which are malignant and benign respectively. It is shown as the output below −

['malignant' 'benign']

现在,以下命令将显示它们映射到二进制值 0 和 1。此处 0 表示恶性癌症,1 表示良性癌症。您将收到以下输出 −

Now, the command below will show that they are mapped to binary values 0 and 1. Here 0 represents malignant cancer and 1 represents benign cancer. You will receive the following output −

print(labels[0])
0

下面给出的两个命令将生成特征名称和特征值。

The two commands given below will produce the feature names and feature values.

print(feature_names[0])
mean radius
print(features[0])
[ 1.79900000e+01 1.03800000e+01 1.22800000e+02 1.00100000e+03
  1.18400000e-01 2.77600000e-01 3.00100000e-01 1.47100000e-01
  2.41900000e-01 7.87100000e-02 1.09500000e+00 9.05300000e-01
  8.58900000e+00 1.53400000e+02 6.39900000e-03 4.90400000e-02
  5.37300000e-02 1.58700000e-02 3.00300000e-02 6.19300000e-03
  2.53800000e+01 1.73300000e+01 1.84600000e+02 2.01900000e+03
  1.62200000e-01 6.65600000e-01 7.11900000e-01 2.65400000e-01
  4.60100000e-01 1.18900000e-01]

从上述输出,我们可以看到第一个数据实例是一个半径为 1.7990000e+01 的恶性肿瘤。

From the above output, we can see that the first data instance is a malignant tumor the radius of which is 1.7990000e+01.

Step 3 − Organizing data into sets

在此步骤中,我们将把数据分成两部分,即训练集和测试集。将数据分成这些集合非常重要,因为我们必须在未见数据上测试我们的模型。为了将数据分成集合,sklearn 有一个名为 train_test_split() 的函数。借助以下命令,我们可以将数据分成这些集合 −

In this step, we will divide our data into two parts namely a training set and a test set. Splitting the data into these sets is very important because we have to test our model on the unseen data. To split the data into sets, sklearn has a function called the train_test_split() function. With the help of the following commands, we can split the data in these sets −

from sklearn.model_selection import train_test_split

上述命令将从 sklearn 导入 train_test_split 函数,下面的命令将把数据分成训练数据和测试数据。在下面给出的示例中,我们使用 40% 的数据进行测试,剩余数据将用于训练模型。

The above command will import the train_test_split function from sklearn and the command below will split the data into training and test data. In the example given below, we are using 40 % of the data for testing and the remaining data would be used for training the model.

train, test, train_labels, test_labels = train_test_split(features,labels,test_size = 0.40, random_state = 42)

Step 4 − Building the model

在此步骤中,我们将构建我们的模型。我们将使用朴素贝叶斯算法来构建模型。可以使用以下命令来构建模型 −

In this step, we will be building our model. We are going to use Naïve Bayes algorithm for building the model. Following commands can be used to build the model −

from sklearn.naive_bayes import GaussianNB

上述命令将导入 GaussianNB 模块。现在,以下命令将帮助您初始化模型。

The above command will import the GaussianNB module. Now, the following command will help you initialize the model.

gnb = GaussianNB()

我们将通过使用 gnb.fit() 将模型拟合到数据来训练模型。

We will train the model by fitting it to the data by using gnb.fit().

model = gnb.fit(train, train_labels)

Step 5 − Evaluating the model and its accuracy

在此步骤中,我们将通过对我们的测试数据进行预测来评估模型。然后我们将找出其准确性。为了进行预测,我们将使用 predict() 函数。以下命令将帮助您执行此操作 −

In this step, we are going to evaluate the model by making predictions on our test data. Then we will find out its accuracy also. For making predictions, we will use the predict() function. The following command will help you do this −

preds = gnb.predict(test)
print(preds)

[1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1
 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0
 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 1 0 0 0 0
 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0 1 1 0 0 0
 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0
 1 0 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0
 1 1 0 1 1 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 0 1 1 0 1 1 1 0
 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 0 1 1 0 1]

上述 0 和 1 的系列是针对肿瘤类别(恶性和良性)的预测值。

The above series of 0s and 1s are the predicted values for the tumor classes – malignant and benign.

现在,通过比较 test_labelspreds 这两个数组,我们可以找出模型的准确性。我们将使用 accuracy_score() 函数来确定准确性。请考虑以下命令 −

Now, by comparing the two arrays namely test_labels and preds, we can find out the accuracy of our model. We are going to use the accuracy_score() function to determine the accuracy. Consider the following command for this −

from sklearn.metrics import accuracy_score
print(accuracy_score(test_labels,preds))
0.951754385965

结果显示 NaiveBayes 分类器的准确率为 95.17%。

The result shows that the NaïveBayes classifier is 95.17% accurate.

通过这种方式,借助以上步骤,我们可以在 Python 中构建分类器。

In this way, with the help of the above steps we can build our classifier in Python.

Building Classifier in Python

在本节中,我们将了解如何使用 Python 构建分类器。

In this section, we will learn how to build a classifier in Python.

Naïve Bayes Classifier

朴素贝叶斯是一种利用贝叶斯定理构建分类器的分类技术。假设预测变量是独立的。用简单的话来说,它假设某个特征在某个类别中出现与其他任何特征出现无关。要构建朴素贝叶斯分类器,我们需要使用名为 scikit learn 的 Python 库。scikit learn 包中有三种类型的朴素贝叶斯模型,即 Gaussian, Multinomial and Bernoulli

Naïve Bayes is a classification technique used to build classifier using the Bayes theorem. The assumption is that the predictors are independent. In simple words, it assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For building Naïve Bayes classifier we need to use the python library called scikit learn. There are three types of Naïve Bayes models named Gaussian, Multinomial and Bernoulli under scikit learn package.

要构建朴素贝叶斯机器学习分类器模型,我们需要以下内容:

To build a Naïve Bayes machine learning classifier model, we need the following &minus

Dataset

我们将使用名为 Wisconsin 乳腺癌诊断数据库的数据集。该数据集包括乳腺癌肿瘤的各种信息,以及 malignantbenign 的分类标签。该数据集有 569 个实例或数据,涉及 569 个肿瘤,还包括 30 个属性或特征的信息,例如肿瘤的半径、质地、光滑度和面积。我们可以从 sklearn 包中导入这个数据集。

We are going to use the dataset named Breast Cancer Wisconsin Diagnostic Database. The dataset includes various information about breast cancer tumors, as well as classification labels of malignant or benign. The dataset has 569 instances, or data, on 569 tumors and includes information on 30 attributes, or features, such as the radius of the tumor, texture, smoothness, and area. We can import this dataset from sklearn package.

Naïve Bayes Model

要构建朴素贝叶斯分类器,我们需要一个朴素贝叶斯模型。如前所述,scikit learn 包中有三种类型的朴素贝叶斯模型,即 Gaussian, MultinomialBernoulli 。此处,在以下示例中,我们将使用高斯朴素贝叶斯模型。

For building Naïve Bayes classifier, we need a Naïve Bayes model. As told earlier, there are three types of Naïve Bayes models named Gaussian, Multinomial and Bernoulli under scikit learn package. Here, in the following example we are going to use the Gaussian Naïve Bayes model.

通过使用上述内容,我们将构建一个朴素贝叶斯机器学习模型,利用肿瘤信息来预测肿瘤是否为恶性或良性。

By using the above, we are going to build a Naïve Bayes machine learning model to use the tumor information to predict whether or not a tumor is malignant or benign.

首先,我们需要安装 sklearn 模块。这可以通过以下命令来完成:

To begin with, we need to install the sklearn module. It can be done with the help of the following command −

Import Sklearn

现在,我们需要导入名为 Wisconsin 乳腺癌诊断数据库的数据集。

Now, we need to import the dataset named Breast Cancer Wisconsin Diagnostic Database.

from sklearn.datasets import load_breast_cancer

现在,以下命令将加载数据集。

Now, the following command will load the dataset.

data = load_breast_cancer()

数据可以按如下方式组织:

The data can be organized as follows −

label_names = data['target_names']
labels = data['target']
feature_names = data['feature_names']
features = data['data']

现在,为了更清楚,我们可以借助以下命令打印出类别标签、第一个数据实例的标签、我们的特征名称和特征值:

Now, to make it clearer we can print the class labels, the first data instance’s label, our feature names and the feature’s value with the help of following commands −

print(label_names)

上述命令将打印分别为恶性和良性的类名称。如下所示显示为输出 −

The above command will print the class names which are malignant and benign respectively. It is shown as the output below −

['malignant' 'benign']

现在,以下给出的命令将显示它们被映射为二进制值 0 和 1。此处 0 表示恶性癌症,而 1 表示良性癌症。它作为输出如下所示:

Now, the command given below will show that they are mapped to binary values 0 and 1. Here 0 represents malignant cancer and 1 represents benign cancer. It is shown as the output below −

print(labels[0])
0

以下两个命令将生成特征名称和特征值。

The following two commands will produce the feature names and feature values.

print(feature_names[0])
mean radius
print(features[0])

[ 1.79900000e+01 1.03800000e+01 1.22800000e+02 1.00100000e+03
  1.18400000e-01 2.77600000e-01 3.00100000e-01 1.47100000e-01
  2.41900000e-01 7.87100000e-02 1.09500000e+00 9.05300000e-01
  8.58900000e+00 1.53400000e+02 6.39900000e-03 4.90400000e-02
  5.37300000e-02 1.58700000e-02 3.00300000e-02 6.19300000e-03
  2.53800000e+01 1.73300000e+01 1.84600000e+02 2.01900000e+03
  1.62200000e-01 6.65600000e-01 7.11900000e-01 2.65400000e-01
  4.60100000e-01 1.18900000e-01]

从上述输出中可以看到,第一个数据实例是一个恶性肿瘤,其主半径为 1.7990000e+01。

From the above output, we can see that the first data instance is a malignant tumor the main radius of which is 1.7990000e+01.

为了在未见数据上测试我们的模型,我们需要将数据分为训练数据和测试数据。这可以通过以下代码来完成:

For testing our model on unseen data, we need to split our data into training and testing data. It can be done with the help of the following code −

from sklearn.model_selection import train_test_split

上述命令将从 sklearn 中导入 train_test_split 函数,而下面的命令将把数据分成训练数据和测试数据。在下面的示例中,我们将 40% 的数据用于测试,而其余数据将用于训练模型。

The above command will import the train_test_split function from sklearn and the command below will split the data into training and test data. In the below example, we are using 40 % of the data for testing and the remining data would be used for training the model.

train, test, train_labels, test_labels =
train_test_split(features,labels,test_size = 0.40, random_state = 42)

现在,我们使用以下命令构建模型:

Now, we are building the model with the following commands −

from sklearn.naive_bayes import GaussianNB

上述命令将导入 GaussianNB 模块。现在,借助以下给出的命令,我们需要初始化模型。

The above command will import the GaussianNB module. Now, with the command given below, we need to initialize the model.

gnb = GaussianNB()

我们将通过使用 gnb.fit() 使模型适应数据来训练模型。

We will train the model by fitting it to the data by using gnb.fit().

model = gnb.fit(train, train_labels)

现在,通过对测试数据进行预测来评估模型,可按以下方式进行 −

Now, evaluate the model by making prediction on the test data and it can be done as follows −

preds = gnb.predict(test)
print(preds)

[1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1
 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0
 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 1 0 0 0 0
 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0 1 1 0 0 0
 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0
 1 0 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0
 1 1 0 1 1 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 1 0 1 1 0 1 1 1 0
 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 0 1 1 0 1]

以上 0 和 1 的序列是肿瘤类的预测值,即恶性和良性。

The above series of 0s and 1s are the predicted values for the tumor classes i.e. malignant and benign.

现在,通过比较 test_labelspreds 两个数组,我们可以找出我们模型的准确性。我们将使用 accuracy_score() 函数来确定准确性。考虑以下命令 −

Now, by comparing the two arrays namely test_labels and preds, we can find out the accuracy of our model. We are going to use the accuracy_score() function to determine the accuracy. Consider the following command −

from sklearn.metrics import accuracy_score
print(accuracy_score(test_labels,preds))
0.951754385965

结果表明,NaïveBayes 分类器准确率为 95.17%。

The result shows that NaïveBayes classifier is 95.17% accurate.

这是基于朴素贝叶斯高斯模型的机器学习分类器。

That was machine learning classifier based on the Naïve Bayse Gaussian model.

Support Vector Machines (SVM)

基本上,支持向量机(SVM)是一种监督机器学习算法,可用于回归和分类。SVM 的主要概念是将每个数据项作为具有 n 维空间的点进行绘制,其中每个特征的值都是特定坐标的值。这里 n 将是我们拥有的特征。以下是理解 SVM 概念的简单图形表示 −

Basically, Support vector machine (SVM) is a supervised machine learning algorithm that can be used for both regression and classification. The main concept of SVM is to plot each data item as a point in n-dimensional space with the value of each feature being the value of a particular coordinate. Here n would be the features we would have. Following is a simple graphical representation to understand the concept of SVM −

support vector machine i2

在上面的图表中,我们有两个特征。因此,我们首先需要在二维空间中绘制这两个变量,其中每个点有两个坐标,称为支持向量。该线将数据分成两个不同的分类组。这条线将是分类器。

In the above diagram, we have two features. Hence, we first need to plot these two variables in two dimensional space where each point has two co-ordinates, called support vectors. The line splits the data into two different classified groups. This line would be the classifier.

在这里,我们将使用 scikit-learn 和 iris 数据集构建一个 SVM 分类器。Scikitlearn 库包含 sklearn.svm 模块并提供 sklearn.svm.svc 进行分类。下面显示了基于 4 个特征预测虹膜植物类别的 SVM 分类器。

Here, we are going to build an SVM classifier by using scikit-learn and iris dataset. Scikitlearn library has the sklearn.svm module and provides sklearn.svm.svc for classification. The SVM classifier to predict the class of the iris plant based on 4 features are shown below.

Dataset

我们将使用 iris 数据集,其中包含 3 类,每类有 50 个实例,其中每个类指代一种虹膜植物。每个实例有四个特征,即萼片长度、萼片宽度、花瓣长度和花瓣宽度。下面显示了基于 4 个特征预测虹膜植物类别的 SVM 分类器。

We will use the iris dataset which contains 3 classes of 50 instances each, where each class refers to a type of iris plant. Each instance has the four features namely sepal length, sepal width, petal length and petal width. The SVM classifier to predict the class of the iris plant based on 4 features is shown below.

Kernel

这是 SVM 使用的技术。基本上这些是函数,它们采用低维输入空间并将其转换为更高维空间。它将不可分离问题转换为可分离问题。核函数可以在线性、多项式、rbf 和 sigmoid 中任意选择。在此示例中,我们将使用线性核。

It is a technique used by SVM. Basically these are the functions which take low-dimensional input space and transform it to a higher dimensional space. It converts non-separable problem to separable problem. The kernel function can be any one among linear, polynomial, rbf and sigmoid. In this example, we will use the linear kernel.

现在让我们导入以下包 −

Let us now import the following packages −

import pandas as pd
import numpy as np
from sklearn import svm, datasets
import matplotlib.pyplot as plt

现在,加载输入数据 −

Now, load the input data −

iris = datasets.load_iris()

我们采用前两个特征 −

We are taking first two features −

X = iris.data[:, :2]
y = iris.target

我们将使用原始数据绘制支持向量机边界。我们正在创建一个网格进行绘制。

We will plot the support vector machine boundaries with original data. We are creating a mesh to plot.

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
h = (x_max / x_min)/100
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
X_plot = np.c_[xx.ravel(), yy.ravel()]

我们需要给出正则化参数的值。

We need to give the value of regularization parameter.

C = 1.0

我们需要创建 SVM 分类器对象。

We need to create the SVM classifier object.

Svc_classifier = svm_classifier.SVC(kernel='linear',
C=C, decision_function_shape = 'ovr').fit(X, y)
Z = svc_classifier.predict(X_plot)
Z = Z.reshape(xx.shape)
plt.figure(figsize = (15, 5))
plt.subplot(121)
plt.contourf(xx, yy, Z, cmap = plt.cm.tab10, alpha = 0.3)
plt.scatter(X[:, 0], X[:, 1], c = y, cmap = plt.cm.Set1)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.title('SVC with linear kernel')
svc with liner kernel

Logistic Regression

基本上,逻辑回归模型是监督分类算法家族成员之一。逻辑回归通过使用逻辑函数估计概率来衡量因变量和自变量之间的关系。

Basically, logistic regression model is one of the members of supervised classification algorithm family. Logistic regression measures the relationship between dependent variables and independent variables by estimating the probabilities using a logistic function.

在这里,如果我们谈论因变量和自变量,那么因变量是我们将要预测的目标类变量,而自变量是我们将用于预测目标类的特征。

Here, if we talk about dependent and independent variables then dependent variable is the target class variable we are going to predict and on the other side the independent variables are the features we are going to use to predict the target class.

在逻辑回归中,估计概率意味着预测事件发生的可能性。例如,商店老板想预测进入商店的顾客是否会购买游戏机(例如)。顾客会有许多特征 - 性别、年龄等,商店老板将观察这些特征以预测发生的可能性,即购买游戏机与否。逻辑函数是用于使用各种参数构建函数的 S 型曲线。

In logistic regression, estimating the probabilities means to predict the likelihood occurrence of the event. For example, the shop owner would like to predict the customer who entered into the shop will buy the play station (for example) or not. There would be many features of customer − gender, age, etc. which would be observed by the shop keeper to predict the likelihood occurrence, i.e., buying a play station or not. The logistic function is the sigmoid curve that is used to build the function with various parameters.

Prerequisites

在使用逻辑回归来构建分类器之前,我们需要在我们的系统上安装 Tkinter 包。它可以从 https://docs.python.org/2/library/tkinter.html 安装。

Before building the classifier using logistic regression, we need to install the Tkinter package on our system. It can be installed from https://docs.python.org/2/library/tkinter.html.

现在,借助下方给定的代码,我们可以使用逻辑回归来创建分类器 −

Now, with the help of the code given below, we can create a classifier using logistic regression −

首先,我们将导入一些包 −

First, we will import some packages −

import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt

现在,我们需要定义示例数据,具体如下 −

Now, we need to define the sample data which can be done as follows −

X = np.array([[2, 4.8], [2.9, 4.7], [2.5, 5], [3.2, 5.5], [6, 5], [7.6, 4],
              [3.2, 0.9], [2.9, 1.9],[2.4, 3.5], [0.5, 3.4], [1, 4], [0.9, 5.9]])
y = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3])

接下来,我们需要创建逻辑回归分类器,具体如下 −

Next, we need to create the logistic regression classifier, which can be done as follows −

Classifier_LR = linear_model.LogisticRegression(solver = 'liblinear', C = 75)

很重要的一点是,我们需要训练这个分类器 −

Last but not the least, we need to train this classifier −

Classifier_LR.fit(X, y)

现在,如何可视化输出?我们可以创建一个名为 Logistic_visualize() 的函数 −

Now, how we can visualize the output? It can be done by creating a function named Logistic_visualize() −

Def Logistic_visualize(Classifier_LR, X, y):
   min_x, max_x = X[:, 0].min() - 1.0, X[:, 0].max() + 1.0
   min_y, max_y = X[:, 1].min() - 1.0, X[:, 1].max() + 1.0

在上文中,我们定义了在网格中使用的最小值和最大值 X 和 Y。此外,我们将定义绘制网格的步长。

In the above line, we defined the minimum and maximum values X and Y to be used in mesh grid. In addition, we will define the step size for plotting the mesh grid.

mesh_step_size = 0.02

让我们按照以下方式定义 X 和 Y 值的网格 −

Let us define the mesh grid of X and Y values as follows −

x_vals, y_vals = np.meshgrid(np.arange(min_x, max_x, mesh_step_size),
                 np.arange(min_y, max_y, mesh_step_size))

借助以下代码,我们可以在网格上运行分类器 −

With the help of following code, we can run the classifier on the mesh grid −

output = classifier.predict(np.c_[x_vals.ravel(), y_vals.ravel()])
output = output.reshape(x_vals.shape)
plt.figure()
plt.pcolormesh(x_vals, y_vals, output, cmap = plt.cm.gray)

plt.scatter(X[:, 0], X[:, 1], c = y, s = 75, edgecolors = 'black',
linewidth=1, cmap = plt.cm.Paired)

以下代码行将指定绘图的边界

The following line of code will specify the boundaries of the plot

plt.xlim(x_vals.min(), x_vals.max())
plt.ylim(y_vals.min(), y_vals.max())
plt.xticks((np.arange(int(X[:, 0].min() - 1), int(X[:, 0].max() + 1), 1.0)))
plt.yticks((np.arange(int(X[:, 1].min() - 1), int(X[:, 1].max() + 1), 1.0)))
plt.show()

现在,在运行代码后,我们将得到以下输出,逻辑回归分类器 −

Now, after running the code we will get the following output, logistic regression classifier −

logistic regression

Decision Tree Classifier

决策树本质上是一个二叉树流程图,其中每个节点根据某些特征变量拆分一组观察结果。

A decision tree is basically a binary tree flowchart where each node splits a group of observations according to some feature variable.

在此,我们正在构建一个决策树分类器来预测男性或女性。我们将采用一个非常小的数据集,该数据集包含 19 个样本。这些样本将包含两个特征——“高度”和“头发长度”。

Here, we are building a Decision Tree classifier for predicting male or female. We will take a very small data set having 19 samples. These samples would consist of two features – ‘height’ and ‘length of hair’.

Prerequisite

为了构建以下分类器,我们需要安装 pydotplusgraphviz 。graphviz 是一个使用点文件绘制图形的工具, pydotplus 是 Graphviz 的 Dot 语言模块。它可以使用包管理器或 pip 安装。

For building the following classifier, we need to install pydotplus and graphviz. Basically, graphviz is a tool for drawing graphics using dot files and pydotplus is a module to Graphviz’s Dot language. It can be installed with the package manager or pip.

现在,我们可以借助以下 Python 代码来构建决策树分类器 −

Now, we can build the decision tree classifier with the help of the following Python code −

首先,让我们导入一些重要的库,如下所示 −

To begin with, let us import some important libraries as follows −

import pydotplus
from sklearn import tree
from sklearn.datasets import load_iris
from sklearn.metrics import classification_report
from sklearn import cross_validation
import collections

现在,我们需要提供数据集,如下所示 −

Now, we need to provide the dataset as follows −

X = [[165,19],[175,32],[136,35],[174,65],[141,28],[176,15],[131,32],
[166,6],[128,32],[179,10],[136,34],[186,2],[126,25],[176,28],[112,38],
[169,9],[171,36],[116,25],[196,25]]

Y = ['Man','Woman','Woman','Man','Woman','Man','Woman','Man','Woman',
'Man','Woman','Man','Woman','Woman','Woman','Man','Woman','Woman','Man']
data_feature_names = ['height','length of hair']

X_train, X_test, Y_train, Y_test = cross_validation.train_test_split
(X, Y, test_size=0.40, random_state=5)

在提供数据集之后,我们需要拟合模型,具体如下 −

After providing the dataset, we need to fit the model which can be done as follows −

clf = tree.DecisionTreeClassifier()
clf = clf.fit(X,Y)

使用以下 Python 代码可以进行预测:

Prediction can be made with the help of the following Python code −

prediction = clf.predict([[133,37]])
print(prediction)

我们可以借助以下 Python 代码形象化决策树:

We can visualize the decision tree with the help of the following Python code −

dot_data = tree.export_graphviz(clf,feature_names = data_feature_names,
            out_file = None,filled = True,rounded = True)
graph = pydotplus.graph_from_dot_data(dot_data)
colors = ('orange', 'yellow')
edges = collections.defaultdict(list)

for edge in graph.get_edge_list():
edges[edge.get_source()].append(int(edge.get_destination()))

for edge in edges: edges[edge].sort()

for i in range(2):dest = graph.get_node(str(edges[edge][i]))[0]
dest.set_fillcolor(colors[i])
graph.write_png('Decisiontree16.png')

它将针对上述代码给出 [‘Woman’] 的预测并创建以下决策树:

It will give the prediction for the above code as [‘Woman’] and create the following decision tree −

deision tree

我们可以更改预测中的特征值以对其进行测试。

We can change the values of features in prediction to test it.

Random Forest Classifier

我们知道,集成方法是将机器学习模型组合成更强大的机器学习模型的方法。随机森林(由决策树集合构成)便是其中之一。它优于单一决策树,因为它在保留预测能力的同时,可以通过对结果进行加权平均来减少过度拟合。此处,我们将在 scikit learn 癌症数据集上实现随机森林模型。

As we know that ensemble methods are the methods which combine machine learning models into a more powerful machine learning model. Random Forest, a collection of decision trees, is one of them. It is better than single decision tree because while retaining the predictive powers it can reduce over-fitting by averaging the results. Here, we are going to implement the random forest model on scikit learn cancer dataset.

导入必需的包:

Import the necessary packages −

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
import matplotlib.pyplot as plt
import numpy as np

现在,我们需要提供数据集,具体做法如下:

Now, we need to provide the dataset which can be done as follows &minus

cancer = load_breast_cancer()
X_train, X_test, y_train,
y_test = train_test_split(cancer.data, cancer.target, random_state = 0)

在提供数据集之后,我们需要拟合模型,具体如下 −

After providing the dataset, we need to fit the model which can be done as follows −

forest = RandomForestClassifier(n_estimators = 50, random_state = 0)
forest.fit(X_train,y_train)

现在,获得训练和测试子集的准确性:如果我们增加估计器的数量,则测试子集的准确性也会提高。

Now, get the accuracy on training as well as testing subset: if we will increase the number of estimators then, the accuracy of testing subset would also be increased.

print('Accuracy on the training subset:(:.3f)',format(forest.score(X_train,y_train)))
print('Accuracy on the training subset:(:.3f)',format(forest.score(X_test,y_test)))

Output

Accuracy on the training subset:(:.3f) 1.0
Accuracy on the training subset:(:.3f) 0.965034965034965

现在,与决策树类似,随机森林具有 feature_importance 模块,它将提供比决策树更好的特征权重视图。可以将其绘制和形象化如下所示:

Now, like the decision tree, random forest has the feature_importance module which will provide a better view of feature weight than decision tree. It can be plot and visualize as follows −

n_features = cancer.data.shape[1]
plt.barh(range(n_features),forest.feature_importances_, align='center')
plt.yticks(np.arange(n_features),cancer.feature_names)
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.show()
feature importance

Performance of a classifier

在实现机器学习算法后,我们需要找出该模型的有效程度。度量有效性的标准可能基于数据集和指标。为了评估不同的机器学习算法,我们可以使用不同的性能指标。例如,假设使用分类器来区分不同对象的图像,则可以使用分类性能指标,如平均准确性、AUC 等。在一个或另一个意义上,我们选择来评估机器学习模型的指标非常重要,因为指标的选择影响了机器学习算法性能的度量和比较方式。以下是其中一些指标:

After implementing a machine learning algorithm, we need to find out how effective the model is. The criteria for measuring the effectiveness may be based upon datasets and metric. For evaluating different machine learning algorithms, we can use different performance metrics. For example, suppose if a classifier is used to distinguish between images of different objects, we can use the classification performance metrics such as average accuracy, AUC, etc. In one or other sense, the metric we choose to evaluate our machine learning model is very important because the choice of metrics influences how the performance of a machine learning algorithm is measured and compared. Following are some of the metrics −

Confusion Matrix

它主要用于输出可以为两类或更多类别的分类问题。这是衡量分类器性能的最简单方法。混淆矩阵基本上是一个具有两个维度“实际”和“预测”的表格。这两个维度都具有“真阳性 (TP)”、“真阴性 (TN)”、“假阳性 (FP)”、“假阴性 (FN)”。

Basically it is used for classification problem where the output can be of two or more types of classes. It is the easiest way to measure the performance of a classifier. A confusion matrix is basically a table with two dimensions namely “Actual” and “Predicted”. Both the dimensions have “True Positives (TP)”, “True Negatives (TN)”, “False Positives (FP)”, “False Negatives (FN)”.

confusion matrix

在上面的混淆矩阵中,1 表示阳性类别,0 表示阴性类别。

In the confusion matrix above, 1 is for positive class and 0 is for negative class.

以下术语与混淆矩阵相关:

Following are the terms associated with Confusion matrix −

  1. True Positives − TPs are the cases when the actual class of data point was 1 and the predicted is also 1.

  2. True Negatives − TNs are the cases when the actual class of the data point was 0 and the predicted is also 0.

  3. False Positives − FPs are the cases when the actual class of data point was 0 and the predicted is also 1.

  4. False Negatives − FNs are the cases when the actual class of the data point was 1 and the predicted is also 0.

Accuracy

混淆矩阵本身不是性能度量,但几乎所有性能矩阵都基于混淆矩阵。其中之一是准确性。在分类问题中,可以将其定义为模型做出的正确预测在所有做出的预测中的数量。计算准确性的公式如下:

The confusion matrix itself is not a performance measure as such but almost all the performance matrices are based on the confusion matrix. One of them is accuracy. In classification problems, it may be defined as the number of correct predictions made by the model over all kinds of predictions made. The formula for calculating the accuracy is as follows −

准确性 = \frac{TP+TN}{TP+FP+FN+TN}

Accuracy = \frac{TP+TN}{TP+FP+FN+TN}

Precision

它主要用于文档检索。它可以定义为有多少已返回的文档是正确的。下面是计算精度的公式 -

It is mostly used in document retrievals. It may be defined as how many of the returned documents are correct. Following is the formula for calculating the precision −

Precision = \frac{TP}{TP+FP}

Recall or Sensitivity

它可以定义为模型返回了多少个正样本。下面是计算模型召回率/敏感性的公式 -

It may be defined as how many of the positives do the model return. Following is the formula for calculating the recall/sensitivity of the model −

Recall = \frac{TP}{TP+FN}

Specificity

它可以定义为模型返回了多少个负样本。它与召回率正好相反。下面是计算模型特异性的公式 -

It may be defined as how many of the negatives do the model return. It is exactly opposite to recall. Following is the formula for calculating the specificity of the model −

Specificity = \frac{TN}{TN+FP}

Class Imbalance Problem

类不平衡是指属于一类的观测值数量显著低于属于其他类的观测值数量的情况。例如,这个问题在我们需要识别罕见疾病、银行的欺诈交易等情况下尤为突出。

Class imbalance is the scenario where the number of observations belonging to one class is significantly lower than those belonging to the other classes. For example, this problem is prominent in the scenario where we need to identify the rare diseases, fraudulent transactions in bank etc.

Example of imbalanced classes

让我们考虑一个欺诈检测数据集的例子来理解不平衡类的概念 -

Let us consider an example of fraud detection data set to understand the concept of imbalanced class −

Total observations = 5000
Fraudulent Observations = 50
Non-Fraudulent Observations = 4950
Event Rate = 1%

Solution

Balancing the classes’ 是解决不平衡类的方法。平衡类的主要目的是增加少数类的频率或减少多数类的频率。以下是在解决不平衡类问题的方法 -

Balancing the classes’ acts as a solution to imbalanced classes. The main objective of balancing the classes is to either increase the frequency of the minority class or decrease the frequency of the majority class. Following are the approaches to solve the issue of imbalances classes −

Re-Sampling

再抽样是一系列用于重建样本数据集的方法——包括训练集和测试集。执行再抽样是为了提高模型的准确性。下面是一些再抽样技术 -

Re-sampling is a series of methods used to reconstruct the sample data sets − both training sets and testing sets. Re-sampling is done to improve the accuracy of model. Following are some re-sampling techniques −

  1. Random Under-Sampling − This technique aims to balance class distribution by randomly eliminating majority class examples. This is done until the majority and minority class instances are balanced out.

Total observations = 5000
Fraudulent Observations = 50
Non-Fraudulent Observations = 4950
Event Rate = 1%

在这种情况下,我们从非欺诈实例中无放回地抽取 10% 的样本,然后将它们与欺诈性实例组合 -

In this case, we are taking 10% samples without replacement from non-fraud instances and then combine them with the fraud instances −

无欺诈后的观测值随机降采样 = 4950 的 10% = 495

Non-fraudulent observations after random under sampling = 10% of 4950 = 495

与欺诈性观测值合并后的总观测值 = 50 + 495 = 545

Total observations after combining them with fraudulent observations = 50+495 = 545

因此,现在,降采样后的新数据集的事件率为 9%

Hence now, the event rate for new dataset after under sampling = 9%

该技术的最大优点是它可以减少运行时间并改善存储。但另一方面,它在减少训练数据样本数量时会丢弃有用的信息。

The main advantage of this technique is that it can reduce run time and improve storage. But on the other side, it can discard useful information while reducing the number of training data samples.

  1. Random Over-Sampling − This technique aims to balance class distribution by increasing the number of instances in the minority class by replicating them.

Total observations = 5000
Fraudulent Observations = 50
Non-Fraudulent Observations = 4950
Event Rate = 1%

如果我们将 50 个欺诈性观测值复制 30 次,则在复制少数类观测值之后,欺诈性观测值将为 1500。然后,对新数据过采样后的总观测值为 4950 + 1500 = 6450。因此,新数据集的事件率为 1500/6450 = 23%。

In case we are replicating 50 fraudulent observations 30 times then fraudulent observations after replicating the minority class observations would be 1500. And then total observations in the new data after oversampling would be 4950+1500 = 6450. Hence the event rate for the new data set would be 1500/6450 = 23%.

该方法的主要优点是没有丢失有用的信息。但另一方面,它因为复制了少数类事件导致过拟合的可能性增加。

The main advantage of this method is that there would be no loss of useful information. But on the other hand, it has the increased chances of over-fitting because it replicates the minority class events.

Ensemble Techniques

这种方法主要用于修改现有的分类算法,使它们适用于不平衡的数据集。在这种方法中,我们从原始数据构建几个两阶段分类器,然后汇总它们的预测。随机森林分类器是基于集合的分类器的示例。

This methodology basically is used to modify existing classification algorithms to make them appropriate for imbalanced data sets. In this approach we construct several two stage classifier from the original data and then aggregate their predictions. Random forest classifier is an example of ensemble based classifier.