Machine Learning With Python 简明教程
Classification - Introduction
Introduction to Classification
分类可以定义为从观察到的值或给定的数据点中预测类或类别。分类后的输出形式可以是“黑”或“白”或“垃圾邮件”或“无垃圾邮件”。
Classification may be defined as the process of predicting class or category from observed values or given data points. The categorized output can have the form such as “Black” or “White” or “spam” or “no spam”.
从数学上讲,分类是从输入变量(X)到输出变量(Y)逼近映射函数(f)的任务。它基本上属于监督机器学习,其中目标也与输入数据集一起提供。
Mathematically, classification is the task of approximating a mapping function (f) from input variables (X) to output variables (Y). It is basically belongs to the supervised machine learning in which targets are also provided along with the input data set.
分类问题的示例可以是电子邮件中的垃圾邮件检测。输出只能有两类,“垃圾邮件”和“无垃圾邮件”;因此,这是一个二元类型分类。
An example of classification problem can be the spam detection in emails. There can be only two categories of output, “spam” and “no spam”; hence this is a binary type classification.
为了实现此分类,我们首先需要训练分类器。对于此示例,“垃圾邮件”和“无垃圾邮件”电子邮件将用作训练数据。成功训练分类器后,可将其用于检测未知电子邮件。
To implement this classification, we first need to train the classifier. For this example, “spam” and “no spam” emails would be used as the training data. After successfully train the classifier, it can be used to detect an unknown email.
Types of Learners in Classification
在分类问题中,我们有两种类型的学习器——
We have two types of learners in respective to classification problems −
Lazy Learners
顾名思义,此类学习器会等待在存储训练数据后出现的测试数据。只有在获取测试数据后才会执行分类。它们花费在训练上的时间较少,而花费在预测上的时间较多。惰性学习器的示例包括 k 近邻和基于案例的推理。
As the name suggests, such kind of learners waits for the testing data to be appeared after storing the training data. Classification is done only after getting the testing data. They spend less time on training but more time on predicting. Examples of lazy learners are K-nearest neighbor and case-based reasoning.
Eager Learners
与惰性学习器相反,主动学习器会在存储训练数据后等待在测试数据出现后构建分类模型。它们花费在训练上的时间较多,而花费在预测上的时间较少。主动学习器的示例包括决策树、朴素贝叶斯和人工神经网络 (ANN)。
As opposite to lazy learners, eager learners construct classification model without waiting for the testing data to be appeared after storing the training data. They spend more time on training but less time on predicting. Examples of eager learners are Decision Trees, Naïve Bayes and Artificial Neural Networks (ANN).
Building a Classifier in Python
Scikit-learn(一个用于机器学习的 Python 库)可用于在 Python 中构建分类器。在 Python 中构建分类器的步骤如下——
Scikit-learn, a Python library for machine learning can be used to build a classifier in Python. The steps for building a classifier in Python are as follows −
Step1: Importing necessary python package
要使用 scikit-learn 构建分类器,我们需要导入它。我们可以使用以下脚本导入它——
For building a classifier using scikit-learn, we need to import it. We can import it by using following script −
import sklearn
Step2: Importing dataset
导入必要的包后,我们需要一个数据集来构建分类预测模型。我们可以从 sklearn 数据集导入它,也可以根据我们的要求使用其他数据集。我们将使用 sklearn 的乳腺癌威斯康星州诊断数据库。我们可以借助以下脚本导入它——
After importing necessary package, we need a dataset to build classification prediction model. We can import it from sklearn dataset or can use other one as per our requirement. We are going to use sklearn’s Breast Cancer Wisconsin Diagnostic Database. We can import it with the help of following script −
from sklearn.datasets import load_breast_cancer
以下脚本将加载数据集;
The following script will load the dataset;
data = load_breast_cancer()
我们还需要整理数据,可以使用以下脚本来完成此操作——
We also need to organize the data and it can be done with the help of following scripts −
label_names = data['target_names']
labels = data['target']
feature_names = data['feature_names']
features = data['data']
以下命令将打印标签的名称,在我们的数据库中为“ malignant ”和“ benign ”。
The following command will print the name of the labels, ‘malignant’ and ‘benign’ in case of our database.
print(label_names)
以上命令的输出是标签名称——
The output of the above command is the names of the labels −
['malignant' 'benign']
这些标签映射到二进制值 0 和 1。 Malignant 癌由 0 表示, Benign 癌由 1 表示。
These labels are mapped to binary values 0 and 1. Malignant cancer is represented by 0 and Benign cancer is represented by 1.
这些标签的特征名称和特征值可以使用以下命令查看——
The feature names and feature values of these labels can be seen with the help of following commands −
print(feature_names[0])
以上命令的输出是标签 0(即 Malignant 癌)的特征名称——
The output of the above command is the names of the features for label 0 i.e. Malignant cancer −
mean radius
类似地,可以按如下方式生成标签的特征名称:
Similarly, names of the features for label can be produced as follows −
print(feature_names[1])
上述命令的输出是标签 1 的特征名称,即 Benign 癌症 -
The output of the above command is the names of the features for label 1 i.e. Benign cancer −
mean texture
我们可以使用以下命令打印这些标签的特征 −
We can print the features for these labels with the help of following command −
print(features[0])
这将给出以下输出 −
This will give the following output −
[
1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
4.601e-01 1.189e-01
]
我们可以使用以下命令打印这些标签的特征 −
We can print the features for these labels with the help of following command −
print(features[1])
这将给出以下输出 −
This will give the following output −
[
2.057e+01 1.777e+01 1.329e+02 1.326e+03 8.474e-02 7.864e-02 8.690e-02
7.017e-02 1.812e-01 5.667e-02 5.435e-01 7.339e-01 3.398e+00 7.408e+01
5.225e-03 1.308e-02 1.860e-02 1.340e-02 1.389e-02 3.532e-03 2.499e+01
2.341e+01 1.588e+02 1.956e+03 1.238e-01 1.866e-01 2.416e-01 1.860e-01
2.750e-01 8.902e-02
]
Step3: Organizing data into training & testing sets
因为我们需要在不可见数据上测试我们的模型,我们会将我们的数据集分成两部分:一个训练集和一个测试集。我们可以使用 sklearn python 包的 train_test_split() 函数将数据分成数据集。以下命令将导入该函数 −
As we need to test our model on unseen data, we will divide our dataset into two parts: a training set and a test set. We can use train_test_split() function of sklearn python package to split the data into sets. The following command will import the function −
from sklearn.model_selection import train_test_split
现在,下一个命令会将数据分成训练和测试数据。在这个示例中,我们将 40% 的数据用于测试目的,60% 的数据用于训练目的 −
Now, next command will split the data into training & testing data. In this example, we are using taking 40 percent of the data for testing purpose and 60 percent of the data for training purpose −
train, test, train_labels, test_labels = train_test_split(
features,labels,test_size = 0.40, random_state = 42
)
Step4: Model evaluation
在将数据分成训练和测试后,我们需要构建模型。为此,我们将使用朴素贝叶斯算法。以下命令会导入 GaussianNB 模块 −
After dividing the data into training and testing we need to build the model. We will be using Naïve Bayes algorithm for this purpose. The following commands will import the GaussianNB module −
from sklearn.naive_bayes import GaussianNB
现在,初始化模型如下所示 −
Now, initialize the model as follows −
gnb = GaussianNB()
接下来,在以下命令的帮助下,我们可以训练模型 −
Next, with the help of following command we can train the model −
model = gnb.fit(train, train_labels)
现在,为了评估的目的,我们需要进行预测。它可以通过使用 predict() 函数来完成,如下所示 −
Now, for evaluation purpose we need to make predictions. It can be done by using predict() function as follows −
preds = gnb.predict(test)
print(preds)
这将给出以下输出 −
This will give the following output −
[
1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1
0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0 1 1
1 1 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0 1 1 0
0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 1 1 1
1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 0 1 1 0
1 0 1 1 1 1 0 1 1 0 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 0 1 1 0
1
]
输出中上述一系列 0 和 1 是 Malignant 和 Benign 肿瘤分类的预测值。
The above series of 0s and 1s in output are the predicted values for the Malignant and Benign tumor classes.
Step5: Finding accuracy
我们可以通过比较 test_labels 和 preds 这两个数组来找到上一步构建的模型的精确度。我们将使用 accuracy_score() 函数来确定精确度。
We can find the accuracy of the model build in previous step by comparing the two arrays namely test_labels and preds. We will be using the accuracy_score() function to determine the accuracy.
from sklearn.metrics import accuracy_score
print(accuracy_score(test_labels,preds))
0.951754385965
上述输出显示 NaiveBayes 分类器准确率为 95.17%。
The above output shows that NaïveBayes classifier is 95.17% accurate .
Classification Evaluation Metrics
即使你已完成机器学习应用程序或模型的实现,工作还未完成。我们必须找出我们的模型有多有效?可能有不同的评估指标,但我们必须仔细选择,因为指标的选择会影响机器学习算法的性能的测量和比较方式。
The job is not done even if you have finished implementation of your Machine Learning application or model. We must have to find out how effective our model is? There can be different evaluation metrics, but we must choose it carefully because the choice of metrics influences how the performance of a machine learning algorithm is measured and compared.
以下是你可以根据你的数据集和问题类型从中进行选择的一些重要的分类评估指标 −
The following are some of the important classification evaluation metrics among which you can choose based upon your dataset and kind of problem −
Confusion Matrix
这是衡量分类问题性能的最简单方法,其中输出可以是两种或更多种类的类。混淆矩阵只不过是一个有两维的表格,即“实际”和“预测”,此外,这两个维度都具有“真阳性(TP)”、“真阴性(TN)”、“假阳性(FP)”、“假阴性(FN) ”如下所示 -
It is the easiest way to measure the performance of a classification problem where the output can be of two or more type of classes. A confusion matrix is nothing but a table with two dimensions viz. “Actual” and “Predicted” and furthermore, both the dimensions have “True Positives (TP)”, “True Negatives (TN)”, “False Positives (FP)”, “False Negatives (FN)” as shown below −
与混淆矩阵相关的术语的解释如下 -
The explanation of the terms associated with confusion matrix are as follows −
-
True Positives (TP) − It is the case when both actual class & predicted class of data point is 1.
-
True Negatives (TN) − It is the case when both actual class & predicted class of data point is 0.
-
False Positives (FP) − It is the case when actual class of data point is 0 & predicted class of data point is 1.
-
False Negatives (FN) − It is the case when actual class of data point is 1 & predicted class of data point is 0.
我们可以借助 sklearn 的 confusion_matrix() 函数找到混淆矩阵。借助以下脚本,我们可以找到上述构建的二元分类器的混淆矩阵 -
We can find the confusion matrix with the help of confusion_matrix() function of sklearn. With the help of the following script, we can find the confusion matrix of above built binary classifier −
from sklearn.metrics import confusion_matrix
Accuracy
可以将其定义为由我们的机器学习模型做出的正确预测的数目。我们可以通过以下公式借助混淆矩阵轻松计算它 -
It may be defined as the number of correct predictions made by our ML model. We can easily calculate it by confusion matrix with the help of following formula −
对于上述构建的二元分类器,TP + TN = 73 + 144 = 217 和 TP+FP+FN+TN = 73+7+4+144=228。
For above built binary classifier, TP + TN = 73+144 = 217 and TP+FP+FN+TN = 73+7+4+144=228.
因此,准确性 = 217/228 = 0.951754385965,这与我们在创建二元分类器后计算得出的值相同。
Hence, Accuracy = 217/228 = 0.951754385965 which is same as we have calculated after creating our binary classifier.
Precision
精度,用于文档检索,可以定义为由我们的机器学习模型返回的正确文档数。我们可以通过以下公式借助混淆矩阵轻松计算它 -
Precision, used in document retrievals, may be defined as the number of correct documents returned by our ML model. We can easily calculate it by confusion matrix with the help of following formula −
对于上述构建的二元分类器,TP = 73 和 TP+FP = 73+7 = 80。
For the above built binary classifier, TP = 73 and TP+FP = 73+7 = 80.
因此,精度 = 73/80 = 0.915
Hence, Precision = 73/80 = 0.915
Recall or Sensitivity
召回率可以定义为由我们的机器学习模型返回的正例数。我们可以通过以下公式借助混淆矩阵轻松计算它 -
Recall may be defined as the number of positives returned by our ML model. We can easily calculate it by confusion matrix with the help of following formula −
对于上述构建的二元分类器,TP = 73 和 TP+FN = 73+4 = 77。
For above built binary classifier, TP = 73 and TP+FN = 73+4 = 77.
因此,精度 = 73/77 = 0.94805
Hence, Precision = 73/77 = 0.94805
Specificity
特异性与召回率相反,可以定义为我们的 ML 模型返回的负样本数量。我们可以通过使用以下公式轻松地通过混淆矩阵计算它−
Specificity, in contrast to recall, may be defined as the number of negatives returned by our ML model. We can easily calculate it by confusion matrix with the help of following formula −
对于上述构建的二元分类器,TN = 144,TN+FP = 144+7 = 151。
For the above built binary classifier, TN = 144 and TN+FP = 144+7 = 151.
因此,精度 = 144/151 = 0.95364
Hence, Precision = 144/151 = 0.95364
Various ML Classification Algorithms
以下是某些重要的 ML 分类算法 −
The followings are some important ML classification algorithms −
-
Logistic Regression
-
Support Vector Machine (SVM)
-
Decision Tree
-
Naïve Bayes
-
Random Forest
我们将在后面的章节中详细讨论所有这些分类算法。
We will be discussing all these classification algorithms in detail in further chapters.