Machine Learning 简明教程
Machine Learning - Classification Algorithms
分类是一种监督学习技术类型,它涉及基于一组输入特征预测分类目标变量。它通常用于解决诸如垃圾邮件检测、欺诈检测、图像识别、情感分析和许多其他问题。
Classification is a type of supervised learning technique that involves predicting a categorical target variable based on a set of input features. It is commonly used to solve problems such as spam detection, fraud detection, image recognition, sentiment analysis, and many others.
分类模型的目标是学习输入特征 (X) 和目标变量 (Y) 之间的一个映射函数 (f)。此映射函数通常表示为决策边界,它将输入特征空间中的不同类别分开。一旦模型经过训练,就可以用它来预测新的、未见示例的类别。
The goal of a classification model is to learn a mapping function (f) between the input features (X) and the target variable (Y). This mapping function is often represented as a decision boundary, which separates different classes in the input feature space. Once the model is trained, it can be used to predict the class of new, unseen examples.
现在让我们来看看构建分类模型涉及的步骤 −
Let us now take a look at the steps involved in building a classification model −
Data Preparation
第一步是收集和预处理数据。这包括清理数据、处理缺失值以及将分类变量转换为数值变量。
The first step is to collect and preprocess the data. This involves cleaning the data, handling missing values, and converting categorical variables to numerical values.
Feature Extraction/Selection
下一步是从数据中提取或选择相关特征。这是重要的一步,因为特征的质量会极大地影响模型的性能。一些常见的特征选择技术包括相关性分析、特征重要性排名和主成分分析。
The next step is to extract or select relevant features from the data. This is an important step because the quality of the features can greatly impact the performance of the model. Some common feature selection techniques include correlation analysis, feature importance ranking, and principal component analysis.
Model Selection
选择完特征后,下一步就是选择合适的分类算法。有许多不同的算法可供选择,每个算法都有自己的优缺点。一些流行的算法包括逻辑回归、决策树、随机森林、支持向量机和神经网络。
Once the features are selected, the next step is to choose an appropriate classification algorithm. There are many different algorithms to choose from, each with its own strengths and weaknesses. Some popular algorithms include logistic regression, decision trees, random forests, support vector machines, and neural networks
Model Training
在选择合适的算法后,下一步是在标记的训练数据上训练模型。在训练期间,模型会学习输入特征和目标变量之间的映射函数。模型参数会进行迭代调整,以将预测输出与实际输出之间的差异最小化。
After selecting a suitable algorithm, the next step is to train the model on the labeled training data. During training, the model learns the mapping function between the input features and the target variable. The model parameters are adjusted iteratively to minimize the difference between the predicted outputs and the actual outputs.
Model Evaluation
模型训练完成后,下一步是在一组单独的验证数据上评估其性能。这样做是为了估计模型的准确度和泛化性能。常见的评估指标包括准确度、准确率、召回率、F1 分数和接收者操作特性 (ROC) 曲线下的面积。
Once the model is trained, the next step is to evaluate its performance on a separate set of validation data. This is done to estimate the model’s accuracy and generalization performance. Common evaluation metrics include accuracy, precision, recall, F1-score, and area under the receiver operating characteristic (ROC) curve.
Hyperparameter Tuning
在许多情况下,可以通过微调模型的超参数来进一步提高模型的性能。超参数是在训练模型之前选择的设置,并且控制诸如学习率、正则化强度和神经网络中隐藏层的数量等方面。网格搜索、随机搜索和贝叶斯优化是一些用于超参数调优的常见技术。
In many cases, the performance of the model can be further improved by tuning its hyperparameters. Hyperparameters are settings that are chosen before training the model and control aspects such as the learning rate, regularization strength, and the number of hidden layers in a neural network. Grid search, random search, and Bayesian optimization are some common techniques used for hyperparameter tuning.
Model Deployment
训练和评估模型后,最后一步是将其部署到生产环境中。这包括将模型集成到更大的系统中、在真实世界数据上对其进行测试和随着时间的推移监控其性能。
Once the model has been trained and evaluated, the final step is to deploy it in a production environment. This involves integrating the model into a larger system, testing it on realworld data, and monitoring its performance over time.
Types of Learners in Classification
在分类问题中,我们有两种类型的学习器——
We have two types of learners in respective to classification problems −
Lazy Learners
顾名思义,此类学习器会等待在存储训练数据后出现的测试数据。只有在获取测试数据后才会执行分类。它们花费在训练上的时间较少,而花费在预测上的时间较多。惰性学习器的示例包括 k 近邻和基于案例的推理。
As the name suggests, such kind of learners waits for the testing data to be appeared after storing the training data. Classification is done only after getting the testing data. They spend less time on training but more time on predicting. Examples of lazy learners are K-nearest neighbor and case-based reasoning.
Eager Learners
与惰性学习器相反,主动学习器会在存储训练数据后等待在测试数据出现后构建分类模型。它们花费在训练上的时间较多,而花费在预测上的时间较少。主动学习器的示例包括决策树、朴素贝叶斯和人工神经网络 (ANN)。
As opposite to lazy learners, eager learners construct classification model without waiting for the testing data to be appeared after storing the training data. They spend more time on training but less time on predicting. Examples of eager learners are Decision Trees, Naïve Bayes and Artificial Neural Networks (ANN).
Building a Classifier in Python
Scikit-learn(一个用于机器学习的 Python 库)可用于在 Python 中构建分类器。在 Python 中构建分类器的步骤如下——
Scikit-learn, a Python library for machine learning can be used to build a classifier in Python. The steps for building a classifier in Python are as follows −
Step 1: Importing necessary python package
要使用 scikit-learn 构建分类器,我们需要导入它。我们可以使用以下脚本导入它——
For building a classifier using scikit-learn, we need to import it. We can import it by using following script −
import sklearn
Step 2: Importing dataset
导入必要的包后,我们需要一个数据集来构建分类预测模型。我们可以从 sklearn 数据集导入它,也可以根据我们的要求使用其他数据集。我们将使用 sklearn 的乳腺癌威斯康星州诊断数据库。我们可以借助以下脚本导入它——
After importing necessary package, we need a dataset to build classification prediction model. We can import it from sklearn dataset or can use other one as per our requirement. We are going to use sklearn’s Breast Cancer Wisconsin Diagnostic Database. We can import it with the help of following script −
from sklearn.datasets import load_breast_cancer
以下脚本将加载数据集;
The following script will load the dataset;
data = load_breast_cancer()
我们还需要整理数据,可以使用以下脚本来完成此操作——
We also need to organize the data and it can be done with the help of following scripts −
label_names = data['target_names']
labels = data['target']
feature_names = data['feature_names']
features = data['data']
以下命令将打印标签的名称,在我们的数据库中为 ‘malignant’ 和 ‘benign’ 。
The following command will print the name of the labels, ‘malignant’ and ‘benign’ in case of our database.
print(label_names)
以上命令的输出是标签名称——
The output of the above command is the names of the labels −
['malignant' 'benign']
这些标签映射到二进制值 0 和 1。 Malignant 癌由 0 表示, Benign 癌由 1 表示。
These labels are mapped to binary values 0 and 1. Malignant cancer is represented by 0 and Benign cancer is represented by 1.
这些标签的特征名称和特征值可以使用以下命令查看——
The feature names and feature values of these labels can be seen with the help of following commands −
print(feature_names[0])
以上命令的输出是标签 0(即 Malignant 癌)的特征名称——
The output of the above command is the names of the features for label 0 i.e. Malignant cancer −
mean radius
类似地,可以按如下方式生成标签的特征名称:
Similarly, names of the features for label can be produced as follows −
print(feature_names[1])
上述命令的输出是标签 1 的特征名称,即良性癌症 −
The output of the above command is the names of the features for label 1 i.e. Benign cancer −
mean texture
我们可以使用以下命令打印这些标签的特征 −
We can print the features for these labels with the help of following command −
print(features[0])
这将给出以下输出 −
This will give the following output −
[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
4.601e-01 1.189e-01]
我们可以使用以下命令打印这些标签的特征 −
We can print the features for these labels with the help of following command −
print(features[1])
这将给出以下输出 −
This will give the following output −
[2.057e+01 1.777e+01 1.329e+02 1.326e+03 8.474e-02 7.864e-02 8.690e-02
7.017e-02 1.812e-01 5.667e-02 5.435e-01 7.339e-01 3.398e+00 7.408e+01
5.225e-03 1.308e-02 1.860e-02 1.340e-02 1.389e-02 3.532e-03 2.499e+01
2.341e+01 1.588e+02 1.956e+03 1.238e-01 1.866e-01 2.416e-01 1.860e-01
2.750e-01 8.902e-02]
Step 3: Organizing data into training & testing sets
因为我们需要在不可见数据上测试我们的模型,我们会将我们的数据集分成两部分:一个训练集和一个测试集。我们可以使用 sklearn python 包的 train_test_split() 函数将数据分成数据集。以下命令将导入该函数 −
As we need to test our model on unseen data, we will divide our dataset into two parts: a training set and a test set. We can use train_test_split() function of sklearn python package to split the data into sets. The following command will import the function −
from sklearn.model_selection import train_test_split
现在,下一个命令会将数据分成训练和测试数据。在这个示例中,我们将 40% 的数据用于测试目的,60% 的数据用于训练目的 −
Now, next command will split the data into training & testing data. In this example, we are using taking 40 percent of the data for testing purpose and 60 percent of the data for training purpose −
train, test, train_labels, test_labels =
train_test_split(features,labels,test_size = 0.40, random_state = 42)
Step 4: Model evaluation
在将数据分成训练和测试后,我们需要构建模型。为此,我们将使用朴素贝叶斯算法。以下命令会导入 GaussianNB 模块 −
After dividing the data into training and testing we need to build the model. We will be using Naïve Bayes algorithm for this purpose. The following commands will import the GaussianNB module −
from sklearn.naive_bayes import GaussianNB
现在,初始化模型如下所示 −
Now, initialize the model as follows −
gnb = GaussianNB()
接下来,在以下命令的帮助下,我们可以训练模型 −
Next, with the help of following command we can train the model −
model = gnb.fit(train, train_labels)
现在,为了评估的目的,我们需要进行预测。它可以通过使用 predict() 函数来完成,如下所示 −
Now, for evaluation purpose we need to make predictions. It can be done by using predict() function as follows −
preds = gnb.predict(test)
print(preds)
这将给出以下输出 −
This will give the following output −
[1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0
1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0
1 1 1 1 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0
1 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0
1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0
0 1 1 0 1 0 1 1 1 1 0 1 1 0 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1
0 0 1 1 0 1]
输出中上述一系列 0 和 1 是 Malignant 和 Benign 肿瘤分类的预测值。
The above series of 0s and 1s in output are the predicted values for the Malignant and Benign tumor classes.
Step 5: Finding accuracy
我们可以通过比较 test_labels 和 preds 这两个数组来找到上一步构建的模型的精确度。我们将使用 accuracy_score() 函数来确定精确度。
We can find the accuracy of the model build in previous step by comparing the two arrays namely test_labels and preds. We will be using the accuracy_score() function to determine the accuracy.
from sklearn.metrics import accuracy_score
print(accuracy_score(test_labels,preds))
0.951754385965
上述输出表明 NaiveBayes 分类器的精确度为 95.17%。
The above output shows that NaïveBayes classifier is 95.17% accurate.
Classification Evaluation Metrics
即使你已完成机器学习应用程序或模型的实现,工作还未完成。我们必须找出我们的模型有多有效?可能有不同的评估指标,但我们必须仔细选择,因为指标的选择会影响机器学习算法的性能的测量和比较方式。
The job is not done even if you have finished implementation of your Machine Learning application or model. We must have to find out how effective our model is? There can be different evaluation metrics, but we must choose it carefully because the choice of metrics influences how the performance of a machine learning algorithm is measured and compared.
以下是你可以根据你的数据集和问题类型从中进行选择的一些重要的分类评估指标 −
The following are some of the important classification evaluation metrics among which you can choose based upon your dataset and kind of problem −
Confusion Matrix
-
Confusion Matrix − It is the easiest way to measure the performance of a classification problem where the output can be of two or more type of classes.
Various ML Classification Algorithms
以下是某些重要的 ML 分类算法 −
The followings are some important ML classification algorithms −
我们将在后面的章节中详细讨论所有这些分类算法。
We will be discussing all these classification algorithms in detail in further chapters.
Applications
分类算法的一些最重要的应用程序如下 −
Some of the most important applications of classification algorithms are as follows −
-
Speech Recognition
-
Handwriting Recognition
-
Biometric Identification
-
Document Classification
在后续章节中,我们将讨论机器学习中一些最流行的分类算法。
In the subsequent chapters, we will discuss some of the most popular classification algorithms in machine learning.