Machine Learning 简明教程
Machine Learning - Support Vector Machine
支持向量机 (SVM) 是功能强大但灵活的监督式机器学习算法,用于分类和回归。但通常它们用于分类问题。SVMs 最早于 20 世纪 60 年代推出,但后来在 1990 年得到改进。与其他机器学习算法相比,SVM 有其独特的实现方式。如今,它们因能够处理多个连续和分类变量而非常流行。
Support vector machines (SVMs) are powerful yet flexible supervised machine learning algorithm which is used for both classification and regression. But generally, they are used in classification problems. In 1960s, SVMs were first introduced but later they got refined in 1990 also. SVMs have their unique way of implementation as compared to other machine learning algorithms. Now a days, they are extremely popular because of their ability to handle multiple continuous and categorical variables.
Working of SVM
SVM 的目标是找到一个将数据点分成不同类的超平面。超平面是二维空间中的线,三维空间中的平面或 n 维空间中的高维曲面。选择超平面的方式是使其最大化间隔,即超平面与每个类中最近数据点之间的距离。最近的数据点称为支持向量。
The goal of SVM is to find a hyperplane that separates the data points into different classes. A hyperplane is a line in 2D space, a plane in 3D space, or a higher-dimensional surface in n-dimensional space. The hyperplane is chosen in such a way that it maximizes the margin, which is the distance between the hyperplane and the closest data points of each class. The closest data points are called the support vectors.
超平面和数据点“x”之间的距离可以使用以下公式计算:
The distance between the hyperplane and a data point "x" can be calculated using the formula −
distance = (w . x + b) / ||w||
其中“w”是权重向量,“b”是偏置项,“||w||”是权重向量的欧几里得范数。权重向量“w”垂直于超平面并决定了它的方向,而偏置项“b”决定了它的位置。
where "w" is the weight vector, "b" is the bias term, and "||w||" is the Euclidean norm of the weight vector. The weight vector "w" is perpendicular to the hyperplane and determines its orientation, while the bias term "b" determines its position.
通过解决一个优化问题可以找到最佳超平面,即在所有数据点都正确分类的约束下最大化间隔。换句话说,我们想要找到一个在确保没有数据点被错误分类的情况下,使两类之间的间隔最大化的超平面。这是一个使用二次规划求解的凸优化问题。
The optimal hyperplane is found by solving an optimization problem, which is to maximize the margin subject to the constraint that all data points are correctly classified. In other words, we want to find the hyperplane that maximizes the margin between the two classes while ensuring that no data point is misclassified. This is a convex optimization problem that can be solved using quadratic programming.
如果数据点不可线性分离,我们可以使用称为核技巧的技术将数据点映射到它们变得可分离的高维空间中。核函数计算映射的数据点之间的内积,而无需计算映射本身。这让我们无需承担映射的计算成本,就可以使用高维空间中的数据点。
If the data points are not linearly separable, we can use a technique called kernel trick to map the data points into a higher-dimensional space where they become separable. The kernel function computes the inner product between the mapped data points without computing the mapping itself. This allows us to work with the data points in the higherdimensional space without incurring the computational cost of mapping them.
让我们借助以下图表详细了解它 −
Let’s understand it in detail with the help of following diagram −
以下是 SVM 中重要概念 −
Given below are the important concepts in SVM −
-
Support Vectors − Datapoints that are closest to the hyperplane is called support vectors. Separating line will be defined with the help of these data points.
-
Hyperplane − As we can see in the above diagram it is a decision plane or space which is divided between a set of objects having different classes.
-
Margin − It may be defined as the gap between two lines on the closet data points of different classes. It can be calculated as the perpendicular distance from the line to the support vectors. Large margin is considered as a good margin and small margin is considered as a bad margin.
Implementation in Python
我们将使用 scikit-learn 库在 Python 中实现 SVM。Scikit-learn 是一个流行的机器学习库,它提供了广泛的用于分类、回归、聚类和降维任务的算法。
We will use the scikit-learn library to implement SVM in Python. Scikit-learn is a popular machine learning library that provides a wide range of algorithms for classification, regression, clustering, and dimensionality reduction tasks.
我们将使用著名的 Iris 数据集,其中包含三种鸢尾花:山鸢尾、杂色鸢尾和弗吉尼亚鸢尾的花萼长度、花萼宽度、花瓣长度和花瓣宽度。目标是根据这四个特征将这些花分类到各自的种类中。
We will use the famous Iris dataset, which contains the sepal length, sepal width, petal length, and petal width of three species of iris flowers: Iris setosa, Iris versicolor, and Iris virginica. The goal is to classify the flowers into their respective species based on these four features.
Example
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# load the iris dataset
iris = load_iris()
# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(iris.data,
iris.target, test_size=0.2, random_state=42)
# create an SVM classifier with a linear kernel
svm = SVC(kernel='linear')
# train the SVM classifier on the training set
svm.fit(X_train, y_train)
# make predictions on the testing set
y_pred = svm.predict(X_test)
# calculate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
我们通过从 scikit-learn 导入必要的模块 load_iris 来加载鸢尾花数据集, train_test_split 将数据分割为训练集和测试集,SVC 使用线性核创建 SVM 分类器,以及 accuracy_score 计算分类器的准确率。
We start by importing the necessary modules from scikit-learn: load_iris to load the iris dataset, train_test_split to split the data into training and testing sets, SVC to create an SVM classifier with a linear kernel, and accuracy_score to calculate the accuracy of the classifier.
我们使用 load_iris 加载鸢尾花数据集,并使用 train_test_split 将数据分割为训练集和测试集。我们采用 0.2 的测试集大小,这意味着 20% 的数据将用于测试,80% 用于训练。我们将随机状态设置为 42,确保结果的可重复性。
We load the iris dataset using load_iris and split the data into training and testing sets using train_test_split. We use a test size of 0.2, which means that 20% of the data will be used for testing and 80% for training. We set the random state to 42 to ensure reproducibility of the results.
我们使用 SVC(kernel='linear') 创建一个线性核的 SVM 分类器。然后使用 svm.fit(X_train, y_train) 在训练集上训练 SVM 分类器。
We create an SVM classifier with a linear kernel using SVC(kernel='linear'). We then train the SVM classifier on the training set using svm.fit(X_train, y_train).
分类器训练好后,我们使用 svm.predict(X_test) 对测试集进行预测。然后,我们使用 accuracy_score(y_test, y_pred) 计算分类器的准确率并将其打印到控制台。
Once the classifier is trained, we make predictions on the testing set using svm.predict(X_test). We then calculate the accuracy of the classifier using accuracy_score(y_test, y_pred) and print it to the console.
代码的输出应类似于以下内容 −
The output of the code should be something like this −
Accuracy: 1.0
Tuning SVM Parameters
实际上,SVM 通常需要对其参数进行微调才能实现最佳性能。最重要的微调参数是核函数、正则化参数 C 以及特定核函数的参数。
In practice, SVMs often require tuning of their parameters to achieve optimal performance. The most important parameters to tune are the kernel, the regularization parameter C, and the kernel-specific parameters.
核函数参数决定要使用的核函数的类型。最常见的核函数类型包括线性、多项式、径向基函数 (RBF) 和 sigmoid。线性核函数用于线性可分数据,而其他核函数用于非线性可分数据。
The kernel parameter determines the type of kernel to use. The most common kernel types are linear, polynomial, radial basis function (RBF), and sigmoid. The linear kernel is used for linearly separable data, while the other kernels are used for non-linearly separable data.
正则化参数 C 控制最大化边界和最小化分类误差之间的折衷。较高的 C 值意味着分类器将以牺牲更小的边界为代价最小化分类误差,而较低的 C 值意味着分类器将尝试最大化边界,即使这意味着更多分类错误。
The regularization parameter C controls the trade-off between maximizing the margin and minimizing the classification error. A higher value of C means that the classifier will try to minimize the classification error at the expense of a smaller margin, while a lower value of C means that the classifier will try to maximize the margin even if it means more misclassifications.
特定核函数的参数取决于正在使用的核函数类型。例如,多项式核函数有用于多项式次数和多项式系数的参数,而 RBF 核函数有用于高斯函数宽度的参数。
The kernel-specific parameters depend on the type of kernel being used. For example, the polynomial kernel has parameters for the degree of the polynomial and the coefficient of the polynomial, while the RBF kernel has a parameter for the width of the Gaussian function.
我们可以使用交叉验证来调整 SVM 的参数。交叉验证包括将数据分割成多个子集,并在每个子集上训练分类器,同时将其余子集用于测试。这允许我们在数据的不同子集上评估分类器的性能,并选择最佳的参数集。
We can use cross-validation to tune the parameters of the SVM. Cross-validation involves splitting the data into several subsets and training the classifier on each subset while using the remaining subsets for testing. This allows us to evaluate the performance of the classifier on different subsets of the data and choose the best set of parameters.
Example
from sklearn.model_selection import GridSearchCV
# define the parameter grid
param_grid = {
'C': [0.1, 1, 10, 100],
'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
'degree': [2, 3, 4],
'coef0': [0.0, 0.1, 0.5],
'gamma': ['scale', 'auto']
}
# create an SVM classifier
svm = SVC()
# perform grid search to find the best set of parameters
grid_search = GridSearchCV(svm, param_grid, cv=5)
grid_search.fit(X_train, y_train)
# print the best set of parameters and their accuracy
print("Best parameters:", grid_search.best_params_)
print("Best accuracy:", grid_search.best_score_)
我们通过从 scikit-learn 中导入 GridSearchCV 模块开始,该模块是用于对一组参数执行网格搜索的工具。我们定义了一个参数网格,其中包含我们想要微调的每个参数的可能值。
We start by importing the GridSearchCV module from scikit-learn, which is a tool for performing grid search on a set of parameters. We define a parameter grid that contains the possible values for each parameter we want to tune.
我们使用 SVC() 创建一个 SVM 分类器,然后将其传递给 GridSearchCV ,连同参数网格和交叉验证折数 (cv=5)。然后我们调用 grid_search.fit(X_train, y_train) 来执行网格搜索。
We create an SVM classifier using SVC() and then pass it to GridSearchCV along with the parameter grid and the number of cross-validation folds (cv=5). We then call grid_search.fit(X_train, y_train) to perform the grid search.
网格搜索完成后,我们使用 grid_search.best_params_ 和 grid_search.best_score_, 分别打印最佳参数集及其准确率。
Once the grid search is complete, we print the best set of parameters and their accuracy using grid_search.best_params_ and grid_search.best_score_, respectively.
执行此程序时,您将获得以下输出 −
On executing this program, you will get the following output −
Best parameters: {'C': 0.1, 'coef0': 0.5, 'degree': 3, 'gamma': 'scale', 'kernel': 'poly'}
Best accuracy: 0.975
这意味着网格搜索找到的最佳参数集是: C=0.1, coef0=0.5, degree=3, gamma=scale, and kernel=poly 。这组参数在训练集上实现的准确率为 97.5%。
This means that the best set of parameters found by the grid search are: C=0.1, coef0=0.5, degree=3, gamma=scale, and kernel=poly. The accuracy achieved by this set of parameters on the training set is 97.5%.
您现在可以使用这些参数来创建一个新的 SVM 分类器,并在测试集上测试其性能。
You can now use these parameters to create a new SVM classifier and test its performance on the testing set.