Machine Learning With Python 简明教程

Support Vector Machine (SVM)

Introduction to SVM

支持向量机 (SVM) 功能强大且灵活,是监督机器学习算法,可同时用于分类和回归。但一般而言,它们用于分类问题。SVm 于 1960 年代首次推出,但后来在 1990 年得到了改进。与其他机器学习算法相比,SVM 具有独特的实现方式。近年来,由于其处理多个连续变量和分类变量的能力而备受青睐。

Support vector machines (SVMs) are powerful yet flexible supervised machine learning algorithms which are used both for classification and regression. But generally, they are used in classification problems. In 1960s, SVMs were first introduced but later they got refined in 1990. SVMs have their unique way of implementation as compared to other machine learning algorithms. Lately, they are extremely popular because of their ability to handle multiple continuous and categorical variables.

Working of SVM

SVM 模型本质上是对多维空间中超平面中不同类别的表示。SVM 会以迭代方式生成超平面,以便最大限度地减少错误。SVM 的目标是将数据集划分为不同的类别,以找到最大边缘超平面 (MMH)。

An SVM model is basically a representation of different classes in a hyperplane in multidimensional space. The hyperplane will be generated in an iterative manner by SVM so that the error can be minimized. The goal of SVM is to divide the datasets into classes to find a maximum marginal hyperplane (MMH).

margin

SVM 中的重要概念包括以下几个 −

The followings are important concepts in SVM −

  1. Support Vectors − Datapoints that are closest to the hyperplane is called support vectors. Separating line will be defined with the help of these data points.

  2. Hyperplane − As we can see in the above diagram, it is a decision plane or space which is divided between a set of objects having different classes.

  3. Margin − It may be defined as the gap between two lines on the closet data points of different classes. It can be calculated as the perpendicular distance from the line to the support vectors. Large margin is considered as a good margin and small margin is considered as a bad margin.

SVM 的主要目标是将数据集划分为不同类别,以找到最大边缘超平面 (MMH),可以分两步完成 −

The main goal of SVM is to divide the datasets into classes to find a maximum marginal hyperplane (MMH) and it can be done in the following two steps −

  1. First, SVM will generate hyperplanes iteratively that segregates the classes in best way.

  2. Then, it will choose the hyperplane that separates the classes correctly.

Implementing SVM in Python

我们在 Python 中实现 SVM 的过程从导入标准库开始,如下所示 −

For implementing SVM in Python we will start with the standard libraries import as follows −

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns; sns.set()

接下来,我们从 sklearn.dataset.sample_generator 使用 SVM 创建用于分类的线性可分离数据样本集 −

Next, we are creating a sample dataset, having linearly separable data, from sklearn.dataset.sample_generator for classification using SVM −

from sklearn.datasets.samples_generator import make_blobs
X, y = make_blobs(n_samples=100, centers=2, random_state=0, cluster_std=0.50)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='summer');

在生成包含 100 个样本和 2 个簇的样本集后,输出如下 −

The following would be the output after generating sample dataset having 100 samples and 2 clusters −

map

我们知道 SVM 支持判别分类。它通过在二维情况下简单地找到一条线或在多维情况下找到一个流形来区分不同的类。在上述数据集中实现 SVM 的过程如下 −

We know that SVM supports discriminative classification. it divides the classes from each other by simply finding a line in case of two dimensions or manifold in case of multiple dimensions. It is implemented on the above dataset as follows −

xfit = np.linspace(-1, 3.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='summer')
plt.plot([0.6], [2.1], 'x', color='black', markeredgewidth=4, markersize=12)
for m, b in [(1, 0.65), (0.5, 1.6), (-0.2, 2.9)]:
   plt.plot(xfit, m * xfit + b, '-k')
plt.xlim(-1, 3.5);

输出如下 −

The output is as follows −

cross

从上述输出中我们可以看到,有三个不同的分隔符完美地区分了上述样本。

We can see from the above output that there are three different separators that perfectly discriminate the above samples.

如前所述,SVM 的主要目标是将数据集划分为类别以找到最大边缘超平面 (MMH),因此,我们可以在类之间绘制零线,也可以围绕每条线绘制一个边缘,其宽度最多可达最近点。它可以如下完成——

As discussed, the main goal of SVM is to divide the datasets into classes to find a maximum marginal hyperplane (MMH) hence rather than drawing a zero line between classes we can draw around each line a margin of some width up to the nearest point. It can be done as follows −

xfit = np.linspace(-1, 3.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='summer')
   for m, b, d in [(1, 0.65, 0.33), (0.5, 1.6, 0.55), (-0.2, 2.9, 0.2)]:
   yfit = m * xfit + b
   plt.plot(xfit, yfit, '-k')
   plt.fill_between(xfit, yfit - d, yfit + d, edgecolor='none',
         color='#AAAAAA', alpha=0.4)
plt.xlim(-1, 3.5);
green

从输出中的上述图像中,我们很容易观察到判别分类器中的“边缘”。SVM 将选择最大化边缘的线。

From the above image in output, we can easily observe the “margins” within the discriminative classifiers. SVM will choose the line that maximizes the margin.

接下来,我们将使用 Scikit-Learn 的支持向量分类器对该数据训练 SVM 模型。在此,我们使用线性核来拟合 SVM,如下所示——

Next, we will use Scikit-Learn’s support vector classifier to train an SVM model on this data. Here, we are using linear kernel to fit SVM as follows −

from sklearn.svm import SVC # "Support vector classifier"
model = SVC(kernel='linear', C=1E10)
model.fit(X, y)

输出如下 −

The output is as follows −

SVC(C=10000000000.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='linear', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)

现在,为了更好地理解,以下内容将绘制二维 SVC 的决策函数——

Now, for a better understanding, the following will plot the decision functions for 2D SVC −

def decision_function(model, ax=None, plot_support=True):
   if ax is None:
      ax = plt.gca()
   xlim = ax.get_xlim()
   ylim = ax.get_ylim()

为了评估模型,我们需要创建网格,如下所示——

For evaluating model, we need to create grid as follows −

x = np.linspace(xlim[0], xlim[1], 30)
y = np.linspace(ylim[0], ylim[1], 30)
Y, X = np.meshgrid(y, x)
xy = np.vstack([X.ravel(), Y.ravel()]).T
P = model.decision_function(xy).reshape(X.shape)

接下来,我们需要绘制决策边界和边缘,如下所示——

Next, we need to plot decision boundaries and margins as follows −

ax.contour(X, Y, P, colors='k',
   levels=[-1, 0, 1], alpha=0.5,
   linestyles=['--', '-', '--'])

现在,以类似的方式绘制支持向量,如下所示——

Now, similarly plot the support vectors as follows −

if plot_support:
   ax.scatter(model.support_vectors_[:, 0],
      model.support_vectors_[:, 1],
      s=300, linewidth=1, facecolors='none');
ax.set_xlim(xlim)
ax.set_ylim(ylim)

现在,使用此函数拟合我们的模型,如下所示——

Now, use this function to fit our models as follows −

plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='summer')
decision_function(model);
yellow

我们可以从上述输出中观察到,一个 SVM 分类器适合具有裕度的数据,即虚线和支持向量,这是该拟合的关键元素,触及虚线。这些支持向量点存储在分类器的 support_vectors_ 属性中,如下所示——

We can observe from the above output that an SVM classifier fit to the data with margins i.e. dashed lines and support vectors, the pivotal elements of this fit, touching the dashed line. These support vector points are stored in the support_vectors_ attribute of the classifier as follows −

model.support_vectors_

输出如下 −

The output is as follows −

array([[0.5323772 , 3.31338909],
   [2.11114739, 3.57660449],
   [1.46870582, 1.86947425]])

SVM Kernels

在实践中,SVM 算法使用将输入数据空间转换为所需形式的核来实现。SVM 使用称为核技巧的技术,其中核采用低维输入空间并将其转换为更高维的空间。简单来说,核通过向其中添加更多维度将不可分离的问题转换为可分离问题。它使 SVM 功能更强大、更灵活、更准确。以下是 SVM 使用的一些内核类型——

In practice, SVM algorithm is implemented with kernel that transforms an input data space into the required form. SVM uses a technique called the kernel trick in which kernel takes a low dimensional input space and transforms it into a higher dimensional space. In simple words, kernel converts non-separable problems into separable problems by adding more dimensions to it. It makes SVM more powerful, flexible and accurate. The following are some of the types of kernels used by SVM −

Linear Kernel

它可用作任意两个观测之间的点积。线性核的公式如下——

It can be used as a dot product between any two observations. The formula of linear kernel is as below −

k(x,xi) = sum(x*xi)

从上述公式中,我们可以看到两个向量(例如 𝑥 和 𝑥𝑖)之间的乘积是输入值每对乘积的总和。

From the above formula, we can see that the product between two vectors say 𝑥 & 𝑥𝑖 is the sum of the multiplication of each pair of input values.

Polynomial Kernel

它是线性核的更通用形式,并且区分曲线或非线性输入空间。以下是多项式核的公式——

It is more generalized form of linear kernel and distinguish curved or nonlinear input space. Following is the formula for polynomial kernel −

K(x, xi) = 1 + sum(x * xi)^d

其中 d 是多项式的度数,我们需要在学习算法中手动指定它。

Here d is the degree of polynomial, which we need to specify manually in the learning algorithm.

Radial Basis Function (RBF) Kernel

RBF 核主要用于 SVM 分类,它将输入空间映射到无限维空间。以下公式在数学上对它进行了说明——

RBF kernel, mostly used in SVM classification, maps input space in indefinite dimensional space. Following formula explains it mathematically −

K(x,xi) = exp(-gamma * sumx – xi^2

在此,gamma 范围从 0 到 1。我们需要在学习算法中手动指定它。gamma 一个好的默认值为 0.1。

Here, gamma ranges from 0 to 1. We need to manually specify it in the learning algorithm. A good default value of gamma is 0.1.

由于我们为线性可分离数据实现了 SVM,所以我们可以使用 Python 针对不可线性分离的数据实现它。可以通过使用核函数来实现这一点。

As we implemented SVM for linearly separable data, we can implement it in Python for the data that is not linearly separable. It can be done by using kernels.

Example

下面是使用核函数创建 SVM 分类器的示例。我们将使用 scikit-learn 中的虹膜数据集:

The following is an example for creating an SVM classifier by using kernels. We will be using iris dataset from scikit-learn −

我们将通过导入以下包来开始:

We will start by importing following packages −

import pandas as pd
import numpy as np
from sklearn import svm, datasets
import matplotlib.pyplot as plt

现在,我们需要加载输入数据:

Now, we need to load the input data −

iris = datasets.load_iris()

从该数据集,我们按如下方式获取前两个特征:

From this dataset, we are taking first two features as follows −

X = iris.data[:, :2]
y = iris.target

接下来,我们将使用原始数据按如下方式绘制 SVM 边界:

Next, we will plot the SVM boundaries with original data as follows −

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
h = (x_max / x_min)/100
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
   np.arange(y_min, y_max, h))
X_plot = np.c_[xx.ravel(), yy.ravel()]

现在,我们需要按如下方式提供正则化参数的值:

Now, we need to provide the value of regularization parameter as follows −

C = 1.0

接下来,可以按如下方式创建 SVM 分类器对象:

Next, SVM classifier object can be created as follows −

Svc_classifier = svm.SVC(kernel='linear', C=C).fit(X, y)

Z = svc_classifier.predict(X_plot)
Z = Z.reshape(xx.shape)
plt.figure(figsize=(15, 5))
plt.subplot(121)
plt.contourf(xx, yy, Z, cmap=plt.cm.tab10, alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.title('Support Vector Classifier with linear kernel')

Output

Text(0.5, 1.0, 'Support Vector Classifier with linear kernel')
curve

对于使用 rbf 核创建 SVM 分类器,我们可以按如下方式将核更改为 rbf

For creating SVM classifier with rbf kernel, we can change the kernel to rbf as follows −

Svc_classifier = svm.SVC(kernel='rbf', gamma =‘auto’,C=C).fit(X, y)
Z = svc_classifier.predict(X_plot)
Z = Z.reshape(xx.shape)
plt.figure(figsize=(15, 5))
plt.subplot(121)
plt.contourf(xx, yy, Z, cmap=plt.cm.tab10, alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.title('Support Vector Classifier with rbf kernel')

Output

Text(0.5, 1.0, 'Support Vector Classifier with rbf kernel')
classifier

我们将 gamma 值设置为“自动”,但是你也可以提供 0 到 1 之间的值。

We put the value of gamma to ‘auto’ but you can provide its value between 0 to 1 also.

Pros and Cons of SVM Classifiers

Pros of SVM classifiers

SVM 分类器提供极高的准确度,并且在高维度空间中表现良好。SVM 分类器基本上使用训练点的子集,因此实际上使用非常少的内存。

SVM classifiers offers great accuracy and work well with high dimensional space. SVM classifiers basically use a subset of training points hence in result uses very less memory.

Cons of SVM classifiers

它们的训练时间很长,因此在实践中不适用于大型数据集。另一个缺点是 SVM 分类器与重叠类别不匹配。

They have high training time hence in practice not suitable for large datasets. Another disadvantage is that SVM classifiers do not work well with overlapping classes.