Scikit Learn 简明教程

Scikit Learn - Stochastic Gradient Descent

在这里,我们将了解 Sklearn 中的一种优化算法,称为随机梯度下降 (SGD)。

Here, we will learn about an optimization algorithm in Sklearn, termed as Stochastic Gradient Descent (SGD).

随机梯度下降 (SGD) 是一种简单而有效的优化算法,用于查找最小化成本函数的函数参数/系数的值。换句话说,它用于在凸损失函数(如 SVM 和逻辑回归)下对线性分类器进行判别学习。它已成功应用于大型数据集,因为对系数的更新对每个训练实例执行,而不是在实例结束时执行。

Stochastic Gradient Descent (SGD) is a simple yet efficient optimization algorithm used to find the values of parameters/coefficients of functions that minimize a cost function. In other words, it is used for discriminative learning of linear classifiers under convex loss functions such as SVM and Logistic regression. It has been successfully applied to large-scale datasets because the update to the coefficients is performed for each training instance, rather than at the end of instances.

SGD Classifier

随机梯度下降 (SGD) 分类器基本上实现了支持各种损失函数和分类惩罚的普通 SGD 学习例程。Scikit-learn 提供 ` SGDClassifier ` 模块来实现 SGD 分类。

Stochastic Gradient Descent (SGD) classifier basically implements a plain SGD learning routine supporting various loss functions and penalties for classification. Scikit-learn provides SGDClassifier module to implement SGD classification.

Parameters

下表包含 ` SGDClassifier ` 模块使用的参数 -

Followings table consist the parameters used by SGDClassifier module −

Sr.No

Parameter & Description

1

loss − str, default = ‘hinge’ It represents the loss function to be used while implementing. The default value is ‘hinge’ which will give us a linear SVM. The other options which can be used are − log − This loss will give us logistic regression i.e. a probabilistic classifier. modified_huber − a smooth loss that brings tolerance to outliers along with probability estimates. squared_hinge − similar to ‘hinge’ loss but it is quadratically penalized. perceptron − as the name suggests, it is a linear loss which is used by the perceptron algorithm.

2

penalty − str, ‘none’, ‘l2’, ‘l1’, ‘elasticnet’ It is the regularization term used in the model. By default, it is L2. We can use L1 or ‘elasticnet; as well but both might bring sparsity to the model, hence not achievable with L2.

3

alpha − float, default = 0.0001 Alpha, the constant that multiplies the regularization term, is the tuning parameter that decides how much we want to penalize the model. The default value is 0.0001.

4

l1_ratio − float, default = 0.15 This is called the ElasticNet mixing parameter. Its range is 0 < = l1_ratio < = 1. If l1_ratio = 1, the penalty would be L1 penalty. If l1_ratio = 0, the penalty would be an L2 penalty.

5

fit_intercept − Boolean, Default=True This parameter specifies that a constant (bias or intercept) should be added to the decision function. No intercept will be used in calculation and data will be assumed already centered, if it will set to false.

6

tol − float or none, optional, default = 1.e-3 This parameter represents the stopping criterion for iterations. Its default value is False but if set to None, the iterations will stop when 𝒍loss > *best_loss - tol for n_iter_no_change*successive epochs.

7

shuffle − Boolean, optional, default = True This parameter represents that whether we want our training data to be shuffled after each epoch or not.

8

verbose − integer, default = 0 It represents the verbosity level. Its default value is 0.

9

epsilon − float, default = 0.1 This parameter specifies the width of the insensitive region. If loss = ‘epsilon-insensitive’, any difference, between current prediction and the correct label, less than the threshold would be ignored.

10

max_iter − int, optional, default = 1000 As name suggest, it represents the maximum number of passes over the epochs i.e. training data.

11

warm_start − bool, optional, default = false With this parameter set to True, we can reuse the solution of the previous call to fit as initialization. If we choose default i.e. false, it will erase the previous solution.

12

random_state − int, RandomState instance or None, optional, default = none This parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options. int − In this case, random_state is the seed used by random number generator. RandomState instance − In this case, random_state is the random number generator. None − In this case, the random number generator is the RandonState instance used by np.random.

13

n_jobs − int or none, optional, Default = None It represents the number of CPUs to be used in OVA (One Versus All) computation, for multi-class problems. The default value is none which means 1.

14

learning_rate − string, optional, default = ‘optimal’ If learning rate is ‘constant’, eta = eta0; If learning rate is ‘optimal’, eta = 1.0/(alpha*(t+t0)), where t0 is chosen by Leon Bottou; If learning rate = ‘invscalling’, eta = eta0/pow(t, power_t). If learning rate = ‘adaptive’, eta = eta0.

15

eta0 − double, default = 0.0 It represents the initial learning rate for above mentioned learning rate options i.e. ‘constant’, ‘invscalling’, or ‘adaptive’.

16

power_t − idouble, default =0.5 It is the exponent for ‘incscalling’ learning rate.

17

early_stopping − bool, default = False This parameter represents the use of early stopping to terminate training when validation score is not improving. Its default value is false but when set to true, it automatically set aside a stratified fraction of training data as validation and stop training when validation score is not improving.

18

validation_fraction − float, default = 0.1 It is only used when early_stopping is true. It represents the proportion of training data to set asides as validation set for early termination of training data..

19

n_iter_no_change − int, default=5 It represents the number of iteration with no improvement should algorithm run before early stopping.

20

classs_weight − dict, {class_label: weight} or “balanced”, or None, optional This parameter represents the weights associated with classes. If not provided, the classes are supposed to have weight 1.

20

warm_start − bool, optional, default = false With this parameter set to True, we can reuse the solution of the previous call to fit as initialization. If we choose default i.e. false, it will erase the previous solution.

21

average − iBoolean or int, optional, default = false It represents the number of CPUs to be used in OVA (One Versus All) computation, for multi-class problems. The default value is none which means 1.

Attributes

下表包含 SGDClassifier 模块使用的属性 −

Following table consist the attributes used by SGDClassifier module −

Sr.No

Attributes & Description

1

coef_ − array, shape (1, n_features) if n_classes==2, else (n_classes, n_features) This attribute provides the weight assigned to the features.

2

intercept_ − array, shape (1,) if n_classes==2, else (n_classes,) It represents the independent term in decision function.

3

n_iter_ − int It gives the number of iterations to reach the stopping criterion.

Implementation Example

Implementation Example

与其他分类器类似,必须使用以下两个数组拟合随机梯度下降 (SGD) −

Like other classifiers, Stochastic Gradient Descent (SGD) has to be fitted with following two arrays −

  1. An array X holding the training samples. It is of size [n_samples, n_features].

  2. An array Y holding the target values i.e. class labels for the training samples. It is of size [n_samples].

Example

使用 SGDClassifier 线性模型 − 的 Python 脚本如下:

Following Python script uses SGDClassifier linear model −

import numpy as np
from sklearn import linear_model
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
Y = np.array([1, 1, 2, 2])
SGDClf = linear_model.SGDClassifier(max_iter = 1000, tol=1e-3,penalty = "elasticnet")
SGDClf.fit(X, Y)

Output

SGDClassifier(
   alpha = 0.0001, average = False, class_weight = None,
   early_stopping = False, epsilon = 0.1, eta0 = 0.0, fit_intercept = True,
   l1_ratio = 0.15, learning_rate = 'optimal', loss = 'hinge', max_iter = 1000,
   n_iter = None, n_iter_no_change = 5, n_jobs = None, penalty = 'elasticnet',
   power_t = 0.5, random_state = None, shuffle = True, tol = 0.001,
   validation_fraction = 0.1, verbose = 0, warm_start = False
)

Example

现在,在拟合模型后,可以预测新值如下所示:

Now, once fitted, the model can predict new values as follows −

SGDClf.predict([[2.,2.]])

Output

array([2])

Example

对于上述示例,我们可以借助以下 Python 脚本获取权重向量:

For the above example, we can get the weight vector with the help of following python script −

SGDClf.coef_

Output

array([[19.54811198, 9.77200712]])

Example

同样,我们可以借助以下 Python 脚本获取截距值:

Similarly, we can get the value of intercept with the help of following python script −

SGDClf.intercept_

Output

array([10.])

Example

我们可以使用以下 Python 脚本中使用的 SGDClassifier.decision_function 来获取到超平面的有符号距离:

We can get the signed distance to the hyperplane by using SGDClassifier.decision_function as used in the following python script −

SGDClf.decision_function([[2., 2.]])

Output

array([68.6402382])

SGD Regressor

随机梯度下降 (SGD) 回归器基本执行一个朴素的 SGD 学习例程,支持各种损失函数和惩罚项来拟合线性回归模型。Scikit-learn 提供了 SGDRegressor 模块来实现 SGD 回归。

Stochastic Gradient Descent (SGD) regressor basically implements a plain SGD learning routine supporting various loss functions and penalties to fit linear regression models. Scikit-learn provides SGDRegressor module to implement SGD regression.

Parameters

SGDRegressor 所使用的参数与 SGDClassifier 模块中使用的参数几乎相同。不同之处在于“损失”参数。对于 SGDRegressor 模块的损失参数,正值如下:

Parameters used by SGDRegressor are almost same as that were used in SGDClassifier module. The difference lies in ‘loss’ parameter. For SGDRegressor modules’ loss parameter the positives values are as follows −

  1. squared_loss − It refers to the ordinary least squares fit.

  2. huber: SGDRegressor − correct the outliers by switching from squared to linear loss past a distance of epsilon. The work of ‘huber’ is to modify ‘squared_loss’ so that algorithm focus less on correcting outliers.

  3. epsilon_insensitive − Actually, it ignores the errors less than epsilon.

  4. squared_epsilon_insensitive − It is same as epsilon_insensitive. The only difference is that it becomes squared loss past a tolerance of epsilon.

另一个区别是,名为“power_t”的参数的默认值是 0.25,而不是如 SGDClassifier 中的 0.5。此外,它没有“class_weight”和“n_jobs”参数。

Another difference is that the parameter named ‘power_t’ has the default value of 0.25 rather than 0.5 as in SGDClassifier. Furthermore, it doesn’t have ‘class_weight’ and ‘n_jobs’ parameters.

Attributes

SGDRegressor 的属性也与 SGDClassifier 模块的属性相同。相反,它具有三个额外的属性,如下所示:

Attributes of SGDRegressor are also same as that were of SGDClassifier module. Rather it has three extra attributes as follows −

  1. average_coef_ − array, shape(n_features,)

顾名思义,它提供了分配给特征的平均权重。

As name suggest, it provides the average weights assigned to the features.

  1. average_intercept_ − array, shape(1,)

顾名思义,它提供了平均截距项。

As name suggest, it provides the averaged intercept term.

  1. t_ − int

它提供了在训练阶段执行的权重更新次数。

It provides the number of weight updates performed during the training phase.

Note − 在将参数“average”启用为 True 后,属性 average_coef_ 和 average_intercept_ 将生效。

Note − the attributes average_coef_ and average_intercept_ will work after enabling parameter ‘average’ to True.

Implementation Example

Implementation Example

使用 SGDRegressor 线性模型的 Python 脚本如下:

Following Python script uses SGDRegressor linear model −

import numpy as np
from sklearn import linear_model
n_samples, n_features = 10, 5
rng = np.random.RandomState(0)
y = rng.randn(n_samples)
X = rng.randn(n_samples, n_features)
SGDReg =linear_model.SGDRegressor(
   max_iter = 1000,penalty = "elasticnet",loss = 'huber',tol = 1e-3, average = True
)
SGDReg.fit(X, y)

Output

SGDRegressor(
   alpha = 0.0001, average = True, early_stopping = False, epsilon = 0.1,
   eta0 = 0.01, fit_intercept = True, l1_ratio = 0.15,
   learning_rate = 'invscaling', loss = 'huber', max_iter = 1000,
   n_iter = None, n_iter_no_change = 5, penalty = 'elasticnet', power_t = 0.25,
   random_state = None, shuffle = True, tol = 0.001, validation_fraction = 0.1,
   verbose = 0, warm_start = False
)

Example

现在,一旦拟合,我们可以借助以下 Python 脚本获取权重向量:

Now, once fitted, we can get the weight vector with the help of following python script −

SGDReg.coef_

Output

array([-0.00423314, 0.00362922, -0.00380136, 0.00585455, 0.00396787])

Example

同样,我们可以借助以下 Python 脚本获取截距值:

Similarly, we can get the value of intercept with the help of following python script −

SGReg.intercept_

Output

SGReg.intercept_

Example

我们可以在训练阶段借助以下 Python 脚本获取权重更新次数−

We can get the number of weight updates during training phase with the help of the following python script −

SGDReg.t_

Output

61.0

Pros and Cons of SGD

ESG 的优点如下 −

Following the pros of SGD −

  1. Stochastic Gradient Descent (SGD) is very efficient.

  2. It is very easy to implement as there are lots of opportunities for code tuning.

SGD 的缺点如下 −

Following the cons of SGD −

  1. Stochastic Gradient Descent (SGD) requires several hyperparameters like regularization parameters.

  2. It is sensitive to feature scaling.