Machine Learning With Python 简明教程

Improving Performance of ML Model (Contd…)

Performance Improvement with Algorithm Tuning

众所周知,ML 模型以这样一种方式进行参数化,即可以调整它们的行为以解决特定问题。算法调优是指找到这些参数的最佳组合,以便提高 ML 模型的性能。此过程有时称为超参数优化,算法本身的参数称为超参数,ML 算法找到的系数称为参数。

As we know that ML models are parameterized in such a way that their behavior can be adjusted for a specific problem. Algorithm tuning means finding the best combination of these parameters so that the performance of ML model can be improved. This process sometimes called hyperparameter optimization and the parameters of algorithm itself are called hyperparameters and coefficients found by ML algorithm are called parameters.

在此,我们将讨论 Python Scikit-learn 提供的算法参数调优的一些方法。

Here, we are going to discuss about some methods for algorithm parameter tuning provided by Python Scikit-learn.

Grid Search Parameter Tuning

这是一种参数调优方法。此方法的工作要点是,针对网格中指定的每个可能的算法参数组合有条不紊地构建和评估模型方法。因此,我们可以说此算法具有搜索性质。

It is a parameter tuning approach. The key point of working of this method is that it builds and evaluate the model methodically for every possible combination of algorithm parameter specified in a grid. Hence, we can say that this algorithm is having search nature.

Example

在以下 Python 代码示例中,我们将使用 Sklearn 的 GridSearchCV 类对 Pima 印第安人糖尿病数据集执行网格搜索,以评估岭回归算法的各个 alpha 值。

In the following Python recipe, we are going to perform grid search by using GridSearchCV class of sklearn for evaluating various alpha values for the Ridge Regression algorithm on Pima Indians diabetes dataset.

首先,按如下所示导入所需包:

First, import the required packages as follows −

import numpy
from pandas import read_csv
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

现在,我们需要加载 Pima diabetes 数据集,如在之前的示例中所做的那样:

Now, we need to load the Pima diabetes dataset as did in previous examples −

path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]

接下来,按如下方式评估各个 alpha 值 -

Next, evaluate the various alpha values as follows −

alphas = numpy.array([1,0.1,0.01,0.001,0.0001,0])
param_grid = dict(alpha=alphas)

现在,我们需要对我们的模型应用网格搜索 -

Now, we need to apply grid search on our model −

model = Ridge()
grid = GridSearchCV(estimator=model, param_grid=param_grid)
grid.fit(X, Y)

使用以下脚本行打印结果 -

Print the result with following script line −

print(grid.best_score_)
print(grid.best_estimator_.alpha)

Output

0.2796175593129722
1.0

以上输出为我们提供了最佳分数以及达到该分数的网格中的参数集。此例中的 alpha 值为 1.0。

The above output gives us the optimal score and the set of parameters in the grid that achieved that score. The alpha value in this case is 1.0.

Random Search Parameter Tuning

这是一种参数调优方法。此方法的工作要点是,针对固定次数的迭代从随机分布中对算法参数进行采样。

It is a parameter tuning approach. The key point of working of this method is that it samples the algorithm parameters from a random distribution for a fixed number of iterations.

Example

在以下 Python 代码示例中,我们将使用 Sklearn 的 RandomizedSearchCV 类对 Pima 印第安人糖尿病数据集执行随机搜索,以评估岭回归算法的 0 到 1 之间的不同 alpha 值。

In the following Python recipe, we are going to perform random search by using RandomizedSearchCV class of sklearn for evaluating different alpha values between 0 and 1 for the Ridge Regression algorithm on Pima Indians diabetes dataset.

首先,按如下所示导入所需包:

First, import the required packages as follows −

import numpy
from pandas import read_csv
from scipy.stats import uniform
from sklearn.linear_model import Ridge
from sklearn.model_selection import RandomizedSearchCV

现在,我们需要加载 Pima diabetes 数据集,如在之前的示例中所做的那样:

Now, we need to load the Pima diabetes dataset as did in previous examples −

path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]

接下来,按如下方式在岭回归算法上评估各个 alpha 值 -

Next, evaluate the various alpha values on Ridge regression algorithm as follows −

param_grid = {'alpha': uniform()}
model = Ridge()
random_search = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=50,
random_state=7)
random_search.fit(X, Y)

使用以下脚本行打印结果 -

Print the result with following script line −

print(random_search.best_score_)
print(random_search.best_estimator_.alpha)

Output

0.27961712703051084
0.9779895119966027

以上输出为我们提供了与网格搜索非常相似的最佳分数。

The above output gives us the optimal score just similar to the grid search.