Machine Learning With Python 简明教程
Improving Performance of ML Models
Performance Improvement with Ensembles
集成可以通过组合多个模型为我们提升机器学习结果。基本上,集成模型包含几个单独训练的监督学习模型,将它们的结果通过各种方式合并以实现比单个模型更好的预测性能。集成方法可以分成以下两组 −
Ensembles can give us boost in the machine learning result by combining several models. Basically, ensemble models consist of several individually trained supervised learning models and their results are merged in various ways to achieve better predictive performance compared to a single model. Ensemble methods can be divided into following two groups −
Ensemble Learning Methods
以下是最流行的集成学习方法,即组合不同模型的预测的方法 −
The following are the most popular ensemble learning methods i.e. the methods for combining the predictions from different models −
Bagging
装袋法也称为引导聚合。在装袋方法中,集成模型尝试通过结合单个模型的预测(这些模型是根据随机生成的训练样本进行训练的)来提高预测精度并降低模型方差。集成模型的最终预测将通过计算所有预测的平均值来给出。装袋方法的最佳示例之一是随机森林。
The term bagging is also known as bootstrap aggregation. In bagging methods, ensemble model tries to improve prediction accuracy and decrease model variance by combining predictions of individual models trained over randomly generated training samples. The final prediction of ensemble model will be given by calculating the average of all predictions from the individual estimators. One of the best examples of bagging methods are random forests.
Boosting
在推进方法中,构建集成模型的主要原则是通过顺序训练每个基础模型估计量来逐步构建集成模型。顾名思义,它基本上结合多个较弱的基础学习器(在训练数据的多次迭代中顺序训练),构建强大的集成模型。在较弱的基础学习器的训练期间,将为较早被错误分类的学习器分配更大的权重。推进方法的一个示例是 AdaBoost。
In boosting method, the main principle of building ensemble model is to build it incrementally by training each base model estimator sequentially. As the name suggests, it basically combine several week base learners, trained sequentially over multiple iterations of training data, to build powerful ensemble. During the training of week base learners, higher weights are assigned to those learners which were misclassified earlier. The example of boosting method is AdaBoost.
Voting
在此集成学习模型中,将构建多种不同类型的模型,并且一些简单的统计信息(例如计算均值或中位数等)用于组合预测。此预测将用作训练的附加输入以做出最终预测。
In this ensemble learning model, multiple models of different types are built and some simple statistics, like calculating mean or median etc., are used to combine the predictions. This prediction will serve as the additional input for training to make the final prediction.
Bagging Ensemble Algorithms
以下三个是装袋集成算法 −
The following are three bagging ensemble algorithms −
Bagged Decision Tree
众所周知,装袋集成法适用于方差较高的算法,在这一方面,决策树算法属于佼佼者。在以下 Python 配方中,我们将使用 sklearn 的 BaggingClassifier 函数和 DecisionTreeClassifier(一种分类和回归树算法)在 Pima Indians 糖尿病数据集上建立装袋决策树集成模型。
As we know that bagging ensemble methods work well with the algorithms that have high variance and, in this concern, the best one is decision tree algorithm. In the following Python recipe, we are going to build bagged decision tree ensemble model by using BaggingClassifier function of sklearn with DecisionTreeClasifier (a classification & regression trees algorithm) on Pima Indians diabetes dataset.
首先,按如下所示导入所需包:
First, import the required packages as follows −
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
现在,我们需要加载 Pima 糖尿病数据集,如我们在前面的示例中所做的那样 −
Now, we need to load the Pima diabetes dataset as we did in the previous examples −
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]
接下来,输入 10 倍交叉验证,如下所示 −
Next, give the input for 10-fold cross validation as follows −
seed = 7
kfold = KFold(n_splits=10, random_state=seed)
cart = DecisionTreeClassifier()
我们需要提供要构建的树的数量。这里我们构建 150 棵树 −
We need to provide the number of trees we are going to build. Here we are building 150 trees −
num_trees = 150
接下来,借助以下脚本构建模型 −
Next, build the model with the help of following script −
model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=seed)
计算并打印结果,如下所示 −
Calculate and print the result as follows −
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
Output
0.7733766233766234
上面显示的输出表明,我们的袋装决策树分类器模型的准确率约为 77%。
The output above shows that we got around 77% accuracy of our bagged decision tree classifier model.
Random Forest
它是装袋决策树的延伸。对于单独的分类器,训练数据集的样本是替换抽取的,但树是以这样的方式构建的,从而降低它们之间的相关性。另外,在构建每棵树时,会考虑特征的随机子集来选择每个分割点,而不是贪心地选择最佳分割点。
It is an extension of bagged decision trees. For individual classifiers, the samples of training dataset are taken with replacement, but the trees are constructed in such a way that reduces the correlation between them. Also, a random subset of features is considered to choose each split point rather than greedily choosing the best split point in construction of each tree.
在以下 Python 配方中,我们将使用 sklearn 的 RandomForestClassifier 类在 Pima Indians 糖尿病数据集上建立装袋随机森林集成模型。
In the following Python recipe, we are going to build bagged random forest ensemble model by using RandomForestClassifier class of sklearn on Pima Indians diabetes dataset.
首先,按如下所示导入所需包:
First, import the required packages as follows −
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
现在,我们需要加载 Pima diabetes 数据集,如在之前的示例中所做的那样:
Now, we need to load the Pima diabetes dataset as did in previous examples −
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]
接下来,输入 10 倍交叉验证,如下所示 −
Next, give the input for 10-fold cross validation as follows −
seed = 7
kfold = KFold(n_splits=10, random_state=seed)
我们需要提供要构建的树的数量。这里我们将构建 150 棵树,分割点从 5 个特征中选取 −
We need to provide the number of trees we are going to build. Here we are building 150 trees with split points chosen from 5 features −
num_trees = 150
max_features = 5
接下来,借助以下脚本构建模型 −
Next, build the model with the help of following script −
model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features)
计算并打印结果,如下所示 −
Calculate and print the result as follows −
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
Output
0.7629357484620642
上面显示的输出表明,我们的装袋随机森林分类器模型的准确率约为 76%。
The output above shows that we got around 76% accuracy of our bagged random forest classifier model.
Extra Trees
它是装袋决策树集成法的另一个延伸。在这种方法中,随机树是从训练数据集的样本构建的。
It is another extension of bagged decision tree ensemble method. In this method, the random trees are constructed from the samples of the training dataset.
在以下 Python 配方中,我们将使用 sklearn 的 ExtraTreesClassifier 类在 Pima Indians 糖尿病数据集上构建额外树集成模型。
In the following Python recipe, we are going to build extra tree ensemble model by using ExtraTreesClassifier class of sklearn on Pima Indians diabetes dataset.
首先,按如下所示导入所需包:
First, import the required packages as follows −
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import ExtraTreesClassifier
现在,我们需要加载 Pima diabetes 数据集,如在之前的示例中所做的那样:
Now, we need to load the Pima diabetes dataset as did in previous examples −
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]
接下来,输入 10 倍交叉验证,如下所示 −
Next, give the input for 10-fold cross validation as follows −
seed = 7
kfold = KFold(n_splits=10, random_state=seed)
我们需要提供要构建的树的数量。这里我们将构建 150 棵树,分割点从 5 个特征中选取 −
We need to provide the number of trees we are going to build. Here we are building 150 trees with split points chosen from 5 features −
num_trees = 150
max_features = 5
接下来,借助以下脚本构建模型 −
Next, build the model with the help of following script −
model = ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features)
计算并打印结果,如下所示 −
Calculate and print the result as follows −
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
Boosting Ensemble Algorithms
以下是两种最常见的提升集成算法 −
The followings are the two most common boosting ensemble algorithms −
AdaBoost
它是其中一个最成功的提升集成算法。此算法的主要关键在于赋予数据集中的实例的权重方式。因此,该算法在构建后续模型时需要较少地关注这些实例。
It is one the most successful boosting ensemble algorithm. The main key of this algorithm is in the way they give weights to the instances in dataset. Due to this the algorithm needs to pay less attention to the instances while constructing subsequent models.
在以下 Python 配方中,我们将使用 sklearn 的 AdaBoostClassifier 类在 Pima Indians 糖尿病数据集上构建 Ada Boost 集成模型,用于分类。
In the following Python recipe, we are going to build Ada Boost ensemble model for classification by using AdaBoostClassifier class of sklearn on Pima Indians diabetes dataset.
首先,按如下所示导入所需包:
First, import the required packages as follows −
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import AdaBoostClassifier
现在,我们需要加载 Pima diabetes 数据集,如在之前的示例中所做的那样:
Now, we need to load the Pima diabetes dataset as did in previous examples −
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]
接下来,输入 10 倍交叉验证,如下所示 −
Next, give the input for 10-fold cross validation as follows −
seed = 5
kfold = KFold(n_splits=10, random_state=seed)
我们需要提供要构建的树的数量。这里我们将构建 150 棵树,分割点从 5 个特征中选取 −
We need to provide the number of trees we are going to build. Here we are building 150 trees with split points chosen from 5 features −
num_trees = 50
接下来,借助以下脚本构建模型 −
Next, build the model with the help of following script −
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
计算并打印结果,如下所示 −
Calculate and print the result as follows −
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
Output
0.7539473684210527
上面显示的输出表明,我们的 AdaBoost 分类器集成模型的准确率约为 75%。
The output above shows that we got around 75% accuracy of our AdaBoost classifier ensemble model.
Stochastic Gradient Boosting
它也被称为梯度提升机。在以下 Python 配方中,我们将使用 sklearn 的 GradientBoostingClassifier 类在 Pima Indians 糖尿病数据集上构建随机梯度提升集成模型,用于分类。
It is also called Gradient Boosting Machines. In the following Python recipe, we are going to build Stochastic Gradient Boostingensemble model for classification by using GradientBoostingClassifier class of sklearn on Pima Indians diabetes dataset.
首先,按如下所示导入所需包:
First, import the required packages as follows −
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
现在,我们需要加载 Pima diabetes 数据集,如在之前的示例中所做的那样:
Now, we need to load the Pima diabetes dataset as did in previous examples −
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]
接下来,输入 10 倍交叉验证,如下所示 −
Next, give the input for 10-fold cross validation as follows −
seed = 5
kfold = KFold(n_splits=10, random_state=seed)
我们需要提供要构建的树的数量。这里我们将构建 150 棵树,分割点从 5 个特征中选取 −
We need to provide the number of trees we are going to build. Here we are building 150 trees with split points chosen from 5 features −
num_trees = 50
接下来,借助以下脚本构建模型 −
Next, build the model with the help of following script −
model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)
计算并打印结果,如下所示 −
Calculate and print the result as follows −
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
Voting Ensemble Algorithms
如讨论所述,投票首先从训练数据集创建两个或更多独立的模型,然后投票分类器将围绕模型进行封装,同时根据需要对子模型的预测结果求取平均值以生成新数据。
As discussed, voting first creates two or more standalone models from training dataset and then a voting classifier will wrap the model along with taking the average of the predictions of sub-model whenever needed new data.
在以下 Python 配方中,我们将使用 sklearn 中的 VotingClassifier 类对 Pima Indian 糖尿病数据集建立投票集成模型,用于分类。我们对逻辑回归、决策树分类器和 SVM 的预测结果进行组合,如下所示,用于解决分类问题 −
In the following Python recipe, we are going to build Voting ensemble model for classification by using VotingClassifier class of sklearn on Pima Indians diabetes dataset. We are combining the predictions of logistic regression, Decision Tree classifier and SVM together for a classification problem as follows −
首先,按如下所示导入所需包:
First, import the required packages as follows −
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
现在,我们需要加载 Pima diabetes 数据集,如在之前的示例中所做的那样:
Now, we need to load the Pima diabetes dataset as did in previous examples −
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]
接下来,输入 10 倍交叉验证,如下所示 −
Next, give the input for 10-fold cross validation as follows −
kfold = KFold(n_splits=10, random_state=7)
接下来,我们需要创建子模型,如下所示 −
Next, we need to create sub-models as follows −
estimators = []
model1 = LogisticRegression()
estimators.append(('logistic', model1))
model2 = DecisionTreeClassifier()
estimators.append(('cart', model2))
model3 = SVC()
estimators.append(('svm', model3))
现在,通过组合上述创建的子模型的预测结果来创建投票集成模型。
Now, create the voting ensemble model by combining the predictions of above created sub models.
ensemble = VotingClassifier(estimators)
results = cross_val_score(ensemble, X, Y, cv=kfold)
print(results.mean())