Machine Learning 简明教程
Machine Learning - Stacking
堆叠,它也被称作堆叠泛化,是一种机器学习的集成学习技术,它以分层的方式将多个模型结合在一起以提高预测准确度。这项技术包括在原训练数据集上训练一组基础模型,然后使用这些基础模型的预测作为元模型的输入,这个元模型被训练来做出最终的预测。
Stacking, also known as stacked generalization, is an ensemble learning technique in machine learning where multiple models are combined in a hierarchical manner to improve prediction accuracy. The technique involves training a set of base models on the original training dataset, and then using the predictions of these base models as inputs to a meta-model, which is trained to make the final predictions.
堆叠背后的基本想法就是,通过以一种补偿它们各自缺点的方式将多个模型结合在一起,来利用这些模型的优势。通过使用一组提出不同假设并捕捉数据不同方面的模型,我们可以提高集成模型的整体预测能力。
The basic idea behind stacking is to leverage the strengths of multiple models by combining them in a way that compensates for their individual weaknesses. By using a diverse set of models that make different assumptions and capture different aspects of the data, we can improve the overall predictive power of the ensemble.
堆叠技术可以分成两步 −
The stacking technique can be divided into two stages −
-
Base Model Training − In this stage, a set of base models are trained on the original training data. These models can be of any type, such as decision trees, random forests, support vector machines, neural networks, or any other algorithm. Each model is trained on a subset of the training data, and produces a set of predictions for the remaining data points.
-
Meta-model Training − In this stage, the predictions of the base models are used as inputs to a meta-model, which is trained on the original training data. The goal of the meta-model is to learn how to combine the predictions of the base models to produce more accurate predictions. The meta-model can be of any type, such as linear regression, logistic regression, or any other algorithm. The meta-model is trained using cross-validation to avoid overfitting.
训练元模型后,可通过传递基础模型的预测结果作为输入,对新数据点进行预测。基础模型的预测结果可通过不同方式组合,如取平均值、加权平均值或最大值。
Once the meta-model is trained, it can be used to make predictions on new data points by passing the predictions of the base models as inputs. The predictions of the base models can be combined in different ways, such as by taking the average, weighted average, or maximum.
Example
以下是使用scikit-learn在Python中实现堆叠的示例——
Here is an example implementation of stacking in Python using scikit-learn −
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from mlxtend.classifier import StackingClassifier
from sklearn.metrics import accuracy_score
# Load the iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# Define the base models
rf = RandomForestClassifier(n_estimators=10, random_state=42)
gb = GradientBoostingClassifier(random_state=42)
# Define the meta-model
lr = LogisticRegression()
# Define the stacking classifier
stack = StackingClassifier(classifiers=[rf, gb], meta_classifier=lr)
# Use cross-validation to generate predictions for the meta-model
y_pred = cross_val_predict(stack, X, y, cv=5)
# Evaluate the performance of the stacked model
acc = accuracy_score(y, y_pred)
print(f"Accuracy: {acc}")
在此代码中,我们首先加载iris数据集,并定义基础模型,即随机森林和梯度提升分类器。然后我们定义元模型,即逻辑回归模型。
In this code, we first load the iris dataset and define the base models, which are a random forest and a gradient boosting classifier. We then define the meta-model, which is a logistic regression model.
我们使用基础模型和元模型创建一个StackingClassifier对象,并使用交叉验证为元模型生成预测结果。最后,我们使用准确度分数评估堆叠模型的性能。
We create a StackingClassifier object with the base models and meta-model, and use cross-validation to generate predictions for the meta-model. Finally, we evaluate the performance of the stacked model using the accuracy score.