Machine Learning 简明教程
Machine Learning - Bootstrap Aggregation (Bagging)
Bagging 是一种集成学习技术,它结合多个模型的预测,以提高单个模型的准确性和稳定性。它涉及通过有放回随机抽样来创建训练数据的多个子集。然后使用每个子集训练一个单独的模型,最终预测是通过对所有模型的预测进行平均而得出的。
Bagging is an ensemble learning technique that combines the predictions of multiple models to improve the accuracy and stability of a single model. It involves creating multiple subsets of the training data by randomly sampling with replacement. Each subset is then used to train a separate model, and the final prediction is made by averaging the predictions of all models.
Bagging 背后的主要思想是通过使用多个不太复杂但仍然准确的模型来减少单个模型的方差。通过对多个模型的预测进行平均,Bagging 降低了过拟合风险并提高了模型的稳定性。
The main idea behind Bagging is to reduce the variance of a single model by using multiple models that are less complex but still accurate. By averaging the predictions of multiple models, Bagging reduces the risk of overfitting and improves the stability of the model.
How Does Bagging Work?
Bagging 算法按照以下步骤工作 −
The Bagging algorithm works in the following steps −
-
Create multiple subsets of the training data by randomly sampling with replacement.
-
Train a separate model on each subset of the data.
-
Make predictions on the testing data using each model.
-
Combine the predictions of all models by taking the average or majority vote.
Bagging 的关键特在于每个模型都训练在训练数据的不同子集上,这为集成引入了多样性。模型通常使用基本模型进行训练,例如决策树、逻辑回归或支持向量机。
The key feature of Bagging is that each model is trained on a different subset of the training data, which introduces diversity into the ensemble. The models are typically trained using a base model, such as a decision tree, logistic regression, or support vector machine.
Example
现在,让我们看看如何使用 Scikit-learn 库在 Python 中实现 Bagging。对于此示例,我们将使用著名的鸢尾花数据集。
Now let’s see how we can implement Bagging in Python using the Scikit-learn library. For this example, we will use the famous Iris dataset.
from sklearn.datasets import load_iris
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load the Iris dataset
iris = load_iris()
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)
# Define the base estimator
base_estimator = DecisionTreeClassifier(max_depth=3)
# Define the Bagging classifier
bagging = BaggingClassifier(base_estimator=base_estimator, n_estimators=10, random_state=42)
# Train the Bagging classifier
bagging.fit(X_train, y_train)
# Make predictions on the testing set
y_pred = bagging.predict(X_test)
# Evaluate the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
在此示例中,我们首先使用 Scikit-learn 的 load_iris 函数加载鸢尾花数据集,并使用 train_test_split 函数将其划分为训练集和测试集。
In this example, we first load the Iris dataset using Scikit-learn’s load_iris function and split it into training and testing sets using the train_test_split function.
然后,我们定义基准估计器,它是一个最大深度为 3 的决策树,以及包含 10 个决策树的装袋分类器。
We then define the base estimator, which is a decision tree with a maximum depth of 3, and the Bagging classifier, which consists of 10 decision trees.
我们使用 fit 方法训练 bagging 分类器,并使用 predict 方法对测试集进行预测。最后,我们使用 Scikit-learn metrics 模块中的 accuracy_score 函数评估模型的准确性。
We train the Bagging classifier using the fit method and make predictions on the testing set using the predict method. Finally, we evaluate the model’s accuracy using the accuracy_score function from Scikit-learn’s metrics module.