Machine Learning 简明教程

Machine Learning - Cross Validation

交叉验证是一种功能强大的技术,用于机器学习中根据不可见数据估计模型的性能。在建立稳健的机器学习模型中,它是一个必不可少的步骤,因为它有助于识别过拟合或欠拟合,并有助于确定最佳模型超参数。

Cross-validation is a powerful technique used in machine learning to estimate the performance of a model on unseen data. It is an essential step in building a robust machine learning model, as it helps to identify overfitting or underfitting, and helps to determine the optimal model hyperparameters.

What is Cross-Validation?

交叉验证是一种技术,用于通过将数据集划分为子集、在数据的一部分上训练模型,然后在其余数据上验证模型,来评估模型的性能。交叉验证背后的基本思想是使用数据的一个子集来训练模型,再使用另一子集来测试其性能。这使机器学习模型能够在各种数据上进行训练,并更好地泛化到新数据。

Cross-validation is a technique used to evaluate the performance of a model by partitioning the dataset into subsets, training the model on a portion of the data, and then validating the model on the remaining data. The basic idea behind cross-validation is to use a subset of the data to train the model and another subset to test its performance. This allows the machine learning model to be trained on a variety of data and to generalize better to new data.

有不同类型的交叉验证技术,但使用最广泛的技术是 k 折交叉验证。在 k 折交叉验证中,数据被分割成 k 个大小相等的部分。然后在 k-1 个部分上训练模型,并在剩下的部分上测试模型。这个过程重复 k 次,每次都将 k 个部分中的一个用作验证数据。最终,将 k 次迭代中模型的性能平均起来以获得对模型性能的估计值。

There are different types of cross-validation techniques available, but the most commonly used technique is k-fold cross-validation. In k-fold cross-validation, the data is partitioned into k equally sized folds. The model is then trained on k-1 folds and tested on the remaining fold. This process is repeated k times, with each of the k folds used once as the validation data. The final performance of the model is then averaged over the k iterations to obtain an estimate of the model’s performance.

Why is Cross-Validation Important?

交叉验证在机器学习中是一种基本技术,因为它有助于防止模型过拟合或欠拟合。过拟合是指模型过于复杂,过分贴合训练数据,导致在新数据上的性能不佳。另一方面,欠拟合是指模型过于简单,无法捕获数据中的底层模式,导致在训练数据和测试数据上的性能均不佳。

Cross-validation is an essential technique in machine learning because it helps to prevent overfitting or underfitting of a model. Overfitting occurs when the model is too complex and fits the training data too closely, resulting in poor performance on new data. On the other hand, underfitting occurs when the model is too simple and does not capture the underlying patterns in the data, resulting in poor performance on both the training and test data.

交叉验证还有助于确定最佳模型超参数。超参数是对模型行为进行控制的设置。例如,在决策树算法中,树的最大深度是一个超参数,它决定了模型的复杂度级别。通过使用交叉验证来评估模型在不同超参数值下的性能,我们可以选择最大化模型性能的最优超参数。

Cross-validation also helps to determine the optimal model hyperparameters. Hyperparameters are the settings that control the behavior of the model. For example, in a decision tree algorithm, the maximum depth of the tree is a hyperparameter that determines the level of complexity of the model. By using cross-validation to evaluate the performance of the model at different hyperparameter values, we can select the optimal hyperparameters that maximize the model’s performance.

Implementing Cross-Validation in Python

在本节中,我们将讨论如何使用 Scikit-learn 库在 Python 中实现 k 折交叉验证。Scikit-learn 是一个流行的 Python 库,用于机器学习,它提供了各种算法和工具,用于数据预处理、模型选择和评估。

In this section, we will discuss how to implement k-fold cross-validation in Python using the Scikit-learn library. Scikit-learn is a popular Python library for machine learning that provides a range of algorithms and tools for data preprocessing, model selection, and evaluation.

为了演示如何用 Python 实现交叉验证,我们将使用著名的鸢尾花数据集。鸢尾花数据集包含了三种鸢尾花卉萼片长度、萼片宽度、花瓣长度和花瓣宽度测量值。目标是构建一个模型,该模型可根据鸢尾花的测量值来预测它的品种。

To demonstrate how to implement cross-validation in Python, we will use the famous Iris dataset. The Iris dataset contains measurements of the sepal length, sepal width, petal length, and petal width of three different species of iris flowers. The goal is to build a model that can predict the species of an iris flower based on its measurements.

首先,我们需要使用 Scikit-learn load_iris() 函数加载数据集,并使用 train_test_split() 函数将数据集分割为训练集和测试集。训练集将用于训练模型,测试集将用于评估模型的性能。

First, we need to load the dataset using the Scikit-learn load_iris() function and split it into a training set and a test set using the train_test_split() function. The training set will be used to train the model, and the test set will be used to evaluate the performance of the model.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the Iris dataset
iris = load_iris()

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

接下来,我们将使用 Scikit-learn DecisionTreeClassifier() 函数创建一个决策树分类器。

Next, we will create a decision tree classifier using the Scikit-learn DecisionTreeClassifier() function.

from sklearn.tree import DecisionTree

创建一个决策树分类器。

Create a decision tree classifier.

clf = DecisionTreeClassifier(random_state=42)

现在,我们可以使用 K 折交叉验证来评估模型的性能。我们将使用 Scikit-learn 中的 cross_val_score() 函数来执行 K 折交叉验证。该函数将模型、训练数据、目标变量和折数作为输入。返回一个分数数组,每个分数对应一个折数。

Now, we can use k-fold cross-validation to evaluate the performance of the model. We will use the cross_val_score() function from Scikit-learn to perform k-fold cross-validation. The function takes as input the model, the training data, the target variable, and the number of folds. It returns an array of scores, one for each fold.

from sklearn.model_selection import cross_val_score

# Perform k-fold cross-validation
scores = cross_val_score(clf, X_train, y_train, cv=5)

在此,我们将折数指定为 5,这意味着数据将划分为 5 个大小相等的折。cross_val_score() 函数将在 4 个折上训练模型,并在剩下的折上对其进行测试。此过程将重复 5 次,每个折会用作一次验证数据。该函数返回一个分数数组,每个分数对应一个折数。

Here, we have specified the number of folds as 5, meaning that the data will be partitioned into 5 equally sized folds. The cross_val_score() function will train the model on 4 folds and test it on the remaining fold. This process will be repeated 5 times, with each fold used once as the validation data. The function returns an array of scores, one for each fold.

最后,我们可以计算分数的平均值和标准差来对模型的性能进行评估。

Finally, we can calculate the mean and standard deviation of the scores to get an estimate of the model’s performance.

import numpy as np

# Calculate the mean and standard deviation of the scores
mean_score = np.mean(scores)
std_score = np.std(scores)

print("Mean cross-validation score: {:.2f}".format(mean_score))
print("Standard deviation of cross-validation score: {:.2f}".format(std_score))

这段代码的输出是分数的平均值和标准差。平均分表示了模型在所有折迭上的平均表现,而标准差则表示了分数的可变性。

The output of this code will be the mean and standard deviation of the scores. The mean score represents the average performance of the model across all folds, while the standard deviation represents the variability of the scores.

Example

以下是用 Python 实现交叉验证的完整代码:

Here is the complete implementation of Cross-Validation in Python −

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

# Load the iris dataset
iris = load_iris()

# Define the features and target variables
X = iris.data
y = iris.target

# Create a decision tree classifier
clf = DecisionTreeClassifier(random_state=42)

# Perform k-fold cross-validation
scores = cross_val_score(clf, X, y, cv=5)

# Calculate the mean and standard deviation of the scores
mean_score = np.mean(scores)
std_score = np.std(scores)

print("Mean cross-validation score: {:.2f}".format(mean_score))
print("Standard deviation of cross-validation score: {:.2f}".format(std_score))

执行此代码时,将生成以下输出 −

When you execute this code, it will produce the following output −

Mean cross-validation score: 0.95
Standard deviation of cross-validation score: 0.03