Machine Learning 简明教程

Machine Learning - Data Leakage

数据泄漏是机器学习中常见的错误,它发生在使用外部训练数据集的信息来创建或评估模型时。这可能导致过拟合,即模型过于贴合训练数据,且在新数据上的表现不佳。

Data leakage is a common problem in machine learning that occurs when information from outside the training dataset is used to create or evaluate a model. This can lead to overfitting, where the model is too closely tailored to the training data and performs poorly on new data.

有两种主要类型的数据泄漏:目标泄漏和训练测试污染

There are two main types of data leakage: Target Leakage and Train-test Contamination

Target Leakage

当在预测期间不可用的特征被用于创建模型时,就会发生目标泄漏。例如,如果我们正在预测客户是否会流失,并且我们将客户的取消日期作为特征包含在内,那么该模型将可以获取实际情况中不可用的信息。这可能导致在训练期间获得不切实际的高精度,并且在新数据上的表现不佳。

Target leakage occurs when features that are not available during prediction are used to create the model. For example, if we are predicting whether a customer will churn, and we include the customer’s cancellation date as a feature, then the model will have access to information that would not be available in practice. This can lead to unrealistically high accuracy during training and poor performance on new data.

Train-test Contamination

训练测试污染发生在测试集中的信息在训练过程中无意中被使用时。例如,如果我们基于整个数据集而不是仅训练集的平均值和标准差对数据进行归一化,那么该模型将可以获取实际情况中不可用的信息。这可能导致对模型性能做出过于乐观的评估。

Train-test contamination occurs when information from the test set is inadvertently used in the training process. For example, if we normalize the data based on the mean and standard deviation of the entire dataset instead of just the training set, then the model will have access to information that would not be available in practice. This can lead to overly optimistic estimates of model performance.

How to Prevent Data Leakage?

为防止数据泄漏,仔细预处理数据并确保训练过程中没有使用测试集中的任何信息非常重要。防止数据泄漏的一些策略包括:

To prevent data leakage, it is important to carefully preprocess the data and ensure that no information from the test set is used in the training process. Some strategies for preventing data leakage include −

  1. Splitting the data into separate training and test sets before doing any preprocessing or feature engineering.

  2. Only using features that would be available at the time of prediction.

  3. Using cross-validation to evaluate model performance instead of a single train-test split.

  4. Ensuring that all preprocessing steps (such as normalization or scaling) are applied to the training set only and then using the same transformations on the test set.

  5. Being aware of any potential sources of leakage, such as date or time-based features, and handling them appropriately.

Implementation in Python

这是一个示例,我们将使用 Scilearn 乳腺癌数据集,并确保训练期间没有任何来自测试集的信息泄漏到模型中 −

Here is an example in which we will be using Sklearn breast cancer dataset and ensure that no information from the test set is leaked into the model during training −

Example

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

# Load the breast cancer dataset
data = load_breast_cancer()

# Separate features and labels
X, y = data.data, data.target

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the pipeline
pipeline = Pipeline([
   ('scaler', StandardScaler()),
   ('svm', SVC())
])

# Fit the pipeline on the train set
pipeline.fit(X_train, y_train)

# Make predictions on the test set
y_pred = pipeline.predict(X_test)

# Evaluate the model performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

执行此代码时,将生成以下输出 −

When you execute this code, it will produce the following output −

Accuracy: 0.9824561403508771