Machine Learning 简明教程

Machine Learning - Train and Test

在机器学习中，训练测试分割是一种常用的技术，用于评估机器学习模型的性能。训练测试分割的基本思想是将可用数据分成两个集合：训练集和测试集。训练集用于训练模型，而测试集用于评估模型的性能。

In machine learning, the train-test split is a common technique used to evaluate the performance of a machine learning model. The basic idea behind the train-test split is to split the available data into two sets: a training set and a testing set. The training set is used to train the model, and the testing set is used to evaluate the model’s performance.

训练测试分割很重要，因为它使我们能够在模型以前未见过的数据上对其进行测试。这是很重要的，因为如果我们在训练模型所用的相同数据上对其进行评估，则该模型在训练数据上可能表现良好，但在新数据上可能无法很好地泛化。

The train-test split is important because it allows us to test the model on data that it has not seen before. This is important because if we evaluate the model on the same data that it was trained on, the model may perform well on the training data but may not generalize well to new data.

Example

在 Python 中，可以使用 sklearn.model_selection 模块中的 train_test_split 函数将数据分割为训练集和测试集。示例实现如下 −

In Python, the train_test_split function from the sklearn.model_selection module can be used to split the data into training and testing sets. Here is an example implementation −

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Load the iris dataset
data = load_iris()
X = data.data
y = data.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a logistic regression model and fit it to the training data
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate the model on the testing data
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2f}")

在此示例中，我们加载鸢尾花数据集，并使用 train_test_split 函数将其拆分为训练集和测试集。然后，我们创建一个逻辑回归模型，并将其拟合到训练数据中。最后，我们使用模型对象的 score 方法，针对测试数据评估模型。

In this example, we load the iris dataset and split it into training and testing sets using the train_test_split function. We then create a logistic regression model and fit it to the training data. Finally, we evaluate the model on the testing data using the score method of the model object.

train_test_split 函数中的 test_size 参数指定应用于测试的数据比例。在此示例中，我们将其设置为 0.2，这意味着 20% 的数据将用于测试，80% 的数据将用于训练。random_state 参数确保拆分是可重复的，因此每次运行代码时都会得到相同的拆分。

The test_size parameter in the train_test_split function specifies the proportion of the data that should be used for testing. In this example, we set it to 0.2, which means that 20% of the data will be used for testing and 80% will be used for training. The random_state parameter ensures that the split is reproducible, so we get the same split every time we run the code.

Output

执行此代码时，将生成以下输出 −

When you execute this code, it will produce the following output −

Accuracy: 1.00

总体而言，训练-测试拆分是评估机器学习模型性能的关键步骤。通过将数据拆分为训练集和测试集，我们可以确保模型不会过拟合训练数据，并且能很好地推广到新数据。

Overall, the train-test split is a crucial step in evaluating the performance of a machine learning model. By splitting the data into training and testing sets, we can ensure that the model is not overfitting to the training data and can generalize well to new data.