Microsoft Cognitive Toolkit 简明教程
CNTK - Measuring Performance
本章将解释如何在 CNKT 中度量模型性能。
This chapter will explain how to measure the model performance in CNKT.
Strategy to validate model performance
在构建 ML 模型后,我们通常使用一组数据样本对它进行训练。由于这种训练,我们的 ML 模型得以学习并推导出一些一般规则。当我们向模型输入新的样本(即与训练时提供的样本不同的样本)时,ML 模型的性能至关重要。在这种情况下,该模型的行为不同。对于这些新样本,它在做出良好的预测方面可能表现不佳。
After building a ML model, we used to train it using a set of data samples. Because of this training our ML model learns and derive some general rules. The performance of ML model matters when we feed new samples, i.e., different samples than provided at the time of training, to the model. The model behaves differently in that case. It may be worse at making a good prediction on those new samples.
但是,该模型也必须对新样本表现良好,因为在生产环境中,我们将获得与用于训练目的的样本数据不同的输入。这就是我们应该使用与用于训练目的的样本不同的样本集来验证 ML 模型的原因。在这里,我们将讨论用于为神经网络创建数据集的两种不同的技术。
But the model must work well for new samples as well because in production environment we will get different input than we used sample data for training purpose. That’s the reason, we should validate the ML model by using a set of samples different from the samples we used for training purpose. Here, we are going to discuss two different techniques for creating a dataset for validating a NN.
Hold-out dataset
这是用于创建数据集以验证神经网络的最简单的方法之一。顾名思义,在此方法中,我们将保留一组用于训练的样本(例如 20%),并使用它来测试 ML 模型的性能。下图显示了训练样本和验证样本之间的比率 −
It is one of the easiest methods for creating a dataset to validate a NN. As name implies, in this method we will be holding back one set of samples from training (say 20%) and using it to test the performance of our ML model. Following diagram shows the ratio between training and validation samples −

保留数据集模型确保我们拥有足够的数据来训练我们的 ML 模型,同时我们将有合理数量的样本来获得模型性能的良好度量。
Hold-out dataset model ensures that we have enough data to train our ML model and at the same time we will have a reasonable number of samples to get good measurement of model’s performance.
为了包含在训练集中和测试集中,从主数据集中选择随机样本是一种良好的做法。它确保了训练集和测试集之间的平均分布。
In order to include in the training set and test set, it’s a good practice to choose random samples from the main dataset. It ensures an even distribution between training and test set.
下面是一个示例,我们在其中使用 train_test_split 函数从 scikit-learn 库中生成自己的保留数据集。
Following is an example in which we are producing own hold-out dataset by using train_test_split function from the scikit-learn library.
Example
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Here above test_size = 0.2 represents that we provided 20% of the data as test data.
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
classifier_knn = KNeighborsClassifier(n_neighbors=3)
classifier_knn.fit(X_train, y_train)
y_pred = classifier_knn.predict(X_test)
# Providing sample data and the model will make prediction out of that data
sample = [[5, 5, 3, 2], [2, 4, 3, 5]]
preds = classifier_knn.predict(sample)
pred_species = [iris.target_names[p] for p in preds] print("Predictions:", pred_species)
Output
Predictions: ['versicolor', 'virginica']
在我们使用 CNTK 时,我们在每次训练模型时都需要随机安排数据集的顺序,因为 −
While using CNTK, we need to randomise the order of our dataset each time we train our model because −
-
Deep learning algorithms are highly influenced by the random-number generators.
-
The order in which we provide the samples to NN during training greatly affects its performance.
使用保留数据集技术的主要缺点是它不可靠,因为有时我们会得到非常好的结果,但有时我们会得到较差的结果。
The major downside of using the hold-out dataset technique is that it is unreliable because sometimes we get very good results but sometimes, we get bad results.
K-fold cross validation
为了使我们的 ML 模型更可靠,有一种称为 K 折叠交叉验证的技术。在本质上,K 折叠交叉验证技术与前一种技术相同,但它会重复多次 —— 通常是 5 到 10 次。下图表示其概念 −
To make our ML model more reliable, there is a technique called K-fold cross validation. In nature K-fold cross validation technique is same as the previous technique, but it repeats it several times-usually about 5 to 10 times. Following diagram represents its concept −

Working of K-fold cross validation
K 折叠交叉验证的工作原理可以通过以下步骤理解 −
The working of K-fold cross validation can be understood with the help of following steps −
Step 1 —— 与保留数据集技术类似,在 K 折叠交叉验证技术中,我们首先需要将数据集拆分为训练集和测试集。理想情况下,比率为 80-20,即 80% 的训练集和 20% 的测试集。
Step 1 − Like in Hand-out dataset technique, in K-fold cross validation technique, first we need to split the dataset into a training and test set. Ideally, the ratio is 80-20, i.e. 80% of training set and 20% of test set.
Step 2 —— 接下来,我们需要使用训练集来训练我们的模型。
Step 2 − Next, we need to train our model using the training set.
Step 3 —— 最后,我们将使用测试集来衡量我们模型的性能。保留数据集技术和 K 交叉验证技术之间的唯一区别在于,上述过程通常会重复 5 到 10 次,最后计算所有性能指标的平均值。该平均值将是最终的性能指标。
Step 3 −At last, we will be using the test set to measure the performance of our model. The only difference between Hold-out dataset technique and k-cross validation technique is that the above process gets repeated usually for 5 to 10 times and at the end the average is calculated over all the performance metrics. That average would be the final performance metrics.
让我们看一个小数据,集的示例 −
Let us see an example with a small dataset −
Example
from numpy import array
from sklearn.model_selection import KFold
data = array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])
kfold = KFold(5, True, 1)
for train, test in kfold.split(data):
print('train: %s, test: %s' % (data[train],(data[test]))
Output
train: [0.1 0.2 0.4 0.5 0.6 0.7 0.8 0.9], test: [0.3 1. ]
train: [0.1 0.2 0.3 0.4 0.6 0.8 0.9 1. ], test: [0.5 0.7]
train: [0.2 0.3 0.5 0.6 0.7 0.8 0.9 1. ], test: [0.1 0.4]
train: [0.1 0.3 0.4 0.5 0.6 0.7 0.9 1. ], test: [0.2 0.8]
train: [0.1 0.2 0.3 0.4 0.5 0.7 0.8 1. ], test: [0.6 0.9]
如我们所见,由于使用了更贴近现实的训练和测试方案,K 折叠交叉验证技术为我们提供了更稳定的性能度量,但在缺点方面,在验证深度学习模型时需要花费大量时间。
As we see, because of using a more realistic training and test scenario, k-fold cross validation technique gives us a much more stable performance measurement but, on the downside, it takes a lot of time when validating deep learning models.
CNTK 不支持 K 交叉验证,因此我们需要编写自己的脚本来完成此操作。
CNTK does not support for k-cross validation, hence we need to write our own script to do so.
Detecting underfitting and overfitting
无论我们使用保留数据集还是 K 折叠交叉验证技术,我们都会发现指标的输出对于用于训练的数据集和用于验证的数据集将不同。
Whether, we use Hand-out dataset or k-fold cross-validation technique, we will discover that the output for the metrics will be different for dataset used for training and the dataset used for validation.
Detecting overfitting
过度拟合,简称过拟,是机器学习模型对训练数据进行极佳建模,但在测试数据上表现不佳,即无法预测测试数据。
The phenomenon called overfitting is a situation where our ML model, models the training data exceptionally well, but fails to perform well on the testing data, i.e. was not able to predict test data.
当机器学习模型在训练数据中学习具体模式和噪声时,就会发生这种现象,以至于它对模型从训练数据泛化到新数据(即未见过的数据)的能力产生了负面影响。此处,噪声是数据集中无关的信息或随机性。
It happens when a ML model learns a specific pattern and noise from the training data to such an extent, that it negatively impacts that model’s ability to generalise from the training data to new, i.e. unseen data. Here, noise is the irrelevant information or randomness in a dataset.
以下两种方法可以帮助我们检测模型是否过拟合:
Following are the two ways with the help of which we can detect weather our model is overfit or not −
-
The overfit model will perform well on the same samples we used for training, but it will perform very bad on the new samples, i.e. samples different from training.
-
The model is overfit during validation if the metric on the test set is lower than the same metric, we use on our training set.
Detecting underfitting
机器学习中可能出现的另一种情况是欠拟合。在欠拟合中,我们的机器学习模型并未很好地对训练数据建模,并且无法预测有用的输出。当开始训练第一个时期时,我们的模型将欠拟合,但随着训练的进行,欠拟合将减少。
Another situation that can arise in our ML is underfitting. This is a situation where, our ML model didn’t model the training data well and fails to predict useful output. When we start training the first epoch, our model will be underfitting, but will become less underfit as training progress.
检测模型是否欠拟合的方法之一是查看训练集和测试集的指标。如果测试集上的度量值高于训练集上的度量值,则我们的模型将欠拟合。
One of the ways to detect, whether our model is underfit or not is to look at the metrics for training set and test set. Our model will be underfit if the metric on the test set is higher than the metric on the training set.