Microsoft Cognitive Toolkit 简明教程
CNTK - Neural Network Classification
在本章节中,我们将学习如何使用 CNTK 对神经网络进行分类。
In this chapter, we will study how to classify neural network by using CNTK.
Introduction
分类可以定义为预测给定输入数据的分类输出标签或响应的过程。分类输出基于模型在训练阶段学到的内容,可以是“黑色”或“白色”或“垃圾邮件”或“非垃圾邮件”等形式。
Classification may be defined as the process to predict categorial output labels or responses for the given input data. The categorised output, which will be based on what the model has learned in training phase, can have the form such as "Black" or "White" or "spam" or "no spam".
另一方面,在数学上,它是近似映射函数的任务,例如从输入变量,比如 X,到输出变量,比如 Y。
On the other hand, mathematically, it is the task of approximating a mapping function say f from input variables say X to the output variables say Y.
分类问题的一个经典示例可以是电子邮件中的垃圾邮件检测。显然,输出只有两类,“垃圾邮件”和“非垃圾邮件”。
A classic example of classification problem can be the spam detection in e-mails. It is obvious that there can be only two categories of output, "spam" and "no spam".
要实施这种分类,我们首先需要对分类器进行训练,其中“垃圾邮件”和“非垃圾邮件”电子邮件将用作训练数据。一旦分类器训练成功,就可以用它来检测未知电子邮件。
To implement such classification, we first need to do training of the classifier where "spam" and "no spam" emails would be used as the training data. Once, the classifier trained successfully, it can be used to detect an unknown email.
在此,我们使用具有以下特征的鸢尾花数据集创建一个 4-5-3 神经网络:
Here, we are going to create a 4-5-3 NN using iris flower dataset having the following −
-
4-input nodes (one for each predictor value).
-
5-hidden processing nodes.
-
3-output nodes (because there are three possible species in iris dataset).
Loading Dataset
我们将使用鸢尾花数据集,从中我们希望基于萼片宽度和长度以及花瓣宽度和长度对鸢尾花的种类进行分类。数据集描述了不同品种鸢尾花的物理特性:
We will be using iris flower dataset, from which we want to classify species of iris flowers based on the physical properties of sepal width and length, and petal width and length. The dataset describes the physical properties of different varieties of iris flowers −
-
Sepal length
-
Sepal width
-
Petal length
-
Petal width
-
Class i.e. iris setosa or iris versicolor or iris virginica
我们有 iris.CSV 文件,我们之前也曾在章节中使用过它。可以使用 Pandas 库加载它。但是,在将它用于分类器或为分类器加载它之前,我们需要准备训练和测试文件,这样才可以在 CNTK 中轻松使用它。
We have iris.CSV file which we used before in previous chapters also. It can be loaded with the help of Pandas library. But, before using it or loading it for our classifier, we need to prepare the training and test files, so that it can be used easily with CNTK.
Preparing training & test files
鸢尾花数据集是机器学习项目中最流行的数据集之一。它有 150 个数据项,原始数据如下所示:
Iris dataset is one of the most popular datasets for ML projects. It has 150 data items and the raw data looks as follows −
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
…
7.0 3.2 4.7 1.4 versicolor
6.4 3.2 4.5 1.5 versicolor
…
6.3 3.3 6.0 2.5 virginica
5.8 2.7 5.1 1.9 virginica
如前所述,每行上的前四个值描述了不同变种的物理性质,即鸢尾花的萼片长度、萼片宽度、花瓣长度、花瓣宽度。
As told earlier, the first four values on each line describes the physical properties of different varieties, i.e. Sepal length, Sepal width, Petal length, Petal width of iris flowers.
然而,我们必须将数据转换为 CNTK 可以轻松使用的格式,该格式为 .ctf 文件(我们还在上一部分创建了 iris.ctf)。它的外观如下 −
But, we should have to convert the data in the format, that can be easily used by CNTK and that format is .ctf file (we created one iris.ctf in previous section also). It will look like as follows −
|attribs 5.1 3.5 1.4 0.2|species 1 0 0
|attribs 4.9 3.0 1.4 0.2|species 1 0 0
…
|attribs 7.0 3.2 4.7 1.4|species 0 1 0
|attribs 6.4 3.2 4.5 1.5|species 0 1 0
…
|attribs 6.3 3.3 6.0 2.5|species 0 0 1
|attribs 5.8 2.7 5.1 1.9|species 0 0 1
在以上数据中,|attribs 标记标记特征值的开头,|species 标记标记类标签值。我们也可以使用任何其他我们想要的标记名称,甚至还可以添加项目 ID。例如,查看以下数据 −
In the above data, the |attribs tag mark the start of the feature value and the |species tags the class label values. We can also use any other tag names of our wish, even we can add item ID as well. For example, look at the following data −
|ID 001 |attribs 5.1 3.5 1.4 0.2|species 1 0 0 |#setosa
|ID 002 |attribs 4.9 3.0 1.4 0.2|species 1 0 0 |#setosa
…
|ID 051 |attribs 7.0 3.2 4.7 1.4|species 0 1 0 |#versicolor
|ID 052 |attribs 6.4 3.2 4.5 1.5|species 0 1 0 |#versicolor
…
iris 数据集中共有 150 个数据项,对于此示例,我们将使用 80-20 保留数据集规则,即 80%(120 个项目)数据项用于训练目的,其余 20%(30 个项目)数据项用于测试目的。
There are total 150 data items in iris dataset and for this example, we will be using 80-20 hold-out dataset rule i.e. 80% (120 items) data items for training purpose and remaining 20% (30 items) data items for testing purpose.
Constructing Classification model
首先,我们需要使用 CNTK 格式处理数据文件,为此,我们将使用名为 create_reader 的帮助程序函数,如下 −
First, we need to process the data files in CNTK format and for that we are going to use the helper function named create_reader as follows −
def create_reader(path, input_dim, output_dim, rnd_order, sweeps):
x_strm = C.io.StreamDef(field='attribs', shape=input_dim, is_sparse=False)
y_strm = C.io.StreamDef(field='species', shape=output_dim, is_sparse=False)
streams = C.io.StreamDefs(x_src=x_strm, y_src=y_strm)
deserial = C.io.CTFDeserializer(path, streams)
mb_src = C.io.MinibatchSource(deserial, randomize=rnd_order, max_sweeps=sweeps)
return mb_src
现在,我们需要为我们的 NN 设置架构参数,还要提供数据文件的位置。这可以通过以下 Python 代码来完成 −
Now, we need to set the architecture arguments for our NN and also provide the location of the data files. It can be done with the help of following python code −
def main():
print("Using CNTK version = " + str(C.__version__) + "\n")
input_dim = 4
hidden_dim = 5
output_dim = 3
train_file = ".\\...\\" #provide the name of the training file(120 data items)
test_file = ".\\...\\" #provide the name of the test file(30 data items)
现在,我们的程序将在以下代码行的帮助下创建未训练的 NN −
Now, with the help of following code line our program will create the untrained NN −
X = C.ops.input_variable(input_dim, np.float32)
Y = C.ops.input_variable(output_dim, np.float32)
with C.layers.default_options(init=C.initializer.uniform(scale=0.01, seed=1)):
hLayer = C.layers.Dense(hidden_dim, activation=C.ops.tanh, name='hidLayer')(X)
oLayer = C.layers.Dense(output_dim, activation=None, name='outLayer')(hLayer)
nnet = oLayer
model = C.ops.softmax(nnet)
现在,一旦我们创建了未训练的双模型,我们就需要设置一个 Learner 算法对象,然后使用它创建一个 Trainer 训练对象。我们将使用 SGD 学习器和 cross_entropy_with_softmax 损失函数 −
Now, once we created the dual untrained model, we need to set up a Learner algorithm object and afterwards use it to create a Trainer training object. We are going to use SGD learner and cross_entropy_with_softmax loss function −
tr_loss = C.cross_entropy_with_softmax(nnet, Y)
tr_clas = C.classification_error(nnet, Y)
max_iter = 2000
batch_size = 10
learn_rate = 0.01
learner = C.sgd(nnet.parameters, learn_rate)
trainer = C.Trainer(nnet, (tr_loss, tr_clas), [learner])
将学习算法编码如下 −
Code the learning algorithm as follows −
max_iter = 2000
batch_size = 10
lr_schedule = C.learning_parameter_schedule_per_sample([(1000, 0.05), (1, 0.01)])
mom_sch = C.momentum_schedule([(100, 0.99), (0, 0.95)], batch_size)
learner = C.fsadagrad(nnet.parameters, lr=lr_schedule, momentum=mom_sch)
trainer = C.Trainer(nnet, (tr_loss, tr_clas), [learner])
现在,一旦我们完成了 Trainer 对象,我们就需要创建阅读器函数来读取训练数据 −
Now, once we finished with Trainer object, we need to create a reader function to read the training data−
rdr = create_reader(train_file, input_dim, output_dim, rnd_order=True, sweeps=C.io.INFINITELY_REPEAT)
iris_input_map = { X : rdr.streams.x_src, Y : rdr.streams.y_src }
现在是时候训练我们的 NN 模型了 −
Now it’s time to train our NN model−
for i in range(0, max_iter):
curr_batch = rdr.next_minibatch(batch_size, input_map=iris_input_map) trainer.train_minibatch(curr_batch)
if i % 500 == 0:
mcee = trainer.previous_minibatch_loss_average
macc = (1.0 - trainer.previous_minibatch_evaluation_average) * 100
print("batch %4d: mean loss = %0.4f, accuracy = %0.2f%% " \ % (i, mcee, macc))
训练完成后,让我们使用测试数据项评估模型 −
Once, we have done with training, let’s evaluate the model using test data items −
print("\nEvaluating test data \n")
rdr = create_reader(test_file, input_dim, output_dim, rnd_order=False, sweeps=1)
iris_input_map = { X : rdr.streams.x_src, Y : rdr.streams.y_src }
num_test = 30
all_test = rdr.next_minibatch(num_test, input_map=iris_input_map) acc = (1.0 - trainer.test_minibatch(all_test)) * 100
print("Classification accuracy = %0.2f%%" % acc)
评估了我们训练的 NN 模型的准确性后,我们将使用它对未见数据进行预测 −
After evaluating the accuracy of our trained NN model, we will be using it for making a prediction on unseen data −
np.set_printoptions(precision = 1, suppress=True)
unknown = np.array([[6.4, 3.2, 4.5, 1.5]], dtype=np.float32)
print("\nPredicting Iris species for input features: ")
print(unknown[0]) pred_prob = model.eval(unknown)
np.set_printoptions(precision = 4, suppress=True)
print("Prediction probabilities are: ")
print(pred_prob[0])
Complete Classification Model
Import numpy as np
Import cntk as C
def create_reader(path, input_dim, output_dim, rnd_order, sweeps):
x_strm = C.io.StreamDef(field='attribs', shape=input_dim, is_sparse=False)
y_strm = C.io.StreamDef(field='species', shape=output_dim, is_sparse=False)
streams = C.io.StreamDefs(x_src=x_strm, y_src=y_strm)
deserial = C.io.CTFDeserializer(path, streams)
mb_src = C.io.MinibatchSource(deserial, randomize=rnd_order, max_sweeps=sweeps)
return mb_src
def main():
print("Using CNTK version = " + str(C.__version__) + "\n")
input_dim = 4
hidden_dim = 5
output_dim = 3
train_file = ".\\...\\" #provide the name of the training file(120 data items)
test_file = ".\\...\\" #provide the name of the test file(30 data items)
X = C.ops.input_variable(input_dim, np.float32)
Y = C.ops.input_variable(output_dim, np.float32)
with C.layers.default_options(init=C.initializer.uniform(scale=0.01, seed=1)):
hLayer = C.layers.Dense(hidden_dim, activation=C.ops.tanh, name='hidLayer')(X)
oLayer = C.layers.Dense(output_dim, activation=None, name='outLayer')(hLayer)
nnet = oLayer
model = C.ops.softmax(nnet)
tr_loss = C.cross_entropy_with_softmax(nnet, Y)
tr_clas = C.classification_error(nnet, Y)
max_iter = 2000
batch_size = 10
learn_rate = 0.01
learner = C.sgd(nnet.parameters, learn_rate)
trainer = C.Trainer(nnet, (tr_loss, tr_clas), [learner])
max_iter = 2000
batch_size = 10
lr_schedule = C.learning_parameter_schedule_per_sample([(1000, 0.05), (1, 0.01)])
mom_sch = C.momentum_schedule([(100, 0.99), (0, 0.95)], batch_size)
learner = C.fsadagrad(nnet.parameters, lr=lr_schedule, momentum=mom_sch)
trainer = C.Trainer(nnet, (tr_loss, tr_clas), [learner])
rdr = create_reader(train_file, input_dim, output_dim, rnd_order=True, sweeps=C.io.INFINITELY_REPEAT)
iris_input_map = { X : rdr.streams.x_src, Y : rdr.streams.y_src }
for i in range(0, max_iter):
curr_batch = rdr.next_minibatch(batch_size, input_map=iris_input_map) trainer.train_minibatch(curr_batch)
if i % 500 == 0:
mcee = trainer.previous_minibatch_loss_average
macc = (1.0 - trainer.previous_minibatch_evaluation_average) * 100
print("batch %4d: mean loss = %0.4f, accuracy = %0.2f%% " \ % (i, mcee, macc))
print("\nEvaluating test data \n")
rdr = create_reader(test_file, input_dim, output_dim, rnd_order=False, sweeps=1)
iris_input_map = { X : rdr.streams.x_src, Y : rdr.streams.y_src }
num_test = 30
all_test = rdr.next_minibatch(num_test, input_map=iris_input_map) acc = (1.0 - trainer.test_minibatch(all_test)) * 100
print("Classification accuracy = %0.2f%%" % acc)
np.set_printoptions(precision = 1, suppress=True)
unknown = np.array([[7.0, 3.2, 4.7, 1.4]], dtype=np.float32)
print("\nPredicting species for input features: ")
print(unknown[0])
pred_prob = model.eval(unknown)
np.set_printoptions(precision = 4, suppress=True)
print("Prediction probabilities: ")
print(pred_prob[0])
if __name__== ”__main__”:
main()
Output
Using CNTK version = 2.7
batch 0: mean loss = 1.0986, mean accuracy = 40.00%
batch 500: mean loss = 0.6677, mean accuracy = 80.00%
batch 1000: mean loss = 0.5332, mean accuracy = 70.00%
batch 1500: mean loss = 0.2408, mean accuracy = 100.00%
Evaluating test data
Classification accuracy = 94.58%
Predicting species for input features:
[7.0 3.2 4.7 1.4]
Prediction probabilities:
[0.0847 0.736 0.113]
Saving the trained model
此 Iris 数据集仅有 150 个数据项,因此训练 NN 分类器模型只需要几秒钟,但训练包含数千或数百万个数据项的大型数据集可能需要数小时甚至数天。
This Iris dataset has only 150 data items, hence it would take only a few seconds to train the NN classifier model, but training on a large dataset having hundred or thousand data items can take hours or even days.
我们可以保存我们的模型,这样我们就不用从头开始重新训练它。借助以下 Python 代码,我们可以保存我们的训练的 NN −
We can save our model so that, we won’t have to retain it from scratch. With the help of following Python code, we can save our trained NN −
nn_classifier = “.\\neuralclassifier.model” #provide the name of the file
model.save(nn_classifier, format=C.ModelFormat.CNTKv2)
以下是上面使用的 save() 函数的参数 −
Following are the arguments of save() function used above −
-
File name is the first argument of save() function. It can also be write along with the path of file.
-
Another parameter is the format parameter which has a default value C.ModelFormat.CNTKv2.
Loading the trained model
保存训练后的模型后,加载该模型非常容易。我们只需要使用 load () 函数。让我们在以下示例中检查此事 −
Once you saved the trained model, it’s very easy to load that model. We only need to use the load () function. Let’s check this in the following example −
import numpy as np
import cntk as C
model = C.ops.functions.Function.load(“.\\neuralclassifier.model”)
np.set_printoptions(precision = 1, suppress=True)
unknown = np.array([[7.0, 3.2, 4.7, 1.4]], dtype=np.float32)
print("\nPredicting species for input features: ")
print(unknown[0])
pred_prob = model.eval(unknown)
np.set_printoptions(precision = 4, suppress=True)
print("Prediction probabilities: ")
print(pred_prob[0])
保存后的模型的好处是,一旦加载保存的模型,就可以像刚训练过模型一样使用它。
The benefit of saved model is that, once you load a saved model, it can be used exactly as if the model had just been trained.