Big Data Analytics 简明教程

Big Data Analytics - Online Learning

在线学习是机器学习的一个子领域，它允许将监督学习模型扩展到海量数据集。基本思想是我们不需要在内存中读取所有数据来拟合模型，我们只需要一次读取每个实例。

Online learning is a subfield of machine learning that allows to scale supervised learning models to massive datasets. The basic idea is that we don’t need to read all the data in memory to fit a model, we only need to read each instance at a time.

在这种情况下，我们将展示如何使用逻辑回归来实现在线学习算法。与大多数监督学习算法一样，有一个要最小化的成本函数。在逻辑回归中，成本函数定义为−

In this case, we will show how to implement an online learning algorithm using logistic regression. As in most of supervised learning algorithms, there is a cost function that is minimized. In logistic regression, the cost function is defined as −

J(\theta) \: = \: \frac{-1}{m} \left [ \sum_{i = 1} ^{m}y {(i)}log(h_{\theta}(x^{(i)})) + (1 - y^{(i)}) log(1 - h_{\theta}(x^{(i)})) \right ]

J(\theta) \: = \: \frac{-1}{m} \left [ \sum_{i = 1}^{m}y{(i)}log(h_{\theta}(x^{(i)})) + (1 - y^{(i)}) log(1 - h_{\theta}(x^{(i)})) \right ]

其中 J(θ) 表示成本函数，hθ(x) 表示假设。在逻辑回归的情况下，它由以下公式定义−

where J(θ) represents the cost function and hθ(x) represents the hypothesis. In the case of logistic regression it is defined with the following formula −

h_\theta(x) = \frac{1}{1 + e ^{\theta T x}}

h_\theta(x) = \frac{1}{1 + e^{\thetaT x}}

现在我们已经定义了成本函数，我们需要找到一个算法来最小化它。实现这一点的最简单算法称为随机梯度下降。逻辑回归模型权重的算法更新规则定义如下−

Now that we have defined the cost function we need to find an algorithm to minimize it. The simplest algorithm for achieving this is called stochastic gradient descent. The update rule of the algorithm for the weights of the logistic regression model is defined as −

\theta_j : = \theta_j - \alpha(h_\theta(x) - y)x

有几种实现以下算法的方法，但 vowpal wabbit 库中实现的方法是到目前为止最成熟的一种。该库允许训练大规模回归模型，并使用少量 RAM。用创建者自己的话说，它被描述为：“Vowpal Wabbit (VW) 项目是由 Microsoft Research 和（之前的）Yahoo! Research 赞助的快速 out-of-core 学习系统”。

There are several implementations of the following algorithm, but the one implemented in the vowpal wabbit library is by far the most developed one. The library allows training of large scale regression models and uses small amounts of RAM. In the creators own words it is described as: "The Vowpal Wabbit (VW) project is a fast out-of-core learning system sponsored by Microsoft Research and (previously) Yahoo! Research".

我们将在 kaggle 竞赛中使用泰坦尼克号数据集。原始数据可以在 bda/part3/vw 文件夹中找到。在这里，我们有两个文件−

We will be working with the titanic dataset from a kaggle competition. The original data can be found in the bda/part3/vw folder. Here, we have two files −

We have training data (train_titanic.csv), and
unlabeled data in order to make new predictions (test_titanic.csv).

要将 csv 格式转换为 vowpal wabbit 输入格式，可以使用 csv_to_vowpal_wabbit.py python 脚本。很明显，你需要为此安装 python。导航到 bda/part3/vw 文件夹，打开终端并执行以下命令 −

In order to convert the csv format to the vowpal wabbit input format use the csv_to_vowpal_wabbit.py python script. You will obviously need to have python installed for this. Navigate to the bda/part3/vw folder, open the terminal and execute the following command −

python csv_to_vowpal_wabbit.py

请注意，对于此部分，如果你使用的是 Windows，你需要安装一个 Unix 命令行，进入 cygwin 网站。

Note that for this section, if you are using windows you will need to install a Unix command line, enter the cygwin website for that.

打开终端，并在 bda/part3/vw 文件夹中执行以下命令 −

Open the terminal and also in the folder bda/part3/vw and execute the following command −

vw train_titanic.vw -f model.vw --binary --passes 20 -c -q ff --sgd --l1
0.00000001 --l2 0.0000001 --learning_rate 0.5 --loss_function logistic

让我们分解 vw call 的每个参数。

Let us break down what each argument of the vw call means.

-f model.vw − means that we are saving the model in the model.vw file for making predictions later
--binary − Reports loss as binary classification with -1,1 labels
--passes 20 − The data is used 20 times to learn the weights
-c − create a cache file
-q ff − Use quadratic features in the f namespace
--sgd − use regular/classic/simple stochastic gradient descent update, i.e., nonadaptive, non-normalized, and non-invariant.
--l1 --l2 − L1 and L2 norm regularization
--learning_rate 0.5 − The learning rate αas defined in the update rule formula

以下代码显示了在命令行中运行回归模型的结果。在结果中，我们获得了平均对数损失和算法性能的一个小报告。

The following code shows the results of running the regression model in the command line. In the results, we get the average log-loss and a small report of the algorithm performance.

-loss_function logistic
creating quadratic features for pairs: ff
using l1 regularization = 1e-08
using l2 regularization = 1e-07

final_regressor = model.vw
Num weight bits = 18
learning rate = 0.5
initial_t = 1
power_t = 0.5
decay_learning_rate = 1
using cache_file = train_titanic.vw.cache
ignoring text input in favor of cache input
num sources = 1

average    since         example   example  current  current  current
loss       last          counter   weight    label   predict  features
0.000000   0.000000          1      1.0    -1.0000   -1.0000       57
0.500000   1.000000          2      2.0     1.0000   -1.0000       57
0.250000   0.000000          4      4.0     1.0000    1.0000       57
0.375000   0.500000          8      8.0    -1.0000   -1.0000       73
0.625000   0.875000         16     16.0    -1.0000    1.0000       73
0.468750   0.312500         32     32.0    -1.0000   -1.0000       57
0.468750   0.468750         64     64.0    -1.0000    1.0000       43
0.375000   0.281250        128    128.0     1.0000   -1.0000       43
0.351562   0.328125        256    256.0     1.0000   -1.0000       43
0.359375   0.367188        512    512.0    -1.0000    1.0000       57
0.274336   0.274336       1024   1024.0    -1.0000   -1.0000       57 h
0.281938   0.289474       2048   2048.0    -1.0000   -1.0000       43 h
0.246696   0.211454       4096   4096.0    -1.0000   -1.0000       43 h
0.218922   0.191209       8192   8192.0     1.0000    1.0000       43 h

finished run
number of examples per pass = 802
passes used = 11
weighted example sum = 8822
weighted label sum = -2288
average loss = 0.179775 h
best constant = -0.530826
best constant’s loss = 0.659128
total feature number = 427878

现在我们可以使用我们训练的 model.vw 来生成新数据的预测。

Now we can use the model.vw we trained to generate predictions with new data.

vw -d test_titanic.vw -t -i model.vw -p predictions.txt

在前面的命令中生成的预测未规范化以适合 [0, 1] 范围。为此，我们使用 sigmoid 转换。

The predictions generated in the previous command are not normalized to fit between the [0, 1] range. In order to do this, we use a sigmoid transformation.

# Read the predictions
preds = fread('vw/predictions.txt')

# Define the sigmoid function
sigmoid = function(x) {
   1 / (1 + exp(-x))
}
probs = sigmoid(preds[[1]])

# Generate class labels
preds = ifelse(probs > 0.5, 1, 0)
head(preds)
# [1] 0 1 0 0 1 0