H2o 简明教程

H2O - Running Sample Application

在示例列表中点击航空公司延误 Flow 链接,如下面的屏幕截图所示 −

Click on the Airlines Delay Flow link in the list of samples as shown in the screenshot below −

sample application

在你确认之后,新的笔记本将被加载。

After you confirm, the new notebook would be loaded.

Clearing All Outputs

在我们解释笔记本中的代码语句之前,让我们清除所有输出,然后逐渐运行笔记本。要清除所有输出,请选择以下菜单选项 −

Before we explain the code statements in the notebook, let us clear all the outputs and then run the notebook gradually. To clear all outputs, select the following menu option −

Flow / Clear All Cell Contents

这在以下屏幕截图中所示 −

This is shown in the following screenshot −

clearing outputs

一旦清除所有输出,我们将单独运行笔记本中的每个单元格,并检查其输出。

Once all outputs are cleared, we will run each cell in the notebook individually and examine its output.

Running the First Cell

单击第一个单元格。左侧会出现一个红旗,表示该单元格已选中。如下图所示 −

Click the first cell. A red flag appears on the left indicating that the cell is selected. This is as shown in the screenshot below −

first cell

此单元格的内容只是使用 MarkDown (MD) 语言编写的程序注释。该内容描述了加载的应用程序的功能。要运行单元格,请单击运行图标,如下图所示 −

The contents of this cell are just the program comment written in MarkDown (MD) language. The content describes what the loaded application does. To run the cell, click the Run icon as shown in the screenshot below −

markdown

你将在单元格下方看不到任何输出,因为当前单元格中没有可执行代码。光标现在会自动移动到下一个单元格,该单元格已准备好执行。

You will not see any output underneath the cell as there is no executable code in the current cell. The cursor now moves automatically to the next cell, which is ready to execute.

Importing Data

下一个单元格包含以下 Python 语句 −

The next cell contains the following Python statement −

importFiles ["https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"]

该语句将 allyears2k.csv 文件从 Amazon AWS 导入到系统中。当你运行该单元格时,它会导入该文件,并为你提供以下输出。

The statement imports the allyears2k.csv file from Amazon AWS into the system. When you run the cell, it imports the file and gives you the following output.

statement imports

Setting Up Data Parser

现在,我们需要对数据进行解析,使其适合我们的 ML 算法。使用以下命令执行此操作–

Now, we need to parse the data and make it suitable for our ML algorithm. This is done using the following command −

setupParse paths: [ "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv" ]

在执行上述语句后,将出现设置配置对话框。该对话框允许您对文件的解析进行多个设置。如下图所示所示:

Upon execution of the above statement, a setup configuration dialog appears. The dialog allows you several settings for parsing the file. This is as shown in the screenshot below −

configuration dialog

在此对话框中,您可以从给定的下拉列表中选择所需的解析器,并设置其他参数,如字段分隔符等。

In this dialog, you can select the desired parser from the given drop-down list and set other parameters such as the field separator, etc.

Parsing Data

下一个语句实际使用上述配置解析数据文件,语句较长,如下所示:

The next statement, which actually parses the datafile using the above configuration, is a long one and is as shown here −

parseFiles
paths: ["https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"]
destination_frame: "allyears2k.hex"
parse_type: "CSV"
separator: 44
number_columns: 31
single_quotes: false
column_names: ["Year","Month","DayofMonth","DayOfWeek","DepTime","CRSDepTime",
   "ArrTime","CRSArrTime","UniqueCarrier","FlightNum","TailNum",
   "ActualElapsedTime","CRSElapsedTime","AirTime","ArrDelay","DepDelay",
   "Origin","Dest","Distance","TaxiIn","TaxiOut","Cancelled","CancellationCode",
   "Diverted","CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
   "LateAircraftDelay","IsArrDelayed","IsDepDelayed"]
column_types: ["Enum","Enum","Enum","Enum","Numeric","Numeric","Numeric"
   ,"Numeric","Enum","Enum","Enum","Numeric","Numeric","Numeric","Numeric",
   "Numeric","Enum","Enum","Numeric","Numeric","Numeric","Enum","Enum",
   "Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Enum","Enum"]
delete_on_done: true
check_header: 1
chunk_size: 4194304

请注意,您在配置框中设置的参数在上述代码中已列出。现在,运行此单元格。一段时间后,解析完成,您将看到以下输出:

Observe that the parameters you have set up in the configuration box are listed in the above code. Now, run this cell. After a while, the parsing completes and you will see the following output −

configuration box

Examining Dataframe

经处理后,它将生成一个数据框,可以使用以下语句对其进行检查:

After the processing, it generates a dataframe, which can be examined using the following statement −

getFrameSummary "allyears2k.hex"

在执行上述语句后,您将看到以下输出:

Upon execution of the above statement, you will see the following output −

datexamining aframe

现在,您的数据已准备就绪,可以将其输入机器学习算法。

Now, your data is ready to be fed into a Machine Learning algorithm.

下一条语句是程序注释,表示我们将使用回归模型并指定预设正则化和 lambda 值。

The next statement is a program comment that says we will be using the regression model and specifies the preset regularization and the lambda values.

Building the Model

接下来是对模型本身进行构建,这是最重要的语句。此语句指定在以下语句中:

Next, comes the most important statement and that is building the model itself. This is specified in the following statement −

buildModel 'glm', {
   "model_id":"glm_model","training_frame":"allyears2k.hex",
   "ignored_columns":[
      "DayofMonth","DepTime","CRSDepTime","ArrTime","CRSArrTime","TailNum",
      "ActualElapsedTime","CRSElapsedTime","AirTime","ArrDelay","DepDelay",
      "TaxiIn","TaxiOut","Cancelled","CancellationCode","Diverted","CarrierDelay",
      "WeatherDelay","NASDelay","SecurityDelay","LateAircraftDelay","IsArrDelayed"],
   "ignore_const_cols":true,"response_column":"IsDepDelayed","family":"binomial",
   "solver":"IRLSM","alpha":[0.5],"lambda":[0.00001],"lambda_search":false,
   "standardize":true,"non_negative":false,"score_each_iteration":false,
   "max_iterations":-1,"link":"family_default","intercept":true,
   "objective_epsilon":0.00001,"beta_epsilon":0.0001,"gradient_epsilon":0.0001,
   "prior":-1,"max_active_predictors":-1
}

我们使用了 glm,这是一套广义线性模型,其族类型设置为二项式。您可以在上述语句中看到这些内容突出显示。在我们的示例中,预期输出是二进制的,因此我们使用二项式类型。您可以自己检查其他参数;例如,查看我们之前指定的 alpha 和 lambda。参考 GLM 模型文档以获取所有参数的说明。

We use glm, which is a Generalized Linear Model suite with family type set to binomial. You can see these highlighted in the above statement. In our case, the expected output is binary and that is why we use the binomial type. You may examine the other parameters by yourself; for example, look at alpha and lambda that we had specified earlier. Refer to the GLM model documentation for the explanation of all the parameters.

现在,运行此语句。执行后,将生成以下输出:

Now, run this statement. Upon execution, the following output will be generated −

generated aframe

当然,在您的机器上,执行时间会有所不同。现在,来到此示例代码中最有趣的部分。

Certainly, the execution time would be different on your machine. Now, comes the most interesting part of this sample code.

Examining Output

我们仅使用以下语句输出了我们已经构建的模型:

We simply output the model that we have built using the following statement −

getModel "glm_model"

请注意,glm_model 是我们在上一个语句中构建模型时指定为 model_id 参数的模型 ID。这为我们提供了详细的结果输出,其中包含多个不同的参数。下图中显示了报告的部分输出:

Note the glm_model is the model ID that we specified as model_id parameter while building the model in the previous statement. This gives us a huge output detailing the results with several varying parameters. A partial output of the report is shown in the screenshot below −

examining output

如您在输出中所见,它表示这是在您的数据集上运行广义线性建模算法的结果。

As you can see in the output, it says that this is the result of running the Generalized Linear Modeling algorithm on your dataset.

在 SCORING HISTORY 正上方,您会看到 MODEL PARAMETERS 标记,展开该标记,您将看到构建模型时使用所有参数的列表。在下图中显示了这一点。

Right above the SCORING HISTORY, you see the MODEL PARAMETERS tag, expand it and you will see the list of all parameters that are used while building the model. This is shown in the screenshot below.

scoring history

同样,每个标记都会提供特定类型的详细输出。展开各个标记以研究不同种类的输出。

Likewise, each tag provides a detailed output of a specific type. Expand the various tags yourself to study the outputs of different kinds.

Building Another Model

接下来,我们将在我们数据框上构建一个深度学习模型。示例代码中的下一行语句只是一个程序注释。下一条语句实际上是一个模型构建命令。如下所示:

Next, we will build a Deep Learning model on our dataframe. The next statement in the sample code is just a program comment. The following statement is actually a model building command. It is as shown here −

buildModel 'deeplearning', {
   "model_id":"deeplearning_model","training_frame":"allyear
   s2k.hex","ignored_columns":[
      "DepTime","CRSDepTime","ArrTime","CRSArrTime","FlightNum","TailNum",
      "ActualElapsedTime","CRSElapsedTime","AirTime","ArrDelay","DepDelay",
      "TaxiIn","TaxiOut","Cancelled","CancellationCode","Diverted",
      "CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
      "LateAircraftDelay","IsArrDelayed"],
   "ignore_const_cols":true,"res   ponse_column":"IsDepDelayed",
   "activation":"Rectifier","hidden":[200,200],"epochs":"100",
   "variable_importances":false,"balance_classes":false,
   "checkpoint":"","use_all_factor_levels":true,
   "train_samples_per_iteration":-2,"adaptive_rate":true,
   "input_dropout_ratio":0,"l1":0,"l2":0,"loss":"Automatic","score_interval":5,
   "score_training_samples":10000,"score_duty_cycle":0.1,"autoencoder":false,
   "overwrite_with_best_model":true,"target_ratio_comm_to_comp":0.02,
   "seed":6765686131094811000,"rho":0.99,"epsilon":1e-8,"max_w2":"Infinity",
   "initial_weight_distribution":"UniformAdaptive","classification_stop":0,
   "diagnostics":true,"fast_mode":true,"force_load_balance":true,
   "single_node_mode":false,"shuffle_training_data":false,"missing_values_handling":
   "MeanImputation","quiet_mode":false,"sparse":false,"col_major":false,
   "average_activation":0,"sparsity_beta":0,"max_categorical_features":2147483647,
   "reproducible":false,"export_weights_and_biases":false
}

如您在上方的代码中看到的那样,我们指定对相应的数值设定几个参数来使用 deeplearning 构建模型,如 deeplearning 模型文档中指定的那样。当您运行此语句时,它将比 GLM 模型构建需要更多时间。您将在模型构建完成后看到以下输出,尽管时间不同。

As you can see in the above code, we specify deeplearning for building the model with several parameters set to the appropriate values as specified in the documentation of deeplearning model. When you run this statement, it will take longer time than the GLM model building. You will see the following output when the model building completes, albeit with different timings.

building another model

Examining Deep Learning Model Output

这生成了这样的输出,可以使用以下语句进行检查,如同前例中的情况。

This generates the kind of output, which can be examined using the following statement as in the earlier case.

getModel "deeplearning_model"

我们将考虑如下所示的 ROC 曲线,以快速参考。

We will consider the ROC curve output as shown below for quick reference.

deep learning

如同前例一样,请展开各个标签并研究不同的输出。

Like in the earlier case, expand the various tabs and study the different outputs.

Saving the Model

在研究了不同模型的输出后,您可以决定在您的生产环境中使用其中一个。H20 允许您将此模型保存为 POJO(普通的 Java 对象)。

After you have studied the output of different models, you decide to use one of those in your production environment. H20 allows you to save this model as a POJO (Plain Old Java Object).

展开输出中的最后一个标签预览 POJO,您将看到调整过模型的 Java 代码。在您的生产环境中使用它。

Expand the last tag PREVIEW POJO in the output and you will see the Java code for your fine-tuned model. Use this in your production environment.

saving model

接下来,我们将了解 H2O 的一项非常激动人心的特性。我们将了解如何使用 AutoML 根据性能来测试和对各种算法进行评级。

Next, we will learn about a very exciting feature of H2O. We will learn how to use AutoML to test and rank various algorithms based on their performance.