H2o 简明教程
H2O - Running Sample Application
在示例列表中点击航空公司延误 Flow 链接,如下面的屏幕截图所示 −
在你确认之后,新的笔记本将被加载。
Clearing All Outputs
在我们解释笔记本中的代码语句之前,让我们清除所有输出,然后逐渐运行笔记本。要清除所有输出,请选择以下菜单选项 −
Flow / Clear All Cell Contents
这在以下屏幕截图中所示 −
一旦清除所有输出,我们将单独运行笔记本中的每个单元格,并检查其输出。
Running the First Cell
单击第一个单元格。左侧会出现一个红旗,表示该单元格已选中。如下图所示 −
此单元格的内容只是使用 MarkDown (MD) 语言编写的程序注释。该内容描述了加载的应用程序的功能。要运行单元格,请单击运行图标,如下图所示 −
你将在单元格下方看不到任何输出,因为当前单元格中没有可执行代码。光标现在会自动移动到下一个单元格,该单元格已准备好执行。
Importing Data
下一个单元格包含以下 Python 语句 −
importFiles ["https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"]
该语句将 allyears2k.csv 文件从 Amazon AWS 导入到系统中。当你运行该单元格时,它会导入该文件,并为你提供以下输出。
Setting Up Data Parser
现在,我们需要对数据进行解析,使其适合我们的 ML 算法。使用以下命令执行此操作–
setupParse paths: [ "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv" ]
在执行上述语句后,将出现设置配置对话框。该对话框允许您对文件的解析进行多个设置。如下图所示所示:
在此对话框中,您可以从给定的下拉列表中选择所需的解析器,并设置其他参数,如字段分隔符等。
Parsing Data
下一个语句实际使用上述配置解析数据文件,语句较长,如下所示:
parseFiles
paths: ["https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"]
destination_frame: "allyears2k.hex"
parse_type: "CSV"
separator: 44
number_columns: 31
single_quotes: false
column_names: ["Year","Month","DayofMonth","DayOfWeek","DepTime","CRSDepTime",
"ArrTime","CRSArrTime","UniqueCarrier","FlightNum","TailNum",
"ActualElapsedTime","CRSElapsedTime","AirTime","ArrDelay","DepDelay",
"Origin","Dest","Distance","TaxiIn","TaxiOut","Cancelled","CancellationCode",
"Diverted","CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
"LateAircraftDelay","IsArrDelayed","IsDepDelayed"]
column_types: ["Enum","Enum","Enum","Enum","Numeric","Numeric","Numeric"
,"Numeric","Enum","Enum","Enum","Numeric","Numeric","Numeric","Numeric",
"Numeric","Enum","Enum","Numeric","Numeric","Numeric","Enum","Enum",
"Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Enum","Enum"]
delete_on_done: true
check_header: 1
chunk_size: 4194304
请注意,您在配置框中设置的参数在上述代码中已列出。现在,运行此单元格。一段时间后,解析完成,您将看到以下输出:
Examining Dataframe
经处理后,它将生成一个数据框,可以使用以下语句对其进行检查:
getFrameSummary "allyears2k.hex"
在执行上述语句后,您将看到以下输出:
现在,您的数据已准备就绪,可以将其输入机器学习算法。
下一条语句是程序注释,表示我们将使用回归模型并指定预设正则化和 lambda 值。
Building the Model
接下来是对模型本身进行构建,这是最重要的语句。此语句指定在以下语句中:
buildModel 'glm', {
"model_id":"glm_model","training_frame":"allyears2k.hex",
"ignored_columns":[
"DayofMonth","DepTime","CRSDepTime","ArrTime","CRSArrTime","TailNum",
"ActualElapsedTime","CRSElapsedTime","AirTime","ArrDelay","DepDelay",
"TaxiIn","TaxiOut","Cancelled","CancellationCode","Diverted","CarrierDelay",
"WeatherDelay","NASDelay","SecurityDelay","LateAircraftDelay","IsArrDelayed"],
"ignore_const_cols":true,"response_column":"IsDepDelayed","family":"binomial",
"solver":"IRLSM","alpha":[0.5],"lambda":[0.00001],"lambda_search":false,
"standardize":true,"non_negative":false,"score_each_iteration":false,
"max_iterations":-1,"link":"family_default","intercept":true,
"objective_epsilon":0.00001,"beta_epsilon":0.0001,"gradient_epsilon":0.0001,
"prior":-1,"max_active_predictors":-1
}
我们使用了 glm,这是一套广义线性模型,其族类型设置为二项式。您可以在上述语句中看到这些内容突出显示。在我们的示例中,预期输出是二进制的,因此我们使用二项式类型。您可以自己检查其他参数;例如,查看我们之前指定的 alpha 和 lambda。参考 GLM 模型文档以获取所有参数的说明。
现在,运行此语句。执行后,将生成以下输出:
当然,在您的机器上,执行时间会有所不同。现在,来到此示例代码中最有趣的部分。
Examining Output
我们仅使用以下语句输出了我们已经构建的模型:
getModel "glm_model"
请注意,glm_model 是我们在上一个语句中构建模型时指定为 model_id 参数的模型 ID。这为我们提供了详细的结果输出,其中包含多个不同的参数。下图中显示了报告的部分输出:
如您在输出中所见,它表示这是在您的数据集上运行广义线性建模算法的结果。
在 SCORING HISTORY 正上方,您会看到 MODEL PARAMETERS 标记,展开该标记,您将看到构建模型时使用所有参数的列表。在下图中显示了这一点。
同样,每个标记都会提供特定类型的详细输出。展开各个标记以研究不同种类的输出。
Building Another Model
接下来,我们将在我们数据框上构建一个深度学习模型。示例代码中的下一行语句只是一个程序注释。下一条语句实际上是一个模型构建命令。如下所示:
buildModel 'deeplearning', {
"model_id":"deeplearning_model","training_frame":"allyear
s2k.hex","ignored_columns":[
"DepTime","CRSDepTime","ArrTime","CRSArrTime","FlightNum","TailNum",
"ActualElapsedTime","CRSElapsedTime","AirTime","ArrDelay","DepDelay",
"TaxiIn","TaxiOut","Cancelled","CancellationCode","Diverted",
"CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
"LateAircraftDelay","IsArrDelayed"],
"ignore_const_cols":true,"res ponse_column":"IsDepDelayed",
"activation":"Rectifier","hidden":[200,200],"epochs":"100",
"variable_importances":false,"balance_classes":false,
"checkpoint":"","use_all_factor_levels":true,
"train_samples_per_iteration":-2,"adaptive_rate":true,
"input_dropout_ratio":0,"l1":0,"l2":0,"loss":"Automatic","score_interval":5,
"score_training_samples":10000,"score_duty_cycle":0.1,"autoencoder":false,
"overwrite_with_best_model":true,"target_ratio_comm_to_comp":0.02,
"seed":6765686131094811000,"rho":0.99,"epsilon":1e-8,"max_w2":"Infinity",
"initial_weight_distribution":"UniformAdaptive","classification_stop":0,
"diagnostics":true,"fast_mode":true,"force_load_balance":true,
"single_node_mode":false,"shuffle_training_data":false,"missing_values_handling":
"MeanImputation","quiet_mode":false,"sparse":false,"col_major":false,
"average_activation":0,"sparsity_beta":0,"max_categorical_features":2147483647,
"reproducible":false,"export_weights_and_biases":false
}
如您在上方的代码中看到的那样,我们指定对相应的数值设定几个参数来使用 deeplearning 构建模型,如 deeplearning 模型文档中指定的那样。当您运行此语句时,它将比 GLM 模型构建需要更多时间。您将在模型构建完成后看到以下输出,尽管时间不同。