H2o 简明教程
H2O - Introduction
您是否曾被要求在一个庞大的数据库上开发一个机器学习模型?通常情况下,客户会为您提供数据库,并要求您做出某些预测,例如谁将是潜在买家;是否可以早期检测到欺诈案,等。为了回答这些问题,您的任务将是开发一个机器学习算法来为客户的查询提供答案。从头开始开发机器学习算法并非易事,而且当市场上有几个可直接使用的机器学习库时,为何要这样做。
Have you ever been asked to develop a Machine Learning model on a huge database? Typically, the customer will provide you the database and ask you to make certain predictions such as who will be the potential buyers; if there can be an early detection of fraudulent cases, etc. To answer these questions, your task would be to develop a Machine Learning algorithm that would provide an answer to the customer’s query. Developing a Machine Learning algorithm from scratch is not an easy task and why should you do this when there are several ready-to-use Machine Learning libraries available in the market.
如今,您宁愿使用这些库,从这些库中应用一个经过充分测试的算法,然后再查看其性能。如果性能未达到可接受的限度,您将尝试微调当前算法或尝试一个完全不同的算法。
These days, you would rather use these libraries, apply a well-tested algorithm from these libraries and look at its performance. If the performance were not within acceptable limits, you would try to either fine-tune the current algorithm or try an altogether different one.
同样,您可以在同一数据集上尝试多个算法,然后选取最能令人满意地满足客户需求的算法。这就是 H2O 可以帮助您的地方。它是一个开源机器学习框架,其中包含了多种被广泛接受的 ML 算法的经过全面测试的实现。您只需要从其庞大的存储库中选取该算法,然后将其应用到您的数据集。它包含使用最广泛的统计和 ML 算法。
Likewise, you may try multiple algorithms on the same dataset and then pick up the best one that satisfactorily meets the customer’s requirements. This is where H2O comes to your rescue. It is an open source Machine Learning framework with full-tested implementations of several widely-accepted ML algorithms. You just have to pick up the algorithm from its huge repository and apply it to your dataset. It contains the most widely used statistical and ML algorithms.
这里提到了几个,其中包括梯度提升机 (GBM)、广义线性模型 (GLM)、深度学习等等。不仅如此,它还支持 AutoML 功能,该功能将对您的数据集中的不同算法的性能进行排名,从而减少您为找到性能最佳的模型所做的努力。H2O 被全球 18,000 多家组织使用,并且能很好地与 R 和 Python 对接,方便您的开发。它是一个内存平台,提供卓越的性能。
To mention a few here it includes gradient boosted machines (GBM), generalized linear model (GLM), deep learning and many more. Not only that it also supports AutoML functionality that will rank the performance of different algorithms on your dataset, thus reducing your efforts of finding the best performing model. H2O is used worldwide by more than 18000 organizations and interfaces well with R and Python for your ease of development. It is an in-memory platform that provides superb performance.
在本教程中,您将首先学习如何在计算机上安装 H2O,同时使用 Python 和 R 选项。我们将了解如何使用命令行,以便逐行了解其工作原理。如果您是一名 Python 爱好者,您可以使用 Jupyter 或任何其他您选择的 IDE 来开发 H2O 应用程序。如果您更喜欢 R,可以使用 RStudio 进行开发。
In this tutorial, you will first learn to install the H2O on your machine with both Python and R options. We will understand how to use this in the command line so that you understand its working line-wise. If you are a Python lover, you may use Jupyter or any other IDE of your choice for developing H2O applications. If you prefer R, you may use RStudio for development.
在本教程中,我们将考虑一个示例来了解如何使用 H2O。我们还将学习如何在程序代码中更改算法,并将其性能与之前的算法进行比较。H2O 还提供了一个基于 Web 的工具来测试您数据集上的不同算法。这称为 Flow。
In this tutorial, we will consider an example to understand how to go about working with H2O. We will also learn how to change the algorithm in your program code and compare its performance with the earlier one. The H2O also provides a web-based tool to test the different algorithms on your dataset. This is called Flow.
本教程将向您介绍 Flow 的使用。与此同时,我们将讨论 AutoML 的使用,该功能将识别您数据集上性能最佳的算法。您是不是很兴奋要学习 H2O?继续阅读!
The tutorial will introduce you to the use of Flow. Alongside, we will discuss the use of AutoML that will identify the best performing algorithm on your dataset. Are you not excited to learn H2O? Keep reading!
H2O - Installation
H2O 可以按照如下列出的五种不同选项进行配置和使用:
H2O can be configured and used with five different options as listed below −
-
Install in Python
-
Install in R
-
Web-based Flow GUI
-
Hadoop
-
Anaconda Cloud
在我们的后续部分中,您会看到根据可用选项来安装 H2O 的说明。您可能会使用其中一个选项。
In our subsequent sections, you will see the instructions for installation of H2O based on the options available. You are likely to use one of the options.
Install in Python
要使用 Python 运行 H2O,安装需要一些依赖关系。因此,让我们开始安装运行 H2O 所需的最低依赖关系。
To run H2O with Python, the installation requires several dependencies. So let us start installing the minimum set of dependencies to run H2O.
Installing Dependencies
要安装依赖关系,请执行以下 pip 命令:
To install a dependency, execute the following pip command −
$ pip install requests
打开您的控制台窗口,然后键入以上命令来安装 requests 包。以下屏幕截图显示了在我们的 Mac 机器上执行上述命令的情况:
Open your console window and type the above command to install the requests package. The following screenshot shows the execution of the above command on our Mac machine −
安装 requests 后,您需要安装如下所示的另外三个包:
After installing requests, you need to install three more packages as shown below −
$ pip install tabulate
$ pip install "colorama >= 0.3.8"
$ pip install future
最新依赖项清单可在 H2O GitHub 页面获取。在撰写本文时,该页面列出了以下依赖项。
The most updated list of dependencies is available on H2O GitHub page. At the time of this writing, the following dependencies are listed on the page.
python 2. H2O — Installation
pip >= 9.0.1
setuptools
colorama >= 0.3.7
future >= 0.15.2
Removing Older Versions
安装以上依赖项后,您需要删除任何现有的 H2O 安装。为此,请运行以下命令:
After installing the above dependencies, you need to remove any existing H2O installation. To do so, run the following command −
$ pip uninstall h2o
Installing the Latest Version
接下来,让我们使用以下命令安装最新版本的 H2O:
Now, let us install the latest version of H2O using the following command −
$ pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o
安装成功后,您应该在屏幕上看到以下信息显示:
After successful installation, you should see the following message display on the screen −
Installing collected packages: h2o
Successfully installed h2o-3.26.0.1
Testing the Installation
要测试安装,我们将运行 H2O 安装中提供的一个示例应用程序。首先,通过键入以下命令启动 Python 提示符:
To test the installation, we will run one of the sample applications provided in the H2O installation. First start the Python prompt by typing the following command −
$ Python3
Python 解释器启动后,在 Python 命令提示符中键入以下 Python 语句:
Once the Python interpreter starts, type the following Python statement on the Python command prompt −
>>>import h2o
上述命令在您的程序中导入 H2O 软件包。接下来,使用以下命令初始化 H2O 系统:
The above command imports the H2O package in your program. Next, initialize the H2O system using the following command −
>>>h2o.init()
您的屏幕将显示集群信息,并且应该在此时显示以下内容:
Your screen would show the cluster information and should look the following at this stage −
现在,您可以运行示例代码。在 Python 提示符中键入以下命令并执行它。
Now, you are ready to run the sample code. Type the following command on the Python prompt and execute it.
>>>h2o.demo("glm")
该演示由一个包含一系列命令的 Python Notebook 组成。在执行每条命令后,其输出会立即显示在屏幕上,并且系统会要求您按某个键继续执行下一步。执行 Notebook 中的最后一条语语句的部分屏幕截图显示在此处:
The demo consists of a Python notebook with a series of commands. After executing each command, its output is shown immediately on the screen and you will be asked to hit the key to continue with the next step. The partial screenshot on executing the last statement in the notebook is shown here −
在这个阶段,您的 Python 安装就完成了,您可以进行您自己的实验。
At this stage your Python installation is complete and you are ready for your own experimentation.
Install in R
为 R 开发环境安装 H2O 与为 Python 安装 H2O 非常相似,不同之处在于您将使用 R 提示符进行安装。
Installing H2O for R development is very much similar to installing it for Python, except that you would be using R prompt for the installation.
Starting R Console
通过单击机器上的 R 应用程序图标启动 R 控制台。控制台屏幕将如以下屏幕截图所示出现:
Start R console by clicking on the R application icon on your machine. The console screen would appear as shown in the following screenshot −
您的 H2O 安装将在上述 R 提示符上完成。如果您喜欢使用 RStudio,请在 R 控制台子窗口中键入命令。
Your H2O installation would be done on the above R prompt. If you prefer using RStudio, type the commands in the R console subwindow.
Removing Older Versions
首先,在 R 提示符中使用以下命令删除旧版本:
To begin with, remove older versions using the following command on the R prompt −
> if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
> if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }
Downloading Dependencies
使用以下代码下载 H2O 依赖项:
Download the dependencies for H2O using the following code −
> pkgs <- c("RCurl","jsonlite")
for (pkg in pkgs) {
if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }
}
Installing H2O
通过在 R 提示符中键入以下命令安装 H2O:
Install H2O by typing the following command on the R prompt −
> install.packages("h2o", type = "source", repos = (c("http://h2o-release.s3.amazonaws.com/h2o/latest_stable_R")))
以下屏幕截图显示了预期的输出:
The following screenshot shows the expected output −
还有一种方法可以在 R 中安装 H2O。
There is another way of installing H2O in R.
Install in R from CRAN
从 CRAN 安装 R,在 R 提示符下使用以下命令 −
To install R from CRAN, use the following command on R prompt −
> install.packages("h2o")
系统将要求你选择镜像 −
You will be asked to select the mirror −
--- Please select a CRAN mirror for use in this session ---
屏幕上将显示一个对话框,其中列出了镜像网站列表。选择最近的位置或你选择的镜像。
A dialog box displaying the list of mirror sites is shown on your screen. Select the nearest location or the mirror of your choice.
Installing Web GUI Flow
要安装 GUI Flow,请从 H20 网站下载安装文件。将下载的文件解压到你的首选文件夹中。请注意安装中是否包含 h2o.jar 文件。使用以下命令在命令窗口中运行该文件 −
To install GUI Flow download the installation file from the H20 site. Unzip the downloaded file in your preferred folder. Note the presence of h2o.jar file in the installation. Run this file in a command window using the following command −
$ java -jar h2o.jar
一段时间后,控制台窗口中将显示以下内容。
After a while, the following will appear in your console window.
07-24 16:06:37.304 192.168.1.18:54321 3294 main INFO: H2O started in 7725ms
07-24 16:06:37.304 192.168.1.18:54321 3294 main INFO:
07-24 16:06:37.305 192.168.1.18:54321 3294 main INFO: Open H2O Flow in your web browser: http://192.168.1.18:54321
07-24 16:06:37.305 192.168.1.18:54321 3294 main INFO:
要启动 Flow,请在浏览器中打开给定的 URL http://localhost:54321 。将显示以下屏幕 −
To start the Flow, open the given URL http://localhost:54321 in your browser. The following screen will appear −
此时,你的 Flow 安装已经完成。
At this stage, your Flow installation is complete.
Install on Hadoop / Anaconda Cloud
除非你是经验丰富的开发人员,否则你不会考虑在宏观数据上使用 H2O。这里有必要说明的是,H2O 模型可以在数 TB 的海量数据库上高效运行。如果你的数据位于 Hadoop 安装中或云中,请按照 H2O 网站上给定的步骤为你的各个数据库进行安装。
Unless you are a seasoned developer, you would not think of using H2O on Big Data. It is sufficient to say here that H2O models run efficiently on huge databases of several terabytes. If your data is on your Hadoop installation or in the Cloud, follow the steps given on H2O site to install it for your respective database.
既然你已在计算机上成功安装并测试了 H2O,你就可以开始进行实际开发。首先,我们将了解如何通过命令提示符进行开发。在后面的教程中,我们将学习如何在 H2O Flow 中进行模型测试。
Now that you have successfully installed and tested H2O on your machine, you are ready for real development. First, we will see the development from a Command prompt. In our subsequent lessons, we will learn how to do model testing in H2O Flow.
Developing in Command Prompt
现在让我们考虑使用 H2O 对著名的鸢尾花数据集进行分类,该数据集可免费用于开发机器学习应用程序。
Let us now consider using H2O to classify plants of the well-known iris dataset that is freely available for developing Machine Learning applications.
通过在 shell 窗口中输入以下命令启动 Python 解释器 −
Start the Python interpreter by typing the following command in your shell window −
$ Python3
这将启动 Python 解释器。使用以下命令导入 h2o 平台 −
This starts the Python interpreter. Import h2o platform using the following command −
>>> import h2o
我们将使用随机森林算法进行分类。这是 H2ORandomForest Estimator 包中提供的。我们使用 import 语句按如下所示导入此包 −
We will use Random Forest algorithm for classification. This is provided in the H2ORandomForestEstimator package. We import this package using the import statement as follows −
>>> from h2o.estimators import H2ORandomForestEstimator
我们通过调用其 init 方法来初始化 H2o 环境。
We initialize the H2o environment by calling its init method.
>>> h2o.init()
初始化成功后,你应该在控制台上看到以下消息以及集群信息。
On successful initialization, you should see the following message on the console along with the cluster information.
Checking whether there is an H2O instance running at http://localhost:54321 . connected.
现在,我们将使用 H2O 中的 import_file 方法导入 iris 数据。
Now, we will import the iris data using the import_file method in H2O.
>>> data = h2o.import_file('iris.csv')
进度将按以下屏幕截图所示方式显示 −
The progress will display as shown in the following screenshot −
在文件载入内存后,您可以通过显示载入表的首 10 行来验证此操作。执行此操作时,您可以使用以下 head 方法 −
After the file is loaded in the memory, you can verify this by displaying the first 10 rows of the loaded table. You use the head method to do so −
>>> data.head()
您将在表格格式下看到以下输出。
You will see the following output in tabular format.
表格也会显示列名。我们将使用前四列作为机器学习算法的特征,将最后一列类别指定为预测输出。我们通过首先设置以下两个变量,在对机器学习算法的调用中指定此内容。
The table also displays the column names. We will use the first four columns as the features for our ML algorithm and the last column class as the predicted output. We specify this in the call to our ML algorithm by first creating the following two variables.
>>> features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
>>> output = 'class'
接下来,我们通过调用 split_frame 方法将数据拆分成训练数据和测试数据。
Next, we split the data into training and testing by calling the split_frame method.
>>> train, test = data.split_frame(ratios = [0.8])
数据以 80:20 的比例拆分。我们使用 80% 的数据进行训练,使用 20% 的数据进行测试。
The data is split in the 80:20 ratio. We use 80% data for training and 20% for testing.
现在,我们将内置的随机森林模型载入系统中。
Now, we load the built-in Random Forest model into the system.
>>> model = H2ORandomForestEstimator(ntrees = 50, max_depth = 20, nfolds = 10)
在上述调用中,我们将树的数量设为 50,树的最大深度设为 20,交叉验证的折叠数量设为 10。我们现在需要对模型进行训练。我们将通过以下方式调用 train 方法来完成此操作 −
In the above call, we set the number of trees to 50, the maximum depth for the tree to 20 and number of folds for cross validation to 10. We now need to train the model. We do so by calling the train method as follows −
>>> model.train(x = features, y = output, training_frame = train)
作为前两个参数,train 方法接收先前创建的特征和输出。训练数据集设置为 train,即完整数据集的 80%。在训练期间,您将看到进度,如下所示 −
The train method receives the features and the output that we created earlier as first two parameters. The training dataset is set to train, which is the 80% of our full dataset. During training, you will see the progress as shown here −
现在,由于模型构建过程已结束,是时候测试模型了。我们通过对训练过的模型对象调用 model_performance 方法来完成此操作。
Now, as the model building process is over, it is time to test the model. We do this by calling the model_performance method on the trained model object.
>>> performance = model.model_performance(test_data=test)
在上述方法调用中,我们发送测试数据作为参数。
In the above method call, we sent test data as our parameter.
现在是时候查看输出了,即模型的性能。通过简单地打印性能即可完成此操作。
It is time now to see the output, which is the performance of our model. You do this by simply printing the performance.
>>> print (performance)
此操作会输出以下内容 −
This will give you the following output −
输出会显示均方误差 (MSE)、均方根误差 (RMSE)、LogLoss 以及混淆矩阵。
The output shows the Mean Square Error (MSE), Root Mean Square Error (RMSE), LogLoss and even the Confusion Matrix.
Running in Jupyter
我们已从命令中看到执行情况,并已理解每行代码的用途。您可以在 Jupyter 环境中一次性运行整个代码,也可以一行一行运行。完整清单如下 −
We have seen the execution from the command and also understood the purpose of each line of code. You may run the entire code in a Jupyter environment, either line by line or the whole program at a time. The complete listing is given here −
import h2o
from h2o.estimators import H2ORandomForestEstimator
h2o.init()
data = h2o.import_file('iris.csv')
features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
output = 'class'
train, test = data.split_frame(ratios=[0.8])
model = H2ORandomForestEstimator(ntrees = 50, max_depth = 20, nfolds = 10)
model.train(x = features, y = output, training_frame = train)
performance = model.model_performance(test_data=test)
print (performance)
运行代码并查看输出。您现在可以意识到将随机森林算法应用于您的数据集并对其进行测试是多么容易。H20 的强大功能远远超出了此功能。如果您希望在同一数据集上尝试其他模型以查看是否可以获得更好的性能,该怎么办?我们的后续部分中将对此进行说明。
Run the code and observe the output. You can now appreciate how easy it is to apply and test a Random Forest algorithm on your dataset. The power of H20 goes far beyond this capability. What if you want to try another model on the same dataset to see if you can get better performance. This is explained in our subsequent section.
Applying a Different Algorithm
现在,我们将了解如何将梯度提升算法应用于我们之前的那个数据集,以查看其性能如何。在上述完整清单中,您仅需根据以下代码中突出显示的内容做出两个轻微更改 −
Now, we will learn how to apply a Gradient Boosting algorithm to our earlier dataset to see how it performs. In the above full listing, you will need to make only two minor changes as highlighted in the code below −
import h2o
from h2o.estimators import H2OGradientBoostingEstimator
h2o.init()
data = h2o.import_file('iris.csv')
features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
output = 'class'
train, test = data.split_frame(ratios = [0.8])
model = H2OGradientBoostingEstimator
(ntrees = 50, max_depth = 20, nfolds = 10)
model.train(x = features, y = output, training_frame = train)
performance = model.model_performance(test_data = test)
print (performance)
运行代码,您将获得以下输出 −
Run the code and you will get the following output −
只需将 MSE、RMSE、混淆矩阵等结果与之前的输出进行比较,然后决定在实际部署中使用哪个结果。事实上,您可以应用多种不同的算法,以决定最符合您目的的那一个。
Just compare the results like MSE, RMSE, Confusion Matrix, etc. with the previous output and decide on which one to use for production deployment. As a matter of fact, you can apply several different algorithms to decide on the best one that meets your purpose.
H2O - Flow
在上一课中,你学习了使用命令行界面创建基于 H2O 的机器学习模型。H2O Flow 实现了同样的目的,但使用的是基于网络的界面。
In the last lesson, you learned to create H2O based ML models using command line interface. H2O Flow fulfils the same purpose, but with a web-based interface.
在接下来的教程中,我将向你展示如何启动 H2O Flow 和运行一个示例应用程序。
In the following lessons, I will show you how to start H2O Flow and to run a sample application.
Starting H2O Flow
你先前下载的 H2O 安装包含 h2o.jar 文件。要启动 H2O Flow,首先从命令提示符运行此 jar 文件 −
The H2O installation that you downloaded earlier contains the h2o.jar file. To start H2O Flow, first run this jar from the command prompt −
$ java -jar h2o.jar
当 jar 文件运行成功后,你将在控制台上收到以下消息 −
When the jar runs successfully, you will get the following message on the console −
Open H2O Flow in your web browser: http://192.168.1.10:54321
现在,打开你选择的浏览器,并键入以上 URL。你将看到 H2O 基于网络的桌面,如下所示 −
Now, open the browser of your choice and type the above URL. You would see the H2O web-based desktop as shown here −
这基本上是一个类似于 Colab 或 Jupyter 的笔记本。我将在解释 Flow 中的各种功能的同时向你展示如何在笔记本中加载和运行一个示例应用程序。点击上面屏幕上的查看示例 Flow 链接以查看提供的示例列表。
This is basically a notebook similar to Colab or Jupyter. I will show you how to load and run a sample application in this notebook while explaining the various features in Flow. Click on the view example Flows link on the above screen to see the list of provided examples.
我将从示例中描述航空公司延误 Flow 示例。
I will describe the Airlines delay Flow example from the sample.
H2O - Running Sample Application
在示例列表中点击航空公司延误 Flow 链接,如下面的屏幕截图所示 −
Click on the Airlines Delay Flow link in the list of samples as shown in the screenshot below −
在你确认之后,新的笔记本将被加载。
After you confirm, the new notebook would be loaded.
Clearing All Outputs
在我们解释笔记本中的代码语句之前,让我们清除所有输出,然后逐渐运行笔记本。要清除所有输出,请选择以下菜单选项 −
Before we explain the code statements in the notebook, let us clear all the outputs and then run the notebook gradually. To clear all outputs, select the following menu option −
Flow / Clear All Cell Contents
这在以下屏幕截图中所示 −
This is shown in the following screenshot −
一旦清除所有输出,我们将单独运行笔记本中的每个单元格,并检查其输出。
Once all outputs are cleared, we will run each cell in the notebook individually and examine its output.
Running the First Cell
单击第一个单元格。左侧会出现一个红旗,表示该单元格已选中。如下图所示 −
Click the first cell. A red flag appears on the left indicating that the cell is selected. This is as shown in the screenshot below −
此单元格的内容只是使用 MarkDown (MD) 语言编写的程序注释。该内容描述了加载的应用程序的功能。要运行单元格,请单击运行图标,如下图所示 −
The contents of this cell are just the program comment written in MarkDown (MD) language. The content describes what the loaded application does. To run the cell, click the Run icon as shown in the screenshot below −
你将在单元格下方看不到任何输出,因为当前单元格中没有可执行代码。光标现在会自动移动到下一个单元格,该单元格已准备好执行。
You will not see any output underneath the cell as there is no executable code in the current cell. The cursor now moves automatically to the next cell, which is ready to execute.
Importing Data
下一个单元格包含以下 Python 语句 −
The next cell contains the following Python statement −
importFiles ["https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"]
该语句将 allyears2k.csv 文件从 Amazon AWS 导入到系统中。当你运行该单元格时,它会导入该文件,并为你提供以下输出。
The statement imports the allyears2k.csv file from Amazon AWS into the system. When you run the cell, it imports the file and gives you the following output.
Setting Up Data Parser
现在,我们需要对数据进行解析,使其适合我们的 ML 算法。使用以下命令执行此操作–
Now, we need to parse the data and make it suitable for our ML algorithm. This is done using the following command −
setupParse paths: [ "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv" ]
在执行上述语句后,将出现设置配置对话框。该对话框允许您对文件的解析进行多个设置。如下图所示所示:
Upon execution of the above statement, a setup configuration dialog appears. The dialog allows you several settings for parsing the file. This is as shown in the screenshot below −
在此对话框中,您可以从给定的下拉列表中选择所需的解析器,并设置其他参数,如字段分隔符等。
In this dialog, you can select the desired parser from the given drop-down list and set other parameters such as the field separator, etc.
Parsing Data
下一个语句实际使用上述配置解析数据文件,语句较长,如下所示:
The next statement, which actually parses the datafile using the above configuration, is a long one and is as shown here −
parseFiles
paths: ["https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"]
destination_frame: "allyears2k.hex"
parse_type: "CSV"
separator: 44
number_columns: 31
single_quotes: false
column_names: ["Year","Month","DayofMonth","DayOfWeek","DepTime","CRSDepTime",
"ArrTime","CRSArrTime","UniqueCarrier","FlightNum","TailNum",
"ActualElapsedTime","CRSElapsedTime","AirTime","ArrDelay","DepDelay",
"Origin","Dest","Distance","TaxiIn","TaxiOut","Cancelled","CancellationCode",
"Diverted","CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
"LateAircraftDelay","IsArrDelayed","IsDepDelayed"]
column_types: ["Enum","Enum","Enum","Enum","Numeric","Numeric","Numeric"
,"Numeric","Enum","Enum","Enum","Numeric","Numeric","Numeric","Numeric",
"Numeric","Enum","Enum","Numeric","Numeric","Numeric","Enum","Enum",
"Numeric","Numeric","Numeric","Numeric","Numeric","Numeric","Enum","Enum"]
delete_on_done: true
check_header: 1
chunk_size: 4194304
请注意,您在配置框中设置的参数在上述代码中已列出。现在,运行此单元格。一段时间后,解析完成,您将看到以下输出:
Observe that the parameters you have set up in the configuration box are listed in the above code. Now, run this cell. After a while, the parsing completes and you will see the following output −
Examining Dataframe
经处理后,它将生成一个数据框,可以使用以下语句对其进行检查:
After the processing, it generates a dataframe, which can be examined using the following statement −
getFrameSummary "allyears2k.hex"
在执行上述语句后,您将看到以下输出:
Upon execution of the above statement, you will see the following output −
现在,您的数据已准备就绪,可以将其输入机器学习算法。
Now, your data is ready to be fed into a Machine Learning algorithm.
下一条语句是程序注释,表示我们将使用回归模型并指定预设正则化和 lambda 值。
The next statement is a program comment that says we will be using the regression model and specifies the preset regularization and the lambda values.
Building the Model
接下来是对模型本身进行构建,这是最重要的语句。此语句指定在以下语句中:
Next, comes the most important statement and that is building the model itself. This is specified in the following statement −
buildModel 'glm', {
"model_id":"glm_model","training_frame":"allyears2k.hex",
"ignored_columns":[
"DayofMonth","DepTime","CRSDepTime","ArrTime","CRSArrTime","TailNum",
"ActualElapsedTime","CRSElapsedTime","AirTime","ArrDelay","DepDelay",
"TaxiIn","TaxiOut","Cancelled","CancellationCode","Diverted","CarrierDelay",
"WeatherDelay","NASDelay","SecurityDelay","LateAircraftDelay","IsArrDelayed"],
"ignore_const_cols":true,"response_column":"IsDepDelayed","family":"binomial",
"solver":"IRLSM","alpha":[0.5],"lambda":[0.00001],"lambda_search":false,
"standardize":true,"non_negative":false,"score_each_iteration":false,
"max_iterations":-1,"link":"family_default","intercept":true,
"objective_epsilon":0.00001,"beta_epsilon":0.0001,"gradient_epsilon":0.0001,
"prior":-1,"max_active_predictors":-1
}
我们使用了 glm,这是一套广义线性模型,其族类型设置为二项式。您可以在上述语句中看到这些内容突出显示。在我们的示例中,预期输出是二进制的,因此我们使用二项式类型。您可以自己检查其他参数;例如,查看我们之前指定的 alpha 和 lambda。参考 GLM 模型文档以获取所有参数的说明。
We use glm, which is a Generalized Linear Model suite with family type set to binomial. You can see these highlighted in the above statement. In our case, the expected output is binary and that is why we use the binomial type. You may examine the other parameters by yourself; for example, look at alpha and lambda that we had specified earlier. Refer to the GLM model documentation for the explanation of all the parameters.
现在,运行此语句。执行后,将生成以下输出:
Now, run this statement. Upon execution, the following output will be generated −
当然,在您的机器上,执行时间会有所不同。现在,来到此示例代码中最有趣的部分。
Certainly, the execution time would be different on your machine. Now, comes the most interesting part of this sample code.
Examining Output
我们仅使用以下语句输出了我们已经构建的模型:
We simply output the model that we have built using the following statement −
getModel "glm_model"
请注意,glm_model 是我们在上一个语句中构建模型时指定为 model_id 参数的模型 ID。这为我们提供了详细的结果输出,其中包含多个不同的参数。下图中显示了报告的部分输出:
Note the glm_model is the model ID that we specified as model_id parameter while building the model in the previous statement. This gives us a huge output detailing the results with several varying parameters. A partial output of the report is shown in the screenshot below −
如您在输出中所见,它表示这是在您的数据集上运行广义线性建模算法的结果。
As you can see in the output, it says that this is the result of running the Generalized Linear Modeling algorithm on your dataset.
在 SCORING HISTORY 正上方,您会看到 MODEL PARAMETERS 标记,展开该标记,您将看到构建模型时使用所有参数的列表。在下图中显示了这一点。
Right above the SCORING HISTORY, you see the MODEL PARAMETERS tag, expand it and you will see the list of all parameters that are used while building the model. This is shown in the screenshot below.
同样,每个标记都会提供特定类型的详细输出。展开各个标记以研究不同种类的输出。
Likewise, each tag provides a detailed output of a specific type. Expand the various tags yourself to study the outputs of different kinds.
Building Another Model
接下来,我们将在我们数据框上构建一个深度学习模型。示例代码中的下一行语句只是一个程序注释。下一条语句实际上是一个模型构建命令。如下所示:
Next, we will build a Deep Learning model on our dataframe. The next statement in the sample code is just a program comment. The following statement is actually a model building command. It is as shown here −
buildModel 'deeplearning', {
"model_id":"deeplearning_model","training_frame":"allyear
s2k.hex","ignored_columns":[
"DepTime","CRSDepTime","ArrTime","CRSArrTime","FlightNum","TailNum",
"ActualElapsedTime","CRSElapsedTime","AirTime","ArrDelay","DepDelay",
"TaxiIn","TaxiOut","Cancelled","CancellationCode","Diverted",
"CarrierDelay","WeatherDelay","NASDelay","SecurityDelay",
"LateAircraftDelay","IsArrDelayed"],
"ignore_const_cols":true,"res ponse_column":"IsDepDelayed",
"activation":"Rectifier","hidden":[200,200],"epochs":"100",
"variable_importances":false,"balance_classes":false,
"checkpoint":"","use_all_factor_levels":true,
"train_samples_per_iteration":-2,"adaptive_rate":true,
"input_dropout_ratio":0,"l1":0,"l2":0,"loss":"Automatic","score_interval":5,
"score_training_samples":10000,"score_duty_cycle":0.1,"autoencoder":false,
"overwrite_with_best_model":true,"target_ratio_comm_to_comp":0.02,
"seed":6765686131094811000,"rho":0.99,"epsilon":1e-8,"max_w2":"Infinity",
"initial_weight_distribution":"UniformAdaptive","classification_stop":0,
"diagnostics":true,"fast_mode":true,"force_load_balance":true,
"single_node_mode":false,"shuffle_training_data":false,"missing_values_handling":
"MeanImputation","quiet_mode":false,"sparse":false,"col_major":false,
"average_activation":0,"sparsity_beta":0,"max_categorical_features":2147483647,
"reproducible":false,"export_weights_and_biases":false
}
如您在上方的代码中看到的那样,我们指定对相应的数值设定几个参数来使用 deeplearning 构建模型,如 deeplearning 模型文档中指定的那样。当您运行此语句时,它将比 GLM 模型构建需要更多时间。您将在模型构建完成后看到以下输出,尽管时间不同。
As you can see in the above code, we specify deeplearning for building the model with several parameters set to the appropriate values as specified in the documentation of deeplearning model. When you run this statement, it will take longer time than the GLM model building. You will see the following output when the model building completes, albeit with different timings.
Examining Deep Learning Model Output
这生成了这样的输出,可以使用以下语句进行检查,如同前例中的情况。
This generates the kind of output, which can be examined using the following statement as in the earlier case.
getModel "deeplearning_model"
我们将考虑如下所示的 ROC 曲线,以快速参考。
We will consider the ROC curve output as shown below for quick reference.
如同前例一样,请展开各个标签并研究不同的输出。
Like in the earlier case, expand the various tabs and study the different outputs.
Saving the Model
在研究了不同模型的输出后,您可以决定在您的生产环境中使用其中一个。H20 允许您将此模型保存为 POJO(普通的 Java 对象)。
After you have studied the output of different models, you decide to use one of those in your production environment. H20 allows you to save this model as a POJO (Plain Old Java Object).
展开输出中的最后一个标签预览 POJO,您将看到调整过模型的 Java 代码。在您的生产环境中使用它。
Expand the last tag PREVIEW POJO in the output and you will see the Java code for your fine-tuned model. Use this in your production environment.
接下来,我们将了解 H2O 的一项非常激动人心的特性。我们将了解如何使用 AutoML 根据性能来测试和对各种算法进行评级。
Next, we will learn about a very exciting feature of H2O. We will learn how to use AutoML to test and rank various algorithms based on their performance.
H2O - AutoML
要使用 AutoML,请启动一个新的 Jupyter 笔记本,并遵循如下步骤。
To use AutoML, start a new Jupyter notebook and follow the steps shown below.
Importing AutoML
首先使用以下两个语句将 H2O 和 AutoML 软件包导入项目中:
First import H2O and AutoML package into the project using the following two statements −
import h2o
from h2o.automl import H2OAutoML
Initialize H2O
使用以下语句初始化 h2o:
Initialize h2o using the following statement −
h2o.init()
您应该可以在屏幕上看到集群信息,如下方的截图所示:
You should see the cluster information on the screen as shown in the screenshot below −
Loading Data
我们将使用本教程前面您使用过的相同的 iris.csv 数据集。使用以下语句加载数据:
We will use the same iris.csv dataset that you used earlier in this tutorial. Load the data using the following statement −
data = h2o.import_file('iris.csv')
Preparing Dataset
我们需要决定特征和预测列。我们使用前例中相同的特征和预测列。使用以下两个语句设定特征和输出列:
We need to decide on the features and the prediction columns. We use the same features and the predication column as in our earlier case. Set the features and the output column using the following two statements −
features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
output = 'class'
以 80:20 的比例将数据分为训练集和测试集:
Split the data in 80:20 ratio for training and testing −
train, test = data.split_frame(ratios=[0.8])
Applying AutoML
现在,我们已准备就绪地在我们的数据集上应用 AutoML 了。AutoML 将根据我们设定的固定时间运行,并为我们提供优化过的模型。我们使用以下语句设置 AutoML:
Now, we are all set for applying AutoML on our dataset. The AutoML will run for a fixed amount of time set by us and give us the optimized model. We set up the AutoML using the following statement −
aml = H2OAutoML(max_models = 30, max_runtime_secs=300, seed = 1)
第一个参数指定了我们要评估和比较的模型数。
The first parameter specifies the number of models that we want to evaluate and compare.
第二个参数指定了算法运行的时间。
The second parameter specifies the time for which the algorithm runs.
我们现在在 AutoML 对象上调用 train 方法,如这里所示:
We now call the train method on the AutoML object as shown here −
aml.train(x = features, y = output, training_frame = train)
我们指定 x 为之前创建的特征数组,指定 y 为指示预测值 的输出变量,并指定数据帧为 train 数据集。
We specify the x as the features array that we created earlier, the y as the output variable to indicate the predicted value and the dataframe as train dataset.
运行代码,您需要等待 5 分钟(我们将 max_runtime_secs 设置为 300),直到获得以下输出−
Run the code, you will have to wait for 5 minutes (we set the max_runtime_secs to 300) until you get the following output −
Printing the Leaderboard
当 AutoML 处理完成后,它会创建排行榜,对评估过的 30 个算法进行分级。若要查看排行榜前 10 条记录,请使用以下代码 −
When the AutoML processing completes, it creates a leaderboard ranking all the 30 algorithms that it has evaluated. To see the first 10 records of the leaderboard, use the following code −
lb = aml.leaderboard
lb.head()
执行时,上述代码将生成以下输出 −
Upon execution, the above code will generate the following output −
显然,DeepLearning 算法获得了最高分。
Clearly, the DeepLearning algorithm has got the maximum score.
Predicting on Test Data
现在,您对模型进行了排名,可以在测试数据上查看排名前列模型的性能。要执行此操作,请运行以下代码语句 −
Now, you have the models ranked, you can see the performance of the top-rated model on your test data. To do so, run the following code statement −
preds = aml.predict(test)
处理将持续一段时间,完成后您将看到以下输出。
The processing continues for a while and you will see the following output when it completes.
Printing Result
使用以下语句打印预测结果 −
Print the predicted result using the following statement −
print (preds)
执行上述语句后,您将看到以下结果 −
Upon execution of the above statement, you will see the following result −
Printing the Ranking for All
如果要查看所有已测试算法的排名,请运行以下代码语句 −
If you want to see the ranks of all the tested algorithms, run the following code statement −
lb.head(rows = lb.nrows)
执行上述语句后,将生成以下输出(部分显示) −
Upon execution of the above statement, the following output will be generated (partially shown) −
Conclusion
H2O 提供了一个易于使用的开源平台,用于对给定数据集应用不同的 ML 算法。它提供了多种统计和 ML 算法,包括深度学习。在测试期间,您可以对这些算法的参数进行微调。您可以使用命令行或基于 Web 的提供界面 Flow 来执行此操作。H2O 还支持 AutoML,它根据性能对多种算法进行排名。H2O 在大数据上也表现良好。对于数据科学家而言,这无疑是一个福音,他们可以对其数据集应用不同的机器学习模型,并选择最能满足其需求的模型。
H2O provides an easy-to-use open source platform for applying different ML algorithms on a given dataset. It provides several statistical and ML algorithms including deep learning. During testing, you can fine tune the parameters to these algorithms. You can do so using command-line or the provided web-based interface called Flow. H2O also supports AutoML that provides the ranking amongst the several algorithms based on their performance. H2O also performs well on Big Data. This is definitely a boon for Data Scientist to apply the different Machine Learning models on their dataset and pick up the best one to meet their needs.