H2o 简明教程
H2O - Installation
H2O 可以按照如下列出的五种不同选项进行配置和使用:
H2O can be configured and used with five different options as listed below −
-
Install in Python
-
Install in R
-
Web-based Flow GUI
-
Hadoop
-
Anaconda Cloud
在我们的后续部分中,您会看到根据可用选项来安装 H2O 的说明。您可能会使用其中一个选项。
In our subsequent sections, you will see the instructions for installation of H2O based on the options available. You are likely to use one of the options.
Install in Python
要使用 Python 运行 H2O,安装需要一些依赖关系。因此,让我们开始安装运行 H2O 所需的最低依赖关系。
To run H2O with Python, the installation requires several dependencies. So let us start installing the minimum set of dependencies to run H2O.
Installing Dependencies
要安装依赖关系,请执行以下 pip 命令:
To install a dependency, execute the following pip command −
$ pip install requests
打开您的控制台窗口,然后键入以上命令来安装 requests 包。以下屏幕截图显示了在我们的 Mac 机器上执行上述命令的情况:
Open your console window and type the above command to install the requests package. The following screenshot shows the execution of the above command on our Mac machine −

安装 requests 后,您需要安装如下所示的另外三个包:
After installing requests, you need to install three more packages as shown below −
$ pip install tabulate
$ pip install "colorama >= 0.3.8"
$ pip install future
最新依赖项清单可在 H2O GitHub 页面获取。在撰写本文时,该页面列出了以下依赖项。
The most updated list of dependencies is available on H2O GitHub page. At the time of this writing, the following dependencies are listed on the page.
python 2. H2O — Installation
pip >= 9.0.1
setuptools
colorama >= 0.3.7
future >= 0.15.2
Removing Older Versions
安装以上依赖项后,您需要删除任何现有的 H2O 安装。为此,请运行以下命令:
After installing the above dependencies, you need to remove any existing H2O installation. To do so, run the following command −
$ pip uninstall h2o
Installing the Latest Version
接下来,让我们使用以下命令安装最新版本的 H2O:
Now, let us install the latest version of H2O using the following command −
$ pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o
安装成功后,您应该在屏幕上看到以下信息显示:
After successful installation, you should see the following message display on the screen −
Installing collected packages: h2o
Successfully installed h2o-3.26.0.1
Testing the Installation
要测试安装,我们将运行 H2O 安装中提供的一个示例应用程序。首先,通过键入以下命令启动 Python 提示符:
To test the installation, we will run one of the sample applications provided in the H2O installation. First start the Python prompt by typing the following command −
$ Python3
Python 解释器启动后,在 Python 命令提示符中键入以下 Python 语句:
Once the Python interpreter starts, type the following Python statement on the Python command prompt −
>>>import h2o
上述命令在您的程序中导入 H2O 软件包。接下来,使用以下命令初始化 H2O 系统:
The above command imports the H2O package in your program. Next, initialize the H2O system using the following command −
>>>h2o.init()
您的屏幕将显示集群信息,并且应该在此时显示以下内容:
Your screen would show the cluster information and should look the following at this stage −

现在,您可以运行示例代码。在 Python 提示符中键入以下命令并执行它。
Now, you are ready to run the sample code. Type the following command on the Python prompt and execute it.
>>>h2o.demo("glm")
该演示由一个包含一系列命令的 Python Notebook 组成。在执行每条命令后,其输出会立即显示在屏幕上,并且系统会要求您按某个键继续执行下一步。执行 Notebook 中的最后一条语语句的部分屏幕截图显示在此处:
The demo consists of a Python notebook with a series of commands. After executing each command, its output is shown immediately on the screen and you will be asked to hit the key to continue with the next step. The partial screenshot on executing the last statement in the notebook is shown here −

在这个阶段,您的 Python 安装就完成了,您可以进行您自己的实验。
At this stage your Python installation is complete and you are ready for your own experimentation.
Install in R
为 R 开发环境安装 H2O 与为 Python 安装 H2O 非常相似,不同之处在于您将使用 R 提示符进行安装。
Installing H2O for R development is very much similar to installing it for Python, except that you would be using R prompt for the installation.
Starting R Console
通过单击机器上的 R 应用程序图标启动 R 控制台。控制台屏幕将如以下屏幕截图所示出现:
Start R console by clicking on the R application icon on your machine. The console screen would appear as shown in the following screenshot −

您的 H2O 安装将在上述 R 提示符上完成。如果您喜欢使用 RStudio,请在 R 控制台子窗口中键入命令。
Your H2O installation would be done on the above R prompt. If you prefer using RStudio, type the commands in the R console subwindow.
Removing Older Versions
首先,在 R 提示符中使用以下命令删除旧版本:
To begin with, remove older versions using the following command on the R prompt −
> if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
> if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }
Downloading Dependencies
使用以下代码下载 H2O 依赖项:
Download the dependencies for H2O using the following code −
> pkgs <- c("RCurl","jsonlite")
for (pkg in pkgs) {
if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }
}
Installing H2O
通过在 R 提示符中键入以下命令安装 H2O:
Install H2O by typing the following command on the R prompt −
> install.packages("h2o", type = "source", repos = (c("http://h2o-release.s3.amazonaws.com/h2o/latest_stable_R")))
以下屏幕截图显示了预期的输出:
The following screenshot shows the expected output −

还有一种方法可以在 R 中安装 H2O。
There is another way of installing H2O in R.
Install in R from CRAN
从 CRAN 安装 R,在 R 提示符下使用以下命令 −
To install R from CRAN, use the following command on R prompt −
> install.packages("h2o")
系统将要求你选择镜像 −
You will be asked to select the mirror −
--- Please select a CRAN mirror for use in this session ---

屏幕上将显示一个对话框,其中列出了镜像网站列表。选择最近的位置或你选择的镜像。
A dialog box displaying the list of mirror sites is shown on your screen. Select the nearest location or the mirror of your choice.
Installing Web GUI Flow
要安装 GUI Flow,请从 H20 网站下载安装文件。将下载的文件解压到你的首选文件夹中。请注意安装中是否包含 h2o.jar 文件。使用以下命令在命令窗口中运行该文件 −
To install GUI Flow download the installation file from the H20 site. Unzip the downloaded file in your preferred folder. Note the presence of h2o.jar file in the installation. Run this file in a command window using the following command −
$ java -jar h2o.jar
一段时间后,控制台窗口中将显示以下内容。
After a while, the following will appear in your console window.
07-24 16:06:37.304 192.168.1.18:54321 3294 main INFO: H2O started in 7725ms
07-24 16:06:37.304 192.168.1.18:54321 3294 main INFO:
07-24 16:06:37.305 192.168.1.18:54321 3294 main INFO: Open H2O Flow in your web browser: http://192.168.1.18:54321
07-24 16:06:37.305 192.168.1.18:54321 3294 main INFO:
要启动 Flow,请在浏览器中打开给定的 URL http://localhost:54321 。将显示以下屏幕 −
To start the Flow, open the given URL http://localhost:54321 in your browser. The following screen will appear −

此时,你的 Flow 安装已经完成。
At this stage, your Flow installation is complete.
Install on Hadoop / Anaconda Cloud
除非你是经验丰富的开发人员,否则你不会考虑在宏观数据上使用 H2O。这里有必要说明的是,H2O 模型可以在数 TB 的海量数据库上高效运行。如果你的数据位于 Hadoop 安装中或云中,请按照 H2O 网站上给定的步骤为你的各个数据库进行安装。
Unless you are a seasoned developer, you would not think of using H2O on Big Data. It is sufficient to say here that H2O models run efficiently on huge databases of several terabytes. If your data is on your Hadoop installation or in the Cloud, follow the steps given on H2O site to install it for your respective database.
既然你已在计算机上成功安装并测试了 H2O,你就可以开始进行实际开发。首先,我们将了解如何通过命令提示符进行开发。在后面的教程中,我们将学习如何在 H2O Flow 中进行模型测试。
Now that you have successfully installed and tested H2O on your machine, you are ready for real development. First, we will see the development from a Command prompt. In our subsequent lessons, we will learn how to do model testing in H2O Flow.
Developing in Command Prompt
现在让我们考虑使用 H2O 对著名的鸢尾花数据集进行分类,该数据集可免费用于开发机器学习应用程序。
Let us now consider using H2O to classify plants of the well-known iris dataset that is freely available for developing Machine Learning applications.
通过在 shell 窗口中输入以下命令启动 Python 解释器 −
Start the Python interpreter by typing the following command in your shell window −
$ Python3
这将启动 Python 解释器。使用以下命令导入 h2o 平台 −
This starts the Python interpreter. Import h2o platform using the following command −
>>> import h2o
我们将使用随机森林算法进行分类。这是 H2ORandomForest Estimator 包中提供的。我们使用 import 语句按如下所示导入此包 −
We will use Random Forest algorithm for classification. This is provided in the H2ORandomForestEstimator package. We import this package using the import statement as follows −
>>> from h2o.estimators import H2ORandomForestEstimator
我们通过调用其 init 方法来初始化 H2o 环境。
We initialize the H2o environment by calling its init method.
>>> h2o.init()
初始化成功后,你应该在控制台上看到以下消息以及集群信息。
On successful initialization, you should see the following message on the console along with the cluster information.
Checking whether there is an H2O instance running at http://localhost:54321 . connected.
现在,我们将使用 H2O 中的 import_file 方法导入 iris 数据。
Now, we will import the iris data using the import_file method in H2O.
>>> data = h2o.import_file('iris.csv')
进度将按以下屏幕截图所示方式显示 −
The progress will display as shown in the following screenshot −

在文件载入内存后,您可以通过显示载入表的首 10 行来验证此操作。执行此操作时,您可以使用以下 head 方法 −
After the file is loaded in the memory, you can verify this by displaying the first 10 rows of the loaded table. You use the head method to do so −
>>> data.head()
您将在表格格式下看到以下输出。
You will see the following output in tabular format.

表格也会显示列名。我们将使用前四列作为机器学习算法的特征,将最后一列类别指定为预测输出。我们通过首先设置以下两个变量,在对机器学习算法的调用中指定此内容。
The table also displays the column names. We will use the first four columns as the features for our ML algorithm and the last column class as the predicted output. We specify this in the call to our ML algorithm by first creating the following two variables.
>>> features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
>>> output = 'class'
接下来,我们通过调用 split_frame 方法将数据拆分成训练数据和测试数据。
Next, we split the data into training and testing by calling the split_frame method.
>>> train, test = data.split_frame(ratios = [0.8])
数据以 80:20 的比例拆分。我们使用 80% 的数据进行训练,使用 20% 的数据进行测试。
The data is split in the 80:20 ratio. We use 80% data for training and 20% for testing.
现在,我们将内置的随机森林模型载入系统中。
Now, we load the built-in Random Forest model into the system.
>>> model = H2ORandomForestEstimator(ntrees = 50, max_depth = 20, nfolds = 10)
在上述调用中,我们将树的数量设为 50,树的最大深度设为 20,交叉验证的折叠数量设为 10。我们现在需要对模型进行训练。我们将通过以下方式调用 train 方法来完成此操作 −
In the above call, we set the number of trees to 50, the maximum depth for the tree to 20 and number of folds for cross validation to 10. We now need to train the model. We do so by calling the train method as follows −
>>> model.train(x = features, y = output, training_frame = train)
作为前两个参数,train 方法接收先前创建的特征和输出。训练数据集设置为 train,即完整数据集的 80%。在训练期间,您将看到进度,如下所示 −
The train method receives the features and the output that we created earlier as first two parameters. The training dataset is set to train, which is the 80% of our full dataset. During training, you will see the progress as shown here −
现在,由于模型构建过程已结束,是时候测试模型了。我们通过对训练过的模型对象调用 model_performance 方法来完成此操作。
Now, as the model building process is over, it is time to test the model. We do this by calling the model_performance method on the trained model object.
>>> performance = model.model_performance(test_data=test)
在上述方法调用中,我们发送测试数据作为参数。
In the above method call, we sent test data as our parameter.
现在是时候查看输出了,即模型的性能。通过简单地打印性能即可完成此操作。
It is time now to see the output, which is the performance of our model. You do this by simply printing the performance.
>>> print (performance)
此操作会输出以下内容 −
This will give you the following output −

输出会显示均方误差 (MSE)、均方根误差 (RMSE)、LogLoss 以及混淆矩阵。
The output shows the Mean Square Error (MSE), Root Mean Square Error (RMSE), LogLoss and even the Confusion Matrix.
Running in Jupyter
我们已从命令中看到执行情况,并已理解每行代码的用途。您可以在 Jupyter 环境中一次性运行整个代码,也可以一行一行运行。完整清单如下 −
We have seen the execution from the command and also understood the purpose of each line of code. You may run the entire code in a Jupyter environment, either line by line or the whole program at a time. The complete listing is given here −
import h2o
from h2o.estimators import H2ORandomForestEstimator
h2o.init()
data = h2o.import_file('iris.csv')
features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
output = 'class'
train, test = data.split_frame(ratios=[0.8])
model = H2ORandomForestEstimator(ntrees = 50, max_depth = 20, nfolds = 10)
model.train(x = features, y = output, training_frame = train)
performance = model.model_performance(test_data=test)
print (performance)
运行代码并查看输出。您现在可以意识到将随机森林算法应用于您的数据集并对其进行测试是多么容易。H20 的强大功能远远超出了此功能。如果您希望在同一数据集上尝试其他模型以查看是否可以获得更好的性能,该怎么办?我们的后续部分中将对此进行说明。
Run the code and observe the output. You can now appreciate how easy it is to apply and test a Random Forest algorithm on your dataset. The power of H20 goes far beyond this capability. What if you want to try another model on the same dataset to see if you can get better performance. This is explained in our subsequent section.
Applying a Different Algorithm
现在,我们将了解如何将梯度提升算法应用于我们之前的那个数据集,以查看其性能如何。在上述完整清单中,您仅需根据以下代码中突出显示的内容做出两个轻微更改 −
Now, we will learn how to apply a Gradient Boosting algorithm to our earlier dataset to see how it performs. In the above full listing, you will need to make only two minor changes as highlighted in the code below −
import h2o
from h2o.estimators import H2OGradientBoostingEstimator
h2o.init()
data = h2o.import_file('iris.csv')
features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
output = 'class'
train, test = data.split_frame(ratios = [0.8])
model = H2OGradientBoostingEstimator
(ntrees = 50, max_depth = 20, nfolds = 10)
model.train(x = features, y = output, training_frame = train)
performance = model.model_performance(test_data = test)
print (performance)
运行代码,您将获得以下输出 −
Run the code and you will get the following output −

只需将 MSE、RMSE、混淆矩阵等结果与之前的输出进行比较,然后决定在实际部署中使用哪个结果。事实上,您可以应用多种不同的算法,以决定最符合您目的的那一个。
Just compare the results like MSE, RMSE, Confusion Matrix, etc. with the previous output and decide on which one to use for production deployment. As a matter of fact, you can apply several different algorithms to decide on the best one that meets your purpose.