H2o 简明教程
H2O - AutoML
要使用 AutoML,请启动一个新的 Jupyter 笔记本,并遵循如下步骤。
To use AutoML, start a new Jupyter notebook and follow the steps shown below.
Importing AutoML
首先使用以下两个语句将 H2O 和 AutoML 软件包导入项目中:
First import H2O and AutoML package into the project using the following two statements −
import h2o
from h2o.automl import H2OAutoML
Initialize H2O
使用以下语句初始化 h2o:
Initialize h2o using the following statement −
h2o.init()
您应该可以在屏幕上看到集群信息,如下方的截图所示:
You should see the cluster information on the screen as shown in the screenshot below −
Loading Data
我们将使用本教程前面您使用过的相同的 iris.csv 数据集。使用以下语句加载数据:
We will use the same iris.csv dataset that you used earlier in this tutorial. Load the data using the following statement −
data = h2o.import_file('iris.csv')
Preparing Dataset
我们需要决定特征和预测列。我们使用前例中相同的特征和预测列。使用以下两个语句设定特征和输出列:
We need to decide on the features and the prediction columns. We use the same features and the predication column as in our earlier case. Set the features and the output column using the following two statements −
features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
output = 'class'
以 80:20 的比例将数据分为训练集和测试集:
Split the data in 80:20 ratio for training and testing −
train, test = data.split_frame(ratios=[0.8])
Applying AutoML
现在,我们已准备就绪地在我们的数据集上应用 AutoML 了。AutoML 将根据我们设定的固定时间运行,并为我们提供优化过的模型。我们使用以下语句设置 AutoML:
Now, we are all set for applying AutoML on our dataset. The AutoML will run for a fixed amount of time set by us and give us the optimized model. We set up the AutoML using the following statement −
aml = H2OAutoML(max_models = 30, max_runtime_secs=300, seed = 1)
第一个参数指定了我们要评估和比较的模型数。
The first parameter specifies the number of models that we want to evaluate and compare.
第二个参数指定了算法运行的时间。
The second parameter specifies the time for which the algorithm runs.
我们现在在 AutoML 对象上调用 train 方法,如这里所示:
We now call the train method on the AutoML object as shown here −
aml.train(x = features, y = output, training_frame = train)
我们指定 x 为之前创建的特征数组,指定 y 为指示预测值 的输出变量,并指定数据帧为 train 数据集。
We specify the x as the features array that we created earlier, the y as the output variable to indicate the predicted value and the dataframe as train dataset.
运行代码,您需要等待 5 分钟(我们将 max_runtime_secs 设置为 300),直到获得以下输出−
Run the code, you will have to wait for 5 minutes (we set the max_runtime_secs to 300) until you get the following output −
Printing the Leaderboard
当 AutoML 处理完成后,它会创建排行榜,对评估过的 30 个算法进行分级。若要查看排行榜前 10 条记录,请使用以下代码 −
When the AutoML processing completes, it creates a leaderboard ranking all the 30 algorithms that it has evaluated. To see the first 10 records of the leaderboard, use the following code −
lb = aml.leaderboard
lb.head()
执行时,上述代码将生成以下输出 −
Upon execution, the above code will generate the following output −
显然,DeepLearning 算法获得了最高分。
Clearly, the DeepLearning algorithm has got the maximum score.
Predicting on Test Data
现在,您对模型进行了排名,可以在测试数据上查看排名前列模型的性能。要执行此操作,请运行以下代码语句 −
Now, you have the models ranked, you can see the performance of the top-rated model on your test data. To do so, run the following code statement −
preds = aml.predict(test)
处理将持续一段时间,完成后您将看到以下输出。
The processing continues for a while and you will see the following output when it completes.
Printing Result
使用以下语句打印预测结果 −
Print the predicted result using the following statement −
print (preds)
执行上述语句后,您将看到以下结果 −
Upon execution of the above statement, you will see the following result −
Printing the Ranking for All
如果要查看所有已测试算法的排名,请运行以下代码语句 −
If you want to see the ranks of all the tested algorithms, run the following code statement −
lb.head(rows = lb.nrows)
执行上述语句后,将生成以下输出(部分显示) −
Upon execution of the above statement, the following output will be generated (partially shown) −
Conclusion
H2O 提供了一个易于使用的开源平台,用于对给定数据集应用不同的 ML 算法。它提供了多种统计和 ML 算法,包括深度学习。在测试期间,您可以对这些算法的参数进行微调。您可以使用命令行或基于 Web 的提供界面 Flow 来执行此操作。H2O 还支持 AutoML,它根据性能对多种算法进行排名。H2O 在大数据上也表现良好。对于数据科学家而言,这无疑是一个福音,他们可以对其数据集应用不同的机器学习模型,并选择最能满足其需求的模型。
H2O provides an easy-to-use open source platform for applying different ML algorithms on a given dataset. It provides several statistical and ML algorithms including deep learning. During testing, you can fine tune the parameters to these algorithms. You can do so using command-line or the provided web-based interface called Flow. H2O also supports AutoML that provides the ranking amongst the several algorithms based on their performance. H2O also performs well on Big Data. This is definitely a boon for Data Scientist to apply the different Machine Learning models on their dataset and pick up the best one to meet their needs.