Weka 简明教程

Weka - Classifiers

许多机器学习应用程序与分类相关。例如,您可能希望将肿瘤分类为恶性或良性。您可能希望根据天气条件决定是否进行户外游戏。通常,此决定取决于天气的几个特征/条件。因此,您可能更喜欢使用树分类器来决定是否进行游戏。

Many machine learning applications are classification related. For example, you may like to classify a tumor as malignant or benign. You may like to decide whether to play an outside game depending on the weather conditions. Generally, this decision is dependent on several features/conditions of the weather. So you may prefer to use a tree classifier to make your decision of whether to play or not.

在本章中,我们将学习如何基于天气数据构建这样的树分类器来决定游戏条件。

In this chapter, we will learn how to build such a tree classifier on weather data to decide on the playing conditions.

Setting Test Data

我们从上一课中使用预处理后的天气数据文件。使用 Open file …​ 选项在 Preprocess 选项卡中打开已保存的文件,单击 Classify 选项卡,您会看到以下屏幕 −

We will use the preprocessed weather data file from the previous lesson. Open the saved file by using the Open file …​ option under the Preprocess tab, click on the Classify tab, and you would see the following screen −

classify tab

在了解可用的分类器之前,让我们检查测试选项。您会看到以下四个测试选项 −

Before you learn about the available classifiers, let us examine the Test options. You will notice four testing options as listed below −

  1. Training set

  2. Supplied test set

  3. Cross-validation

  4. Percentage split

除非您有自己的训练集或由客户提供的测试集,否则您将使用交叉验证或百分比分割选项。通过交叉验证,您可以设置整个数据将被分割并用于每次训练迭代的折叠数。在百分比分割中,您将使用设置的分割百分比在训练和测试之间分割数据。

Unless you have your own training set or a client supplied test set, you would use cross-validation or percentage split options. Under cross-validation, you can set the number of folds in which entire data would be split and used during each iteration of training. In the percentage split, you will split the data between training and testing using the set split percentage.

现在,将输出类的默认 play 选项保留为 −

Now, keep the default play option for the output class −

play option

接下来,您将选择分类器。

Next, you will select the classifier.

Selecting Classifier

点击选择按钮并选择以下分类器 −

Click on the Choose button and select the following classifier −

weka→classifiers>trees>J48

weka→classifiers>trees>J48

这在下面的屏幕截图中显示 -

This is shown in the screenshot below −

weka trees

单击 Start 按钮以启动分类过程。一段时间后,分类结果将显示在您的屏幕上,如下所示 −

Click on the Start button to start the classification process. After a while, the classification results would be presented on your screen as shown here −

start button

让我们检查屏幕右侧显示的输出。

Let us examine the output shown on the right hand side of the screen.

它说树的大小是 6。您很快将看到树的可视化表示。在摘要中,它说正确分类的实例为 2,错误分类的实例为 3。它还说相对绝对误差为 110%。它还显示了混淆矩阵。对这些结果的分析超出了本教程的范围。但是,您可以轻松地从这些结果中看出分类是不可接受的,并且您需要更多数据进行分析,以完善您的功能选择、重建模型等,直到您对模型的准确性感到满意。无论如何,这就是 WEKA 的全部意义所在。它允许您快速测试您的想法。

It says the size of the tree is 6. You will very shortly see the visual representation of the tree. In the Summary, it says that the correctly classified instances as 2 and the incorrectly classified instances as 3, It also says that the Relative absolute error is 110%. It also shows the Confusion Matrix. Going into the analysis of these results is beyond the scope of this tutorial. However, you can easily make out from these results that the classification is not acceptable and you will need more data for analysis, to refine your features selection, rebuild the model and so on until you are satisfied with the model’s accuracy. Anyway, that’s what WEKA is all about. It allows you to test your ideas quickly.

Visualize Results

要查看结果的可视化表示,请右键单击 Result list 框中的结果。屏幕上会出现几个选项,如下所示 −

To see the visual representation of the results, right click on the result in the Result list box. Several options would pop up on the screen as shown here −

result list

选择 Visualize tree 以获得遍历树的可视化表示,如下面的屏幕截图所示 −

Select Visualize tree to get a visual representation of the traversal tree as seen in the screenshot below −

visualize tree

选择 Visualize classifier errors 将显示分类结果,如下所示 −

Selecting Visualize classifier errors would plot the results of classification as shown here −

classifier errors

cross 表示正确分类的实例,而 squares 表示错误分类的实例。在绘图左下角,您可以看到一个 cross ,该 cross 指示如果 outlook 天气晴朗,则 play 比赛。因此,这是一个正确分类的实例。若要定位实例,您可以通过滑动 jitter 滑块来引入一些抖动。

A cross represents a correctly classified instance while squares represents incorrectly classified instances. At the lower left corner of the plot you see a cross that indicates if outlook is sunny then play the game. So this is a correctly classified instance. To locate instances, you can introduce some jitter in it by sliding the jitter slide bar.

当前绘图是 outlookplay 。这由屏幕顶部的两个下拉列表框指示。

The current plot is outlook versus play. These are indicated by the two drop down list boxes at the top of the screen.

outlook versus play

现在,在这些框中的每一个框中尝试不同的选择,并注意 X 和 Y 轴如何变化。使用绘图右侧的水平条也可以实现相同的效果。每个条代表一个属性。左键单击条将所选属性设置为 X 轴,而右键单击将该属性设置为 Y 轴。

Now, try a different selection in each of these boxes and notice how the X & Y axes change. The same can be achieved by using the horizontal strips on the right hand side of the plot. Each strip represents an attribute. Left click on the strip sets the selected attribute on the X-axis while a right click would set it on the Y-axis.

提供了几个其他绘图以供您进行深入分析。明智地使用它们来微调您的模型。 Cost/Benefit analysis 的一个此类绘图如下所示,供您快速参考。

There are several other plots provided for your deeper analysis. Use them judiciously to fine tune your model. One such plot of Cost/Benefit analysis is shown below for your quick reference.

cost benefit analysis

对此类图表中分析内容进行说明超出了本教程的范围。建议读者查阅机器学习算法分析知识。

Explaining the analysis in these charts is beyond the scope of this tutorial. The reader is encouraged to brush up their knowledge of analysis of machine learning algorithms.

在下一章中,我们将了解下一组机器学习算法,即聚类。

In the next chapter, we will learn the next set of machine learning algorithms, that is clustering.