Weka 简明教程
Weka - Clustering
群集算法将在整个数据集内找到一组相似实例。WEKA 支持多种群集算法,例如 EM、FilteredClusterer、HierarchicalClusterer、SimpleKMeans 等。你应当完全理解这些算法,以便充分利用 WEKA 的功能。
A clustering algorithm finds groups of similar instances in the entire dataset. WEKA supports several clustering algorithms such as EM, FilteredClusterer, HierarchicalClusterer, SimpleKMeans and so on. You should understand these algorithms completely to fully exploit the WEKA capabilities.
如同在分类的情况下,通过 WEKA 能够以图形方式可视化检测到的群集。为了演示群集,我们将使用提供的 iris 数据库。数据集包含三类,每类有 50 个实例。每类表示一种类型的 iris 植物。
As in the case of classification, WEKA allows you to visualize the detected clusters graphically. To demonstrate the clustering, we will use the provided iris database. The data set contains three classes of 50 instances each. Each class refers to a type of iris plant.
Loading Data
在 WEKA 浏览器中,选择 Preprocess 选项卡。点击 Open file … 选项并在文件选择对话框中选择 iris.arff 文件。当你加载数据时,屏幕看起来如下图所示 −
In the WEKA explorer select the Preprocess tab. Click on the Open file … option and select the iris.arff file in the file selection dialog. When you load the data, the screen looks like as shown below −

你可以观察到有 150 个实例和 5 个属性。属性的名称列为 sepallength 、 sepalwidth 、 petallength 、 petalwidth 和 class 。前四个属性为数字类型,而类为具有 3 个不同值的公称类型。检查每个属性以了解数据库的特性。我们不会针对此数据进行任何预处理,直接进入模型构建。
You can observe that there are 150 instances and 5 attributes. The names of attributes are listed as sepallength, sepalwidth, petallength, petalwidth and class. The first four attributes are of numeric type while the class is a nominal type with 3 distinct values. Examine each attribute to understand the features of the database. We will not do any preprocessing on this data and straight-away proceed to model building.
Clustering
点击 Cluster 选项卡,将群集算法应用到我们加载的数据。点击 Choose 按钮。你会看到下面的屏幕 −
Click on the Cluster TAB to apply the clustering algorithms to our loaded data. Click on the Choose button. You will see the following screen −

现在,选择 EM 作为群集算法。在 Cluster mode 子窗口中,选择 Classes to clusters evaluation 选项,如下图所示 −
Now, select EM as the clustering algorithm. In the Cluster mode sub window, select the Classes to clusters evaluation option as shown in the screenshot below −

点击 Start 按钮处理数据。片刻之后,结果将显示在屏幕上。
Click on the Start button to process the data. After a while, the results will be presented on the screen.
接下来,让我们研究一下结果。
Next, let us study the results.
Examining Output
数据处理的输出如下图所示 −
The output of the data processing is shown in the screen below −

从输出屏幕中,你可以观察到 −
From the output screen, you can observe that −
-
There are 5 clustered instances detected in the database.
-
The Cluster 0 represents setosa, Cluster 1 represents virginica, Cluster 2 represents versicolor, while the last two clusters do not have any class associated with them.
如果你向上滚动输出窗口,你还可以看到一些统计信息,为各个检测到的群集中各个属性的平均值和标准差。这在下图的屏幕截图中显示 −
If you scroll up the output window, you will also see some statistics that gives the mean and standard deviation for each of the attributes in the various detected clusters. This is shown in the screenshot given below −

接下来,我们将查看群集的视觉表示。
Next, we will look at the visual representation of the clusters.
Visualizing Clusters
为了可视化群集,右键点击 EM 中的 Result list 的结果。你会看到以下选项 −
To visualize the clusters, right click on the EM result in the Result list. You will see the following options −

选择 Visualize cluster assignments 。你会看到以下输出 −
Select Visualize cluster assignments. You will see the following output −

如同在分类的情况下,你会注意到正确和错误识别实例之间的区别。你可以通过改变 X 轴和 Y 轴来分析结果。你可以使用抖动,如同在分类的情况一样,来确定正确识别实例的浓度。可视化图的运算类似于你在分类的情况下所学习的运算。
As in the case of classification, you will notice the distinction between the correctly and incorrectly identified instances. You can play around by changing the X and Y axes to analyze the results. You may use jittering as in the case of classification to find out the concentration of correctly identified instances. The operations in visualization plot are similar to the one you studied in the case of classification.
Applying Hierarchical Clusterer
为了演示 WEKA 的功能,我们现在来看另一种聚类算法的应用。在 WEKA explorer 中,选择 HierarchicalClusterer 作为您的 ML 算法,如下面的屏幕截图所示 −
To demonstrate the power of WEKA, let us now look into an application of another clustering algorithm. In the WEKA explorer, select the HierarchicalClusterer as your ML algorithm as shown in the screenshot shown below −

选择 Cluster mode 选择以 Classes to cluster evaluation ,然后单击 Start 按钮。您将看到以下输出 −
Choose the Cluster mode selection to Classes to cluster evaluation, and click on the Start button. You will see the following output −

请注意,在 Result list 中,有两个列出的结果:第一个是 EM 结果,第二个是当前 Hierarchical。同样,您可以将多个 ML 算法应用到同一数据集,并快速比较其结果。
Notice that in the Result list, there are two results listed: the first one is the EM result and the second one is the current Hierarchical. Likewise, you can apply multiple ML algorithms to the same dataset and quickly compare their results.
如果您检查此算法生成的树,您将看到以下输出 −
If you examine the tree produced by this algorithm, you will see the following output −

在下一章中,您将学习 Associate 类型的 ML 算法。
In the next chapter, you will study the Associate type of ML algorithms.