Weka 简明教程

Weka - Quick Guide

Weka - Introduction

任何机器学习应用程序的基础都是数据——不仅仅是少量数据,而是大量数据,在当前术语中称为 Big Data

The foundation of any Machine Learning application is data - not just a little data but a huge data which is termed as Big Data in the current terminology.

要训练机器分析大数据,您需要对数据进行多方面考虑 −

To train the machine to analyze big data, you need to have several considerations on the data −

  1. The data must be clean.

  2. It should not contain null values.

此外,数据表中并非所有列都对您尝试实现的分析类型有用。在将数据输入机器学习算法之前,必须删除无关数据列或机器学习术语中称之为“特征”的数据列。

Besides, not all the columns in the data table would be useful for the type of analytics that you are trying to achieve. The irrelevant data columns or ‘features’ as termed in Machine Learning terminology, must be removed before the data is fed into a machine learning algorithm.

简而言之,在可用于机器学习之前,大数据需要大量预处理。一旦数据准备就绪,您将应用各种机器学习算法,例如分类、回归、聚类等,以解决您那里的问题。

In short, your big data needs lots of preprocessing before it can be used for Machine Learning. Once the data is ready, you would apply various Machine Learning algorithms such as classification, regression, clustering and so on to solve the problem at your end.

您应用的算法类型在很大程度上取决于您的领域知识。即使在同一类型中(例如分类),也有多种算法可用。您可能希望在同一类中测试不同的算法以构建高效的机器学习模型。在执行此操作时,您更喜欢可视化处理后的数据,因此您还需要可视化工具。

The type of algorithms that you apply is based largely on your domain knowledge. Even within the same type, for example classification, there are several algorithms available. You may like to test the different algorithms under the same class to build an efficient machine learning model. While doing so, you would prefer visualization of the processed data and thus you also require visualization tools.

在即将到来的章节中,您将了解 Weka,这是一款可以轻松完成上述所有操作并让您舒适地处理大数据的软件。

In the upcoming chapters, you will learn about Weka, a software that accomplishes all the above with ease and lets you work with big data comfortably.

What is Weka?

WEKA——一款开源软件提供了数据预处理、实现多种机器学习算法和可视化工具,使您可以开发机器学习技术并将其应用到实际数据挖掘问题中。WEKA 提供的内容总结在下图中−

WEKA - an open source software provides tools for data preprocessing, implementation of several Machine Learning algorithms, and visualization tools so that you can develop machine learning techniques and apply them to real-world data mining problems. What WEKA offers is summarized in the following diagram −

weka summarized

如果您观察图像流程的开始,您会明白处理大数据以使其适合机器学习的阶段很多−

If you observe the beginning of the flow of the image, you will understand that there are many stages in dealing with Big Data to make it suitable for machine learning −

首先,您将从现场收集原始数据开始。这些数据可能包含多个空值和不相干的字段。您可以使用 WEKA 中提供的预处理数据工具来清理数据。

First, you will start with the raw data collected from the field. This data may contain several null values and irrelevant fields. You use the data preprocessing tools provided in WEKA to cleanse the data.

然后,您将预处理的数据保存在本地存储中以应用 ML 算法。

Then, you would save the preprocessed data in your local storage for applying ML algorithms.

接下来,根据您想要开发的 ML 模型的类型,您将选择 Classify, ClusterAssociate 之类的选项之一。 Attributes Selection 允许自动选择特征以创建缩减数据集。

Next, depending on the kind of ML model that you are trying to develop you would select one of the options such as Classify, Cluster, or Associate. The Attributes Selection allows the automatic selection of features to create a reduced dataset.

请注意,在每个类别下,WEKA 都提供了多种算法的实现。您将选择您选择的算法,设置所需的 parameters 并针对数据集运行它。

Note that under each category, WEKA provides the implementation of several algorithms. You would select an algorithm of your choice, set the desired parameters and run it on the dataset.

然后,WEKA 会提供模型处理的统计输出。它提供了可视化工具来检查数据。

Then, WEKA would give you the statistical output of the model processing. It provides you a visualization tool to inspect the data.

可以对相同的数据集应用各种模型。然后可以比较不同模型的输出,并选择最符合您目的的模型。

The various models can be applied on the same dataset. You can then compare the outputs of different models and select the best that meets your purpose.

因此,整体而言,使用 WEKA 可以更快地开发机器学习模型。

Thus, the use of WEKA results in a quicker development of machine learning models on the whole.

既然我们已经了解了 WEKA 是什么以及它的作用,接下来一章让我们学习如何在本地计算机上安装 WEKA。

Now that we have seen what WEKA is and what it does, in the next chapter let us learn how to install WEKA on your local computer.

Weka - Installation

若要在你的计算机上安装 WEKA,请访问 WEKA’s official website 并下载安装文件。WEKA 支持在 Windows、Mac OS X 和 Linux 上安装。你只需按照此页面上的说明为你的操作系统安装 WEKA。

To install WEKA on your machine, visit WEKA’s official website and download the installation file. WEKA supports installation on Windows, Mac OS X and Linux. You just need to follow the instructions on this page to install WEKA for your OS.

在 Mac 上安装的步骤如下 −

The steps for installing on Mac are as follows −

  1. Download the Mac installation file.

  2. Double click on the downloaded weka-3-8-3-corretto-jvm.dmg file.

在成功安装后,你将看到以下屏幕。

You will see the following screen on successful installation.

weka installation
  1. Click on the weak-3-8-3-corretto-jvm icon to start Weka.

  2. Optionally you may start it from the command line −

java -jar weka.jar

WEKA GUI Chooser 应用程序将启动,你将看到以下屏幕 −

The WEKA GUI Chooser application will start and you would see the following screen −

weka application

GUI Chooser 应用程序允许你运行五种不同类型的应用程序,如下所示 −

The GUI Chooser application allows you to run five different types of applications as listed here −

  1. Explorer

  2. Experimenter

  3. KnowledgeFlow

  4. Workbench

  5. Simple CLI

我们将在本教程中使用 Explorer

We will be using Explorer in this tutorial.

Weka - Launching Explorer

在本章中,我们来看看 Explorer 为处理大数据提供的各种功能。

In this chapter, let us look into various functionalities that the explorer provides for working with big data.

当你在 Applications 选择器中单击 Explorer 按钮时,它将打开以下屏幕 −

When you click on the Explorer button in the Applications selector, it opens the following screen −

explorer button

在顶部,你会看到几个选项卡,如下所示 −

On the top, you will see several tabs as listed here −

  1. Preprocess

  2. Classify

  3. Cluster

  4. Associate

  5. Select Attributes

  6. Visualize

在这些选项卡下,有几个预先实现的机器学习算法。让我们现在详细地研究每个算法。

Under these tabs, there are several pre-implemented machine learning algorithms. Let us look into each of them in detail now.

Preprocess Tab

最初当你打开 Explorer 时,只有 Preprocess 选项卡处于启用状态。机器学习的第一步是预处理数据。因此,在 Preprocess 选项中,你将选择数据文件,处理它并使其适合应用各种机器学习算法。

Initially as you open the explorer, only the Preprocess tab is enabled. The first step in machine learning is to preprocess the data. Thus, in the Preprocess option, you will select the data file, process it and make it fit for applying the various machine learning algorithms.

Classify Tab

Classify 选项卡为你提供了用于分类数据的几种机器学习算法。简单列举几个,你可以应用诸如线性回归、逻辑回归、支持向量机、决策树、RandomTree、随机森林、朴素贝叶斯等算法。列表非常详尽,提供了有监督和无监督的机器学习算法。

The Classify tab provides you several machine learning algorithms for the classification of your data. To list a few, you may apply algorithms such as Linear Regression, Logistic Regression, Support Vector Machines, Decision Trees, RandomTree, RandomForest, NaiveBayes, and so on. The list is very exhaustive and provides both supervised and unsupervised machine learning algorithms.

Cluster Tab

Cluster 选项卡下,提供了多种聚类算法 - 如 SimpleKMeans、FilteredClusterer、HierarchicalClusterer 等。

Under the Cluster tab, there are several clustering algorithms provided - such as SimpleKMeans, FilteredClusterer, HierarchicalClusterer, and so on.

Associate Tab

Associate 选项卡下,你会找到 Apriori、FilteredAssociator 和 FPGrowth。

Under the Associate tab, you would find Apriori, FilteredAssociator and FPGrowth.

Select Attributes Tab

Select Attributes 允许你根据多种算法(如 ClassifierSubsetEval、PrinicipalComponents 等)进行特征选择。

Select Attributes allows you feature selections based on several algorithms such as ClassifierSubsetEval, PrinicipalComponents, etc.

Visualize Tab

最后, Visualize 选项允许你对处理后的数据进行可视化,以便分析。

Lastly, the Visualize option allows you to visualize your processed data for analysis.

正如你所见,WEKA 为测试和构建机器学习应用程序提供了多种开箱即用的算法。为了有效地使用 WEKA,你必须对这些算法、它们的工作原理、在什么情况下选择哪种算法、它们的处理输出中需要注意什么等有深入的了解。简而言之,你必须在机器学习方面有扎实的基础,才能在构建应用程序时有效地使用 WEKA。

As you noticed, WEKA provides several ready-to-use algorithms for testing and building your machine learning applications. To use WEKA effectively, you must have a sound knowledge of these algorithms, how they work, which one to choose under what circumstances, what to look for in their processed output, and so on. In short, you must have a solid foundation in machine learning to use WEKA effectively in building your apps.

在接下来的章节中,你将深入学习资源管理器中的每个选项卡。

In the upcoming chapters, you will study each tab in the explorer in depth.

Weka - Loading Data

在本章中,我们从用于预处理数据的第一个选项卡开始。对于所有应用于数据的算法来说,这是通用的,也是 WEKA 中所有后续操作的通用步骤。

In this chapter, we start with the first tab that you use to preprocess the data. This is common to all algorithms that you would apply to your data for building the model and is a common step for all subsequent operations in WEKA.

为了让机器学习算法给出可接受的准确性,对数据进行清洗非常重要。这是因为从现场收集的原始数据可能包含空值、无关列等。

For a machine learning algorithm to give acceptable accuracy, it is important that you must cleanse your data first. This is because the raw data collected from the field may contain null values, irrelevant columns and so on.

在本章中,你将学习如何预处理原始数据,并创建干净、有意义的数据集以供进一步使用。

In this chapter, you will learn how to preprocess the raw data and create a clean, meaningful dataset for further use.

首先,你将学习如何将数据文件加载到 WEKA 资源管理器中。数据可以从以下来源加载 −

First, you will learn to load the data file into the WEKA explorer. The data can be loaded from the following sources −

  1. Local file system

  2. Web

  3. Database

在本章中,我们将详细了解加载数据的这三个选项。

In this chapter, we will see all the three options of loading data in detail.

Loading Data from Local File System

在你学习的前一课中学到的机器学习选项卡正下方,你会找到以下三个按钮 −

Just under the Machine Learning tabs that you studied in the previous lesson, you would find the following three buttons −

  1. Open file …

  2. Open URL …

  3. Open DB …

单击 Open file …​ 按钮。将打开一个目录导航器窗口,如下面的屏幕所示 −

Click on the Open file …​ button. A directory navigator window opens as shown in the following screen −

local file system

现在,导航到存储数据文件的文件夹。WEKA 安装附带了许多示例数据库供你进行试验。这些数据库可在 WEKA 安装的 data 文件夹中找到。

Now, navigate to the folder where your data files are stored. WEKA installation comes up with many sample databases for you to experiment. These are available in the data folder of the WEKA installation.

出于学习目的,从此文件夹中选择任何数据文件。该文件的内容将加载到 WEKA 环境中。我们将很快学习如何检查和处理这些加载的数据。在此之前,让我们看看如何从 Web 加载数据文件。

For learning purpose, select any data file from this folder. The contents of the file would be loaded in the WEKA environment. We will very soon learn how to inspect and process this loaded data. Before that, let us look at how to load the data file from the Web.

Loading Data from Web

一旦单击 Open URL … 按钮,你就会看到如下窗口 −

Once you click on the Open URL … button, you can see a window as follows −

loading data from web

我们将从公共 URL 打开该文件 在弹出框中输入以下 URL −

We will open the file from a public URL Type the following URL in the popup box −

链接: [https://storm.cis.fordham.edu/ gweiss/data-mining/weka-data/weather.nominal.arff[https://storm.cis.fordham.edu/ gweiss/data-mining/weka-data/weather.nominal.arff]

你还可以指定存储数据的任何其他 URL。 Explorer 会将数据从远程站点加载到其环境中。

You may specify any other URL where your data is stored. The Explorer will load the data from the remote site into its environment.

Loading Data from DB

一旦你点击了 Open DB …​ 按钮,你就可以看到如下窗口 −

Once you click on the Open DB …​ button, you can see a window as follows −

loading data from db

设置数据库的连接字符串,设置用于选择数据的查询,处理查询并在 WEKA 中加载已选记录。

Set the connection string to your database, set up the query for data selection, process the query and load the selected records in WEKA.

Weka - File Formats

WEKA 支持大量的数据文件格式。以下是完整列表 −

WEKA supports a large number of file formats for the data. Here is the complete list −

  1. arff

  2. arff.gz

  3. bsi

  4. csv

  5. dat

  6. data

  7. json

  8. json.gz

  9. libsvm

  10. m

  11. names

  12. xrff

  13. xrff.gz

它支持的文件类型列在屏幕底部的下拉列表框中。这在下面给出的屏幕截图中显示。

The types of files that it supports are listed in the drop-down list box at the bottom of the screen. This is shown in the screenshot given below.

drop down list

您会注意到它支持多种格式,包括 CSV 和 JSON。默认文件类型是 Arff。

As you would notice it supports several formats including CSV and JSON. The default file type is Arff.

Arff Format

Arff 文件包含两个部分——头和数据。

An Arff file contains two sections - header and data.

  1. The header describes the attribute types.

  2. The data section contains a comma separated list of data.

作为 Arff 格式的示例,下面显示了从 WEKA 样本数据库加载的 Weather 数据文件 −

As an example for Arff format, the Weather data file loaded from the WEKA sample databases is shown below −

sample databases

从屏幕截图中,您可以推断以下几点 −

From the screenshot, you can infer the following points −

  1. The @relation tag defines the name of the database.

  2. The @attribute tag defines the attributes.

  3. The @data tag starts the list of data rows each containing the comma separated fields.

  4. The attributes can take nominal values as in the case of outlook shown here −

@attribute outlook (sunny, overcast, rainy)
  1. The attributes can take real values as in this case −

@attribute temperature real
  1. You can also set a Target or a Class variable called play as shown here −

@attribute play (yes, no)
  1. The Target assumes two nominal values yes or no.

Other Formats

Explorer 可加载任何早期提到的格式中的数据。由于 arff 是 WEKA 中的首选格式,你可以从任何格式中加载数据,并将其保存到 arff 格式中以供以后使用。在预处理数据后,只需将其保存为 arff 格式以供进一步分析。

The Explorer can load the data in any of the earlier mentioned formats. As arff is the preferred format in WEKA, you may load the data from any format and save it to arff format for later use. After preprocessing the data, just save it to arff format for further analysis.

现在你已了解如何将数据加载到 WEKA,在下个章节中,你将学习如何预处理数据。

Now that you have learned how to load data into WEKA, in the next chapter, you will learn how to preprocess the data.

Weka - Preprocessing the Data

从该字段收集的数据包含许多导致错误分析的不需要的事物。例如,数据可能包含空字段,它可能包含与当前分析无关的列,等等。因此,必须对数据进行预处理,以满足您所寻求的分析类型要求。这是在预处理模块中完成的。

The data that is collected from the field contains many unwanted things that leads to wrong analysis. For example, the data may contain null fields, it may contain columns that are irrelevant to the current analysis, and so on. Thus, the data must be preprocessed to meet the requirements of the type of analysis you are seeking. This is the done in the preprocessing module.

为了展示预处理中可用的功能,我们将使用安装中提供的 Weather 数据库。

To demonstrate the available features in preprocessing, we will use the Weather database that is provided in the installation.

使用 Preprocess 标记下的 Open file …​ 选项选择 weather-nominal.arff 文件。

Using the Open file …​ option under the Preprocess tag select the weather-nominal.arff file.

weather nominal

当打开文件后,你的屏幕类似于如下所示:

When you open the file, your screen looks like as shown here −

weka explore

该屏幕告诉我们有关已加载数据的若干信息,这些将在本章中进一步讨论。

This screen tells us several things about the loaded data, which are discussed further in this chapter.

Understanding Data

我们先看看高亮的 Current relation 子窗口。它表明当前加载的数据库的名称。你可以从这个子窗口推断出两点:

Let us first look at the highlighted Current relation sub window. It shows the name of the database that is currently loaded. You can infer two points from this sub window −

  1. There are 14 instances - the number of rows in the table.

  2. The table contains 5 attributes - the fields, which are discussed in the upcoming sections.

在左侧,注意显示数据库中各个字段的 Attributes 子窗口。

On the left side, notice the Attributes sub window that displays the various fields in the database.

weka attributes

weather 数据库包含五个字段 - outlook、temperature、humidity、windy 和 play。当通过点击从该列表中选择一个属性时,会在右侧显示属性本身的更多详细信息。

The weather database contains five fields - outlook, temperature, humidity, windy and play. When you select an attribute from this list by clicking on it, further details on the attribute itself are displayed on the right hand side.

我们先选择 temperature 属性。当你点击它时,你将会看到以下屏幕:

Let us select the temperature attribute first. When you click on it, you would see the following screen −

temperature attribute

Selected Attribute 子窗口中,你可以看到以下内容:

In the Selected Attribute subwindow, you can observe the following −

  1. The name and the type of the attribute are displayed.

  2. The type for the temperature attribute is Nominal.

  3. The number of Missing values is zero.

  4. There are three distinct values with no unique value.

  5. The table underneath this information shows the nominal values for this field as hot, mild and cold.

  6. It also shows the count and weight in terms of a percentage for each nominal value.

在窗口的底部,你可以看到 class 值的可视化表示。

At the bottom of the window, you see the visual representation of the class values.

如果你点击 Visualize All 按钮,你将能够在一个窗口中看到所有功能,如下所示:

If you click on the Visualize All button, you will be able to see all features in one single window as shown here −

visualize all

Removing Attributes

很多时候,用于建模的数据包含许多不相干的字段。例如,客户数据库可能包含其移动电话号码,这与分析其信用评级相关。

Many a time, the data that you want to use for model building comes with many irrelevant fields. For example, the customer database may contain his mobile number which is relevant in analysing his credit rating.

removing attributes

要删除属性,请选择它们并单击底部的 Remove 按钮。

To remove Attribute/s select them and click on the Remove button at the bottom.

选定的属性将从数据库中删除。完全预处理数据后,可以将其保存以供建模。

The selected attributes would be removed from the database. After you fully preprocess the data, you can save it for model building.

接下来,您将学习通过对数据应用筛选来预处理数据。

Next, you will learn to preprocess the data by applying filters on this data.

Applying Filters

某些机器学习技术(例如关联规则挖掘)需要分类数据。为了说明筛选的使用,我们将使用 weather-numeric.arff 数据库,该数据库包含两个 numeric 属性—— temperaturehumidity

Some of the machine learning techniques such as association rule mining requires categorical data. To illustrate the use of filters, we will use weather-numeric.arff database that contains two numeric attributes - temperature and humidity.

我们将通过对原始数据应用筛选器将它们转换为 nominal 。单击 Filter 子窗口中的 Choose 按钮并选择以下筛选器−

We will convert these to nominal by applying a filter on our raw data. Click on the Choose button in the Filter subwindow and select the following filter −

weka→filters→supervised→attribute→Discretize

weka→filters→supervised→attribute→Discretize

weka discretize

单击 Apply 按钮并检查 temperature 和/或 humidity 属性。您会注意到它们已从数字类型更改为标称类型。

Click on the Apply button and examine the temperature and/or humidity attribute. You will notice that these have changed from numeric to nominal types.

humidity attribute

让我们来看另一个筛选器。假设您想要选择用于决策的最佳属性 play 。选择并应用以下筛选器 −

Let us look into another filter now. Suppose you want to select the best attributes for deciding the play. Select and apply the following filter −

weka→filters→supervised→attribute→AttributeSelection

weka→filters→supervised→attribute→AttributeSelection

您会注意到它会从数据库中移除温度和湿度属性。

You will notice that it removes the temperature and humidity attributes from the database.

weka attribute selection

满意数据预处理后,单击 Save …​ 按钮保存数据。您将使用此保存的文件进行建模。

After you are satisfied with the preprocessing of your data, save the data by clicking the Save …​ button. You will use this saved file for model building.

在下一章中,我们将使用几个预定义的 ML 算法探索建模。

In the next chapter, we will explore the model building using several predefined ML algorithms.

Weka - Classifiers

许多机器学习应用程序与分类相关。例如,您可能希望将肿瘤分类为恶性或良性。您可能希望根据天气条件决定是否进行户外游戏。通常,此决定取决于天气的几个特征/条件。因此,您可能更喜欢使用树分类器来决定是否进行游戏。

Many machine learning applications are classification related. For example, you may like to classify a tumor as malignant or benign. You may like to decide whether to play an outside game depending on the weather conditions. Generally, this decision is dependent on several features/conditions of the weather. So you may prefer to use a tree classifier to make your decision of whether to play or not.

在本章中,我们将学习如何基于天气数据构建这样的树分类器来决定游戏条件。

In this chapter, we will learn how to build such a tree classifier on weather data to decide on the playing conditions.

Setting Test Data

我们从上一课中使用预处理后的天气数据文件。使用 Open file …​ 选项在 Preprocess 选项卡中打开已保存的文件,单击 Classify 选项卡,您会看到以下屏幕 −

We will use the preprocessed weather data file from the previous lesson. Open the saved file by using the Open file …​ option under the Preprocess tab, click on the Classify tab, and you would see the following screen −

classify tab

在了解可用的分类器之前,让我们检查测试选项。您会看到以下四个测试选项 −

Before you learn about the available classifiers, let us examine the Test options. You will notice four testing options as listed below −

  1. Training set

  2. Supplied test set

  3. Cross-validation

  4. Percentage split

除非您有自己的训练集或由客户提供的测试集,否则您将使用交叉验证或百分比分割选项。通过交叉验证,您可以设置整个数据将被分割并用于每次训练迭代的折叠数。在百分比分割中,您将使用设置的分割百分比在训练和测试之间分割数据。

Unless you have your own training set or a client supplied test set, you would use cross-validation or percentage split options. Under cross-validation, you can set the number of folds in which entire data would be split and used during each iteration of training. In the percentage split, you will split the data between training and testing using the set split percentage.

现在,将输出类的默认 play 选项保留为 −

Now, keep the default play option for the output class −

play option

接下来,您将选择分类器。

Next, you will select the classifier.

Selecting Classifier

点击选择按钮并选择以下分类器 −

Click on the Choose button and select the following classifier −

weka→classifiers>trees>J48

weka→classifiers>trees>J48

这在下面的屏幕截图中显示 -

This is shown in the screenshot below −

weka trees

单击 Start 按钮以启动分类过程。一段时间后,分类结果将显示在您的屏幕上,如下所示 −

Click on the Start button to start the classification process. After a while, the classification results would be presented on your screen as shown here −

start button

让我们检查屏幕右侧显示的输出。

Let us examine the output shown on the right hand side of the screen.

它说树的大小是 6。您很快将看到树的可视化表示。在摘要中,它说正确分类的实例为 2,错误分类的实例为 3。它还说相对绝对误差为 110%。它还显示了混淆矩阵。对这些结果的分析超出了本教程的范围。但是,您可以轻松地从这些结果中看出分类是不可接受的,并且您需要更多数据进行分析,以完善您的功能选择、重建模型等,直到您对模型的准确性感到满意。无论如何,这就是 WEKA 的全部意义所在。它允许您快速测试您的想法。

It says the size of the tree is 6. You will very shortly see the visual representation of the tree. In the Summary, it says that the correctly classified instances as 2 and the incorrectly classified instances as 3, It also says that the Relative absolute error is 110%. It also shows the Confusion Matrix. Going into the analysis of these results is beyond the scope of this tutorial. However, you can easily make out from these results that the classification is not acceptable and you will need more data for analysis, to refine your features selection, rebuild the model and so on until you are satisfied with the model’s accuracy. Anyway, that’s what WEKA is all about. It allows you to test your ideas quickly.

Visualize Results

要查看结果的可视化表示,请右键单击 Result list 框中的结果。屏幕上会出现几个选项,如下所示 −

To see the visual representation of the results, right click on the result in the Result list box. Several options would pop up on the screen as shown here −

result list

选择 Visualize tree 以获得遍历树的可视化表示,如下面的屏幕截图所示 −

Select Visualize tree to get a visual representation of the traversal tree as seen in the screenshot below −

visualize tree

选择 Visualize classifier errors 将显示分类结果,如下所示 −

Selecting Visualize classifier errors would plot the results of classification as shown here −

classifier errors

cross 表示正确分类的实例,而 squares 表示错误分类的实例。在绘图左下角,您可以看到一个 cross ,该 cross 指示如果 outlook 天气晴朗,则 play 比赛。因此,这是一个正确分类的实例。若要定位实例,您可以通过滑动 jitter 滑块来引入一些抖动。

A cross represents a correctly classified instance while squares represents incorrectly classified instances. At the lower left corner of the plot you see a cross that indicates if outlook is sunny then play the game. So this is a correctly classified instance. To locate instances, you can introduce some jitter in it by sliding the jitter slide bar.

当前绘图是 outlookplay 。这由屏幕顶部的两个下拉列表框指示。

The current plot is outlook versus play. These are indicated by the two drop down list boxes at the top of the screen.

outlook versus play

现在,在这些框中的每一个框中尝试不同的选择,并注意 X 和 Y 轴如何变化。使用绘图右侧的水平条也可以实现相同的效果。每个条代表一个属性。左键单击条将所选属性设置为 X 轴,而右键单击将该属性设置为 Y 轴。

Now, try a different selection in each of these boxes and notice how the X & Y axes change. The same can be achieved by using the horizontal strips on the right hand side of the plot. Each strip represents an attribute. Left click on the strip sets the selected attribute on the X-axis while a right click would set it on the Y-axis.

提供了几个其他绘图以供您进行深入分析。明智地使用它们来微调您的模型。 Cost/Benefit analysis 的一个此类绘图如下所示,供您快速参考。

There are several other plots provided for your deeper analysis. Use them judiciously to fine tune your model. One such plot of Cost/Benefit analysis is shown below for your quick reference.

cost benefit analysis

对此类图表中分析内容进行说明超出了本教程的范围。建议读者查阅机器学习算法分析知识。

Explaining the analysis in these charts is beyond the scope of this tutorial. The reader is encouraged to brush up their knowledge of analysis of machine learning algorithms.

在下一章中,我们将了解下一组机器学习算法,即聚类。

In the next chapter, we will learn the next set of machine learning algorithms, that is clustering.

Weka - Clustering

群集算法将在整个数据集内找到一组相似实例。WEKA 支持多种群集算法,例如 EM、FilteredClusterer、HierarchicalClusterer、SimpleKMeans 等。你应当完全理解这些算法,以便充分利用 WEKA 的功能。

A clustering algorithm finds groups of similar instances in the entire dataset. WEKA supports several clustering algorithms such as EM, FilteredClusterer, HierarchicalClusterer, SimpleKMeans and so on. You should understand these algorithms completely to fully exploit the WEKA capabilities.

如同在分类的情况下,通过 WEKA 能够以图形方式可视化检测到的群集。为了演示群集,我们将使用提供的 iris 数据库。数据集包含三类,每类有 50 个实例。每类表示一种类型的 iris 植物。

As in the case of classification, WEKA allows you to visualize the detected clusters graphically. To demonstrate the clustering, we will use the provided iris database. The data set contains three classes of 50 instances each. Each class refers to a type of iris plant.

Loading Data

在 WEKA 浏览器中,选择 Preprocess 选项卡。点击 Open file …​ 选项并在文件选择对话框中选择 iris.arff 文件。当你加载数据时,屏幕看起来如下图所示 −

In the WEKA explorer select the Preprocess tab. Click on the Open file …​ option and select the iris.arff file in the file selection dialog. When you load the data, the screen looks like as shown below −

screen looks

你可以观察到有 150 个实例和 5 个属性。属性的名称列为 sepallengthsepalwidthpetallengthpetalwidthclass 。前四个属性为数字类型,而类为具有 3 个不同值的公称类型。检查每个属性以了解数据库的特性。我们不会针对此数据进行任何预处理,直接进入模型构建。

You can observe that there are 150 instances and 5 attributes. The names of attributes are listed as sepallength, sepalwidth, petallength, petalwidth and class. The first four attributes are of numeric type while the class is a nominal type with 3 distinct values. Examine each attribute to understand the features of the database. We will not do any preprocessing on this data and straight-away proceed to model building.

Clustering

点击 Cluster 选项卡,将群集算法应用到我们加载的数据。点击 Choose 按钮。你会看到下面的屏幕 −

Click on the Cluster TAB to apply the clustering algorithms to our loaded data. Click on the Choose button. You will see the following screen −

cluster tab

现在,选择 EM 作为群集算法。在 Cluster mode 子窗口中,选择 Classes to clusters evaluation 选项,如下图所示 −

Now, select EM as the clustering algorithm. In the Cluster mode sub window, select the Classes to clusters evaluation option as shown in the screenshot below −

clustering algorithm

点击 Start 按钮处理数据。片刻之后,结果将显示在屏幕上。

Click on the Start button to process the data. After a while, the results will be presented on the screen.

接下来,让我们研究一下结果。

Next, let us study the results.

Examining Output

数据处理的输出如下图所示 −

The output of the data processing is shown in the screen below −

examining  output

从输出屏幕中,你可以观察到 −

From the output screen, you can observe that −

  1. There are 5 clustered instances detected in the database.

  2. The Cluster 0 represents setosa, Cluster 1 represents virginica, Cluster 2 represents versicolor, while the last two clusters do not have any class associated with them.

如果你向上滚动输出窗口,你还可以看到一些统计信息,为各个检测到的群集中各个属性的平均值和标准差。这在下图的屏幕截图中显示 −

If you scroll up the output window, you will also see some statistics that gives the mean and standard deviation for each of the attributes in the various detected clusters. This is shown in the screenshot given below −

detected clusters

接下来,我们将查看群集的视觉表示。

Next, we will look at the visual representation of the clusters.

Visualizing Clusters

为了可视化群集,右键点击 EM 中的 Result list 的结果。你会看到以下选项 −

To visualize the clusters, right click on the EM result in the Result list. You will see the following options −

clusters result list

选择 Visualize cluster assignments 。你会看到以下输出 −

Select Visualize cluster assignments. You will see the following output −

cluster assignments

如同在分类的情况下,你会注意到正确和错误识别实例之间的区别。你可以通过改变 X 轴和 Y 轴来分析结果。你可以使用抖动,如同在分类的情况一样,来确定正确识别实例的浓度。可视化图的运算类似于你在分类的情况下所学习的运算。

As in the case of classification, you will notice the distinction between the correctly and incorrectly identified instances. You can play around by changing the X and Y axes to analyze the results. You may use jittering as in the case of classification to find out the concentration of correctly identified instances. The operations in visualization plot are similar to the one you studied in the case of classification.

Applying Hierarchical Clusterer

为了演示 WEKA 的功能,我们现在来看另一种聚类算法的应用。在 WEKA explorer 中,选择 HierarchicalClusterer 作为您的 ML 算法,如下面的屏幕截图所示 −

To demonstrate the power of WEKA, let us now look into an application of another clustering algorithm. In the WEKA explorer, select the HierarchicalClusterer as your ML algorithm as shown in the screenshot shown below −

hierarchical clusterer

选择 Cluster mode 选择以 Classes to cluster evaluation ,然后单击 Start 按钮。您将看到以下输出 −

Choose the Cluster mode selection to Classes to cluster evaluation, and click on the Start button. You will see the following output −

cluster evaluation

请注意,在 Result list 中,有两个列出的结果:第一个是 EM 结果,第二个是当前 Hierarchical。同样,您可以将多个 ML 算法应用到同一数据集,并快速比较其结果。

Notice that in the Result list, there are two results listed: the first one is the EM result and the second one is the current Hierarchical. Likewise, you can apply multiple ML algorithms to the same dataset and quickly compare their results.

如果您检查此算法生成的树,您将看到以下输出 −

If you examine the tree produced by this algorithm, you will see the following output −

examine algorithm

在下一章中,您将学习 Associate 类型的 ML 算法。

In the next chapter, you will study the Associate type of ML algorithms.

Weka - Association

观察显示,购买啤酒的人同时也会购买尿布。也就是说,在同时购买啤酒和尿布中存在关联。尽管这似乎不太令人信服,但这条关联规则是从超市的庞大数据库中挖掘出来的。类似地,可以在花生酱和面包之间找到关联。

It was observed that people who buy beer also buy diapers at the same time. That is there is an association in buying beer and diapers together. Though this seems not well convincing, this association rule was mined from huge databases of supermarkets. Similarly, an association may be found between peanut butter and bread.

找到此类关联对于超市变得至关重要,因为超市会将尿布与啤酒放在一起,以便顾客可以轻松找到这两件商品,从而增加超市的销售额。

Finding such associations becomes vital for supermarkets as they would stock diapers next to beers so that customers can locate both items easily resulting in an increased sale for the supermarket.

Apriori 算法就是机器学习中可以找出可能关联并创建关联规则的算法之一。WEKA 提供了 Apriori 算法的实现。您可以在计算这些规则时定义最低支持度和可接受的置信度。您将把 Apriori 算法应用于 WEKA 安装中提供的 supermarket 数据。

The Apriori algorithm is one such algorithm in ML that finds out the probable associations and creates association rules. WEKA provides the implementation of the Apriori algorithm. You can define the minimum support and an acceptable confidence level while computing these rules. You will apply the Apriori algorithm to the supermarket data provided in the WEKA installation.

Loading Data

在 WEKA 浏览器中,打开 Preprocess 标签,单击 Open file …​ 按钮,然后从安装文件夹中选择 supermarket.arff 数据库。加载数据后,您将看到以下屏幕 −

In the WEKA explorer, open the Preprocess tab, click on the Open file …​ button and select supermarket.arff database from the installation folder. After the data is loaded you will see the following screen −

loading data

该数据库包含 4627 个实例和 217 个属性。您可以轻松了解检测如此多的属性之间的关联有多困难。幸运的是,此任务在 Apriori 算法的帮助下已自动化。

The database contains 4627 instances and 217 attributes. You can easily understand how difficult it would be to detect the association between such a large number of attributes. Fortunately, this task is automated with the help of Apriori algorithm.

Associator

单击 Associate 标签,然后单击 Choose 按钮。选择 Apriori 关联,如屏幕快照所示 −

Click on the Associate TAB and click on the Choose button. Select the Apriori association as shown in the screenshot −

associate tab

若要为 Apriori 算法设置参数,请单击其名称,将弹出一个窗口,如下所示,允许您设置参数 −

To set the parameters for the Apriori algorithm, click on its name, a window will pop up as shown below that allows you to set the parameters −

apriori algorithm

设置参数后,单击 Start 按钮。过一会儿,您将看到屏幕快照中显示的结果 −

After you set the parameters, click the Start button. After a while you will see the results as shown in the screenshot below −

start parameters

在底部,您将找到已检测到的最佳关联规则。这将帮助超市将产品存放在合适的货架上。

At the bottom, you will find the detected best rules of associations. This will help the supermarket in stocking their products in appropriate shelves.

Weka - Feature Selection

当一个数据库包含大量属性时,将有几个属性在您当前正在寻求的分析中并不重要。因此,从数据集中删除不需要的属性成为开发良好机器学习模型的一项重要任务。

When a database contains a large number of attributes, there will be several attributes which do not become significant in the analysis that you are currently seeking. Thus, removing the unwanted attributes from the dataset becomes an important task in developing a good machine learning model.

您可能会从视觉上检查整个数据集并决定不相关的属性。对于包含大量属性的数据库,这可能是一项巨大的任务,例如您在较早的课程中学到的超市案例。幸运的是,WEKA 提供了一个用于特征选择的自动化工具。

You may examine the entire dataset visually and decide on the irrelevant attributes. This could be a huge task for databases containing a large number of attributes like the supermarket case that you saw in an earlier lesson. Fortunately, WEKA provides an automated tool for feature selection.

本章对此功能进行了演示,该功能驻留在包含大量属性的数据库中。

This chapter demonstrate this feature on a database containing a large number of attributes.

Loading Data

在 WEKA Explorer 的 Preprocess 标签中,选择 labor.arff 文件以加载到系统中。加载数据后,您将看到以下屏幕 −

In the Preprocess tag of the WEKA explorer, select the labor.arff file for loading into the system. When you load the data, you will see the following screen −

loading data

请注意,有 17 个属性。我们的任务是通过消除与我们的分析不相关的某些属性来创建一个缩减的数据集。

Notice that there are 17 attributes. Our task is to create a reduced dataset by eliminating some of the attributes which are irrelevant to our analysis.

Features Extraction

单击 选择属性 选项卡。您将看到以下画面 -

Click on the *Select attributes*TAB.You will see the following screen −

select attributes

Attribute EvaluatorSearch Method 下,您将找到几个选项。我们只使用这里的默认值。在 Attribute Selection Mode 中,使用完整训练集选项。

Under the Attribute Evaluator and Search Method, you will find several options. We will just use the defaults here. In the Attribute Selection Mode, use full training set option.

单击开始按钮以处理数据集。您将看到以下输出 −

Click on the Start button to process the dataset. You will see the following output −

start dataset

在结果窗口的底部,您将获得 Selected 属性列表。要获取可视化表示,请右键单击 Result 列表中的结果。

At the bottom of the result window, you will get the list of Selected attributes. To get the visual representation, right click on the result in the Result list.

Explorer 在以下屏幕截图中显示了输出 −

The output is shown in the following screenshot −

screenshot output

单击任何正方形都会为您提供供进一步分析的数据图。一个典型的数据图如下所示 −

Clicking on any of the squares will give you the data plot for your further analysis. A typical data plot is shown below −

data plot

这与我们在前面章节中看到的内容类似。使用可用的不同选项来分析结果。

This is similar to the ones we have seen in the earlier chapters. Play around with the different options available to analyze the results.

What’s Next?

目前为止,您已经见识到了 WEKA 在快速开发机器学习模型方面的强大功能。我们使用的是一个名为 Explorer 的图形工具来开发这些模型。WEKA 还提供了一个命令行界面,该界面提供了比 explorer 中提供的更强大的功能。

You have seen so far the power of WEKA in quickly developing machine learning models. What we used is a graphical tool called Explorer for developing these models. WEKA also provides a command line interface that gives you more power than provided in the explorer.

单击 G*UI Chooser* 应用程序中的 Simple CLI 按钮会启动此命令行界面,如下面的屏幕截图所示 −

Clicking the Simple CLI button in the G*UI Chooser* application starts this command line interface which is shown in the screenshot below −

gui chooser

在底部的输入框中键入命令。您将能够使用该资源管理器所做的一切和其他更多内容。有关详细信息,请参阅 WEKA documentation ([role="bare"] [role="bare"]https://www.cs.waikato.ac.nz/ml/weka/documentation.html )。

Type your commands in the input box at the bottom. You will be able to do all that you have done so far in the explorer plus much more. Refer to WEKA documentation ([role="bare"]https://www.cs.waikato.ac.nz/ml/weka/documentation.html) for further details.

最后,WEKA 是使用 Java 开发的并提供对其 API 的接口。因此,如果您是 Java 开发人员并热衷于在自己的 Java 项目中包含 WEKA ML 实施,则可以轻松做到。

Lastly, WEKA is developed in Java and provides an interface to its API. So if you are a Java developer and keen to include WEKA ML implementations in your own Java projects, you can do so easily.

Conclusion

WEKA 是开发机器学习模型的强大工具。它提供了几种最广泛使用的 ML 算法的实现。在将这些算法应用于数据集之前,它还允许您预处理数据。支持的算法类型在分类、集群、关联和选择属性下进行分类。可以通过美观且强大的可视化表示来可视化处理的各个阶段的结果。这使得数据科学家可以更轻松地快速在其数据集上应用各种机器学习技术,比较结果并为最终用途创建最佳模型。

WEKA is a powerful tool for developing machine learning models. It provides implementation of several most widely used ML algorithms. Before these algorithms are applied to your dataset, it also allows you to preprocess the data. The types of algorithms that are supported are classified under Classify, Cluster, Associate, and Select attributes. The result at various stages of processing can be visualized with a beautiful and powerful visual representation. This makes it easier for a Data Scientist to quickly apply the various machine learning techniques on his dataset, compare the results and create the best model for the final use.