Weka 简明教程

Weka - Preprocessing the Data

从该字段收集的数据包含许多导致错误分析的不需要的事物。例如,数据可能包含空字段,它可能包含与当前分析无关的列,等等。因此,必须对数据进行预处理,以满足您所寻求的分析类型要求。这是在预处理模块中完成的。

The data that is collected from the field contains many unwanted things that leads to wrong analysis. For example, the data may contain null fields, it may contain columns that are irrelevant to the current analysis, and so on. Thus, the data must be preprocessed to meet the requirements of the type of analysis you are seeking. This is the done in the preprocessing module.

为了展示预处理中可用的功能,我们将使用安装中提供的 Weather 数据库。

To demonstrate the available features in preprocessing, we will use the Weather database that is provided in the installation.

使用 Preprocess 标记下的 Open file …​ 选项选择 weather-nominal.arff 文件。

Using the Open file …​ option under the Preprocess tag select the weather-nominal.arff file.

weather nominal

当打开文件后,你的屏幕类似于如下所示:

When you open the file, your screen looks like as shown here −

weka explore

该屏幕告诉我们有关已加载数据的若干信息,这些将在本章中进一步讨论。

This screen tells us several things about the loaded data, which are discussed further in this chapter.

Understanding Data

我们先看看高亮的 Current relation 子窗口。它表明当前加载的数据库的名称。你可以从这个子窗口推断出两点:

Let us first look at the highlighted Current relation sub window. It shows the name of the database that is currently loaded. You can infer two points from this sub window −

  1. There are 14 instances - the number of rows in the table.

  2. The table contains 5 attributes - the fields, which are discussed in the upcoming sections.

在左侧,注意显示数据库中各个字段的 Attributes 子窗口。

On the left side, notice the Attributes sub window that displays the various fields in the database.

weka attributes

weather 数据库包含五个字段 - outlook、temperature、humidity、windy 和 play。当通过点击从该列表中选择一个属性时,会在右侧显示属性本身的更多详细信息。

The weather database contains five fields - outlook, temperature, humidity, windy and play. When you select an attribute from this list by clicking on it, further details on the attribute itself are displayed on the right hand side.

我们先选择 temperature 属性。当你点击它时,你将会看到以下屏幕:

Let us select the temperature attribute first. When you click on it, you would see the following screen −

temperature attribute

Selected Attribute 子窗口中,你可以看到以下内容:

In the Selected Attribute subwindow, you can observe the following −

  1. The name and the type of the attribute are displayed.

  2. The type for the temperature attribute is Nominal.

  3. The number of Missing values is zero.

  4. There are three distinct values with no unique value.

  5. The table underneath this information shows the nominal values for this field as hot, mild and cold.

  6. It also shows the count and weight in terms of a percentage for each nominal value.

在窗口的底部,你可以看到 class 值的可视化表示。

At the bottom of the window, you see the visual representation of the class values.

如果你点击 Visualize All 按钮,你将能够在一个窗口中看到所有功能,如下所示:

If you click on the Visualize All button, you will be able to see all features in one single window as shown here −

visualize all

Removing Attributes

很多时候,用于建模的数据包含许多不相干的字段。例如,客户数据库可能包含其移动电话号码,这与分析其信用评级相关。

Many a time, the data that you want to use for model building comes with many irrelevant fields. For example, the customer database may contain his mobile number which is relevant in analysing his credit rating.

removing attributes

要删除属性,请选择它们并单击底部的 Remove 按钮。

To remove Attribute/s select them and click on the Remove button at the bottom.

选定的属性将从数据库中删除。完全预处理数据后,可以将其保存以供建模。

The selected attributes would be removed from the database. After you fully preprocess the data, you can save it for model building.

接下来,您将学习通过对数据应用筛选来预处理数据。

Next, you will learn to preprocess the data by applying filters on this data.

Applying Filters

某些机器学习技术(例如关联规则挖掘)需要分类数据。为了说明筛选的使用,我们将使用 weather-numeric.arff 数据库,该数据库包含两个 numeric 属性—— temperaturehumidity

Some of the machine learning techniques such as association rule mining requires categorical data. To illustrate the use of filters, we will use weather-numeric.arff database that contains two numeric attributes - temperature and humidity.

我们将通过对原始数据应用筛选器将它们转换为 nominal 。单击 Filter 子窗口中的 Choose 按钮并选择以下筛选器−

We will convert these to nominal by applying a filter on our raw data. Click on the Choose button in the Filter subwindow and select the following filter −

weka→filters→supervised→attribute→Discretize

weka→filters→supervised→attribute→Discretize

weka discretize

单击 Apply 按钮并检查 temperature 和/或 humidity 属性。您会注意到它们已从数字类型更改为标称类型。

Click on the Apply button and examine the temperature and/or humidity attribute. You will notice that these have changed from numeric to nominal types.

humidity attribute

让我们来看另一个筛选器。假设您想要选择用于决策的最佳属性 play 。选择并应用以下筛选器 −

Let us look into another filter now. Suppose you want to select the best attributes for deciding the play. Select and apply the following filter −

weka→filters→supervised→attribute→AttributeSelection

weka→filters→supervised→attribute→AttributeSelection

您会注意到它会从数据库中移除温度和湿度属性。

You will notice that it removes the temperature and humidity attributes from the database.

weka attribute selection

满意数据预处理后,单击 Save …​ 按钮保存数据。您将使用此保存的文件进行建模。

After you are satisfied with the preprocessing of your data, save the data by clicking the Save …​ button. You will use this saved file for model building.

在下一章中,我们将使用几个预定义的 ML 算法探索建模。

In the next chapter, we will explore the model building using several predefined ML algorithms.