Knime 简明教程

KNIME - Quick Guide

KNIME - Introduction

由于其神秘性,开发机器学习模型总是被认为非常具有挑战性。通常,要开发机器学习应用程序,你必须是一个精通命令驱动开发的优秀开发人员。KNIME 的推出使机器学习模型的开发进入普通人的视野。

Developing Machine Learning models is always considered very challenging due to its cryptic nature. Generally, to develop machine learning applications, you must be a good developer with an expertise in command-driven development. The introduction of KNIME has brought the development of Machine Learning models in the purview of a common man.

KNIME 为整个开发提供了一个图形界面(一个用户友好的 GUI)。在 KNIME 中,你只需定义其存储库中提供的各种预定义节点之间的工作流即可。KNIME 提供了几个预定义组件,称为节点,用于各种任务,例如读取数据、应用各种 ML 算法以及以各种格式可视化数据。因此,使用 KNIME 时不需要任何编程知识。这难道不激动人心吗?

KNIME provides a graphical interface (a user friendly GUI) for the entire development. In KNIME, you simply have to define the workflow between the various predefined nodes provided in its repository. KNIME provides several predefined components called nodes for various tasks such as reading data, applying various ML algorithms, and visualizing data in various formats. Thus, for working with KNIME, no programming knowledge is required. Isn’t this exciting?

本教程的后几章将教你如何使用几种经过充分测试的 ML 算法掌握数据分析。

The upcoming chapters of this tutorial will teach you how to master the data analytics using several well-tested ML algorithms.

KNIME - Installation

KNIME 分析平台适用于 Windows、Linux 和 MacOS。在本节中,让我们了解在 Mac 上安装该平台的步骤。如果您使用 Windows 或 Linux,只需按照 KNIME 下载页面上给定的安装说明进行操作即可。所有三个平台的二进制安装都可以在 KNIME’s page 中获得。

KNIME Analytics Platform is available for Windows, Linux and MacOS. In this chapter, let us look into the steps for installing the platform on the Mac. If you use Windows or Linux, just follow the installation instructions given on the KNIME download page. The binary installation for all three platforms is available at KNIME’s page.

Mac Installation

从 KNIME 官方网站下载二进制安装。双击下载的 dmg 文件以开始安装。当安装完成后,只需将 KNIME 图标拖到应用程序文件夹中,如下所示 −

Download the binary installation from the KNIME official site. Double click on the downloaded dmg file to start the installation. When the installation completes, just drag the KNIME icon to the Applications folder as seen here −

mac installation
copy knime

KNIME - First Run

双击 KNIME 图标以启动 KNIME 分析平台。最初,系统会要求你设置一个工作空间文件夹来保存你的工作。你的屏幕将如下所示 -

Double-click the KNIME icon to start the KNIME Analytics Platform. Initially, you will be asked to setup a workspace folder for saving your work. Your screen will look like the following −

您可以将所选的文件夹设为默认值,下次启动 KNIME 时,将不再

You may set the selected folder as default and the next time you launch KNIME, it will not

launch knime

显示该对话框。

show up this dialog again.

一段时间后,KNIME 平台将在您的桌面上启动。这是您开展分析工作的工作台。现在让我们看看工作台的各个部分。

After a while, the KNIME platform will start on your desktop. This is the workbench where you would carry your analytics work. Let us now look at the various portions of the workbench.

KNIME - Workbench

当你启动 KNIME 时,你将看到以下屏幕 -

When KNIME starts, you will see the following screen −

workbench

正如屏幕截图中标记的那样,工作台包含多个视图。我们立即可以使用的视图已在屏幕截图中标记,并列在下面 -

As has been marked in the screenshot, the workbench consists of several views. The views which are of immediate use to us are marked in the screenshot and listed below −

  1. Workspace

  2. Outline

  3. Nodes Repository

  4. KNIME Explorer

  5. Console

  6. Description

随着我们在本章中向前推进,让我们详细了解这些视图。

As we move ahead in this chapter, let us learn these views each in detail.

Workspace View

对我们来说,最重要的视图是 Workspace 视图。这就是你创建机器学习模型的地方。工作空间视图在下面的屏幕截图中突出显示 -

The most important view for us is the Workspace view. This is where you would create your machine learning model. The workspace view is highlighted in the screenshot below −

workspace view

屏幕截图显示了一个打开的工作空间。你很快就会学会如何打开现有工作空间。

The screenshot shows an opened workspace. You will soon learn how to open an existing workspace.

每个工作空间包含一个或多个节点。你将在本教程的后面部分了解这些节点的重要性。这些节点通过箭头连接。通常,程序流是从左到右定义的,尽管这不是必需的。你可以在工作空间中的任何位置自由移动每个节点。两个之间的连接线会适当地移动以保持节点之间的连接。你可以在任何时候添加/删除节点之间的连接。对于每个节点,可以选择添加一小段描述。

Each workspace contains one or more nodes. You will learn the significance of these nodes later in the tutorial. The nodes are connected using arrows. Generally, the program flow is defined from left to right, though this is not required. You may freely move each node anywhere in the workspace. The connecting lines between the two would move appropriately to maintain the connection between the nodes. You may add/remove connections between nodes at any time. For each node a small description may be optionally added.

Outline View

工作空间视图可能无法一次向你显示整个工作流。这就是提供了轮廓视图的原因。

The workspace view may not be able to show you the entire workflow at a time. That is the reason, the outline view is provided.

outline view

轮廓视图显示整个工作空间的微型视图。此视图中有一个缩放窗口,你可以将其滑动以查看 Workspace 视图中工作流的不同部分。

The outline view shows a miniature view of the entire workspace. There is a zoom window inside this view that you can slide to see the different portions of the workflow in the Workspace view.

Node Repository

这是工作台中下一个重要的视图。节点存储库列出可用于你的分析的各种节点。整个存储库根据节点功能进行了很好的分类。你会发现如下类别 -

This is the next important view in the workbench. The Node repository lists the various nodes available for your analytics. The entire repository is nicely categorized based on the node functions. You will find categories such as −

  1. IO

  2. Views

  3. Analytics

node repository

在每个类别下,你会发现几个选项。只需展开每个类别视图即可查看你拥有的内容。在 IO 类别下,你会找到以各种文件格式(如 ARFF、CSV、PMML、XLS 等)读取数据的节点。

Under each category you would find several options. Just expand each category view to see what you have there. Under the IO category, you will find nodes to read your data in various file formats, such as ARFF, CSV, PMML, XLS, etc.

node repository io

根据你的输入源数据格式,你将选择相应的节点来读取你的数据集。

Depending on your input source data format, you will select the appropriate node for reading your dataset.

到目前为止,你可能已经理解了节点的目的。节点定义了你可以在工作流中以可视方式包含的特定功能。

By this time, probably you have understood the purpose of a node. A node defines a certain kind of functionality that you can visually include in your workflow.

Analytics 节点定义了各种机器学习算法,例如贝叶斯、聚类、决策树、集成学习等等。

The Analytics node defines the various machine learning algorithms, such as Bayes, Clustering, Decision Tree, Ensemble Learning, and so on.

node repository analytics

这些不同的 ML 算法的实现都在这些节点中提供。要在你的分析中应用任何算法,只需从存储库中选取所需的节点并将其添加到你的工作空间即可。将 Data reader 节点的输出连接到此 ML 节点的输入,然后你的工作流就创建好了。

The implementation of these various ML algorithms is provided in these nodes. To apply any algorithm in your analytics, simply pick up the desired node from the repository and add it to your workspace. Connect the output of the Data reader node to the input of this ML node and your workflow is created.

我们建议您浏览存储库中提供/可用的各种节点。

We suggest you to explore the various nodes available in the repository.

KNIME Explorer

工作台中下一个重要的视图是 Explorer 视图,如下面的截图所示 −

The next important view in the workbench is the Explorer view as shown in the screenshot below −

explorer

前两个类别列出了在 KNIME 服务器上定义的工作空间。第三个选项 LOCAL 用于存储您在本地计算机上创建的所有工作空间。尝试展开这些选项卡以查看各种预定义的工作空间。特别是展开 EXAMPLES 选项卡。

The first two categories list the workspaces defined on the KNIME server. The third option LOCAL is used for storing all the workspaces that you create on your local machine. Try expanding these tabs to see the various predefined workspaces. Especially, expand EXAMPLES tab.

knime explorer

KNIME 提供了多个示例,可帮助您入门该平台。在下一章中,您将使用其中一个示例来熟悉该平台。

KNIME provides several examples to get you started with the platform. In the next chapter, you will be using one of these examples to get yourself acquainted with the platform.

Console View

顾名思义, Console 视图在执行工作流时提供了各种控制台消息的视图。

As the name indicates, the Console view provides a view of the various console messages while executing your workflow.

console view

Console 视图有助于诊断工作流和检查分析结果。

The Console view is useful in diagnosing the workflow and examining the analytics results.

Description View

我们立即需要关心的最后一个重要视图是 Description 视图。此视图提供了工作空间中所选项目的描述。典型的视图如下图所示 −

The last important view that is of immediate relevance to us is the Description view. This view provides a description of a selected item in the workspace. A typical view is shown in the screenshot below −

description view

上图显示了 File Reader 节点的描述。当您选择工作空间中的 File Reader 节点时,您将在该视图中看到其描述。单击任何其他节点将显示所选节点的描述。因此,在学习的初始阶段,当您不确切了解工作空间中各个节点的目的和/或节点存储库时,此视图非常有用。

The above view shows the description of a File Reader node. When you select the File Reader node in your workspace, you will see its description in this view. Clicking on any other node shows the description of the selected node. Thus, this view becomes very useful in the initial stages of learning when you do not precisely know the purpose of the various nodes in the workspace and/or the nodes repository.

Toolbar

除了上述视图外,工作台还有其他视图,如工具栏。工具栏包含各种图标,可以快速执行操作。这些图标会根据上下文启用/禁用。您可以将鼠标悬停在图标上,以查看每个图标执行的操作。以下屏幕显示了 Configure 图标执行的操作。

Besides the above described views, the workbench has other views such as toolbar. The toolbar contains various icons that facilitate a quick action. The icons are enabled/disabled depending on the context. You can see the action that each icon performs by hovering mouse on it. The following screen shows the action taken by Configure icon.

toolbar

Enabling/Disabling Views

到目前为止,您看到的各种视图可以轻松打开/关闭。单击视图中的关闭图标,将 close 该视图。要恢复视图,请转到 View 菜单选项并选择所需视图。选定的视图将添加到工作台。

The various views that you have seen so far can be turned on/off easily. Clicking the Close icon in the view will close the view. To reinstate the view, go to the View menu option and select the desired view. The selected view will be added to the workbench.

enabling disabling views

现在,当您熟悉工作台后,我将向您展示如何运行工作流并研究它执行的分析。

Now, as you have been acquainted with the workbench, I will show you how to run a workflow and study the analytics performed by it.

KNIME - Running Your First Workflow

KNIME 提供了几个良好的工作流程,以方便学习。在本节中,我们将选择安装中提供的其中一个工作流程,以解释分析平台的各种功能和强大功能。在我们的研究中,我们将使用基于 Decision Tree 的简单分类器。

KNIME has provided several good workflows for ease of learning. In this chapter, we shall pick up one of the workflows provided in the installation to explain the various features and the power of analytics platform. We will use a simple classifier based on a Decision Tree for our study.

Loading Decision Tree Classifier

在 KNIME 浏览器中找到以下工作流程 −

In the KNIME Explorer locate the following workflow −

LOCAL / Example Workflows / Basic Examples / Building a Simple Classifier

您也可以在下面的屏幕截图中进行快速参考 −

This is also shown in the screenshot below for your quick reference −

tree classifier

双击选定的项目以打开工作流。观察工作区视图。您将看到包含多个节点的工作流。此工作流程的目的是从 UCI 机器学习存储库中获取的成人数据集的民主属性中预测收入组。此机器学习模型的任务是将特定区域中收入高于或低于 5 万美元的人归类。

Double click on the selected item to open the workflow. Observe the Workspace view. You will see the workflow containing several nodes. The purpose of this workflow is to predict the income group from the democratic attributes of the adult data set taken from UCI Machine Learning Repository. The task of this ML model is to classify the people in a specific region as having income greater or lesser than 50K.

Workspace 视图及其实例如下图所示 −

The Workspace view along with its outline is shown in the screenshot below −

workspace

注意从 Nodes 存储库中选取的几个节点,并通过箭头连接到工作流中。连接表示一个节点的输出馈送到下一个节点的输入中。在我们了解工作流中每个节点的功能之前,让我们首先执行整个工作流。

Notice the presence of several nodes picked up from the Nodes repository and connected in a workflow by arrows. The connection indicates that the output of one node is fed to the input of the next node. Before we learn the functionality of each of the nodes in the workflow, let us first execute the entire workflow.

Executing Workflow

在了解工作流程的执行之前,了解每个节点的状态报告非常重要。检查工作流中的任何节点。在每个节点的底部,您会找到一个包含三个圆圈的状态指示器。决策树学习器节点如下图所示 −

Before we look into the execution of the workflow, it is important to understand the status report of each node. Examine any node in the workflow. At the bottom of each node you would find a status indicator containing three circles. The Decision Tree Learner node is shown in the screenshot below −

workflow decision

状态指示器为红色,表示此节点到目前为止尚未执行。在执行期间,黄色的中心圆圈将亮起。在执行成功后,最后一个圆圈变为绿色。还有更多指标可以让您在出现错误的情况下获取状态信息。当处理中出现错误时,您将了解这些指标。

The status indicator is red indicating that this node has not been executed so far. During the execution, the center circle which is yellow in color would light up. On successful execution, the last circle turns green. There are more indicators to give you the status information in case of errors. You will learn them when an error occurs in the processing.

请注意,当前所有节点上的指示器均显示为红色,表示到目前为止还没有任何节点执行。要运行所有节点,请单击以下菜单项:

Note that currently the indicators on all nodes are red indicating that no node is executed so far. To run all nodes, click on the following menu item −

Node → Execute All
execution workflow

过一会儿,你会发现每个节点状态指示器现已变为绿色,表示没有错误。

After a while, you will find that each node status indicator has now turned green indicating that there are no errors.

在下一章中,我们将探讨工作流中各个节点的功能。

In the next chapter, we will explore the functionality of the various nodes in the workflow.

KNIME - Exploring Workflow

如果你查看工作流中的节点,你会发现它包含以下内容:

If you check out the nodes in the workflow, you can see that it contains the following −

  1. 文件读取器,

. File Reader,

  1. 颜色管理器

. Color Manager

  1. 分区

. Partitioning

  1. 决策树学习器

. Decision Tree Learner

  1. 决策树预测器

. Decision Tree Predictor

  1. 得分

. Score

  1. 交互式表

. Interactive Table

  1. 散点图

. Scatter Plot

  1. 统计信息

. Statistics

这些很容易在 Outline 视图中看到,如下所示:

These are easily seen in the Outline view as shown here −

outline

每个节点在工作流中提供特定功能。现在,我们将研究如何配置这些节点以满足所需功能。请注意,我们将仅讨论与我们在当前探索工作流上下文中相关的节点。

Each node provides a specific functionality in the workflow. We will now look into how to configure these nodes to meet up the desired functionality. Please note that we will discuss only those nodes that are relevant to us in the current context of exploring the workflow.

File Reader

文件读取器节点在下图中显示:

The File Reader node is depicted in the screenshot below −

file reader

窗口顶部有一些由工作流创建者提供的说明。它告诉这个节点读取成人数据集。文件名称为 adult.csv ,从节点符号下的说明中可以看到。 File Reader 有两个输出 - 一个转到 Color Manager 节点,另一个转到 Statistics 节点。

There is some description at the top of the window that is provided by the creator of the workflow. It tells that this node reads the adult data set. The name of the file is adult.csv as seen from the description underneath the node symbol. The File Reader has two outputs - one goes to Color Manager node and the other one goes to Statistics node.

如果你右键单击 File Manager ,将弹出一个菜单,如下所示:

If you right click the File Manager, a popup menu would show up as follows −

file manager

Configure 菜单选项允许节点配置。 Execute 菜单运行节点。请注意,如果节点已经运行并且处于绿色状态,则此菜单将被禁用。此外,还请注意 Edit Note Description 菜单选项的存在。这允许你为你的节点编写说明。

The Configure menu option allows for the node configuration. The Execute menu runs the node. Note that if the node has already been run and if it is in a green state, this menu is disabled. Also, note the presence of Edit Note Description menu option. This allows you to write the description for your node.

现在,选择 Configure 菜单选项,它将显示一个包含 adult.csv 文件数据并如图所示的屏幕截图 −

Now, select the Configure menu option, it shows the screen containing the data from the adult.csv file as seen in the screenshot here −

adult csv file

当执行此节点时,数据将加载到内存中。整个数据加载程序代码都对用户隐藏。你现在可以欣赏这些节点的有用性——无需编码。

When you execute this node, the data will be loaded in the memory. The entire data loading program code is hidden from the user. You can now appreciate the usefulness of such nodes - no coding required.

我们的下一个节点是 Color Manager

Our next node is the Color Manager.

Color Manager

选择 Color Manager 节点,并通过右键单击进入其配置。将显示颜色设置对话框。从下拉列表中选择 income 列。

Select the Color Manager node and go into its configuration by right clicking on it. A colors settings dialog would appear. Select the income column from the dropdown list.

你的屏幕将类似于下面 −

Your screen would look like the following −

color manager

请注意是否存在两个约束。如果收入低于 50K,则数据点将获得绿色,如果收入较高,则将其变为红色。当我们在本章后面查看散点图时,你将看到数据点映射。

Notice the presence of two constraints. If the income is less than 50K, the datapoint will acquire green color and if it is more it gets red color. You will see the data point mappings when we look at the scatter plot later in this chapter.

Partitioning

在机器学习中,我们通常将所有可用数据分成两部分。较大的一部分用于训练模型,而较小的一部分用于测试。有不同的策略用于对数据进行分区。

In machine learning, we usually split the entire available data in two parts. The larger part is used in training the model, while the smaller portion is used for testing. There are different strategies used for partitioning the data.

要定义所需的划分,请右键单击 Partitioning 节点并选择 Configure 选项。你将看到以下屏幕 −

To define the desired partitioning, right click on the Partitioning node and select the Configure option. You would see the following screen −

partitioning

在这种情况下,系统建模器使用了 Relative (%) 模式,并且数据以 80:20 的比例进行分割。在进行分割时,将随机拾取数据点。这确保你的测试数据不会有偏差。对于线性采样,剩余的 20% 用于测试的数据可能无法正确表示训练数据,因为它在收集过程中可能完全偏向。

In the case, the system modeller has used the Relative (%) mode and the data is split in 80:20 ratio. While doing the split, the data points are picked up randomly. This ensures that your test data may not be biased. In case of Linear sampling, the remaining 20% data used for testing may not correctly represent the training data as it may be totally biased during its collection.

如果你确定在数据收集期间确保了随机性,那么你可以选择线性采样。一旦数据准备好用于训练模型,请输入下一个节点,即 Decision Tree Learner

If you are sure that during data collection, the randomness is guaranteed, then you may select the linear sampling. Once your data is ready for training the model, feed it to the next node, which is the Decision Tree Learner.

Decision Tree Learner

顾名思义, Decision Tree Learner 节点使用训练数据并构建模型。查看此节点的配置设置,如下图所示 −

The Decision Tree Learner node as the name suggests uses the training data and builds a model. Check out the configuration setting of this node, which is depicted in the screenshot below −

decision tree learner

正如你所看到的, Classincome 。因此,树将基于收入列构建,而这是我们在此模型中要达到的目标。我们需要将收入高于或低于 50K 的人分离。

As you see the Class is income. Thus the tree would be built based on the income column and that is what we are trying to achieve in this model. We want a separation of people having income greater or lesser than 50K.

此节点成功运行后,你的模型将准备好进行测试。

After this node runs successfully, your model would be ready for testing.

Decision Tree Predictor

决策树预测器节点将开发的模型应用于测试数据集并附加模型预测。

The Decision Tree Predictor node applies the developed model to the test data set and appends the model predictions.

tree predictor

预测器的输出馈送到两个不同的节点 - ScorerScatter Plot 。接下来,我们将检查预测的输出。

The output of the predictor is fed to two different nodes - Scorer and Scatter Plot. Next, we will examine the output of prediction.

Scorer

此节点生成 confusion matrix 。要查看它,请右键单击该节点。你将看到以下弹出菜单 −

This node generates the confusion matrix. To view it, right click on the node. You will see the following popup menu −

scorer

单击 View: Confusion Matrix 菜单选项,矩阵将如图所示在单独的窗口中弹出 −

Click the View: Confusion Matrix menu option and the matrix will pop up in a separate window as shown in the screenshot here −

confusion matrix

它表示我们开发的模型的准确性为 83.71%。如果你对此不满意,你可以尝试使用其他参数进行模型构建,特别是,你可能想要重新查看并清理你的数据。

It indicates that the accuracy of our developed model is 83.71%. If you are not satisfied with this, you may play around with other parameters in model building, especially, you may like to revisit and cleanse your data.

Scatter Plot

要查看数据分布的散点图,请右键单击 Scatter Plot 节点并选择菜单选项 Interactive View: Scatter Plot 。你将看到以下绘图 −

To see the scatter plot of the data distribution, right click on the Scatter Plot node and select the menu option Interactive View: Scatter Plot. You will see the following plot −

scatter plot

该图给出了基于 50K 阈值的两种不同颜色的点(红色和蓝色)的不同收入人群的分布。这些是我们 Color Manager 节点中设置的颜色。分布相对于在 x 轴上绘制的年龄。你可以通过更改节点的配置为 x 轴选择不同的特征。

The plot gives the distribution of different income group people based on the threshold of 50K in two different colored dots - red and blue. These were the colors set in our Color Manager node. The distribution is relative to the age as plotted on the x-axis. You may select a different feature for x-axis by changing the configuration of the node.

配置对话框显示在这里,我们在其中选择 marital-status 作为 x 轴的特征。

The configuration dialog is shown here where we have selected the marital-status as a feature for x-axis.

marital status

这完成了我们对 KNIME 提供的预定义模型的讨论。我们建议你在模型中学习其他两个节点(统计和交互式表格)。

This completes our discussion on the predefined model provided by KNIME. We suggest you to take up the other two nodes (Statistics and Interactive Table) in the model for your self-study.

现在让我们继续本教程中最重要的部分——创建你自己的模型。

Let us now move on to the most important part of the tutorial – creating your own model.

KNIME - Building Your Own Model

在本教程中,您将基于一些观察到的特征构建自己的机器学习模型以对植物进行分类。为此,我们将使用 UCI Machine Learning Repository 的众所周知的 iris 数据集。该数据集包含三个不同的植物类别。我们将训练我们的模型以将未知植物分类为这三个类别之一。

In this chapter, you will build your own machine learning model to categorize the plants based on a few observed features. We will use the well-known iris dataset from UCI Machine Learning Repository for this purpose. The dataset contains three different classes of plants. We will train our model to classify an unknown plant into one of these three classes.

我们将从在 KNIME 中创建一个新工作流开始,用于创建机器学习模型。

We will start with creating a new workflow in KNIME for creating our machine learning models.

Creating Workflow

要在 KNIME 工作台中创建一个新工作流,选择以下菜单选项。

To create a new workflow, select the following menu option in the KNIME workbench.

File → New

您将看到以下屏幕:

You will see the following screen −

creating workflow

选择 New KNIME Workflow 选项,然后单击 Next 按钮。在下一个屏幕上,系统会要求您输入工作流的所需名称和保存它的目标文件夹。根据需要输入此信息,然后单击 Finish 以创建一个新的工作区。

Select the New KNIME Workflow option and click on the Next button. On the next screen, you will be asked for the desired name for the workflow and the destination folder for saving it. Enter this information as desired and click Finish to create a new workspace.

将带给定名称的新工作空间添加到 Workspace 视图中,如下所示 −

A new workspace with the given name would be added to the Workspace view as seen here −

creating workspace

你现在将添加此工作空间中的各个节点以创建模型。在添加节点之前,必须下载和准备 iris 数据集以供我们使用。

You will now add the various nodes in this workspace to create your model. Before, you add nodes, you have to download and prepare the iris dataset for our use.

Preparing Dataset

从 UCI 机器学习仓库网站 Download Iris Dataset 下载鸢尾花数据集。下载的 iris.data 文件为 CSV 格式。我们将做一些更改以添加列名称。

Download the iris dataset from the UCI Machine Learning Repository site Download Iris Dataset. The downloaded iris.data file is in CSV format. We will make some changes in it to add the column names.

在喜欢的文本编辑器中打开下载的文件,并在开头添加以下行。

Open the downloaded file in your favorite text editor and add the following line at the beginning.

sepal length, petal length, sepal width, petal width, class

当我们的 File Reader 节点读取此文件时,它将自动将上述字段作为列名称。

When our File Reader node reads this file, it will automatically take the above fields as column names.

现在,你将开始添加各个节点。

Now, you will start adding various nodes.

Adding File Reader

转到 Node Repository 视图,在搜索框中输入“文件”以找到 File Reader 节点。这在下图中可见 −

Go to the Node Repository view, type “file” in the search box to locate the File Reader node. This is seen in the screenshot below −

adding file reader

选择 File Reader 并双击,将节点添加到工作空间。或者,你可以使用拖放功能将节点添加到工作空间。添加节点后,你将需要对其进行配置。右键单击节点并选择 Configure 菜单选项。你已经在前面的课程中这么做过。

Select and double click the File Reader to add the node into the workspace. Alternatively, you may use drag-n-drop feature to add the node into the workspace. After the node is added, you will have to configure it. Right click on the node and select the Configure menu option. You have done this in the earlier lesson.

在加载数据文件后,设置屏幕如下所示。

The settings screen looks like the following after the datafile is loaded.

adding datafile

要加载数据集,请单击 Browse 按钮并选择 iris.data 文件的位置。节点将加载文件内容,这些内容显示在配置框的下部分中。一旦你确认数据文件已正确定位并加载,请单击 OK 按钮以关闭配置对话框。

To load your dataset, click on the Browse button and select the location of your iris.data file. The node will load the contents of the file which are displayed in the lower portion of the configuration box. Once you are satisfied that the datafile is located properly and loaded, click on the OK button to close the configuration dialog.

现在你将向此节点添加一些注释。右键单击节点并选择 New Workflow Annotation 菜单选项。屏幕上将出现一个注释框,如下图所示:

You will now add some annotation to this node. Right click on the node and select New Workflow Annotation menu option. An annotation box would appear on the screen as shown in the screenshot here:

workflow annotation

单击该框内并添加以下注释 −

Click inside the box and add the following annotation −

Reads iris.data

单击该框外的任意位置以退出编辑模式。根据需要调整大小并将框放置在节点周围。最后,双击节点下方的 Node 1 文本来将此字符串更改为以下内容 −

Click anywhere outside the box to exit the edit mode. Resize and place the box around the node as desired. Finally, double click on the Node 1 text underneath the node to change this string to the following −

Loads data

此时,屏幕将如下所示 −

At this point, your screen would look like the following −

iris data

现在我们将添加一个新节点,将加载的数据集分区为训练和测试。

We will now add a new node for partitioning our loaded dataset into training and testing.

Adding Partitioning Node

Node Repository 搜索窗口中,键入几个字符以找到 Partitioning 节点,如下面的屏幕截图所示 −

In the Node Repository search window, type a few characters to locate the Partitioning node, as seen in the screenshot below −

locate partitioning

将节点添加到我们的工作空间。将配置设置为以下内容 −

Add the node to our workspace. Set its configuration as follows −

Relative (%) : 95
Draw Randomly

以下屏幕截图显示了配置参数。

The following screenshot shows the configuration parameters.

configuration parameters

接下来,在两个节点之间建立连接。为此,请单击 File Reader 节点的输出,保持鼠标按钮单击,将出现一根橡皮筋线,将其拖动到 Partitioning 节点的输入,松开鼠标按钮。现在就在两个节点之间建立了连接。

Next, make the connection between the two nodes. To do so, click on the output of the File Reader node, keep the mouse button clicked, a rubber band line would appear, drag it to the input of Partitioning node, release the mouse button. A connection is now established between the two nodes.

添加注释,更改说明,按需要放置节点和注释视图。此时您的屏幕应如下 -

Add the annotation, change the description, position the node and annotation view as desired. Your screen should look like the following at this stage −

file reader partitioning

接下来,我们将添加 k-Means 节点。

Next, we will add the k-Means node.

Adding k-Means Node

从资源库中选择 k-Means 节点并将其添加到工作区。如果您想要复习有关 k-Means 算法的知识,只需在工作台的说明视图中查找其说明即可。这在下面的屏幕截图中有所展示 -

Select the k-Means node from the repository and add it to the workspace. If you want to refresh your knowledge on k-Means algorithm, just look up its description in the description view of the workbench. This is shown in the screenshot below −

k means

顺便提一下,在最后决定使用哪种算法之前,您可以在说明窗口中查找不同算法的说明。

Incidentally, you may look up the description of different algorithms in the description window before taking a final decision on which one to use.

打开节点的配置对话框。我们将使用以下所示字段的所有默认值 -

Open the configuration dialog for the node. We will use the defaults for all fields as shown here −

configuration dialog

单击 OK 接受默认值并关闭对话框。

Click OK to accept the defaults and to close the dialog.

将注释和说明设为以下内容 -

Set the annotation and description to the following −

  1. Annotation: Classify clusters

  2. Description:Perform clustering

Partitioning 节点的顶部输出连接到 k-Means 节点的输入。重新定位您的项目,您的屏幕应如下 -

Connect the top output of the Partitioning node to the input of k-Means node. Reposition your items and your screen should look like the following −

partitioning node

接下来,我们将添加一个 Cluster Assigner 节点。

Next, we will add a Cluster Assigner node.

Adding Cluster Assigner

Cluster Assigner 为现有原型集分配新数据。它需要两个输入 - 原型模型和包含输入数据的资料表。在说明窗口中查找节点的说明,该说明在下方的屏幕截图中有所描述 -

The Cluster Assigner assigns new data to an existing set of prototypes. It takes two inputs - the prototype model and the datatable containing the input data. Look up the node’s description in the description window which is depicted in the screenshot below −

adding cluster assigner

因此,对于此节点,您必须建立两个连接 -

Thus, for this node you have to make two connections −

  1. The PMML Cluster Model output of Partitioning node → Prototypes Input of Cluster Assigner

  2. Second partition output of Partitioning node → Input data of Cluster Assigner

这两个连接在下方的屏幕截图中有所展示 -

These two connections are shown in the screenshot below −

cluster assigner

Cluster Assigner 不需要任何特殊配置。只需接受默认值即可。

The Cluster Assigner does not need any special configuration. Just accept the defaults.

现在,向此节点添加一些注释和说明。重新排列您的节点。您的屏幕应如下 -

Now, add some annotation and description to this node. Rearrange your nodes. Your screen should look like the following −

shape manager

至此,我们的集群完成。我们需要以图表方式可视化输出。为此,我们将添加一个散点图。我们将在散点图中为三个类别分别设置颜色和形状。因此,我们首先将 k-Means 节点的输出通过 Color Manager 节点再通过 Shape Manager 节点进行过滤。

At this point, our clustering is completed. We need to visualize the output graphically. For this, we will add a scatter plot. We will set the colors and shapes for three classes differently in the scatter plot. Thus, we will filter the output of the k-Means node first through the Color Manager node and then through Shape Manager node.

Adding Color Manager

在资源库中查找 Color Manager 节点。将其添加到工作区。保留其默认配置。请注意,您必须打开配置对话框并点按 OK 才能接受默认值。为节点设置说明文本。

Locate the Color Manager node in the repository. Add it to the workspace. Leave the configuration to its defaults. Note that you must open the configuration dialog and hit OK to accept the defaults. Set the description text for the node.

k-Means 的输出到 Color Manager 的输入建立一个连接。此时您的屏幕应如下 -

Make a connection from the output of k-Means to the input of Color Manager. Your screen would look like the following at this stage −

color manager screen

Adding Shape Manager

在存储库中找到 Shape Manager 并将其添加到工作区。将其配置保留为默认值。与前一个类似,您必须打开配置对话框并点击 OK 以设置默认值。从 Color Manager 的输出到 Shape Manager 的输入建立连接。设置该节点的描述。

Locate the Shape Manager in the repository and add it to the workspace. Leave its configuration to the defaults. Like the previous one, you must open the configuration dialog and hit OK to set defaults. Establish the connection from the output of Color Manager to the input of Shape Manager. Set the description for the node.

您的屏幕应如下所示 −

Your screen should look like the following −

adding shape manager

现在,您将添加此模型中的最后一个节点:散点图。

Now, you will be adding the last node in our model and that is the scatter plot.

Adding Scatter Plot

在存储库中找到“散点图”节点并将其添加到工作区。将 Shape Manager 的输出连接到 Scatter Plot 的输入。将配置保留为默认值。设置描述。

Locate* Scatter Plot* node in the repository and add it to the workspace. Connect the output of Shape Manager to the input of Scatter Plot. Leave the configuration to defaults. Set the description.

最后,向最近添加的三个节点添加一个组注释

Finally, add a group annotation to the recently added three nodes

注释:可视化

Annotation: Visualization

根据需要重新定位这些节点。在这一阶段,您的屏幕应如下所示。

Reposition the nodes as desired. Your screen should look like the following at this stage.

annotation visualization

这完成了模型构建任务。

This completes the task of model building.

KNIME - Testing the Model

要测试模型,请执行以下菜单选项: NodeExecute All

To test the model, execute the following menu options: NodeExecute All

如果一切都正确,每个节点底部的状态信号将变为绿色。如果没有,你需要查找 Console 视图中的错误,修复它们并重新运行工作流。

If everything goes correct, the status signal at the bottom of each node would turn green. If not, you will need to look up the Console view for the errors, fix them up and re-run the workflow.

现在,你可以准备可视化模型的预测输出。为此,右键单击 Scatter Plot 节点并选择以下菜单选项: Interactive View: Scatter Plot

Now, you are ready to visualize the predicted output of the model. For this, right click the Scatter Plot node and select the following menu options: Interactive View: Scatter Plot

这在下面的屏幕截图中显示 -

This is shown in the screenshot below −

interactive view

你将在屏幕上看到如下图所示的散点图 -

You would see the scatter plot on the screen as shown here −

scatter plot screen

你可以通过改变 x 轴和 y 轴来运行不同的可视化。若要执行此操作,请单击散点图右上角的设置菜单。将出现一个弹出菜单,如下面的屏幕截图所示 -

You can run through different visualizations by changing x- and y- axis. To do so, click on the settings menu at the top right corner of the scatter plot. A popup menu would appear as shown in the screenshot below −

visualizations changing

你可以在此屏幕上设置图表的各种参数,以便从多个方面可视化数据。

You can set the various parameters for the plot on this screen to visualize the data from several aspects.

这完成了我们的模型构建任务。

This completes our task of model building.

KNIME - Summary and Future Work

KNIME 提供了一个用于构建机器学习模型的图形化工具。在本教程中,您将学习如何在您的计算机上下载并安装 KNIME。

KNIME provides a graphical tool for building Machine Learning models. In this tutorial, you learned how to download and install KNIME on your machine.

Summary

您学习了 KNIME 工作台中提供的各种视图。KNIME 提供了多个预定义的工作流供您学习。我们使用其中一个工作流来了解 KNIME 的功能。KNIME 提供了多个预先编程的节点,用于读取各种格式的数据、使用多个 ML 算法分析数据,并最终以多种不同的方式将数据可视化。到教程的末尾,您从头开始创建了自己的模型。我们使用众所周知的 iris 数据集,使用 k 均值算法对植物进行分类。

You learned the various views provided in the KNIME workbench. KNIME provides several predefined workflows for your learning. We used one such workflow to learn the capabilities of KNIME. KNIME provides several pre-programmed nodes for reading data in various formats, analyzing data using several ML algorithms, and finally visualizing data in many different ways. Towards the end of the tutorial, you created your own model starting from scratch. We used the well-known iris dataset to classify the plants using k-Means algorithm.

您现在已准备好使用这些技术进行您自己的分析。

You are now ready to use these techniques for your own analytics.

Future Work

如果您是开发人员,并且想要在您的编程应用程序中使用 KNIME 组件,您会很高兴得知 KNIME 本机集成了广泛的编程语言,例如 Java、R、Python 和许多其他语言。

If you are a developer and would like to use the KNIME components in your programming applications, you will be glad to know that KNIME natively integrates with a wide range of programming languages such as Java, R, Python and many more.