Knime 简明教程
KNIME - Building Your Own Model
在本教程中,您将基于一些观察到的特征构建自己的机器学习模型以对植物进行分类。为此,我们将使用 UCI Machine Learning Repository 的众所周知的 iris 数据集。该数据集包含三个不同的植物类别。我们将训练我们的模型以将未知植物分类为这三个类别之一。
In this chapter, you will build your own machine learning model to categorize the plants based on a few observed features. We will use the well-known iris dataset from UCI Machine Learning Repository for this purpose. The dataset contains three different classes of plants. We will train our model to classify an unknown plant into one of these three classes.
我们将从在 KNIME 中创建一个新工作流开始,用于创建机器学习模型。
We will start with creating a new workflow in KNIME for creating our machine learning models.
Creating Workflow
要在 KNIME 工作台中创建一个新工作流,选择以下菜单选项。
To create a new workflow, select the following menu option in the KNIME workbench.
File → New
您将看到以下屏幕:
You will see the following screen −
选择 New KNIME Workflow 选项,然后单击 Next 按钮。在下一个屏幕上,系统会要求您输入工作流的所需名称和保存它的目标文件夹。根据需要输入此信息,然后单击 Finish 以创建一个新的工作区。
Select the New KNIME Workflow option and click on the Next button. On the next screen, you will be asked for the desired name for the workflow and the destination folder for saving it. Enter this information as desired and click Finish to create a new workspace.
将带给定名称的新工作空间添加到 Workspace 视图中,如下所示 −
A new workspace with the given name would be added to the Workspace view as seen here −
你现在将添加此工作空间中的各个节点以创建模型。在添加节点之前,必须下载和准备 iris 数据集以供我们使用。
You will now add the various nodes in this workspace to create your model. Before, you add nodes, you have to download and prepare the iris dataset for our use.
Preparing Dataset
从 UCI 机器学习仓库网站 Download Iris Dataset 下载鸢尾花数据集。下载的 iris.data 文件为 CSV 格式。我们将做一些更改以添加列名称。
Download the iris dataset from the UCI Machine Learning Repository site Download Iris Dataset. The downloaded iris.data file is in CSV format. We will make some changes in it to add the column names.
在喜欢的文本编辑器中打开下载的文件,并在开头添加以下行。
Open the downloaded file in your favorite text editor and add the following line at the beginning.
sepal length, petal length, sepal width, petal width, class
当我们的 File Reader 节点读取此文件时,它将自动将上述字段作为列名称。
When our File Reader node reads this file, it will automatically take the above fields as column names.
现在,你将开始添加各个节点。
Now, you will start adding various nodes.
Adding File Reader
转到 Node Repository 视图,在搜索框中输入“文件”以找到 File Reader 节点。这在下图中可见 −
Go to the Node Repository view, type “file” in the search box to locate the File Reader node. This is seen in the screenshot below −
选择 File Reader 并双击,将节点添加到工作空间。或者,你可以使用拖放功能将节点添加到工作空间。添加节点后,你将需要对其进行配置。右键单击节点并选择 Configure 菜单选项。你已经在前面的课程中这么做过。
Select and double click the File Reader to add the node into the workspace. Alternatively, you may use drag-n-drop feature to add the node into the workspace. After the node is added, you will have to configure it. Right click on the node and select the Configure menu option. You have done this in the earlier lesson.
在加载数据文件后,设置屏幕如下所示。
The settings screen looks like the following after the datafile is loaded.
要加载数据集,请单击 Browse 按钮并选择 iris.data 文件的位置。节点将加载文件内容,这些内容显示在配置框的下部分中。一旦你确认数据文件已正确定位并加载,请单击 OK 按钮以关闭配置对话框。
To load your dataset, click on the Browse button and select the location of your iris.data file. The node will load the contents of the file which are displayed in the lower portion of the configuration box. Once you are satisfied that the datafile is located properly and loaded, click on the OK button to close the configuration dialog.
现在你将向此节点添加一些注释。右键单击节点并选择 New Workflow Annotation 菜单选项。屏幕上将出现一个注释框,如下图所示:
You will now add some annotation to this node. Right click on the node and select New Workflow Annotation menu option. An annotation box would appear on the screen as shown in the screenshot here:
单击该框内并添加以下注释 −
Click inside the box and add the following annotation −
Reads iris.data
单击该框外的任意位置以退出编辑模式。根据需要调整大小并将框放置在节点周围。最后,双击节点下方的 Node 1 文本来将此字符串更改为以下内容 −
Click anywhere outside the box to exit the edit mode. Resize and place the box around the node as desired. Finally, double click on the Node 1 text underneath the node to change this string to the following −
Loads data
此时,屏幕将如下所示 −
At this point, your screen would look like the following −
现在我们将添加一个新节点,将加载的数据集分区为训练和测试。
We will now add a new node for partitioning our loaded dataset into training and testing.
Adding Partitioning Node
在 Node Repository 搜索窗口中,键入几个字符以找到 Partitioning 节点,如下面的屏幕截图所示 −
In the Node Repository search window, type a few characters to locate the Partitioning node, as seen in the screenshot below −
将节点添加到我们的工作空间。将配置设置为以下内容 −
Add the node to our workspace. Set its configuration as follows −
Relative (%) : 95
Draw Randomly
以下屏幕截图显示了配置参数。
The following screenshot shows the configuration parameters.
接下来,在两个节点之间建立连接。为此,请单击 File Reader 节点的输出,保持鼠标按钮单击,将出现一根橡皮筋线,将其拖动到 Partitioning 节点的输入,松开鼠标按钮。现在就在两个节点之间建立了连接。
Next, make the connection between the two nodes. To do so, click on the output of the File Reader node, keep the mouse button clicked, a rubber band line would appear, drag it to the input of Partitioning node, release the mouse button. A connection is now established between the two nodes.
添加注释,更改说明,按需要放置节点和注释视图。此时您的屏幕应如下 -
Add the annotation, change the description, position the node and annotation view as desired. Your screen should look like the following at this stage −
接下来,我们将添加 k-Means 节点。
Next, we will add the k-Means node.
Adding k-Means Node
从资源库中选择 k-Means 节点并将其添加到工作区。如果您想要复习有关 k-Means 算法的知识,只需在工作台的说明视图中查找其说明即可。这在下面的屏幕截图中有所展示 -
Select the k-Means node from the repository and add it to the workspace. If you want to refresh your knowledge on k-Means algorithm, just look up its description in the description view of the workbench. This is shown in the screenshot below −
顺便提一下,在最后决定使用哪种算法之前,您可以在说明窗口中查找不同算法的说明。
Incidentally, you may look up the description of different algorithms in the description window before taking a final decision on which one to use.
打开节点的配置对话框。我们将使用以下所示字段的所有默认值 -
Open the configuration dialog for the node. We will use the defaults for all fields as shown here −
单击 OK 接受默认值并关闭对话框。
Click OK to accept the defaults and to close the dialog.
将注释和说明设为以下内容 -
Set the annotation and description to the following −
-
Annotation: Classify clusters
-
Description:Perform clustering
将 Partitioning 节点的顶部输出连接到 k-Means 节点的输入。重新定位您的项目,您的屏幕应如下 -
Connect the top output of the Partitioning node to the input of k-Means node. Reposition your items and your screen should look like the following −
接下来,我们将添加一个 Cluster Assigner 节点。
Next, we will add a Cluster Assigner node.
Adding Cluster Assigner
Cluster Assigner 为现有原型集分配新数据。它需要两个输入 - 原型模型和包含输入数据的资料表。在说明窗口中查找节点的说明,该说明在下方的屏幕截图中有所描述 -
The Cluster Assigner assigns new data to an existing set of prototypes. It takes two inputs - the prototype model and the datatable containing the input data. Look up the node’s description in the description window which is depicted in the screenshot below −
因此,对于此节点,您必须建立两个连接 -
Thus, for this node you have to make two connections −
-
The PMML Cluster Model output of Partitioning node → Prototypes Input of Cluster Assigner
-
Second partition output of Partitioning node → Input data of Cluster Assigner
这两个连接在下方的屏幕截图中有所展示 -
These two connections are shown in the screenshot below −
Cluster Assigner 不需要任何特殊配置。只需接受默认值即可。
The Cluster Assigner does not need any special configuration. Just accept the defaults.
现在,向此节点添加一些注释和说明。重新排列您的节点。您的屏幕应如下 -
Now, add some annotation and description to this node. Rearrange your nodes. Your screen should look like the following −
至此,我们的集群完成。我们需要以图表方式可视化输出。为此,我们将添加一个散点图。我们将在散点图中为三个类别分别设置颜色和形状。因此,我们首先将 k-Means 节点的输出通过 Color Manager 节点再通过 Shape Manager 节点进行过滤。
At this point, our clustering is completed. We need to visualize the output graphically. For this, we will add a scatter plot. We will set the colors and shapes for three classes differently in the scatter plot. Thus, we will filter the output of the k-Means node first through the Color Manager node and then through Shape Manager node.
Adding Color Manager
在资源库中查找 Color Manager 节点。将其添加到工作区。保留其默认配置。请注意,您必须打开配置对话框并点按 OK 才能接受默认值。为节点设置说明文本。
Locate the Color Manager node in the repository. Add it to the workspace. Leave the configuration to its defaults. Note that you must open the configuration dialog and hit OK to accept the defaults. Set the description text for the node.
从 k-Means 的输出到 Color Manager 的输入建立一个连接。此时您的屏幕应如下 -
Make a connection from the output of k-Means to the input of Color Manager. Your screen would look like the following at this stage −
Adding Shape Manager
在存储库中找到 Shape Manager 并将其添加到工作区。将其配置保留为默认值。与前一个类似,您必须打开配置对话框并点击 OK 以设置默认值。从 Color Manager 的输出到 Shape Manager 的输入建立连接。设置该节点的描述。
Locate the Shape Manager in the repository and add it to the workspace. Leave its configuration to the defaults. Like the previous one, you must open the configuration dialog and hit OK to set defaults. Establish the connection from the output of Color Manager to the input of Shape Manager. Set the description for the node.
您的屏幕应如下所示 −
Your screen should look like the following −
现在,您将添加此模型中的最后一个节点:散点图。
Now, you will be adding the last node in our model and that is the scatter plot.
Adding Scatter Plot
在存储库中找到“散点图”节点并将其添加到工作区。将 Shape Manager 的输出连接到 Scatter Plot 的输入。将配置保留为默认值。设置描述。
Locate* Scatter Plot* node in the repository and add it to the workspace. Connect the output of Shape Manager to the input of Scatter Plot. Leave the configuration to defaults. Set the description.
最后,向最近添加的三个节点添加一个组注释
Finally, add a group annotation to the recently added three nodes
注释:可视化
Annotation: Visualization
根据需要重新定位这些节点。在这一阶段,您的屏幕应如下所示。
Reposition the nodes as desired. Your screen should look like the following at this stage.
这完成了模型构建任务。
This completes the task of model building.