Data Mining 简明教程

Data Mining - Decision Tree Induction

决策树是一个包含根节点、分支和叶节点的结构。每个内部节点表示对属性的测试,每个分支表示测试结果,每个叶节点包含一个类标签。树中最高级别的节点是根节点。

A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label. The topmost node in the tree is the root node.

以下决策树用于表示 buy_computer 概念,它表示一家公司的客户是否可能会购买计算机。每个内部节点表示对属性的测试。每个叶节点表示一个类。

The following decision tree is for the concept buy_computer that indicates whether a customer at a company is likely to buy a computer or not. Each internal node represents a test on an attribute. Each leaf node represents a class.

dm decision tree

拥有决策树的好处如下−

The benefits of having a decision tree are as follows −

  1. It does not require any domain knowledge.

  2. It is easy to comprehend.

  3. The learning and classification steps of a decision tree are simple and fast.

Decision Tree Induction Algorithm

一位名叫 J. Ross Quinlan 的机器研究人员在 1980 年开发了一种名为 ID3(迭代二分器)的决策树算法。随后,他提出了 ID3 的继任者 C4.5。ID3 和 C4.5 采用贪婪方法。在这种演算法中,不存在回溯;以自顶向下的递归分而治之的方式构建树。

A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm known as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which was the successor of ID3. ID3 and C4.5 adopt a greedy approach. In this algorithm, there is no backtracking; the trees are constructed in a top-down recursive divide-and-conquer manner.

Generating a decision tree form training tuples of data partition D
Algorithm : Generate_decision_tree

Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.

Output:
 A Decision Tree

Method
create a node N;

if tuples in D are all of the same class, C then
   return N as leaf node labeled with class C;

if attribute_list is empty then
   return N as leaf node with labeled
   with majority class in D;|| majority voting

apply attribute_selection_method(D, attribute_list)
to find the best splitting_criterion;
label node N with splitting_criterion;

if splitting_attribute is discrete-valued and
   multiway splits allowed then  // no restricted to binary trees

attribute_list = splitting attribute; // remove splitting attribute
for each outcome j of splitting criterion

   // partition the tuples and grow subtrees for each partition
   let Dj be the set of data tuples in D satisfying outcome j; // a partition

   if Dj is empty then
      attach a leaf labeled with the majority
      class in D to node N;
   else
      attach the node returned by Generate
      decision tree(Dj, attribute list) to node N;
   end for
return N;

Tree Pruning

剪枝树的目的是为了消除由于噪声或异常值导致的训练数据中的异常值。剪枝树更小、更简单。

Tree pruning is performed in order to remove anomalies in the training data due to noise or outliers. The pruned trees are smaller and less complex.

Tree Pruning Approaches

有两种剪枝树的方法 -

There are two approaches to prune a tree −

  1. Pre-pruning − The tree is pruned by halting its construction early.

  2. Post-pruning - This approach removes a sub-tree from a fully grown tree.

Cost Complexity

成本复杂度由以下两个参数来衡量 -

The cost complexity is measured by the following two parameters −

  1. Number of leaves in the tree, and

  2. Error rate of the tree.