Data Mining 简明教程
Data Mining - Decision Tree Induction
决策树是一个包含根节点、分支和叶节点的结构。每个内部节点表示对属性的测试,每个分支表示测试结果,每个叶节点包含一个类标签。树中最高级别的节点是根节点。
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label. The topmost node in the tree is the root node.
以下决策树用于表示 buy_computer 概念,它表示一家公司的客户是否可能会购买计算机。每个内部节点表示对属性的测试。每个叶节点表示一个类。
The following decision tree is for the concept buy_computer that indicates whether a customer at a company is likely to buy a computer or not. Each internal node represents a test on an attribute. Each leaf node represents a class.
拥有决策树的好处如下−
The benefits of having a decision tree are as follows −
-
It does not require any domain knowledge.
-
It is easy to comprehend.
-
The learning and classification steps of a decision tree are simple and fast.
Decision Tree Induction Algorithm
一位名叫 J. Ross Quinlan 的机器研究人员在 1980 年开发了一种名为 ID3(迭代二分器)的决策树算法。随后,他提出了 ID3 的继任者 C4.5。ID3 和 C4.5 采用贪婪方法。在这种演算法中,不存在回溯;以自顶向下的递归分而治之的方式构建树。
A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm known as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which was the successor of ID3. ID3 and C4.5 adopt a greedy approach. In this algorithm, there is no backtracking; the trees are constructed in a top-down recursive divide-and-conquer manner.
Generating a decision tree form training tuples of data partition D
Algorithm : Generate_decision_tree
Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.
Output:
A Decision Tree
Method
create a node N;
if tuples in D are all of the same class, C then
return N as leaf node labeled with class C;
if attribute_list is empty then
return N as leaf node with labeled
with majority class in D;|| majority voting
apply attribute_selection_method(D, attribute_list)
to find the best splitting_criterion;
label node N with splitting_criterion;
if splitting_attribute is discrete-valued and
multiway splits allowed then // no restricted to binary trees
attribute_list = splitting attribute; // remove splitting attribute
for each outcome j of splitting criterion
// partition the tuples and grow subtrees for each partition
let Dj be the set of data tuples in D satisfying outcome j; // a partition
if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;