Data Mining 简明教程
Data Mining - Rule Based Classification
IF-THEN Rules
基于规则的分类器使用了一系列的 IF-THEN 规则来进行分类。我们可以使用下面的方式表达规则:
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a rule in the following from −
让我们考虑规则 R1,
Let us consider a rule R1,
R1: IF age = youth AND student = yes
THEN buy_computer = yes
Points to remember −
Points to remember −
-
The IF part of the rule is called rule antecedent or precondition.
-
The THEN part of the rule is called rule consequent.
-
The antecedent part the condition consist of one or more attribute tests and these tests are logically ANDed.
-
The consequent part consists of class prediction.
Note − 我们还可以将规则 R1 写成:
Note − We can also write rule R1 as follows −
R1: (age = youth) ^ (student = yes))(buys computer = yes)
如果一个给定的元组的条件成立,则前件得到满足。
If the condition holds true for a given tuple, then the antecedent is satisfied.
Rule Extraction
这里我们将学习如何通过从决策树中提取 IF-THEN 规则来构建基于规则的分类器。
Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from a decision tree.
Points to remember −
Points to remember −
从决策树中提取规则的步骤:
To extract a rule from a decision tree −
-
One rule is created for each path from the root to the leaf node.
-
To form a rule antecedent, each splitting criterion is logically ANDed.
-
The leaf node holds the class prediction, forming the rule consequent.
Rule Induction Using Sequential Covering Algorithm
顺序覆盖算法可用于提取 IF-THEN 规则形成训练数据。我们不需要首先生成决策树。在此算法中,给定类的每个规则都涵盖了该类的许多元组。
Sequential Covering Algorithm can be used to extract IF-THEN rules form the training data. We do not require to generate a decision tree first. In this algorithm, each rule for a given class covers many of the tuples of that class.
一些顺序覆盖算法是 AQ、CN2 和 RIPPER。根据一般策略,规则是一次学到的。每次规则被学到时,规则所覆盖的元组都会被移除并继续处理剩余的元组。这是因为决策树中到每个叶子的路径都对应一条规则。
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the general strategy the rules are learned one at a time. For each time rules are learned, a tuple covered by the rule is removed and the process continues for the rest of the tuples. This is because the path to each leaf in a decision tree corresponds to a rule.
Note - 决策树归纳可以被认为是同时学习一组规则。
Note − The Decision tree induction can be considered as learning a set of rules simultaneously.
以下是用于一次为一个类学习规则的顺序学习算法。当从类 Ci 中学习规则时,我们希望规则仅涵盖类 C 中的所有元组,而不涵盖其他类的元组。
The Following is the sequential learning Algorithm where rules are learned for one class at a time. When learning a rule from a class Ci, we want the rule to cover all the tuples from class C only and no tuple form any other class.
Algorithm: Sequential Covering
Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.
Output: A Set of IF-THEN rules.
Method:
Rule_set={ }; // initial set of rules learned is empty
for each class c do
repeat
Rule = Learn_One_Rule(D, Att_valls, c);
remove tuples covered by Rule form D;
until termination condition;
Rule_set=Rule_set+Rule; // add a new rule to rule-set
end for
return Rule_Set;
Rule Pruning
由于以下原因,对规则进行剪枝 -
The rule is pruned is due to the following reason −
-
The Assessment of quality is made on the original set of training data. The rule may perform well on training data but less well on subsequent data. That’s why the rule pruning is required.
-
The rule is pruned by removing conjunct. The rule R is pruned, if pruned version of R has greater quality than what was assessed on an independent set of tuples.
FOIL 是用于规则剪枝的一种简单有效的方法。对于给定的规则 R,
FOIL is one of the simple and effective method for rule pruning. For a given rule R,
其中 pos 和 neg 分别是 R 涵盖的正元组的数量。
where pos and neg is the number of positive tuples covered by R, respectively.
Note - 此值将随着 R 在剪枝集上的准确性而增加。因此,如果剪枝后的版本 R 的 FOIL_Prune 值较高,则剪枝 R。
Note − This value will increase with the accuracy of R on the pruning set. Hence, if the FOIL_Prune value is higher for the pruned version of R, then we prune R.