Data Mining 简明教程
Miscellaneous Classification Methods
此处我们将讨论其他分类方法,如遗传算法、粗糙集方法和模糊集方法。
Here we will discuss other classification methods such as Genetic Algorithms, Rough Set Approach, and Fuzzy Set Approach.
Genetic Algorithms
遗传算法的思想源自自然进化。在遗传算法中,首先创建初始种群。这个初始种群由随机生成的规则组成。我们可以用一串位来表示每条规则。
The idea of genetic algorithm is derived from natural evolution. In genetic algorithm, first of all, the initial population is created. This initial population consists of randomly generated rules. We can represent each rule by a string of bits.
例如,在给定的训练集中,样本由两个布尔属性描述,例如 A1 和 A2。而给定的训练集包含两个类,例如 C1 和 C2。
For example, in a given training set, the samples are described by two Boolean attributes such as A1 and A2. And this given training set contains two classes such as C1 and C2.
我们可以将规则 IF A1 AND NOT A2 THEN C2 编码为位串 100 。在此位表示中,最左边的两位分别代表属性 A1 和 A2。
We can encode the rule IF A1 AND NOT A2 THEN C2 into a bit string 100. In this bit representation, the two leftmost bits represent the attribute A1 and A2, respectively.
同样,规则 IF NOT A1 AND NOT A2 THEN C1 可编码为 001 。
Likewise, the rule IF NOT A1 AND NOT A2 THEN C1 can be encoded as 001.
Note - 如果属性有 K 个值,其中 K>2,则我们可以使用 K 位对属性值进行编码。类也以相同的方式编码。
Note − If the attribute has K values where K>2, then we can use the K bits to encode the attribute values. The classes are also encoded in the same manner.
要点 -
Points to remember −
-
Based on the notion of the survival of the fittest, a new population is formed that consists of the fittest rules in the current population and offspring values of these rules as well.
-
The fitness of a rule is assessed by its classification accuracy on a set of training samples.
-
The genetic operators such as crossover and mutation are applied to create offspring.
-
In crossover, the substring from pair of rules are swapped to form a new pair of rules.
-
In mutation, randomly selected bits in a rule’s string are inverted.
Rough Set Approach
我们可以使用粗糙集方法来发现不精确和噪声数据中的结构关系。
We can use the rough set approach to discover structural relationship within imprecise and noisy data.
Note - 此方法只能应用于离散值属性。因此,连续值属性必须在使用之前离散化。
Note − This approach can only be applied on discrete-valued attributes. Therefore, continuous-valued attributes must be discretized before its use.
粗糙集理论基于在给定的训练数据中建立等价类。形成等价类的元组是不可辨别的。这意味着样本相对于描述数据的属性是相同的。
The Rough Set Theory is based on the establishment of equivalence classes within the given training data. The tuples that forms the equivalence class are indiscernible. It means the samples are identical with respect to the attributes describing the data.
在给定的真实世界数据中,有些类不能用可用属性来区分。我们可以使用粗糙集来 roughly 定义这样的类。
There are some classes in the given real world data, which cannot be distinguished in terms of available attributes. We can use the rough sets to roughly define such classes.
对于给定的类 C,粗糙集定义近似为以下两个集合 -
For a given class C, the rough set definition is approximated by two sets as follows −
-
Lower Approximation of C − The lower approximation of C consists of all the data tuples, that based on the knowledge of the attribute, are certain to belong to class C.
-
Upper Approximation of C − The upper approximation of C consists of all the tuples, that based on the knowledge of attributes, cannot be described as not belonging to C.
下图显示了类 C 的上近似和下近似 -
The following diagram shows the Upper and Lower Approximation of class C −
Fuzzy Set Approaches
模糊集理论也称为可能性理论。该理论是由 Lotfi Zadeh 在 1965 年提出的,作为 two-value logic 和 probability theory 的替代品。该理论允许我们在高层次的抽象中工作。它还为我们提供了处理数据不精确测量的方法。
Fuzzy Set Theory is also called Possibility Theory. This theory was proposed by Lotfi Zadeh in 1965 as an alternative the two-value logic and probability theory. This theory allows us to work at a high level of abstraction. It also provides us the means for dealing with imprecise measurement of data.
模糊集理论还允许我们处理模糊或不确切的事实。例如,成为高收入人群的成员是不确切的(例如,如果 50,000 美元很高,那么 49,000 美元和 48,000 美元呢)。与传统的 CRISP 集不同,在传统的 CRISP 集中,元素要么属于 S 或其补集,但在模糊集理论中,元素可以属于多个模糊集。
The fuzzy set theory also allows us to deal with vague or inexact facts. For example, being a member of a set of high incomes is in exact (e.g. if $50,000 is high then what about $49,000 and $48,000). Unlike the traditional CRISP set where the element either belong to S or its complement but in fuzzy set theory the element can belong to more than one fuzzy set.
例如,收入值 49,000 美元属于中等和高模糊集,但程度不同。该收入值的模糊集表示如下 -
For example, the income value $49,000 belongs to both the medium and high fuzzy sets but to differing degrees. Fuzzy set notation for this income value is as follows −
mmedium_income($49k)=0.15 and mhigh_income($49k)=0.96
其中“m”是分别对 medium_income 和 high_income 的模糊集进行操作的隶属函数。该符号可以用图表表示如下 -
where ‘m’ is the membership function that operates on the fuzzy sets of medium_income and high_income respectively. This notation can be shown diagrammatically as follows −