Data Mining 简明教程
Data Mining - Bayesian Classification
贝叶斯分类基于贝叶斯定理。贝叶斯分类器是统计分类器。贝叶斯分类器可以预测类成员身份概率,例如给定元组属于特定类的概率。
Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the statistical classifiers. Bayesian classifiers can predict class membership probabilities such as the probability that a given tuple belongs to a particular class.
Baye’s Theorem
贝叶斯定理是以托马斯·贝叶斯命名的。有两种类型的概率 -
Bayes' Theorem is named after Thomas Bayes. There are two types of probabilities −
-
Posterior Probability [P(H/X)]
-
Prior Probability [P(H)]
其中 X 是数据元组,H 是一些假设。
where X is data tuple and H is some hypothesis.
根据贝叶斯定理,
According to Bayes' Theorem,
Bayesian Belief Network
贝叶斯信念网络指定了联合条件概率分布。它们也被称为信念网络、贝叶斯网络或概率网络。
Bayesian Belief Networks specify joint conditional probability distributions. They are also known as Belief Networks, Bayesian Networks, or Probabilistic Networks.
-
A Belief Network allows class conditional independencies to be defined between subsets of variables.
-
It provides a graphical model of causal relationship on which learning can be performed.
-
We can use a trained Bayesian Network for classification.
定义贝叶斯推理网络的两个组件为:
There are two components that define a Bayesian Belief Network −
-
Directed acyclic graph
-
A set of conditional probability tables
Directed Acyclic Graph
-
Each node in a directed acyclic graph represents a random variable.
-
These variable may be discrete or continuous valued.
-
These variables may correspond to the actual attribute given in the data.
Directed Acyclic Graph Representation
下图给出了六个布尔变量的有向无环图。
The following diagram shows a directed acyclic graph for six Boolean variables.
图中的弧表示因果关系。例如,肺癌受个人的家族肺癌史和是否为吸烟者的影响。值得注意的是,已知患者患有肺癌,则变量 PositiveXray 与患者是否有家族肺癌史或是否为吸烟者无关。
The arc in the diagram allows representation of causal knowledge. For example, lung cancer is influenced by a person’s family history of lung cancer, as well as whether or not the person is a smoker. It is worth noting that the variable PositiveXray is independent of whether the patient has a family history of lung cancer or that the patient is a smoker, given that we know the patient has lung cancer.
Conditional Probability Table
变量 LungCancer (LC) 的条件概率表显示了其父节点 FamilyHistory (FH) 和 Smoker (S) 的值的每种可能组合,如下所示:
The conditional probability table for the values of the variable LungCancer (LC) showing each possible combination of the values of its parent nodes, FamilyHistory (FH), and Smoker (S) is as follows −