Big Data Analytics 简明教程
Big Data Analytics - Association Rules
令 I = i1, i2, …, in 是称为项的 n 个二进制属性的集合。令 D = t1, t2, …, tm 是称为数据库的事务集合。D 中的每笔交易具有唯一的交易 ID,并且包含 I 中项的子集。规则定义为形式为 X ⇒ Y 的蕴含,其中 X, Y ⊆ I 且 X ∩ Y = ∅。
Let I = i1, i2, …, in be a set of n binary attributes called items. Let D = t1, t2, …, tm be a set of transactions called the database. Each transaction in D has a unique transaction ID and contains a subset of the items in I. A rule is defined as an implication of the form X ⇒ Y where X, Y ⊆ I and X ∩ Y = ∅.
项集(简称项集)X 和 Y 称为规则的前件(左侧或 LHS)和结果(右侧或 RHS)。
The sets of items (for short item-sets) X and Y are called antecedent (left-hand-side or LHS) and consequent (right-hand-side or RHS) of the rule.
为了说明这些概念,我们使用超市领域的一个小示例。项集为 I = {牛奶、面包、黄油、啤酒},下表中显示了一个包含该项的小数据库。
To illustrate the concepts, we use a small example from the supermarket domain. The set of items is I = {milk, bread, butter, beer} and a small database containing the items is shown in the following table.
Transaction ID |
Items |
1 |
milk, bread |
2 |
bread, butter |
3 |
beer |
4 |
milk, bread, butter |
5 |
bread, butter |
超市的一个示例规则可以是 {牛奶、面包} ⇒ {黄油},这意味着如果购买牛奶和面包,顾客也会购买黄油。为了从所有可能的规则集中选择出有趣的规则,可以使用对各种重要性和兴趣度量要求约束。众所周知,对支持度和置信度的约束最低。
An example rule for the supermarket could be {milk, bread} ⇒ {butter} meaning that if milk and bread is bought, customers also buy butter. To select interesting rules from the set of all possible rules, constraints on various measures of significance and interest can be used. The best-known constraints are minimum thresholds on support and confidence.
项集 X 的支持度 supp(X) 定义为包含该项集的数据集中事务的比例。在表 1 中的示例数据库中,项集 {牛奶、面包} 的支持度为 2/5 = 0.4,因为它出现在 40% 的所有交易中(5 笔交易中的 2 笔)。查找频繁项集可视为无监督学习问题的简化。
The support supp(X) of an item-set X is defined as the proportion of transactions in the data set which contain the item-set. In the example database in Table 1, the item-set {milk, bread} has a support of 2/5 = 0.4 since it occurs in 40% of all transactions (2 out of 5 transactions). Finding frequent item-sets can be seen as a simplification of the unsupervised learning problem.
规则的置信度定义为 conf(X ⇒ Y ) = supp(X ∪ Y )/supp(X)。例如,规则 {牛奶、面包} ⇒ {黄油} 在表 1 中的数据库中的置信度为 0.2/0.4 = 0.5,这意味着在包含牛奶和面包的交易中,该规则在 50% 的交易中是正确的。置信度可解释为概率 P(Y|X) 的估计,即在这些事务也包含 LHS 的条件下,在事务中找到规则 RHS 的概率。
The confidence of a rule is defined conf(X ⇒ Y ) = supp(X ∪ Y )/supp(X). For example, the rule {milk, bread} ⇒ {butter} has a confidence of 0.2/0.4 = 0.5 in the database in Table 1, which means that for 50% of the transactions containing milk and bread the rule is correct. Confidence can be interpreted as an estimate of the probability P(Y|X), the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS.
可以在 bda/part3/apriori.R 中找到用于实现 apriori algorithm 的代码。
In the script located in bda/part3/apriori.R the code to implement the apriori algorithm can be found.
# Load the library for doing association rules
# install.packages(’arules’)
# Data preprocessing
AdultUCI[["fnlwgt"]] <- NULL
AdultUCI[["education-num"]] <- NULL
AdultUCI[[ "age"]] <- ordered(cut(AdultUCI[[ "age"]], c(15,25,45,65,100)),
labels = c("Young", "Middle-aged", "Senior", "Old"))
AdultUCI[[ "hours-per-week"]] <- ordered(cut(AdultUCI[[ "hours-per-week"]],
c(0,25,40,60,168)), labels = c("Part-time", "Full-time", "Over-time", "Workaholic"))
AdultUCI[[ "capital-gain"]] <- ordered(cut(AdultUCI[[ "capital-gain"]],
c(-Inf,0,median(AdultUCI[[ "capital-gain"]][AdultUCI[[ "capitalgain"]]>0]),Inf)),
labels = c("None", "Low", "High"))
AdultUCI[[ "capital-loss"]] <- ordered(cut(AdultUCI[[ "capital-loss"]],
c(-Inf,0, median(AdultUCI[[ "capital-loss"]][AdultUCI[[ "capitalloss"]]>0]),Inf)),
labels = c("none", "low", "high"))
为了使用 Apriori 算法生成规则,我们需要创建一个事务矩阵。以下代码展示了如何在 R 中执行此操作。
In order to generate rules using the apriori algorithm, we need to create a transaction matrix. The following code shows how to do this in R.
# Convert the data into a transactions format
Adult <- as(AdultUCI, "transactions")
# transactions in sparse format with
# 48842 transactions (rows) and
# 115 items (columns)
# Plot frequent item-sets
itemFrequencyPlot(Adult, support = 0.1, cex.names = 0.8)
# generate rules
min_support = 0.01
confidence = 0.6
rules <- apriori(Adult, parameter = list(support = min_support, confidence = confidence))
inspect(rules[100:110, ])
# lhs rhs support confidence lift
# {occupation = Farming-fishing} => {sex = Male} 0.02856148 0.9362416 1.4005486
# {occupation = Farming-fishing} => {race = White} 0.02831579 0.9281879 1.0855456
# {occupation = Farming-fishing} => {native-country 0.02671881 0.8758389 0.9759474
= United-States}