Data Mining 简明教程
Data Mining - Query Language
数据挖掘查询语言 (DMQL) 由 Han、Fu、Wang 提议用于 DBMiner 数据挖掘系统。数据挖掘查询语言实际上基于结构化查询语言 (SQL)。可以设计数据挖掘查询语言来支持特设和交互式数据挖掘。此 DMQL 提供用于指定原始元素的命令。DMQL 可以与数据库和数据仓库配合使用。DMQL 可用于定义数据挖掘任务。我们特别检查如何在 DMQL 中定义数据仓库和数据市集。
Syntax for Task-Relevant Data Specification
以下是用于指定任务相关数据的 DMQL 语法 −
use database database_name
or
use data warehouse data_warehouse_name
in relevance to att_or_dim_list
from relation(s)/cube(s) [where condition]
order by order_list
group by grouping_list
Syntax for Specifying the Kind of Knowledge
我们在此将讨论表征、辨别、关联、分类和预测的语法。
Characterization
表征语法如下 −
mine characteristics [as pattern_name]
analyze {measure(s) }
analyze 子句,指定聚合测量值,例如 count、sum 或 count%。
例如 -
Description describing customer purchasing habits.
mine characteristics as customerPurchasing
analyze count%
Discrimination
辨别语法如下 −
mine comparison [as {pattern_name]}
For {target_class } where {t arget_condition }
{versus {contrast_class_i }
where {contrast_condition_i}}
analyze {measure(s) }
例如,用户可能会将大额挥霍者定义为平均购买 100 美元或更多商品的客户;而将预算挥霍者定义为平均购买低于 100 美元商品的客户。可以通过 DMQL 将从每个此类别的客户中挖掘的辨别描述指定为 −
mine comparison as purchaseGroups
for bigSpenders where avg(I.price) ≥$100
versus budgetSpenders where avg(I.price)< $100
analyze count
Association
关联语法如下 −
mine associations [ as {pattern_name} ]
{matching {metapattern} }
例如 −
mine associations as buyingHabits
matching P(X:customer,W) ^ Q(X,Y) ≥ buys(X,Z)
其中 X 是客户关系的关键;P 和 Q 是谓词变量;W、Y 和 Z 是对象变量。
Syntax for Concept Hierarchy Specification
要指定概念层次结构,请使用以下语法 −
use hierarchy <hierarchy> for <attribute_or_dimension>
我们使用不同的语法来定义不同类型的层次结构,例如 −
-schema hierarchies
define hierarchy time_hierarchy on date as [date,month quarter,year]
-
set-grouping hierarchies
define hierarchy age_hierarchy for age on customer as
level1: {young, middle_aged, senior} < level0: all
level2: {20, ..., 39} < level1: young
level3: {40, ..., 59} < level1: middle_aged
level4: {60, ..., 89} < level1: senior
-operation-derived hierarchies
define hierarchy age_hierarchy for age on customer as
{age_category(1), ..., age_category(5)}
:= cluster(default, age, 5) < all(age)
-rule-based hierarchies
define hierarchy profit_margin_hierarchy on item as
level_1: low_profit_margin < level_0: all
if (price - cost)< $50
level_1: medium-profit_margin < level_0: all
if ((price - cost) > $50) and ((price - cost) ≤ $250))
level_1: high_profit_margin < level_0: all
Syntax for Interestingness Measures Specification
用户可以使用该语句指定有趣性测量和阈值 −
with <interest_measure_name> threshold = threshold_value
例如 −
with support threshold = 0.05
with confidence threshold = 0.7
Syntax for Pattern Presentation and Visualization Specification
我们有一个语法,允许用户指定一种或多种形式中发现的模式的显示方式。
display as <result_form>
例如 −
display as table
Full Specification of DMQL
作为一名公司的市场经理,您希望对购买价格不少于 100 美元的商品的顾客的购买习惯进行表征;根据顾客的年龄、购买商品的类型以及购买商品的地点。您希望知道具有该特征的顾客的百分比。特别是,您只对在加拿大进行的,并使用美国运通信用卡支付的购买感兴趣。您希望以表格的形式查看结果说明。
use database AllElectronics_db
use hierarchy location_hierarchy for B.address
mine characteristics as customerPurchasing
analyze count%
in relevance to C.age,I.type,I.place_made
from customer C, item I, purchase P, items_sold S, branch B
where I.item_ID = S.item_ID and P.cust_ID = C.cust_ID and
P.method_paid = "AmEx" and B.address = "Canada" and I.price ≥ 100
with noise threshold = 5%
display as table