Data Mining 简明教程
Data Mining - Query Language
数据挖掘查询语言 (DMQL) 由 Han、Fu、Wang 提议用于 DBMiner 数据挖掘系统。数据挖掘查询语言实际上基于结构化查询语言 (SQL)。可以设计数据挖掘查询语言来支持特设和交互式数据挖掘。此 DMQL 提供用于指定原始元素的命令。DMQL 可以与数据库和数据仓库配合使用。DMQL 可用于定义数据挖掘任务。我们特别检查如何在 DMQL 中定义数据仓库和数据市集。
The Data Mining Query Language (DMQL) was proposed by Han, Fu, Wang, et al. for the DBMiner data mining system. The Data Mining Query Language is actually based on the Structured Query Language (SQL). Data Mining Query Languages can be designed to support ad hoc and interactive data mining. This DMQL provides commands for specifying primitives. The DMQL can work with databases and data warehouses as well. DMQL can be used to define data mining tasks. Particularly we examine how to define data warehouses and data marts in DMQL.
Syntax for Task-Relevant Data Specification
以下是用于指定任务相关数据的 DMQL 语法 −
Here is the syntax of DMQL for specifying task-relevant data −
use database database_name
or
use data warehouse data_warehouse_name
in relevance to att_or_dim_list
from relation(s)/cube(s) [where condition]
order by order_list
group by grouping_list
Syntax for Specifying the Kind of Knowledge
我们在此将讨论表征、辨别、关联、分类和预测的语法。
Here we will discuss the syntax for Characterization, Discrimination, Association, Classification, and Prediction.
Characterization
表征语法如下 −
The syntax for characterization is −
mine characteristics [as pattern_name]
analyze {measure(s) }
analyze 子句,指定聚合测量值,例如 count、sum 或 count%。
The analyze clause, specifies aggregate measures, such as count, sum, or count%.
例如 -
For example −
Description describing customer purchasing habits.
mine characteristics as customerPurchasing
analyze count%
Discrimination
辨别语法如下 −
The syntax for Discrimination is −
mine comparison [as {pattern_name]}
For {target_class } where {t arget_condition }
{versus {contrast_class_i }
where {contrast_condition_i}}
analyze {measure(s) }
例如,用户可能会将大额挥霍者定义为平均购买 100 美元或更多商品的客户;而将预算挥霍者定义为平均购买低于 100 美元商品的客户。可以通过 DMQL 将从每个此类别的客户中挖掘的辨别描述指定为 −
For example, a user may define big spenders as customers who purchase items that cost $100 or more on an average; and budget spenders as customers who purchase items at less than $100 on an average. The mining of discriminant descriptions for customers from each of these categories can be specified in the DMQL as −
mine comparison as purchaseGroups
for bigSpenders where avg(I.price) ≥$100
versus budgetSpenders where avg(I.price)< $100
analyze count
Association
关联语法如下 −
The syntax for Association is−
mine associations [ as {pattern_name} ]
{matching {metapattern} }
例如 −
For Example −
mine associations as buyingHabits
matching P(X:customer,W) ^ Q(X,Y) ≥ buys(X,Z)
其中 X 是客户关系的关键;P 和 Q 是谓词变量;W、Y 和 Z 是对象变量。
where X is key of customer relation; P and Q are predicate variables; and W, Y, and Z are object variables.
Classification
分类语法如下 −
The syntax for Classification is −
mine classification [as pattern_name]
analyze classifying_attribute_or_dimension
例如,要挖掘模式,将客户信用评级分类,其中这些类别由属性 credit_rating 确定,并且将挖掘分类确定为 classifyCustomerCreditRating。
For example, to mine patterns, classifying customer credit rating where the classes are determined by the attribute credit_rating, and mine classification is determined as classifyCustomerCreditRating.
analyze credit_rating
Syntax for Concept Hierarchy Specification
要指定概念层次结构,请使用以下语法 −
To specify concept hierarchies, use the following syntax −
use hierarchy <hierarchy> for <attribute_or_dimension>
我们使用不同的语法来定义不同类型的层次结构,例如 −
We use different syntaxes to define different types of hierarchies such as−
-schema hierarchies
define hierarchy time_hierarchy on date as [date,month quarter,year]
-
set-grouping hierarchies
define hierarchy age_hierarchy for age on customer as
level1: {young, middle_aged, senior} < level0: all
level2: {20, ..., 39} < level1: young
level3: {40, ..., 59} < level1: middle_aged
level4: {60, ..., 89} < level1: senior
-operation-derived hierarchies
define hierarchy age_hierarchy for age on customer as
{age_category(1), ..., age_category(5)}
:= cluster(default, age, 5) < all(age)
-rule-based hierarchies
define hierarchy profit_margin_hierarchy on item as
level_1: low_profit_margin < level_0: all
if (price - cost)< $50
level_1: medium-profit_margin < level_0: all
if ((price - cost) > $50) and ((price - cost) ≤ $250))
level_1: high_profit_margin < level_0: all
Syntax for Interestingness Measures Specification
用户可以使用该语句指定有趣性测量和阈值 −
Interestingness measures and thresholds can be specified by the user with the statement −
with <interest_measure_name> threshold = threshold_value
例如 −
For Example −
with support threshold = 0.05
with confidence threshold = 0.7
Syntax for Pattern Presentation and Visualization Specification
我们有一个语法,允许用户指定一种或多种形式中发现的模式的显示方式。
We have a syntax, which allows users to specify the display of discovered patterns in one or more forms.
display as <result_form>
例如 −
For Example −
display as table
Full Specification of DMQL
作为一名公司的市场经理,您希望对购买价格不少于 100 美元的商品的顾客的购买习惯进行表征;根据顾客的年龄、购买商品的类型以及购买商品的地点。您希望知道具有该特征的顾客的百分比。特别是,您只对在加拿大进行的,并使用美国运通信用卡支付的购买感兴趣。您希望以表格的形式查看结果说明。
As a market manager of a company, you would like to characterize the buying habits of customers who can purchase items priced at no less than $100; with respect to the customer’s age, type of item purchased, and the place where the item was purchased. You would like to know the percentage of customers having that characteristic. In particular, you are only interested in purchases made in Canada, and paid with an American Express credit card. You would like to view the resulting descriptions in the form of a table.
use database AllElectronics_db
use hierarchy location_hierarchy for B.address
mine characteristics as customerPurchasing
analyze count%
in relevance to C.age,I.type,I.place_made
from customer C, item I, purchase P, items_sold S, branch B
where I.item_ID = S.item_ID and P.cust_ID = C.cust_ID and
P.method_paid = "AmEx" and B.address = "Canada" and I.price ≥ 100
with noise threshold = 5%
display as table
Data Mining Languages Standardization
标准化数据挖掘语言将服务以下目的 −
Standardizing the Data Mining Languages will serve the following purposes −
-
Helps systematic development of data mining solutions.
-
Improves interoperability among multiple data mining systems and functions.
-
Promotes education and rapid learning.
-
Promotes the use of data mining systems in industry and society.