Data Mining 简明教程

Data Mining - Query Language

数据挖掘查询语言 (DMQL) 由 Han、Fu、Wang 提议用于 DBMiner 数据挖掘系统。数据挖掘查询语言实际上基于结构化查询语言 (SQL)。可以设计数据挖掘查询语言来支持特设和交互式数据挖掘。此 DMQL 提供用于指定原始元素的命令。DMQL 可以与数据库和数据仓库配合使用。DMQL 可用于定义数据挖掘任务。我们特别检查如何在 DMQL 中定义数据仓库和数据市集。

The Data Mining Query Language (DMQL) was proposed by Han, Fu, Wang, et al. for the DBMiner data mining system. The Data Mining Query Language is actually based on the Structured Query Language (SQL). Data Mining Query Languages can be designed to support ad hoc and interactive data mining. This DMQL provides commands for specifying primitives. The DMQL can work with databases and data warehouses as well. DMQL can be used to define data mining tasks. Particularly we examine how to define data warehouses and data marts in DMQL.

Syntax for Task-Relevant Data Specification

以下是用于指定任务相关数据的 DMQL 语法 −

Here is the syntax of DMQL for specifying task-relevant data −

use database database_name

or

use data warehouse data_warehouse_name
in relevance to att_or_dim_list
from relation(s)/cube(s) [where condition]
order by order_list
group by grouping_list

Syntax for Specifying the Kind of Knowledge

我们在此将讨论表征、辨别、关联、分类和预测的语法。

Here we will discuss the syntax for Characterization, Discrimination, Association, Classification, and Prediction.

Characterization

表征语法如下 −

The syntax for characterization is −

mine characteristics [as pattern_name]
   analyze  {measure(s) }

analyze 子句，指定聚合测量值，例如 count、sum 或 count%。

The analyze clause, specifies aggregate measures, such as count, sum, or count%.

例如 -

For example −

Description describing customer purchasing habits.
mine characteristics as customerPurchasing
analyze count%

Discrimination

辨别语法如下 −

The syntax for Discrimination is −

mine comparison [as {pattern_name]}
For {target_class } where  {t arget_condition }
{versus  {contrast_class_i }
where {contrast_condition_i}}
analyze  {measure(s) }

例如，用户可能会将大额挥霍者定义为平均购买 100 美元或更多商品的客户；而将预算挥霍者定义为平均购买低于 100 美元商品的客户。可以通过 DMQL 将从每个此类别的客户中挖掘的辨别描述指定为 −

For example, a user may define big spenders as customers who purchase items that cost $100 or more on an average; and budget spenders as customers who purchase items at less than $100 on an average. The mining of discriminant descriptions for customers from each of these categories can be specified in the DMQL as −

mine comparison as purchaseGroups
for bigSpenders where avg(I.price) ≥$100
versus budgetSpenders where avg(I.price)< $100
analyze count

Association

关联语法如下 −

The syntax for Association is−

mine associations [ as {pattern_name} ]
{matching {metapattern} }

例如 −

For Example −

mine associations as buyingHabits
matching P(X:customer,W) ^ Q(X,Y) ≥ buys(X,Z)

其中 X 是客户关系的关键；P 和 Q 是谓词变量；W、Y 和 Z 是对象变量。

where X is key of customer relation; P and Q are predicate variables; and W, Y, and Z are object variables.

Classification

分类语法如下 −

The syntax for Classification is −

mine classification [as pattern_name]
analyze classifying_attribute_or_dimension

例如，要挖掘模式，将客户信用评级分类，其中这些类别由属性 credit_rating 确定，并且将挖掘分类确定为 classifyCustomerCreditRating。

For example, to mine patterns, classifying customer credit rating where the classes are determined by the attribute credit_rating, and mine classification is determined as classifyCustomerCreditRating.

analyze credit_rating

Prediction

预测语法如下 −

The syntax for prediction is −

mine prediction [as pattern_name]
analyze prediction_attribute_or_dimension
{set {attribute_or_dimension_i= value_i}}

Syntax for Concept Hierarchy Specification

要指定概念层次结构，请使用以下语法 −

To specify concept hierarchies, use the following syntax −

use hierarchy <hierarchy> for <attribute_or_dimension>

我们使用不同的语法来定义不同类型的层次结构，例如 −

We use different syntaxes to define different types of hierarchies such as−

-schema hierarchies
define hierarchy time_hierarchy on date as [date,month quarter,year]
-
set-grouping hierarchies
define hierarchy age_hierarchy for age on customer as
level1: {young, middle_aged, senior} < level0: all
level2: {20, ..., 39} < level1: young
level3: {40, ..., 59} < level1: middle_aged
level4: {60, ..., 89} < level1: senior

-operation-derived hierarchies
define hierarchy age_hierarchy  for age  on customer  as
{age_category(1), ..., age_category(5)}
:= cluster(default, age, 5) < all(age)

-rule-based hierarchies
define hierarchy profit_margin_hierarchy  on item  as
level_1: low_profit_margin < level_0:  all

if (price - cost)< $50
   level_1:  medium-profit_margin < level_0:  all

if ((price - cost) > $50)  and ((price - cost) ≤ $250))
   level_1:  high_profit_margin < level_0:  all

Syntax for Interestingness Measures Specification

用户可以使用该语句指定有趣性测量和阈值 −

Interestingness measures and thresholds can be specified by the user with the statement −

with <interest_measure_name>  threshold = threshold_value

例如 −

For Example −

with support threshold = 0.05
with confidence threshold = 0.7

Syntax for Pattern Presentation and Visualization Specification

我们有一个语法，允许用户指定一种或多种形式中发现的模式的显示方式。

We have a syntax, which allows users to specify the display of discovered patterns in one or more forms.

display as <result_form>

例如 −

For Example −

display as table

Full Specification of DMQL

作为一名公司的市场经理，您希望对购买价格不少于 100 美元的商品的顾客的购买习惯进行表征；根据顾客的年龄、购买商品的类型以及购买商品的地点。您希望知道具有该特征的顾客的百分比。特别是，您只对在加拿大进行的，并使用美国运通信用卡支付的购买感兴趣。您希望以表格的形式查看结果说明。

As a market manager of a company, you would like to characterize the buying habits of customers who can purchase items priced at no less than $100; with respect to the customer’s age, type of item purchased, and the place where the item was purchased. You would like to know the percentage of customers having that characteristic. In particular, you are only interested in purchases made in Canada, and paid with an American Express credit card. You would like to view the resulting descriptions in the form of a table.

use database AllElectronics_db
use hierarchy location_hierarchy for B.address
mine characteristics as customerPurchasing
analyze count%
in relevance to C.age,I.type,I.place_made
from customer C, item I, purchase P, items_sold S,  branch B
where I.item_ID = S.item_ID and P.cust_ID = C.cust_ID and
P.method_paid = "AmEx" and B.address = "Canada" and I.price ≥ 100
with noise threshold = 5%
display as table

Data Mining Languages Standardization

标准化数据挖掘语言将服务以下目的 −

Standardizing the Data Mining Languages will serve the following purposes −

Helps systematic development of data mining solutions.
Improves interoperability among multiple data mining systems and functions.
Promotes education and rapid learning.
Promotes the use of data mining systems in industry and society.