Data Mining 简明教程

Data Mining - Tasks

数据挖掘涉及可挖掘的模式类型。根据要挖掘的数据类型,数据挖掘涉及两个类别的功能 −

Data mining deals with the kind of patterns that can be mined. On the basis of the kind of data to be mined, there are two categories of functions involved in Data Mining −

  1. Descriptive

  2. Classification and Prediction

Descriptive Function

描述功能涉及数据库中数据的常规属性。以下是描述功能列表 −

The descriptive function deals with the general properties of data in the database. Here is the list of descriptive functions −

  1. Class/Concept Description

  2. Mining of Frequent Patterns

  3. Mining of Associations

  4. Mining of Correlations

  5. Mining of Clusters

Class/Concept Description

类别/概念是指要与类别或概念关联的数据。例如,在一家公司中,销售物品的类别包括计算机和打印机,客户概念包括大额消费者和预算消费者。此类类别或概念的描述称为类别/概念描述。可以通过以下两种方式获取这些描述:

Class/Concept refers to the data to be associated with the classes or concepts. For example, in a company, the classes of items for sales include computer and printers, and concepts of customers include big spenders and budget spenders. Such descriptions of a class or a concept are called class/concept descriptions. These descriptions can be derived by the following two ways −

  1. Data Characterization − This refers to summarizing data of class under study. This class under study is called as Target Class.

  2. Data Discrimination − It refers to the mapping or classification of a class with some predefined group or class.

Mining of Frequent Patterns

频繁模式是指在交易数据中经常发生的模式。以下是频繁模式的类型列表 −

Frequent patterns are those patterns that occur frequently in transactional data. Here is the list of kind of frequent patterns −

  1. Frequent Item Set − It refers to a set of items that frequently appear together, for example, milk and bread.

  2. Frequent Subsequence − A sequence of patterns that occur frequently such as purchasing a camera is followed by memory card.

  3. Frequent Sub Structure − Substructure refers to different structural forms, such as graphs, trees, or lattices, which may be combined with item-sets or subsequences.

Mining of Association

关联用于零售销售中,以识别经常一起购买的模式。此过程是指揭示数据之间的关系和确定关联规则的过程。

Associations are used in retail sales to identify patterns that are frequently purchased together. This process refers to the process of uncovering the relationship among data and determining association rules.

例如,零售商生成了一条关联规则,表明 70% 的时间牛奶与面包一起出售,而只有 30% 的时间饼干与面包一起出售。

For example, a retailer generates an association rule that shows that 70% of time milk is sold with bread and only 30% of times biscuits are sold with bread.

Mining of Correlations

它是一种附加分析,执行此分析是为了揭示关联属性值对或两个项目集之间的有趣统计关联,以分析它们是否对彼此有正面、负面或无影响。

It is a kind of additional analysis performed to uncover interesting statistical correlations between associated-attribute-value pairs or between two item sets to analyze that if they have positive, negative or no effect on each other.

Mining of Clusters

集群是指一组类似种类的对象。 Cluster analysis 指的是形成彼此非常相似但与其他集群中的对象有很大区别的对象组。

Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming group of objects that are very similar to each other but are highly different from the objects in other clusters.

Classification and Prediction

Classification 是寻找描述数据类或概念的模型的过程。目的是能够使用此模型来预测类别标签未知的对象的类别。此派生模型基于对训练数据集的分析。派生模型可以以下形式呈现 -

Classification is the process of finding a model that describes the data classes or concepts. The purpose is to be able to use this model to predict the class of objects whose class label is unknown. This derived model is based on the analysis of sets of training data. The derived model can be presented in the following forms −

  1. Classification (IF-THEN) Rules

  2. Decision Trees

  3. Mathematical Formulae

  4. Neural Networks

参与这些过程的函数列表如下:

The list of functions involved in these processes are as follows −

  1. Classification − It predicts the class of objects whose class label is unknown. Its objective is to find a derived model that describes and distinguishes data classes or concepts. The Derived Model is based on the analysis set of training data i.e. the data object whose class label is well known.

  2. Prediction − It is used to predict missing or unavailable numerical data values rather than class labels. Regression Analysis is generally used for prediction. Prediction can also be used for identification of distribution trends based on available data.

  3. Outlier Analysis − Outliers may be defined as the data objects that do not comply with the general behavior or model of the data available.

  4. Evolution Analysis − Evolution analysis refers to the description and model regularities or trends for objects whose behavior changes over time.

Data Mining Task Primitives

  1. We can specify a data mining task in the form of a data mining query.

  2. This query is input to the system.

  3. A data mining query is defined in terms of data mining task primitives.

Note − 这些基础函数允许我们以交互方式与数据挖掘系统进行通信。以下为数据挖掘任务基础函数列表 -

Note − These primitives allow us to communicate in an interactive manner with the data mining system. Here is the list of Data Mining Task Primitives −

  1. Set of task relevant data to be mined.

  2. Kind of knowledge to be mined.

  3. Background knowledge to be used in discovery process.

  4. Interestingness measures and thresholds for pattern evaluation.

  5. Representation for visualizing the discovered patterns.

Set of task relevant data to be mined

这是用户感兴趣的数据库部分。此部分包括以下内容 -

This is the portion of database in which the user is interested. This portion includes the following −

  1. Database Attributes

  2. Data Warehouse dimensions of interest

Kind of knowledge to be mined

它指要执行的功能类型。这些功能为 -

It refers to the kind of functions to be performed. These functions are −

  1. Characterization

  2. Discrimination

  3. Association and Correlation Analysis

  4. Classification

  5. Prediction

  6. Clustering

  7. Outlier Analysis

  8. Evolution Analysis

Background knowledge

背景知识允许在多个抽象层级挖掘数据。例如,概念层次是允许在多个抽象层级挖掘数据的背景知识之一。

The background knowledge allows data to be mined at multiple levels of abstraction. For example, the Concept hierarchies are one of the background knowledge that allows data to be mined at multiple levels of abstraction.

Interestingness measures and thresholds for pattern evaluation

这用于评估通过知识发现过程发现的模式。针对不同种类的知识,有不同的有趣措施.

This is used to evaluate the patterns that are discovered by the process of knowledge discovery. There are different interesting measures for different kind of knowledge.

Representation for visualizing the discovered patterns

这涉及发现的模式将要显示的形式。这些表示形式可能包括以下内容。−

This refers to the form in which discovered patterns are to be displayed. These representations may include the following. −

  1. Rules

  2. Tables

  3. Charts

  4. Graphs

  5. Decision Trees

  6. Cubes