Data Mining 简明教程

Data Mining - Terminologies

Data Mining

数据挖掘被定义为从大量数据中提取信息。换句话说,我们可以说数据挖掘是从数据中挖掘知识。此信息可用于以下任意应用程序 −

Data mining is defined as extracting the information from a huge set of data. In other words we can say that data mining is mining the knowledge from data. This information can be used for any of the following applications −

  1. Market Analysis

  2. Fraud Detection

  3. Customer Retention

  4. Production Control

  5. Science Exploration

Data Mining Engine

数据挖掘引擎对数据挖掘系统至关重要。它由执行以下函数的一组功能模块组成 −

Data mining engine is very essential to the data mining system. It consists of a set of functional modules that perform the following functions −

  1. Characterization

  2. Association and Correlation Analysis

  3. Classification

  4. Prediction

  5. Cluster analysis

  6. Outlier analysis

  7. Evolution analysis

Knowledge Base

这是领域知识。这种知识用于指导搜索或评估所得模式的趣味性。

This is the domain knowledge. This knowledge is used to guide the search or evaluate the interestingness of the resulting patterns.

Knowledge Discovery

有些人将数据挖掘与知识发现视为同义词,而另一些人则将数据挖掘视为知识发现过程中必不可少的一步。以下是知识发现过程中涉及的步骤 −

Some people treat data mining same as knowledge discovery, while others view data mining as an essential step in the process of knowledge discovery. Here is the list of steps involved in the knowledge discovery process −

  1. Data Cleaning

  2. Data Integration

  3. Data Selection

  4. Data Transformation

  5. Data Mining

  6. Pattern Evaluation

  7. Knowledge Presentation

User interface

用户界面是数据挖掘系统的模块,用于帮助用户和数据挖掘系统之间的通信。用户界面允许以下功能 −

User interface is the module of data mining system that helps the communication between users and the data mining system. User Interface allows the following functionalities −

  1. Interact with the system by specifying a data mining query task.

  2. Providing information to help focus the search.

  3. Mining based on the intermediate data mining results.

  4. Browse database and data warehouse schemas or data structures.

  5. Evaluate mined patterns.

  6. Visualize the patterns in different forms.

Data Integration

数据集成是一种将来自多个异构数据源的数据合并到一个连贯的数据存储中的数据预处理技术。数据集成可能涉及不一致的数据,因此需要数据清理。

Data Integration is a data preprocessing technique that merges the data from multiple heterogeneous data sources into a coherent data store. Data integration may involve inconsistent data and therefore needs data cleaning.

Data Cleaning

数据清理是一种用于清除杂乱数据并更正数据中不一致性的技术。数据清理涉及更正错误数据的转换。数据清理是在为数据仓库准备数据时作为数据预处理步骤执行的。

Data cleaning is a technique that is applied to remove the noisy data and correct the inconsistencies in data. Data cleaning involves transformations to correct the wrong data. Data cleaning is performed as a data preprocessing step while preparing the data for a data warehouse.

Data Selection

数据选择是从数据库中检索与分析任务相关的数据的过程。有时,在数据选择过程之前会执行数据转换和合并。

Data Selection is the process where data relevant to the analysis task are retrieved from the database. Sometimes data transformation and consolidation are performed before the data selection process.

Clusters

聚类是指一组相似的对象。聚类分析是指形成一群彼此非常相似但与其他群集中的对象有很大差异的对象。

Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming group of objects that are very similar to each other but are highly different from the objects in other clusters.

Data Transformation

在此步骤中通过执行摘要或汇总操作,将数据转换成适合挖掘的形式。

In this step, data is transformed or consolidated into forms appropriate for mining, by performing summary or aggregation operations.