Data Mining 简明教程
Data Mining - Evaluation
Data Warehouse
data warehouse 对以下特性进行了展示,以支持管理层的决策制定过程 −
A data warehouse exhibits the following characteristics to support the management’s decision-making process −
-
Subject Oriented − Data warehouse is subject oriented because it provides us the information around a subject rather than the organization’s ongoing operations. These subjects can be product, customers, suppliers, sales, revenue, etc. The data warehouse does not focus on the ongoing operations, rather it focuses on modelling and analysis of data for decision-making.
-
Integrated − Data warehouse is constructed by integration of data from heterogeneous sources such as relational databases, flat files etc. This integration enhances the effective analysis of data.
-
Time Variant − The data collected in a data warehouse is identified with a particular time period. The data in a data warehouse provides information from a historical point of view.
-
Non-volatile − Nonvolatile means the previous data is not removed when new data is added to it. The data warehouse is kept separate from the operational database therefore frequent changes in operational database is not reflected in the data warehouse.
Data Warehousing
Data warehousing 是构建和使用数据仓库的过程。数据仓库是通过整合来自多个异构来源的数据而构建的。它支持分析报告、结构化和/或即席查询和决策制定。
Data warehousing is the process of constructing and using the data warehouse. A data warehouse is constructed by integrating the data from multiple heterogeneous sources. It supports analytical reporting, structured and/or ad hoc queries, and decision making.
数据仓储涉及数据清理、数据集成和数据合并。为了集成异构数据库,我们采用了以下两种方法:
Data warehousing involves data cleaning, data integration, and data consolidations. To integrate heterogeneous databases, we have the following two approaches −
-
Query Driven Approach
-
Update Driven Approach
Query-Driven Approach
这是集成异构数据库的传统方法。此方法用于在多个异构数据库之上构建包装和集成器。这些集成器也称为“中介器”。
This is the traditional approach to integrate heterogeneous databases. This approach is used to build wrappers and integrators on top of multiple heterogeneous databases. These integrators are also known as mediators.
Process of Query Driven Approach
-
When a query is issued to a client side, a metadata dictionary translates the query into the queries, appropriate for the individual heterogeneous site involved.
-
Now these queries are mapped and sent to the local query processor.
-
The results from heterogeneous sites are integrated into a global answer set.
Update-Driven Approach
当今的数据仓库系统遵循更新驱动方法,而不是前面讨论的传统方法。在更新驱动方法中,来自多个异构源的信息已预先集成并存储在仓库中。此信息可用于直接查询和分析。
Today’s data warehouse systems follow update-driven approach rather than the traditional approach discussed earlier. In the update-driven approach, the information from multiple heterogeneous sources is integrated in advance and stored in a warehouse. This information is available for direct querying and analysis.
Advantages
此方法有以下优点:
This approach has the following advantages −
-
This approach provides high performance.
-
The data can be copied, processed, integrated, annotated, summarized and restructured in the semantic data store in advance.
查询处理不需要与本地源的处理接口。
Query processing does not require interface with the processing at local sources.
From Data Warehousing (OLAP) to Data Mining (OLAM)
在线分析挖掘与在线分析处理通过对多维数据库进行数据挖掘和知识挖掘进行集成。以下是展示了 OLAP 和 OLAM 集成的示意图 −
Online Analytical Mining integrates with Online Analytical Processing with data mining and mining knowledge in multidimensional databases. Here is the diagram that shows the integration of both OLAP and OLAM −
Importance of OLAM
OLAM 出于以下原因很重要 −
OLAM is important for the following reasons −
-
High quality of data in data warehouses − The data mining tools are required to work on integrated, consistent, and cleaned data. These steps are very costly in the preprocessing of data. The data warehouses constructed by such preprocessing are valuable sources of high quality data for OLAP and data mining as well.
-
Available information processing infrastructure surrounding data warehouses − Information processing infrastructure refers to accessing, integration, consolidation, and transformation of multiple heterogeneous databases, web-accessing and service facilities, reporting and OLAP analysis tools.
-
OLAP−based exploratory data analysis − Exploratory data analysis is required for effective data mining. OLAM provides facility for data mining on various subset of data and at different levels of abstraction.
-
Online selection of data mining functions − Integrating OLAP with multiple data mining functions and online analytical mining provide users with the flexibility to select desired data mining functions and swap data mining tasks dynamically.