Dwh 简明教程

Data Warehousing - System Processes

我们有用于应用于操作型数据库的固定数量的操作,并且我们有 use normalized datakeep table small 等明确界定的技术。这些技术适用于提供解决方案。但在决策支持系统中,我们不知道将来需要执行哪些查询和操作。因此,应用于操作型数据库的技术不适用于数据仓库。

We have a fixed number of operations to be applied on the operational databases and we have well-defined techniques such as use normalized data, keep table small, etc. These techniques are suitable for delivering a solution. But in case of decision-support systems, we do not know what query and operation needs to be executed in future. Therefore techniques applied on operational databases are not suitable for data warehouses.

在本章中,我们将讨论如何基于 Unix 和关系数据库等开放系统技术构建数据仓库解决方案。

In this chapter, we will discuss how to build data warehousing solutions on top open-system technologies like Unix and relational databases.

Process Flow in Data Warehouse

有四个主要过程有助于建立一个数据仓库:

There are four major processes that contribute to a data warehouse −

  1. Extract and load the data.

  2. Cleaning and transforming the data.

  3. Backup and archive the data.

  4. Managing queries and directing them to the appropriate data sources.

process flow

Extract and Load Process

数据提取从源系统获取数据。数据加载提取提取的数据并将其加载到数据仓库中。

Data extraction takes data from the source systems. Data load takes the extracted data and loads it into the data warehouse.

Note - 在将数据加载到数据仓库之前,必须对从外部源提取的信息进行重建。

Note − Before loading the data into the data warehouse, the information extracted from the external sources must be reconstructed.

Controlling the Process

控制流程包括确定何时开始数据提取和对数据进行一致性检查。控制流程确保工具、逻辑模块和程序按正确的顺序和时间执行。

Controlling the process involves determining when to start data extraction and the consistency check on data. Controlling process ensures that the tools, the logic modules, and the programs are executed in correct sequence and at correct time.

When to Initiate Extract

数据在提取时需要处于一致状态,也就是说,数据仓库应向用户表示单一的、一致的信息版本。

Data needs to be in a consistent state when it is extracted, i.e., the data warehouse should represent a single, consistent version of the information to the user.

例如,在电信部门的客户分析数据仓库中,将周三下午 8 点从客户数据库中提取的客户列表与周二下午 8 点之前的客户订阅事件合并是不合逻辑的。这意味着我们正在查找没有关联订阅的客户。

For example, in a customer profiling data warehouse in telecommunication sector, it is illogical to merge the list of customers at 8 pm on Wednesday from a customer database with the customer subscription events up to 8 pm on Tuesday. This would mean that we are finding the customers for whom there are no associated subscriptions.

Loading the Data

在提取数据之后,将其加载到临时数据存储中,在那里对其进行清理并使其保持一致。

After extracting the data, it is loaded into a temporary data store where it is cleaned up and made consistent.

Note - 仅当所有数据源加载到临时数据存储中时,才执行一致性检查。

Note − Consistency checks are executed only when all the data sources have been loaded into the temporary data store.

Clean and Transform Process

一旦数据被提取并加载到临时数据存储中,就该执行清理和转换了。以下是清理和转换涉及的步骤列表:

Once the data is extracted and loaded into the temporary data store, it is time to perform Cleaning and Transforming. Here is the list of steps involved in Cleaning and Transforming −

  1. Clean and transform the loaded data into a structure

  2. Partition the data

  3. Aggregation

Clean and Transform the Loaded Data into a Structure

清理并转换加载的数据有助于加快查询。它可以通过使数据一致来完成:

Cleaning and transforming the loaded data helps speed up the queries. It can be done by making the data consistent −

  1. within itself.

  2. with other data within the same data source.

  3. with the data in other source systems.

  4. with the existing data present in the warehouse.

转换包括将源数据转换为结构。对数据进行结构化有助于提高查询性能并降低运行成本。数据仓库中包含的数据必须经过转换,以支持性能需求并控制持续的运行成本。

Transforming involves converting the source data into a structure. Structuring the data increases the query performance and decreases the operational cost. The data contained in a data warehouse must be transformed to support performance requirements and control the ongoing operational costs.

Partition the Data

这样将优化硬件性能并简化数据仓库管理。我们在此将每个事实表划分为多个单独分区。

It will optimize the hardware performance and simplify the management of data warehouse. Here we partition each fact table into multiple separate partitions.

Aggregation

需要聚合来加速通用查询。聚合依赖于大多数通用查询将分析详细数据的子集或聚合这一事实。

Aggregation is required to speed up common queries. Aggregation relies on the fact that most common queries will analyze a subset or an aggregation of the detailed data.

Backup and Archive the Data

为了在数据丢失、软件故障或硬件故障时恢复数据,有必要定期进行备份。归档涉及以一种可以根据需要快速恢复的方式从系统中删除旧数据。

In order to recover the data in the event of data loss, software failure, or hardware failure, it is necessary to keep regular back ups. Archiving involves removing the old data from the system in a format that allow it to be quickly restored whenever required.

例如,在零售销售分析数据仓库中,可能需要保留 3 年的数据,而最近 6 个月的数据会保留在线。在这样的情况下,通常需要能够对今年和去年逐月进行比较。在这种情况下,我们需要从归档中恢复一些数据。

For example, in a retail sales analysis data warehouse, it may be required to keep data for 3 years with the latest 6 months data being kept online. In such as scenario, there is often a requirement to be able to do month-on-month comparisons for this year and last year. In this case, we require some data to be restored from the archive.

Query Management Process

此流程执行以下功能−

This process performs the following functions −

  1. manages the queries.

  2. helps speed up the execution time of queris.

  3. directs the queries to their most effective data sources.

  4. ensures that all the system sources are used in the most effective way.

  5. monitors actual query profiles.

仓库管理流程使用此流程生成的信息来确定要生成哪些聚合。此流程通常不会在将信息常规加载到数据仓库时运行。

The information generated in this process is used by the warehouse management process to determine which aggregations to generate. This process does not generally operate during the regular load of information into data warehouse.