Dwh 简明教程

Data Warehousing - Architecture

在本章中,我们将讨论用于数据仓库设计的数据仓库架构业务分析框架。

In this chapter, we will discuss the business analysis framework for the data warehouse design and architecture of a data warehouse.

Business Analysis Framework

业务分析师从数据仓库中获取信息,以衡量绩效并做出关键调整,以便在市场中胜过其他业务持有人。拥有数据仓库可提供以下优势 −

The business analyst get the information from the data warehouses to measure the performance and make critical adjustments in order to win over other business holders in the market. Having a data warehouse offers the following advantages −

  1. Since a data warehouse can gather information quickly and efficiently, it can enhance business productivity.

  2. A data warehouse provides us a consistent view of customers and items, hence, it helps us manage customer relationship.

  3. A data warehouse also helps in bringing down the costs by tracking trends, patterns over a long period in a consistent and reliable manner.

要设计出一个有效率的数据仓库,我们需要了解和分析业务需求,并构建 business analysis framework 。每个人对数据仓库设计都有不同的看法。这些视图如下所示 −

To design an effective and efficient data warehouse, we need to understand and analyze the business needs and construct a business analysis framework. Each person has different views regarding the design of a data warehouse. These views are as follows −

  1. The top-down view − This view allows the selection of relevant information needed for a data warehouse.

  2. The data source view − This view presents the information being captured, stored, and managed by the operational system.

  3. The data warehouse view − This view includes the fact tables and dimension tables. It represents the information stored inside the data warehouse.

  4. The business query view − It is the view of the data from the viewpoint of the end-user.

Three-Tier Data Warehouse Architecture

一般来说,数据仓库采用三层架构。以下是数据仓库架构的三层。

Generally a data warehouses adopts a three-tier architecture. Following are the three tiers of the data warehouse architecture.

  1. Bottom Tier − The bottom tier of the architecture is the data warehouse database server. It is the relational database system. We use the back end tools and utilities to feed data into the bottom tier. These back end tools and utilities perform the Extract, Clean, Load, and refresh functions.

  2. Middle Tier − In the middle tier, we have the OLAP Server that can be implemented in either of the following ways. By Relational OLAP (ROLAP), which is an extended relational database management system. The ROLAP maps the operations on multidimensional data to standard relational operations. By Multidimensional OLAP (MOLAP) model, which directly implements the multidimensional data and operations.

  3. Top-Tier − This tier is the front-end client layer. This layer holds the query tools and reporting tools, analysis tools and data mining tools.

下图描绘了数据仓库的三层架构 −

The following diagram depicts the three-tier architecture of data warehouse −

dwh architecture

Data Warehouse Models

从数据仓库架构的角度来看,我们有以下数据仓库模型 −

From the perspective of data warehouse architecture, we have the following data warehouse models −

  1. Virtual Warehouse

  2. Data mart

  3. Enterprise Warehouse

Virtual Warehouse

对操作数据仓库的视图被称为虚拟仓库。构建虚拟仓库很容易。构建虚拟仓库需要操作数据库服务器上的过剩容量。

The view over an operational data warehouse is known as a virtual warehouse. It is easy to build a virtual warehouse. Building a virtual warehouse requires excess capacity on operational database servers.

Data Mart

数据市包含组织范围内的数据子集。这一数据子集对特定组织群体有价值。

Data mart contains a subset of organization-wide data. This subset of data is valuable to specific groups of an organization.

换句话说,我们可以声称数据市包含特定群体特有的数据。例如,营销数据市可能包含与项目、客户和销售相关的数据。数据市局限于主题。

In other words, we can claim that data marts contain data specific to a particular group. For example, the marketing data mart may contain data related to items, customers, and sales. Data marts are confined to subjects.

关于数据市需要注意的要点 −

Points to remember about data marts −

  1. Window-based or Unix/Linux-based servers are used to implement data marts. They are implemented on low-cost servers.

  2. The implementation data mart cycles is measured in short periods of time, i.e., in weeks rather than months or years.

  3. The life cycle of a data mart may be complex in long run, if its planning and design are not organization-wide.

  4. Data marts are small in size.

  5. Data marts are customized by department.

  6. The source of a data mart is departmentally structured data warehouse.

  7. Data mart are flexible.

Enterprise Warehouse

  1. An enterprise warehouse collects all the information and the subjects spanning an entire organization

  2. It provides us enterprise-wide data integration.

  3. The data is integrated from operational systems and external information providers.

  4. This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or beyond.

Load Manager

这一组件执行提取和加载流程所需的运算。

This component performs the operations required to extract and load process.

加载管理器的尺寸和复杂度在从一个数据仓库到其他数据仓库的具体解决方案中会各有不同。

The size and complexity of the load manager varies between specific solutions from one data warehouse to other.

Load Manager Architecture

加载管理器执行以下功能−

The load manager performs the following functions −

  1. Extract the data from source system.

  2. Fast Load the extracted data into temporary data store.

  3. Perform simple transformations into structure similar to the one in the data warehouse.

load manager

Extract Data from Source

数据是从操作数据库或外部信息的提供者中提取的。网关是用于提取数据的应用程序。它由底层的 DBMS 支持,允许客户端程序在服务器上生成要执行的 SQL。开放数据库连接 (ODBC)、Java 数据库连接 (JDBC) 是网关示例。

The data is extracted from the operational databases or the external information providers. Gateways is the application programs that are used to extract data. It is supported by underlying DBMS and allows client program to generate SQL to be executed at a server. Open Database Connection(ODBC), Java Database Connection (JDBC), are examples of gateway.

Fast Load

  1. In order to minimize the total load window the data need to be loaded into the warehouse in the fastest possible time.

  2. The transformations affects the speed of data processing.

  3. It is more effective to load the data into relational database prior to applying transformations and checks.

  4. Gateway technology proves to be not suitable, since they tend not be performant when large data volumes are involved.

Simple Transformations

在加载时,可能需要执行简单的转换。在完成此操作后,我们就能够执行复杂的检查。假设我们正在加载 EPOS 销售交易,我们需要执行以下检查:

While loading it may be required to perform simple transformations. After this has been completed we are in position to do the complex checks. Suppose we are loading the EPOS sales transaction we need to perform the following checks:

  1. Strip out all the columns that are not required within the warehouse.

  2. Convert all the values to required data types.

Warehouse Manager

仓库管理员负责仓库管理流程。它包含第三方系统软件、C 程序和 shell 脚本。

A warehouse manager is responsible for the warehouse management process. It consists of third-party system software, C programs, and shell scripts.

仓库管理器的尺寸和复杂度在具体解决方案中会各有不同。

The size and complexity of warehouse managers varies between specific solutions.

Warehouse Manager Architecture

仓库管理员包括以下内容 -

A warehouse manager includes the following −

  1. The controlling process

  2. Stored procedures or C with SQL

  3. Backup/Recovery tool

  4. SQL Scripts

warehouse manager

Operations Performed by Warehouse Manager

  1. A warehouse manager analyzes the data to perform consistency and referential integrity checks.

  2. Creates indexes, business views, partition views against the base data.

  3. Generates new aggregations and updates existing aggregations. Generates normalizations.

  4. Transforms and merges the source data into the published data warehouse.

  5. Backup the data in the data warehouse.

  6. Archives the data that has reached the end of its captured life.

Note − 仓库管理员也会分析查询概要以确定指标和聚合是合适的。

Note − A warehouse Manager also analyzes query profiles to determine index and aggregations are appropriate.

Query Manager

  1. Query manager is responsible for directing the queries to the suitable tables.

  2. By directing the queries to appropriate tables, the speed of querying and response generation can be increased.

  3. Query manager is responsible for scheduling the execution of the queries posed by the user.

Query Manager Architecture

下图展示了查询管理器的架构。它包括以下内容:

The following screenshot shows the architecture of a query manager. It includes the following:

  1. Query redirection via C tool or RDBMS

  2. Stored procedures

  3. Query management tool

  4. Query scheduling via C tool or RDBMS

  5. Query scheduling via third-party software

query manager

Detailed Information

不保留详细的信息,而是将其汇总到下一个详细信息级别,然后存档至磁带。数据仓库的详细信息部分将详细的信息保存在星型雪花模式中。详细的信息加载到数据仓库以补充汇总数据。

Detailed information is not kept online, rather it is aggregated to the next level of detail and then archived to tape. The detailed information part of data warehouse keeps the detailed information in the starflake schema. Detailed information is loaded into the data warehouse to supplement the aggregated data.

下图以图形形式展示了详细的信息存储在哪以及如何使用。

The following diagram shows a pictorial impression of where detailed information is stored and how it is used.

detailed information

Note - 如果详细的信息离线保存以最小化磁盘存储,在存档之前应确保数据已提取、清理并转换到星型雪花模式。

Note − If detailed information is held offline to minimize disk storage, we should make sure that the data has been extracted, cleaned up, and transformed into starflake schema before it is archived.

Summary Information

汇总信息是数据仓库的组成部分,存储预定义的汇总。这些汇总由仓库管理器生成。必须将汇总信息视为瞬态。为了对不断变化的查询概况做出响应,它会随时发生改变。

Summary Information is a part of data warehouse that stores predefined aggregations. These aggregations are generated by the warehouse manager. Summary Information must be treated as transient. It changes on-the-go in order to respond to the changing query profiles.

有关汇总信息的要点如下 −

The points to note about summary information are as follows −

  1. Summary information speeds up the performance of common queries.

  2. It increases the operational cost.

  3. It needs to be updated whenever new data is loaded into the data warehouse.

  4. It may not have been backed up, since it can be generated fresh from the detailed information.