Dwh 简明教程

Data Warehousing - Quick Guide

Data Warehousing - Overview

“数据仓库”这一术语最早由比尔·因蒙于 1990 年创造。根据因蒙的说法,数据仓库是主题导向、集成、随时间变化且不可变的数据集合。这些数据可帮助分析师在组织中做出明智的决策。

The term "Data Warehouse" was first coined by Bill Inmon in 1990. According to Inmon, a data warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of data. This data helps analysts to take informed decisions in an organization.

操作数据库每天都在频繁更改,原因是会发生交易。假设业务经理想要分析有关任何数据的先前反馈,例如产品、供应商或任何消费者数据,那么经理将没有可用数据进行分析,因为先前数据已因交易而更新。

An operational database undergoes frequent changes on a daily basis on account of the transactions that take place. Suppose a business executive wants to analyze previous feedback on any data such as a product, a supplier, or any consumer data, then the executive will have no data available to analyze because the previous data has been updated due to transactions.

数据仓库通过多维视图为我们提供概括和汇总数据。除了概括和汇总的数据视图外,数据仓库还为我们提供了联机分析处理 (OLAP) 工具。这些工具可以帮助我们以交互式且有效的方式分析多维空间中的数据。这种分析可生成数据概括和数据挖掘。

A data warehouses provides us generalized and consolidated data in multidimensional view. Along with generalized and consolidated view of data, a data warehouses also provides us Online Analytical Processing (OLAP) tools. These tools help us in interactive and effective analysis of data in a multidimensional space. This analysis results in data generalization and data mining.

关联、聚类、分类、预测等数据挖掘功能可以与 OLAP 操作集成,以在多个抽象层级中增强交互式知识挖掘。这就是为什么数据仓库现在已成为数据分析和联机分析处理的重要平台。

Data mining functions such as association, clustering, classification, prediction can be integrated with OLAP operations to enhance the interactive mining of knowledge at multiple level of abstraction. That’s why data warehouse has now become an important platform for data analysis and online analytical processing.

Understanding a Data Warehouse

  1. A data warehouse is a database, which is kept separate from the organization’s operational database.

  2. There is no frequent updating done in a data warehouse.

  3. It possesses consolidated historical data, which helps the organization to analyze its business.

  4. A data warehouse helps executives to organize, understand, and use their data to take strategic decisions.

  5. Data warehouse systems help in the integration of diversity of application systems.

  6. A data warehouse system helps in consolidated historical data analysis.

Why a Data Warehouse is Separated from Operational Databases

数据仓库与操作数据库分开放置,原因如下−

A data warehouses is kept separate from operational databases due to the following reasons −

  1. An operational database is constructed for well-known tasks and workloads such as searching particular records, indexing, etc. In contract, data warehouse queries are often complex and they present a general form of data.

  2. Operational databases support concurrent processing of multiple transactions. Concurrency control and recovery mechanisms are required for operational databases to ensure robustness and consistency of the database.

  3. An operational database query allows to read and modify operations, while an OLAP query needs only read only access of stored data.

  4. An operational database maintains current data. On the other hand, a data warehouse maintains historical data.

Data Warehouse Features

数据仓库的关键特性如下所述−

The key features of a data warehouse are discussed below −

  1. Subject Oriented − A data warehouse is subject oriented because it provides information around a subject rather than the organization’s ongoing operations. These subjects can be product, customers, suppliers, sales, revenue, etc. A data warehouse does not focus on the ongoing operations, rather it focuses on modelling and analysis of data for decision making.

  2. Integrated − A data warehouse is constructed by integrating data from heterogeneous sources such as relational databases, flat files, etc. This integration enhances the effective analysis of data.

  3. Time Variant − The data collected in a data warehouse is identified with a particular time period. The data in a data warehouse provides information from the historical point of view.

  4. Non-volatile − Non-volatile means the previous data is not erased when new data is added to it. A data warehouse is kept separate from the operational database and therefore frequent changes in operational database is not reflected in the data warehouse.

Note − 数据仓库不需要事务处理、恢复和并发控制,因为它被物理存储并与操作数据库分开。

Note − A data warehouse does not require transaction processing, recovery, and concurrency controls, because it is physically stored and separate from the operational database.

Data Warehouse Applications

如前所述,数据仓库帮助企业管理人员组织、分析并使用其数据进行决策。数据仓库是企业管理“闭环”反馈系统的计划执行评估的一部分。数据仓库广泛用于以下领域−

As discussed before, a data warehouse helps business executives to organize, analyze, and use their data for decision making. A data warehouse serves as a sole part of a plan-execute-assess "closed-loop" feedback system for the enterprise management. Data warehouses are widely used in the following fields −

  1. Financial services

  2. Banking services

  3. Consumer goods

  4. Retail sectors

  5. Controlled manufacturing

Types of Data Warehouse

信息处理、分析处理和数据挖掘是接下来讨论的三种数据仓库应用程序−

Information processing, analytical processing, and data mining are the three types of data warehouse applications that are discussed below −

  1. Information Processing − A data warehouse allows to process the data stored in it. The data can be processed by means of querying, basic statistical analysis, reporting using crosstabs, tables, charts, or graphs.

  2. Analytical Processing − A data warehouse supports analytical processing of the information stored in it. The data can be analyzed by means of basic OLAP operations, including slice-and-dice, drill down, drill up, and pivoting.

  3. Data Mining − Data mining supports knowledge discovery by finding hidden patterns and associations, constructing analytical models, performing classification and prediction. These mining results can be presented using the visualization tools.

Sr.No.

Data Warehouse (OLAP)

Operational Database(OLTP)

1

It involves historical processing of information.

It involves day-to-day processing.

2

OLAP systems are used by knowledge workers such as executives, managers, and analysts.

OLTP systems are used by clerks, DBAs, or database professionals.

3

It is used to analyze the business.

It is used to run the business.

4

It focuses on Information out.

It focuses on Data in.

5

It is based on Star Schema, Snowflake Schema, and Fact Constellation Schema.

It is based on Entity Relationship Model.

6

It focuses on Information out.

It is application oriented.

7

It contains historical data.

It contains current data.

8

It provides summarized and consolidated data.

It provides primitive and highly detailed data.

9

It provides summarized and multidimensional view of data.

It provides detailed and flat relational view of data.

10

The number of users is in hundreds.

The number of users is in thousands.

11

The number of records accessed is in millions.

The number of records accessed is in tens.

12

The database size is from 100GB to 100 TB.

The database size is from 100 MB to 100 GB.

13

These are highly flexible.

It provides high performance.

Data Warehousing - Concepts

What is Data Warehousing?

数据仓库是构建和使用数据仓库的过程。数据仓库是通过集成来自多个异构源的数据构建的,这些源支持分析报告、结构化或即席查询以及决策。数据仓库涉及数据清理、数据集成和数据合并。

Data warehousing is the process of constructing and using a data warehouse. A data warehouse is constructed by integrating data from multiple heterogeneous sources that support analytical reporting, structured and/or ad hoc queries, and decision making. Data warehousing involves data cleaning, data integration, and data consolidations.

Using Data Warehouse Information

有有助于利用数据仓库中可用数据的决策支持技术。这些技术帮助高管快速有效地使用仓库。他们可以收集数据、分析数据并根据仓库中的信息做出决策。仓库中收集的信息可用于以下任何域中−

There are decision support technologies that help utilize the data available in a data warehouse. These technologies help executives to use the warehouse quickly and effectively. They can gather data, analyze it, and take decisions based on the information present in the warehouse. The information gathered in a warehouse can be used in any of the following domains −

  1. Tuning Production Strategies − The product strategies can be well tuned by repositioning the products and managing the product portfolios by comparing the sales quarterly or yearly.

  2. Customer Analysis − Customer analysis is done by analyzing the customer’s buying preferences, buying time, budget cycles, etc.

  3. Operations Analysis − Data warehousing also helps in customer relationship management, and making environmental corrections. The information also allows us to analyze business operations.

Integrating Heterogeneous Databases

为了集成异构数据库,我们有两种方法−

To integrate heterogeneous databases, we have two approaches −

  1. Query-driven Approach

  2. Update-driven Approach

Query-Driven Approach

这是集成异构数据库的传统方法。此方法用于在多个异构数据库之上构建包装器和集成器。这些集成器也被称为调解器。

This is the traditional approach to integrate heterogeneous databases. This approach was used to build wrappers and integrators on top of multiple heterogeneous databases. These integrators are also known as mediators.

Process of Query-Driven Approach

  1. When a query is issued to a client side, a metadata dictionary translates the query into an appropriate form for individual heterogeneous sites involved.

  2. Now these queries are mapped and sent to the local query processor.

  3. The results from heterogeneous sites are integrated into a global answer set.

Disadvantages

  1. Query-driven approach needs complex integration and filtering processes.

  2. This approach is very inefficient.

  3. It is very expensive for frequent queries.

  4. This approach is also very expensive for queries that require aggregations.

Update-Driven Approach

这是传统方法的替代方法。当今的数据仓库系统遵循更新驱动方法,而不是前面讨论的传统方法。在更新驱动方法中,来自多个异构源的信息会预先集成并存储在仓库中。该信息可用于直接查询和分析。

This is an alternative to the traditional approach. Today’s data warehouse systems follow update-driven approach rather than the traditional approach discussed earlier. In update-driven approach, the information from multiple heterogeneous sources are integrated in advance and are stored in a warehouse. This information is available for direct querying and analysis.

Advantages

此方法有以下优点:

This approach has the following advantages −

  1. This approach provide high performance.

  2. The data is copied, processed, integrated, annotated, summarized and restructured in semantic data store in advance.

  3. Query processing does not require an interface to process data at local sources.

Functions of Data Warehouse Tools and Utilities

以下是数据仓库工具和实用程序的功能−

The following are the functions of data warehouse tools and utilities −

  1. Data Extraction − Involves gathering data from multiple heterogeneous sources.

  2. Data Cleaning − Involves finding and correcting the errors in data.

  3. Data Transformation − Involves converting the data from legacy format to warehouse format.

  4. Data Loading − Involves sorting, summarizing, consolidating, checking integrity, and building indices and partitions.

  5. Refreshing − Involves updating from data sources to warehouse.

Note − 数据清理和数据转换是提高数据和数据挖掘结果质量的重要步骤。

Note − Data cleaning and data transformation are important steps in improving the quality of data and data mining results.

Data Warehousing - Terminologies

在本章中,我们将讨论数据仓库中最常用的几个术语。

In this chapter, we will discuss some of the most commonly used terms in data warehousing.

Metadata

元数据简单地定义为关于数据的数据。用于表示其他数据的数据称为元数据。例如,书籍的索引是书中内容的元数据。换句话说,我们可以说元数据是将我们引至详细数据的大致数据。

Metadata is simply defined as data about data. The data that are used to represent other data is known as metadata. For example, the index of a book serves as a metadata for the contents in the book. In other words, we can say that metadata is the summarized data that leads us to the detailed data.

在数据仓库方面,我们可以将元数据定义如下 −

In terms of data warehouse, we can define metadata as following −

  1. Metadata is a road-map to data warehouse.

  2. Metadata in data warehouse defines the warehouse objects.

  3. Metadata acts as a directory. This directory helps the decision support system to locate the contents of a data warehouse.

Metadata Repository

元数据存储库是数据仓库系统的一个组成部分。它包含以下元数据 −

Metadata repository is an integral part of a data warehouse system. It contains the following metadata −

  1. Business metadata − It contains the data ownership information, business definition, and changing policies.

  2. Operational metadata − It includes currency of data and data lineage. Currency of data refers to the data being active, archived, or purged. Lineage of data means history of data migrated and transformation applied on it.

  3. Data for mapping from operational environment to data warehouse − It metadata includes source databases and their contents, data extraction, data partition, cleaning, transformation rules, data refresh and purging rules.

  4. The algorithms for summarization − It includes dimension algorithms, data on granularity, aggregation, summarizing, etc.

Data Cube

数据立方体帮助我们在多维度表示数据。它通过维度和事实来定义。维度是企业按其保存记录的实体。

A data cube helps us represent data in multiple dimensions. It is defined by dimensions and facts. The dimensions are the entities with respect to which an enterprise preserves the records.

Illustration of Data Cube

假设一家公司希望借助销售数据仓库来跟踪销售记录,并按时间、项目、分支和位置。这些维度允许跟踪月度销售情况以及项目售出的分支。每个维度都关联一张表。这张表称为维度表。例如,“项目”维度表可能具有诸如项目名称、项目类型、项目品牌之类的属性。

Suppose a company wants to keep track of sales records with the help of sales data warehouse with respect to time, item, branch, and location. These dimensions allow to keep track of monthly sales and at which branch the items were sold. There is a table associated with each dimension. This table is known as dimension table. For example, "item" dimension table may have attributes such as item_name, item_type, and item_brand.

下表表示公司关于时间、项目和位置维度的销售数据的二维视图。

The following table represents the 2-D view of Sales Data for a company with respect to time, item, and location dimensions.

data cube2d

但此处的此二维表中,我们只有时间和项目才具有记录。新德里的销售情况根据销售项目类型按时间和项目维度显示。如果我们希望根据更多维度(比如位置维度)查看销售数据,那么三维视图将很有用。下方的表格显示了关于时间、项目和位置的三维销售数据视图 −

But here in this 2-D table, we have records with respect to time and item only. The sales for New Delhi are shown with respect to time, and item dimensions according to type of items sold. If we want to view the sales data with one more dimension, say, the location dimension, then the 3-D view would be useful. The 3-D view of the sales data with respect to time, item, and location is shown in the table below −

data cube3d

以上三维表格可以如以下图形所示表示为三维数据立方体 −

The above 3-D table can be represented as 3-D data cube as shown in the following figure −

data cube3d1

Data Mart

数据超市包含组织范围内对组织中特定人群有价值的数据子集。换句话说,数据超市仅包含特定于特定人群的数据。例如,市场数据超市可能仅包含与项目、客户和销售有关的数据。数据超市局限于主题。

Data marts contain a subset of organization-wide data that is valuable to specific groups of people in an organization. In other words, a data mart contains only those data that is specific to a particular group. For example, the marketing data mart may contain only data related to items, customers, and sales. Data marts are confined to subjects.

Points to Remember About Data Marts

  1. Windows-based or Unix/Linux-based servers are used to implement data marts. They are implemented on low-cost servers.

  2. The implementation cycle of a data mart is measured in short periods of time, i.e., in weeks rather than months or years.

  3. The life cycle of data marts may be complex in the long run, if their planning and design are not organization-wide.

  4. Data marts are small in size.

  5. Data marts are customized by department.

  6. The source of a data mart is departmentally structured data warehouse.

  7. Data marts are flexible.

下图显示了数据市集的图形表示形式。

The following figure shows a graphical representation of data marts.

data mart

Virtual Warehouse

对一个操作数据仓库的视图称为虚拟仓库。创建虚拟仓库很容易。创建虚拟仓库需要操作数据库服务器上有多余容量。

The view over an operational data warehouse is known as virtual warehouse. It is easy to build a virtual warehouse. Building a virtual warehouse requires excess capacity on operational database servers.

Data Warehousing - Delivery Process

数据仓库绝不是一成不变的;它会随着业务的扩展而不断发展。随着业务发展,其需求不断变化,因此必须设计一个数据仓库来应对这些变化。因此,数据仓库系统需要具有灵活性。

A data warehouse is never static; it evolves as the business expands. As the business evolves, its requirements keep changing and therefore a data warehouse must be designed to ride with these changes. Hence a data warehouse system needs to be flexible.

理想情况下,应该有一个交付流程来交付数据仓库。然而,数据仓库项目通常会受到各种问题的困扰,使得按照瀑布方法要求的严格有序的方式完成任务和交付成果变得困难。大多数时候,需求并没有被完全理解。只有在收集并研究所有需求后,才能完成架构、设计和构建组件。

Ideally there should be a delivery process to deliver a data warehouse. However data warehouse projects normally suffer from various issues that make it difficult to complete tasks and deliverables in the strict and ordered fashion demanded by the waterfall method. Most of the times, the requirements are not understood completely. The architectures, designs, and build components can be completed only after gathering and studying all the requirements.

Delivery Method

交付方法是为交付数据仓库而采用的联合应用程序开发方法的一种变体。我们对数据仓库交付流程进行了分阶段,以最大程度降低风险。我们将在此处讨论的方法并没有缩短整体交付时间表,而是确保在开发过程中逐步交付业务收益。

The delivery method is a variant of the joint application development approach adopted for the delivery of a data warehouse. We have staged the data warehouse delivery process to minimize risks. The approach that we will discuss here does not reduce the overall delivery time-scales but ensures the business benefits are delivered incrementally through the development process.

Note - 交付流程被分解成各个阶段以降低项目和交付风险。

Note − The delivery process is broken into phases to reduce the project and delivery risk.

下图解释了交付过程中的各个阶段 -

The following diagram explains the stages in the delivery process −

delivery method

IT Strategy

数据仓库是战略性投资,需要一个业务流程来创收。IT 战略是获得和保留项目资金所必需的。

Data warehouse are strategic investments that require a business process to generate benefits. IT Strategy is required to procure and retain funding for the project.

Business Case

商业案例的目的是估算企业在使用数据仓库时应考虑的收益。这些收益可能无法量化,但预计的收益需要明确表述。如果某个数据仓库没有明确的商业案例,那么该数据仓库在交付过程中的某个阶段可能会存在信誉问题。因此,在数据仓库项目中,我们需要了解投资的商业案例。

The objective of business case is to estimate business benefits that should be derived from using a data warehouse. These benefits may not be quantifiable but the projected benefits need to be clearly stated. If a data warehouse does not have a clear business case, then the business tends to suffer from credibility problems at some stage during the delivery process. Therefore in data warehouse projects, we need to understand the business case for investment.

Education and Prototyping

在确定解决方案之前,各种组织会尝试数据分析概念,并自学数据仓库的价值。通过原型展示来满足这一目标。它有助于了解数据仓库的可行性和收益。小规模的原型展示活动可以促进教育过程的发挥,只要满足以下条件:

Organizations experiment with the concept of data analysis and educate themselves on the value of having a data warehouse before settling for a solution. This is addressed by prototyping. It helps in understanding the feasibility and benefits of a data warehouse. The prototyping activity on a small scale can promote educational process as long as −

  1. The prototype addresses a defined technical objective.

  2. The prototype can be thrown away after the feasibility concept has been shown.

  3. The activity addresses a small subset of eventual data content of the data warehouse.

  4. The activity timescale is non-critical.

以下几点对于产生早期版本并满足收益至关重要:

The following points are to be kept in mind to produce an early release and deliver business benefits.

  1. Identify the architecture that is capable of evolving.

  2. Focus on business requirements and technical blueprint phases.

  3. Limit the scope of the first build phase to the minimum that delivers business benefits.

  4. Understand the short-term and medium-term requirements of the data warehouse.

Business Requirements

为提供高质量的交付成果,我们应确保了解整体需求。如果我们了解了短期和中期的业务需求,那么就能够设计一个解决方案来满足短期需求。随后可以将短期解决方案扩展至一个完整的解决方案。

To provide quality deliverables, we should make sure the overall requirements are understood. If we understand the business requirements for both short-term and medium-term, then we can design a solution to fulfil short-term requirements. The short-term solution can then be grown to a full solution.

此阶段确定以下方面:

The following aspects are determined in this stage −

  1. The business rule to be applied on data.

  2. The logical model for information within the data warehouse.

  3. The query profiles for the immediate requirement.

  4. The source systems that provide this data.

Technical Blueprint

此阶段需要交付一个满足长期需求的整体架构。该阶段还会交付应在短期内实施以实现业务收益的组成部分。蓝图需要识别以下内容:

This phase need to deliver an overall architecture satisfying the long term requirements. This phase also deliver the components that must be implemented in a short term to derive any business benefit. The blueprint need to identify the followings.

  1. The overall system architecture.

  2. The data retention policy.

  3. The backup and recovery strategy.

  4. The server and data mart architecture.

  5. The capacity plan for hardware and infrastructure.

  6. The components of database design.

Building the Version

在此阶段,生成第一个生产可交付成果。此生产可交付成果是数据仓库中最小的组成部分。这个最小组成部分增加了业务效益。

In this stage, the first production deliverable is produced. This production deliverable is the smallest component of a data warehouse. This smallest component adds business benefit.

History Load

这是将所需的剩余历史记录加载到数据仓库中的阶段。在此阶段,我们不会添加新实体,但可能会创建其他物理表来存储增加的数据量。

This is the phase where the remainder of the required history is loaded into the data warehouse. In this phase, we do not add new entities, but additional physical tables would probably be created to store increased data volumes.

让我们举个例子。假设构建版本阶段已交付了一个包括两个月历史记录的零售销售分析数据仓库。此信息将允许用户仅分析最近的趋势并解决短期问题。在这种情况下,用户无法识别每年的季节性趋势。为了帮助他这样做,可以从归档中加载过去两年的销售历史记录。现在,40GB 数据已扩展到 400GB。

Let us take an example. Suppose the build version phase has delivered a retail sales analysis data warehouse with 2 months’ worth of history. This information will allow the user to analyze only the recent trends and address the short-term issues. The user in this case cannot identify annual and seasonal trends. To help him do so, last 2 years’ sales history could be loaded from the archive. Now the 40GB data is extended to 400GB.

Note − 备份和恢复过程可能会变得很复杂,因此建议在单独的阶段中执行此活动。

Note − The backup and recovery procedures may become complex, therefore it is recommended to perform this activity within a separate phase.

Ad hoc Query

在此阶段,我们配置一个用于操作数据仓库的临时查询工具。这些工具可以生成数据库查询。

In this phase, we configure an ad hoc query tool that is used to operate a data warehouse. These tools can generate the database query.

Note − 建议在对数据库进行大幅修改时不要使用这些访问工具。

Note − It is recommended not to use these access tools when the database is being substantially modified.

Automation

在此阶段,操作管理过程已完全自动化。这些内容包括:

In this phase, operational management processes are fully automated. These would include −

  1. Transforming the data into a form suitable for analysis.

  2. Monitoring query profiles and determining appropriate aggregations to maintain system performance.

  3. Extracting and loading data from different source systems.

  4. Generating aggregations from predefined definitions within the data warehouse.

  5. Backing up, restoring, and archiving the data.

Extending Scope

在此阶段,数据仓库已扩展以满足新的业务需求。范围可以通过两种方式扩展:

In this phase, the data warehouse is extended to address a new set of business requirements. The scope can be extended in two ways −

  1. By loading additional data into the data warehouse.

  2. By introducing new data marts using the existing information.

Note − 由于此阶段将涉及相当的工作量和复杂性,因此应单独执行。

Note − This phase should be performed separately, since it involves substantial efforts and complexity.

Requirements Evolution

从交付流程的角度来看,需求总是可以更改的。它们不是一成不变的。交付流程必须支持这一点,并允许这些更改反映在系统中。

From the perspective of delivery process, the requirements are always changeable. They are not static. The delivery process must support this and allow these changes to be reflected within the system.

通过围绕业务流程内的数据使用来设计数据仓库,而不是围绕现有查询的数据需求来解决此问题。

This issue is addressed by designing the data warehouse around the use of data within business processes, as opposed to the data requirements of existing queries.

该体系结构旨在随着业务需求而改变和增长,该流程作为一个伪应用程序开发流程运行,其中新需求不断输入到开发活动中,并生成部分可交付成果。这些部分可交付成果反馈给用户,然后重新加工,以确保整个系统不断更新,以满足业务需求。

The architecture is designed to change and grow to match the business needs, the process operates as a pseudo-application development process, where the new requirements are continually fed into the development activities and the partial deliverables are produced. These partial deliverables are fed back to the users and then reworked ensuring that the overall system is continually updated to meet the business needs.

Data Warehousing - System Processes

我们有用于应用于操作型数据库的固定数量的操作,并且我们有 use normalized datakeep table small 等明确界定的技术。这些技术适用于提供解决方案。但在决策支持系统中,我们不知道将来需要执行哪些查询和操作。因此,应用于操作型数据库的技术不适用于数据仓库。

We have a fixed number of operations to be applied on the operational databases and we have well-defined techniques such as use normalized data, keep table small, etc. These techniques are suitable for delivering a solution. But in case of decision-support systems, we do not know what query and operation needs to be executed in future. Therefore techniques applied on operational databases are not suitable for data warehouses.

在本章中,我们将讨论如何基于 Unix 和关系数据库等开放系统技术构建数据仓库解决方案。

In this chapter, we will discuss how to build data warehousing solutions on top open-system technologies like Unix and relational databases.

Process Flow in Data Warehouse

有四个主要过程有助于建立一个数据仓库:

There are four major processes that contribute to a data warehouse −

  1. Extract and load the data.

  2. Cleaning and transforming the data.

  3. Backup and archive the data.

  4. Managing queries and directing them to the appropriate data sources.

process flow

Extract and Load Process

数据提取从源系统获取数据。数据加载提取提取的数据并将其加载到数据仓库中。

Data extraction takes data from the source systems. Data load takes the extracted data and loads it into the data warehouse.

Note - 在将数据加载到数据仓库之前,必须对从外部源提取的信息进行重建。

Note − Before loading the data into the data warehouse, the information extracted from the external sources must be reconstructed.

Controlling the Process

控制流程包括确定何时开始数据提取和对数据进行一致性检查。控制流程确保工具、逻辑模块和程序按正确的顺序和时间执行。

Controlling the process involves determining when to start data extraction and the consistency check on data. Controlling process ensures that the tools, the logic modules, and the programs are executed in correct sequence and at correct time.

When to Initiate Extract

数据在提取时需要处于一致状态,也就是说,数据仓库应向用户表示单一的、一致的信息版本。

Data needs to be in a consistent state when it is extracted, i.e., the data warehouse should represent a single, consistent version of the information to the user.

例如,在电信部门的客户分析数据仓库中,将周三下午 8 点从客户数据库中提取的客户列表与周二下午 8 点之前的客户订阅事件合并是不合逻辑的。这意味着我们正在查找没有关联订阅的客户。

For example, in a customer profiling data warehouse in telecommunication sector, it is illogical to merge the list of customers at 8 pm on Wednesday from a customer database with the customer subscription events up to 8 pm on Tuesday. This would mean that we are finding the customers for whom there are no associated subscriptions.

Loading the Data

在提取数据之后,将其加载到临时数据存储中,在那里对其进行清理并使其保持一致。

After extracting the data, it is loaded into a temporary data store where it is cleaned up and made consistent.

Note - 仅当所有数据源加载到临时数据存储中时,才执行一致性检查。

Note − Consistency checks are executed only when all the data sources have been loaded into the temporary data store.

Clean and Transform Process

一旦数据被提取并加载到临时数据存储中,就该执行清理和转换了。以下是清理和转换涉及的步骤列表:

Once the data is extracted and loaded into the temporary data store, it is time to perform Cleaning and Transforming. Here is the list of steps involved in Cleaning and Transforming −

  1. Clean and transform the loaded data into a structure

  2. Partition the data

  3. Aggregation

Clean and Transform the Loaded Data into a Structure

清理并转换加载的数据有助于加快查询。它可以通过使数据一致来完成:

Cleaning and transforming the loaded data helps speed up the queries. It can be done by making the data consistent −

  1. within itself.

  2. with other data within the same data source.

  3. with the data in other source systems.

  4. with the existing data present in the warehouse.

转换包括将源数据转换为结构。对数据进行结构化有助于提高查询性能并降低运行成本。数据仓库中包含的数据必须经过转换,以支持性能需求并控制持续的运行成本。

Transforming involves converting the source data into a structure. Structuring the data increases the query performance and decreases the operational cost. The data contained in a data warehouse must be transformed to support performance requirements and control the ongoing operational costs.

Partition the Data

这样将优化硬件性能并简化数据仓库管理。我们在此将每个事实表划分为多个单独分区。

It will optimize the hardware performance and simplify the management of data warehouse. Here we partition each fact table into multiple separate partitions.

Aggregation

需要聚合来加速通用查询。聚合依赖于大多数通用查询将分析详细数据的子集或聚合这一事实。

Aggregation is required to speed up common queries. Aggregation relies on the fact that most common queries will analyze a subset or an aggregation of the detailed data.

Backup and Archive the Data

为了在数据丢失、软件故障或硬件故障时恢复数据,有必要定期进行备份。归档涉及以一种可以根据需要快速恢复的方式从系统中删除旧数据。

In order to recover the data in the event of data loss, software failure, or hardware failure, it is necessary to keep regular back ups. Archiving involves removing the old data from the system in a format that allow it to be quickly restored whenever required.

例如,在零售销售分析数据仓库中,可能需要保留 3 年的数据,而最近 6 个月的数据会保留在线。在这样的情况下,通常需要能够对今年和去年逐月进行比较。在这种情况下,我们需要从归档中恢复一些数据。

For example, in a retail sales analysis data warehouse, it may be required to keep data for 3 years with the latest 6 months data being kept online. In such as scenario, there is often a requirement to be able to do month-on-month comparisons for this year and last year. In this case, we require some data to be restored from the archive.

Query Management Process

此流程执行以下功能−

This process performs the following functions −

  1. manages the queries.

  2. helps speed up the execution time of queris.

  3. directs the queries to their most effective data sources.

  4. ensures that all the system sources are used in the most effective way.

  5. monitors actual query profiles.

仓库管理流程使用此流程生成的信息来确定要生成哪些聚合。此流程通常不会在将信息常规加载到数据仓库时运行。

The information generated in this process is used by the warehouse management process to determine which aggregations to generate. This process does not generally operate during the regular load of information into data warehouse.

Data Warehousing - Architecture

在本章中,我们将讨论用于数据仓库设计的数据仓库架构业务分析框架。

In this chapter, we will discuss the business analysis framework for the data warehouse design and architecture of a data warehouse.

Business Analysis Framework

业务分析师从数据仓库中获取信息,以衡量绩效并做出关键调整,以便在市场中胜过其他业务持有人。拥有数据仓库可提供以下优势 −

The business analyst get the information from the data warehouses to measure the performance and make critical adjustments in order to win over other business holders in the market. Having a data warehouse offers the following advantages −

  1. Since a data warehouse can gather information quickly and efficiently, it can enhance business productivity.

  2. A data warehouse provides us a consistent view of customers and items, hence, it helps us manage customer relationship.

  3. A data warehouse also helps in bringing down the costs by tracking trends, patterns over a long period in a consistent and reliable manner.

要设计出一个有效率的数据仓库,我们需要了解和分析业务需求,并构建 business analysis framework 。每个人对数据仓库设计都有不同的看法。这些视图如下所示 −

To design an effective and efficient data warehouse, we need to understand and analyze the business needs and construct a business analysis framework. Each person has different views regarding the design of a data warehouse. These views are as follows −

  1. The top-down view − This view allows the selection of relevant information needed for a data warehouse.

  2. The data source view − This view presents the information being captured, stored, and managed by the operational system.

  3. The data warehouse view − This view includes the fact tables and dimension tables. It represents the information stored inside the data warehouse.

  4. The business query view − It is the view of the data from the viewpoint of the end-user.

Three-Tier Data Warehouse Architecture

一般来说,数据仓库采用三层架构。以下是数据仓库架构的三层。

Generally a data warehouses adopts a three-tier architecture. Following are the three tiers of the data warehouse architecture.

  1. Bottom Tier − The bottom tier of the architecture is the data warehouse database server. It is the relational database system. We use the back end tools and utilities to feed data into the bottom tier. These back end tools and utilities perform the Extract, Clean, Load, and refresh functions.

  2. Middle Tier − In the middle tier, we have the OLAP Server that can be implemented in either of the following ways. By Relational OLAP (ROLAP), which is an extended relational database management system. The ROLAP maps the operations on multidimensional data to standard relational operations. By Multidimensional OLAP (MOLAP) model, which directly implements the multidimensional data and operations.

  3. Top-Tier − This tier is the front-end client layer. This layer holds the query tools and reporting tools, analysis tools and data mining tools.

下图描绘了数据仓库的三层架构 −

The following diagram depicts the three-tier architecture of data warehouse −

dwh architecture

Data Warehouse Models

从数据仓库架构的角度来看,我们有以下数据仓库模型 −

From the perspective of data warehouse architecture, we have the following data warehouse models −

  1. Virtual Warehouse

  2. Data mart

  3. Enterprise Warehouse

Virtual Warehouse

对操作数据仓库的视图被称为虚拟仓库。构建虚拟仓库很容易。构建虚拟仓库需要操作数据库服务器上的过剩容量。

The view over an operational data warehouse is known as a virtual warehouse. It is easy to build a virtual warehouse. Building a virtual warehouse requires excess capacity on operational database servers.

Data Mart

数据市包含组织范围内的数据子集。这一数据子集对特定组织群体有价值。

Data mart contains a subset of organization-wide data. This subset of data is valuable to specific groups of an organization.

换句话说,我们可以声称数据市包含特定群体特有的数据。例如,营销数据市可能包含与项目、客户和销售相关的数据。数据市局限于主题。

In other words, we can claim that data marts contain data specific to a particular group. For example, the marketing data mart may contain data related to items, customers, and sales. Data marts are confined to subjects.

关于数据市需要注意的要点 −

Points to remember about data marts −

  1. Window-based or Unix/Linux-based servers are used to implement data marts. They are implemented on low-cost servers.

  2. The implementation data mart cycles is measured in short periods of time, i.e., in weeks rather than months or years.

  3. The life cycle of a data mart may be complex in long run, if its planning and design are not organization-wide.

  4. Data marts are small in size.

  5. Data marts are customized by department.

  6. The source of a data mart is departmentally structured data warehouse.

  7. Data mart are flexible.

Enterprise Warehouse

  1. An enterprise warehouse collects all the information and the subjects spanning an entire organization

  2. It provides us enterprise-wide data integration.

  3. The data is integrated from operational systems and external information providers.

  4. This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or beyond.

Load Manager

这一组件执行提取和加载流程所需的运算。

This component performs the operations required to extract and load process.

加载管理器的尺寸和复杂度在从一个数据仓库到其他数据仓库的具体解决方案中会各有不同。

The size and complexity of the load manager varies between specific solutions from one data warehouse to other.

Load Manager Architecture

加载管理器执行以下功能−

The load manager performs the following functions −

  1. Extract the data from source system.

  2. Fast Load the extracted data into temporary data store.

  3. Perform simple transformations into structure similar to the one in the data warehouse.

load manager

Extract Data from Source

数据是从操作数据库或外部信息的提供者中提取的。网关是用于提取数据的应用程序。它由底层的 DBMS 支持,允许客户端程序在服务器上生成要执行的 SQL。开放数据库连接 (ODBC)、Java 数据库连接 (JDBC) 是网关示例。

The data is extracted from the operational databases or the external information providers. Gateways is the application programs that are used to extract data. It is supported by underlying DBMS and allows client program to generate SQL to be executed at a server. Open Database Connection(ODBC), Java Database Connection (JDBC), are examples of gateway.

Fast Load

  1. In order to minimize the total load window the data need to be loaded into the warehouse in the fastest possible time.

  2. The transformations affects the speed of data processing.

  3. It is more effective to load the data into relational database prior to applying transformations and checks.

  4. Gateway technology proves to be not suitable, since they tend not be performant when large data volumes are involved.

Simple Transformations

在加载时,可能需要执行简单的转换。在完成此操作后,我们就能够执行复杂的检查。假设我们正在加载 EPOS 销售交易,我们需要执行以下检查:

While loading it may be required to perform simple transformations. After this has been completed we are in position to do the complex checks. Suppose we are loading the EPOS sales transaction we need to perform the following checks:

  1. Strip out all the columns that are not required within the warehouse.

  2. Convert all the values to required data types.

Warehouse Manager

仓库管理员负责仓库管理流程。它包含第三方系统软件、C 程序和 shell 脚本。

A warehouse manager is responsible for the warehouse management process. It consists of third-party system software, C programs, and shell scripts.

仓库管理器的尺寸和复杂度在具体解决方案中会各有不同。

The size and complexity of warehouse managers varies between specific solutions.

Warehouse Manager Architecture

仓库管理员包括以下内容 -

A warehouse manager includes the following −

  1. The controlling process

  2. Stored procedures or C with SQL

  3. Backup/Recovery tool

  4. SQL Scripts

warehouse manager

Operations Performed by Warehouse Manager

  1. A warehouse manager analyzes the data to perform consistency and referential integrity checks.

  2. Creates indexes, business views, partition views against the base data.

  3. Generates new aggregations and updates existing aggregations. Generates normalizations.

  4. Transforms and merges the source data into the published data warehouse.

  5. Backup the data in the data warehouse.

  6. Archives the data that has reached the end of its captured life.

Note − 仓库管理员也会分析查询概要以确定指标和聚合是合适的。

Note − A warehouse Manager also analyzes query profiles to determine index and aggregations are appropriate.

Query Manager

  1. Query manager is responsible for directing the queries to the suitable tables.

  2. By directing the queries to appropriate tables, the speed of querying and response generation can be increased.

  3. Query manager is responsible for scheduling the execution of the queries posed by the user.

Query Manager Architecture

下图展示了查询管理器的架构。它包括以下内容:

The following screenshot shows the architecture of a query manager. It includes the following:

  1. Query redirection via C tool or RDBMS

  2. Stored procedures

  3. Query management tool

  4. Query scheduling via C tool or RDBMS

  5. Query scheduling via third-party software

query manager

Detailed Information

不保留详细的信息,而是将其汇总到下一个详细信息级别,然后存档至磁带。数据仓库的详细信息部分将详细的信息保存在星型雪花模式中。详细的信息加载到数据仓库以补充汇总数据。

Detailed information is not kept online, rather it is aggregated to the next level of detail and then archived to tape. The detailed information part of data warehouse keeps the detailed information in the starflake schema. Detailed information is loaded into the data warehouse to supplement the aggregated data.

下图以图形形式展示了详细的信息存储在哪以及如何使用。

The following diagram shows a pictorial impression of where detailed information is stored and how it is used.

detailed information

Note - 如果详细的信息离线保存以最小化磁盘存储,在存档之前应确保数据已提取、清理并转换到星型雪花模式。

Note − If detailed information is held offline to minimize disk storage, we should make sure that the data has been extracted, cleaned up, and transformed into starflake schema before it is archived.

Summary Information

汇总信息是数据仓库的组成部分,存储预定义的汇总。这些汇总由仓库管理器生成。必须将汇总信息视为瞬态。为了对不断变化的查询概况做出响应,它会随时发生改变。

Summary Information is a part of data warehouse that stores predefined aggregations. These aggregations are generated by the warehouse manager. Summary Information must be treated as transient. It changes on-the-go in order to respond to the changing query profiles.

有关汇总信息的要点如下 −

The points to note about summary information are as follows −

  1. Summary information speeds up the performance of common queries.

  2. It increases the operational cost.

  3. It needs to be updated whenever new data is loaded into the data warehouse.

  4. It may not have been backed up, since it can be generated fresh from the detailed information.

Data Warehousing - OLAP

在线分析处理服务器 (OLAP) 基于多维数据模型。它允许管理人员和分析人员通过快速、一致和交互方式获取信息的深刻见解。本章介绍 OLAP 的类型、OLAP 的操作、OLAP 与统计数据库和 OLTP 之间的差异。

Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It allows managers, and analysts to get an insight of the information through fast, consistent, and interactive access to information. This chapter cover the types of OLAP, operations on OLAP, difference between OLAP, and statistical databases and OLTP.

Types of OLAP Servers

我们有四种类型的 OLAP 服务器 −

We have four types of OLAP servers −

  1. Relational OLAP (ROLAP)

  2. Multidimensional OLAP (MOLAP)

  3. Hybrid OLAP (HOLAP)

  4. Specialized SQL Servers

Relational OLAP

ROLAP 服务器放置在关系型后端服务器和客户端前端工具之间。ROLAP 使用关系型或扩展关系型 DBMS 来存储和管理仓库数据。

ROLAP servers are placed between relational back-end server and client front-end tools. To store and manage warehouse data, ROLAP uses relational or extended-relational DBMS.

ROLAP 包括以下内容 -

ROLAP includes the following −

  1. Implementation of aggregation navigation logic.

  2. Optimization for each DBMS back end.

  3. Additional tools and services.

Multidimensional OLAP

MOLAP 使用基于数组的多维存储引擎来实现数据的多维视图。使用多维数据存储时,如果数据集稀疏,存储利用率可能会低。因此,许多 MOLAP 服务器使用两级数据存储表示来处理稠密和稀疏数据集。

MOLAP uses array-based multidimensional storage engines for multidimensional views of data. With multidimensional data stores, the storage utilization may be low if the data set is sparse. Therefore, many MOLAP server use two levels of data storage representation to handle dense and sparse data sets.

Hybrid OLAP

混合 OLAP 是 ROLAP 和 MOLAP 的组合。它提供了 ROLAP 的较高可扩展性和 MOLAP 的更快计算。HOLAP 服务器允许存储详细信息的庞大数据量。汇总会单独存储在 MOLAP 存储中。

Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of ROLAP and faster computation of MOLAP. HOLAP servers allows to store the large data volumes of detailed information. The aggregations are stored separately in MOLAP store.

Specialized SQL Servers

专业 SQL 服务器在只读环境中为星型和雪花型架构上的 SQL 查询提供高级查询语言和查询处理支持。

Specialized SQL servers provide advanced query language and query processing support for SQL queries over star and snowflake schemas in a read-only environment.

OLAP Operations

由于 OLAP 服务器基于多维数据视图,因此我们将在多维数据中讨论 OLAP 操作。

Since OLAP servers are based on multidimensional view of data, we will discuss OLAP operations in multidimensional data.

以下是 OLAP 操作列表:

Here is the list of OLAP operations −

  1. Roll-up

  2. Drill-down

  3. Slice and dice

  4. Pivot (rotate)

Roll-up

汇总对数据立方体执行聚合,方式如下:

Roll-up performs aggregation on a data cube in any of the following ways −

  1. By climbing up a concept hierarchy for a dimension

  2. By dimension reduction

下图展示了汇总如何工作。

The following diagram illustrates how roll-up works.

rollup
  1. Roll-up is performed by climbing up a concept hierarchy for the dimension location.

  2. Initially the concept hierarchy was "street < city < province < country".

  3. On rolling up, the data is aggregated by ascending the location hierarchy from the level of city to the level of country.

  4. The data is grouped into cities rather than countries.

  5. When roll-up is performed, one or more dimensions from the data cube are removed.

Drill-down

向下钻取是汇总的反向操作。按以下任一方式执行:

Drill-down is the reverse operation of roll-up. It is performed by either of the following ways −

  1. By stepping down a concept hierarchy for a dimension

  2. By introducing a new dimension.

下图展示了向下钻取如何工作:

The following diagram illustrates how drill-down works −

drill down
  1. Drill-down is performed by stepping down a concept hierarchy for the dimension time.

  2. Initially the concept hierarchy was "day < month < quarter < year."

  3. On drilling down, the time dimension is descended from the level of quarter to the level of month.

  4. When drill-down is performed, one or more dimensions from the data cube are added.

  5. It navigates the data from less detailed data to highly detailed data.

Slice

分片操作从给定的立方体中选择一个特定的维度,并提供一个新的子立方体。请考虑以下显示分片工作原理的图表。

The slice operation selects one particular dimension from a given cube and provides a new sub-cube. Consider the following diagram that shows how slice works.

slice
  1. Here Slice is performed for the dimension "time" using the criterion time = "Q1".

  2. It will form a new sub-cube by selecting one or more dimensions.

Dice

切块从给定的立方体中选择两个或多个维度,并提供一个新的子立方体。请考虑以下显示切块操作的图表。

Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider the following diagram that shows the dice operation.

dice

基于以下选择条件的切块操作涉及三个维度。

The dice operation on the cube based on the following selection criteria involves three dimensions.

  1. (location = "Toronto" or "Vancouver")

  2. (time = "Q1" or "Q2")

  3. (item =" Mobile" or "Modem")

Pivot

枢轴操作也称为旋转。它围绕视图中的数据轴旋转,以便提供数据的替代表示。请考虑以下显示枢轴操作的图表。

The pivot operation is also known as rotation. It rotates the data axes in view in order to provide an alternative presentation of data. Consider the following diagram that shows the pivot operation.

pivot

OLAP vs OLTP

Sr.No.

Data Warehouse (OLAP)

Operational Database (OLTP)

1

Involves historical processing of information.

Involves day-to-day processing.

2

OLAP systems are used by knowledge workers such as executives, managers and analysts.

OLTP systems are used by clerks, DBAs, or database professionals.

3

Useful in analyzing the business.

Useful in running the business.

4

It focuses on Information out.

It focuses on Data in.

5

Based on Star Schema, Snowflake, Schema and Fact Constellation Schema.

Based on Entity Relationship Model.

6

Contains historical data.

Contains current data.

7

Provides summarized and consolidated data.

Provides primitive and highly detailed data.

8

Provides summarized and multidimensional view of data.

Provides detailed and flat relational view of data.

9

Number or users is in hundreds.

Number of users is in thousands.

10

Number of records accessed is in millions.

Number of records accessed is in tens.

11

Database size is from 100 GB to 1 TB

Database size is from 100 MB to 1 GB.

12

Highly flexible.

Provides high performance.

Data Warehousing - Relational OLAP

关系 OLAP 服务器位于关系后端服务器和客户端前端工具之间。为了存储和管理仓库数据,关系 OLAP 使用关系或扩展关系 DBMS。

Relational OLAP servers are placed between relational back-end server and client front-end tools. To store and manage the warehouse data, the relational OLAP uses relational or extended-relational DBMS.

ROLAP 包括以下内容 -

ROLAP includes the following −

  1. Implementation of aggregation navigation logic

  2. Optimization for each DBMS back-end

  3. Additional tools and services

Points to Remember

  1. ROLAP servers are highly scalable.

  2. ROLAP tools analyze large volumes of data across multiple dimensions.

  3. ROLAP tools store and analyze highly volatile and changeable data.

Relational OLAP Architecture

ROLAP 包含以下组件 −

ROLAP includes the following components −

  1. Database server

  2. ROLAP server

  3. Front-end tool.

rolap architecture

Advantages

  1. ROLAP servers can be easily used with existing RDBMS.

  2. Data can be stored efficiently, since no zero facts can be stored.

  3. ROLAP tools do not use pre-calculated data cubes.

  4. DSS server of micro-strategy adopts the ROLAP approach.

Disadvantages

  1. Poor query performance.

  2. Some limitations of scalability depending on the technology architecture that is utilized.

Data Warehousing - Multidimensional OLAP

多维 OLAP (MOLAP) 使用基于数组的多维存储引擎来获得数据的多维视图。使用多维数据存储,如果数据集稀疏,则存储利用率可能较低。因此,许多 MOLAP 服务器使用两级数据存储表示来处理密集和稀疏数据集。

Multidimensional OLAP (MOLAP) uses array-based multidimensional storage engines for multidimensional views of data. With multidimensional data stores, the storage utilization may be low if the dataset is sparse. Therefore, many MOLAP servers use two levels of data storage representation to handle dense and sparse datasets.

Points to Remember −

  1. MOLAP tools process information with consistent response time regardless of level of summarizing or calculations selected.

  2. MOLAP tools need to avoid many of the complexities of creating a relational database to store data for analysis.

  3. MOLAP tools need fastest possible performance.

  4. MOLAP server adopts two level of storage representation to handle dense and sparse data sets.

  5. Denser sub-cubes are identified and stored as array structure.

  6. Sparse sub-cubes employ compression technology.

MOLAP Architecture

MOLAP 包括以下组件:

MOLAP includes the following components −

  1. Database server.

  2. MOLAP server.

  3. Front-end tool.

molap architecture

Advantages

  1. MOLAP allows fastest indexing to the pre-computed summarized data.

  2. Helps the users connected to a network who need to analyze larger, less-defined data.

  3. Easier to use, therefore MOLAP is suitable for inexperienced users.

Disadvantages

  1. MOLAP are not capable of containing detailed data.

  2. The storage utilization may be low if the data set is sparse.

MOLAP vs ROLAP

Sr.No.

MOLAP

ROLAP

1

Information retrieval is fast.

Information retrieval is comparatively slow.

2

Uses sparse array to store data-sets.

Uses relational table.

3

MOLAP is best suited for inexperienced users, since it is very easy to use.

ROLAP is best suited for experienced users.

4

Maintains a separate database for data cubes.

It may not require space other than available in the Data warehouse.

5

DBMS facility is weak.

DBMS facility is strong.

Data Warehousing - Schemas

模式是对整个数据库的逻辑描述。它包括所有记录类型记录的名称和说明,包括所有关联的数据项和聚合。与数据库非常相似,数据仓库也需要维护模式。数据库使用关系模型,而数据仓库则使用星形、雪花和事实星座模式。在本章中,我们将讨论数据仓库中使用的模式。

Schema is a logical description of the entire database. It includes the name and description of records of all record types including all associated data-items and aggregates. Much like a database, a data warehouse also requires to maintain a schema. A database uses relational model, while a data warehouse uses Star, Snowflake, and Fact Constellation schema. In this chapter, we will discuss the schemas used in a data warehouse.

Star Schema

  1. Each dimension in a star schema is represented with only one-dimension table.

  2. This dimension table contains the set of attributes.

  3. The following diagram shows the sales data of a company with respect to the four dimensions, namely time, item, branch, and location.

start schema
  1. There is a fact table at the center. It contains the keys to each of four dimensions.

  2. The fact table also contains the attributes, namely dollars sold and units sold.

Note − 每个维度仅有一个维度表,并且每个表都包含一组属性。例如,位置维度表包含属性集 {location_key, street, city, province_or_state, country}。此约束可能会导致数据冗余。例如,“温哥华”和“维多利亚”这两个城市都位于加拿大不列颠哥伦比亚省。这些城市的条目可能会导致 province_or_state 和 country 属性的数据冗余。

Note − Each dimension has only one dimension table and each table holds a set of attributes. For example, the location dimension table contains the attribute set {location_key, street, city, province_or_state,country}. This constraint may cause data redundancy. For example, "Vancouver" and "Victoria" both the cities are in the Canadian province of British Columbia. The entries for such cities may cause data redundancy along the attributes province_or_state and country.

Snowflake Schema

  1. Some dimension tables in the Snowflake schema are normalized.

  2. The normalization splits up the data into additional tables.

  3. Unlike Star schema, the dimensions table in a snowflake schema are normalized. For example, the item dimension table in star schema is normalized and split into two dimension tables, namely item and supplier table.

snowflake
  1. Now the item dimension table contains the attributes item_key, item_name, type, brand, and supplier-key.

  2. The supplier key is linked to the supplier dimension table. The supplier dimension table contains the attributes supplier_key and supplier_type.

Note − 由于雪花模式中的归一化,冗余已减少,因此易于维护并节省存储空间。

Note − Due to normalization in the Snowflake schema, the redundancy is reduced and therefore, it becomes easy to maintain and the save storage space.

Fact Constellation Schema

  1. A fact constellation has multiple fact tables. It is also known as galaxy schema.

  2. The following diagram shows two fact tables, namely sales and shipping.

fact constellation
  1. The sales fact table is same as that in the star schema.

  2. The shipping fact table has the five dimensions, namely item_key, time_key, shipper_key, from_location, to_location.

  3. The shipping fact table also contains two measures, namely dollars sold and units sold.

  4. It is also possible to share dimension tables between fact tables. For example, time, item, and location dimension tables are shared between the sales and shipping fact table.

Schema Definition

多维模式使用数据挖掘查询语言 (DMQL) 定义。两个基元,多维数据集定义和维度定义,可用于定义数据仓库和数据市集。

Multidimensional schema is defined using Data Mining Query Language (DMQL). The two primitives, cube definition and dimension definition, can be used for defining the data warehouses and data marts.

Syntax for Cube Definition

define cube < cube_name > [ < dimension-list > }: < measure_list >

Syntax for Dimension Definition

define dimension < dimension_name > as ( < attribute_or_dimension_list > )

Star Schema Definition

我们讨论过的星形模式可以使用数据挖掘查询语言 (DMQL) 如下定义:

The star schema that we have discussed can be defined using Data Mining Query Language (DMQL) as follows −

define cube sales star [time, item, branch, location]:

dollars sold = sum(sales in dollars), units sold = count(*)

define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state, country)

Snowflake Schema Definition

雪花模式可以使用 DMQL 如下定义:

Snowflake schema can be defined using DMQL as follows −

define cube sales snowflake [time, item, branch, location]:

dollars sold = sum(sales in dollars), units sold = count(*)

define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier (supplier key, supplier type))
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city (city key, city, province or state, country))

Fact Constellation Schema Definition

事实星座模式可以使用 DMQL 如下定义:

Fact constellation schema can be defined using DMQL as follows −

define cube sales [time, item, branch, location]:

dollars sold = sum(sales in dollars), units sold = count(*)

define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state,country)
define cube shipping [time, item, shipper, from location, to location]:

dollars cost = sum(cost in dollars), units shipped = count(*)

define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper key, shipper name, location as location in cube sales, shipper type)
define dimension from location as location in cube sales
define dimension to location as location in cube sales

Data Warehousing - Partitioning Strategy

分区旨在提高性能,便于数据管理。分区还有助于平衡系统的各种需求。它通过将每个事实表分区到多个单独的分区来优化硬件性能并简化数据仓库管理。在本章中,我们将讨论不同的分区策略。

Partitioning is done to enhance performance and facilitate easy management of data. Partitioning also helps in balancing the various requirements of the system. It optimizes the hardware performance and simplifies the management of data warehouse by partitioning each fact table into multiple separate partitions. In this chapter, we will discuss different partitioning strategies.

Why is it Necessary to Partition?

分区出于以下原因非常重要 −

Partitioning is important for the following reasons −

  1. For easy management,

  2. To assist backup/recovery,

  3. To enhance performance.

For Easy Management

数据仓库中的事实表可能增长到数百个 GB 大小。如此巨大的事实表大小作为单个实体进行管理非常困难。因此它需要分区。

The fact table in a data warehouse can grow up to hundreds of gigabytes in size. This huge size of fact table is very hard to manage as a single entity. Therefore it needs partitioning.

To Assist Backup/Recovery

如果我们不对事实表进行分区,那么我们就必须加载包含所有数据的完整事实表。分区允许我们只加载定期所需的数据。它减少了加载时间,并提高了系统的性能。

If we do not partition the fact table, then we have to load the complete fact table with all the data. Partitioning allows us to load only as much data as is required on a regular basis. It reduces the time to load and also enhances the performance of the system.

Note − 为了减少备份大小,可以将除当前分区外的所有分区标记为只读。然后,我们可以将这些分区置于无法修改的状态。然后可以对其进行备份。这意味着只需要备份当前分区。

Note − To cut down on the backup size, all partitions other than the current partition can be marked as read-only. We can then put these partitions into a state where they cannot be modified. Then they can be backed up. It means only the current partition is to be backed up.

To Enhance Performance

通过将事实表分区为数据集合,可以增强查询过程。查询性能得到增强,因为现在查询仅扫描那些相关的分区。它不必扫描整个数据。

By partitioning the fact table into sets of data, the query procedures can be enhanced. Query performance is enhanced because now the query scans only those partitions that are relevant. It does not have to scan the whole data.

Horizontal Partitioning

有各种方法可以对事实表进行分区。在水平分区中,我们必须牢记数据仓库可管理性的要求。

There are various ways in which a fact table can be partitioned. In horizontal partitioning, we have to keep in mind the requirements for manageability of the data warehouse.

Partitioning by Time into Equal Segments

在这种分区策略中,事实表根据时间段进行分区。此处的每个时间段都代表业务中的一个重要保留期。例如,如果用户查询 month to date data ,那么将数据分区到按月细分的段中是合适的。我们可以通过删除其中的数据来重用分区表。

In this partitioning strategy, the fact table is partitioned on the basis of time period. Here each time period represents a significant retention period within the business. For example, if the user queries for month to date data then it is appropriate to partition the data into monthly segments. We can reuse the partitioned tables by removing the data in them.

Partition by Time into Different-sized Segments

当访问过时的旧数据不频繁时采用这种划分。它作为一系列小分区实现,用于相对较新的数据,而较大的分区则用于非活动数据。

This kind of partition is done where the aged data is accessed infrequently. It is implemented as a set of small partitions for relatively current data, larger partition for inactive data.

partition

Points to Note

  1. The detailed information remains available online.

  2. The number of physical tables is kept relatively small, which reduces the operating cost.

  3. This technique is suitable where a mix of data dipping recent history and data mining through entire history is required.

  4. This technique is not useful where the partitioning profile changes on a regular basis, because repartitioning will increase the operation cost of data warehouse.

Partition on a Different Dimension

事实表还可以根据时间维度之外的其他维度(如产品组、地区、供应商或任何其他维度)进行划分。让我们举个例子。

The fact table can also be partitioned on the basis of dimensions other than time such as product group, region, supplier, or any other dimension. Let’s have an example.

假设市场功能已按 state by state 方式被划分到不同的地区部门中。如果每个地区想要查询其区域内获取的信息,将事实表划分为区域分区会更有效。这会因为无需扫描不相关的信息而加速查询。

Suppose a market function has been structured into distinct regional departments like on a state by state basis. If each region wants to query on information captured within its region, it would prove to be more effective to partition the fact table into regional partitions. This will cause the queries to speed up because it does not require to scan information that is not relevant.

Points to Note

  1. The query does not have to scan irrelevant data which speeds up the query process.

  2. This technique is not appropriate where the dimensions are unlikely to change in future. So, it is worth determining that the dimension does not change in future.

  3. If the dimension changes, then the entire fact table would have to be repartitioned.

Note - 我们建议仅根据时间维度进行分区,除非你确定建议的维度组在数据仓库的生命周期内不会改变。

Note − We recommend to perform the partition only on the basis of time dimension, unless you are certain that the suggested dimension grouping will not change within the life of the data warehouse.

Partition by Size of Table

当没有任何明确的维度可以将事实表进行分区时,我们应该 partition the fact table on the basis of their size. 我们将预先确定的尺寸设为一个临界点。当表超过预定的尺寸时,将创建一个新的表分区。

When there are no clear basis for partitioning the fact table on any dimension, then we should partition the fact table on the basis of their size. We can set the predetermined size as a critical point. When the table exceeds the predetermined size, a new table partition is created.

Points to Note

  1. This partitioning is complex to manage.

  2. It requires metadata to identify what data is stored in each partition.

Partitioning Dimensions

如果维度包含大量条目,则需要对维度进行分区。这里我们必须检查维度的尺寸。

If a dimension contains large number of entries, then it is required to partition the dimensions. Here we have to check the size of a dimension.

考虑一个随着时间推移而变化的大型设计。如果我们需要存储所有变体以进行比较,该维度可能非常大。这肯定会影响响应时间。

Consider a large design that changes over time. If we need to store all the variations in order to apply comparisons, that dimension may be very large. This would definitely affect the response time.

Round Robin Partitions

在循环法中,当需要一个新分区时,旧的分区将被存档。它使用元数据来允许用户访问工具引用正确的表分区。

In the round robin technique, when a new partition is needed, the old one is archived. It uses metadata to allow user access tool to refer to the correct table partition.

这种技术使得自动化数据仓库中的表管理功能变得容易。

This technique makes it easy to automate table management facilities within the data warehouse.

Vertical Partition

垂直分区垂直划分数据。下图描绘了如何执行垂直分区。

Vertical partitioning, splits the data vertically. The following images depicts how vertical partitioning is done.

vertical partitioning

垂直分区可以通过以下两种方式执行:

Vertical partitioning can be performed in the following two ways −

  1. Normalization

  2. Row Splitting

Normalization

规范化是数据库组织的标准关系方法。在此方法中,行合并到单行中,因此它减少了空间。看看以下表格,显示如何执行规范化。

Normalization is the standard relational method of database organization. In this method, the rows are collapsed into a single row, hence it reduce space. Take a look at the following tables that show how normalization is performed.

规范化之前的表格

Table before Normalization

Product_id

Qty

Value

sales_date

Store_id

Store_name

Location

Region

30

5

3.67

3-Aug-13

16

sunny

Bangalore

S

35

4

5.33

3-Sep-13

16

sunny

Bangalore

S

40

5

2.50

3-Sep-13

64

san

Mumbai

W

45

7

5.66

3-Sep-13

16

sunny

Bangalore

S

规范化之后的表格

Table after Normalization

Store_id

Store_name

Location

Region

16

sunny

Bangalore

W

64

san

Mumbai

S

Product_id

Quantity

Value

sales_date

Store_id

30

5

3.67

3-Aug-13

16

35

4

5.33

3-Sep-13

16

40

5

2.50

3-Sep-13

64

45

7

5.66

3-Sep-13

16

Row Splitting

行拆分倾向于在分区之间保留一对一映射。行拆分的动机是通过减小其大小来加快对大表的访问速度。

Row splitting tends to leave a one-to-one map between partitions. The motive of row splitting is to speed up the access to large table by reducing its size.

Note − 在使用垂直分区时,确保在两个分区之间无需执行主要连接操作。

Note − While using vertical partitioning, make sure that there is no requirement to perform a major join operation between two partitions.

Identify Key to Partition

选择正确的分区键非常关键。选择错误的分区键将导致重新组织事实表。让我们举个例子。假设我们想要对以下表格进行分区。

It is very crucial to choose the right partition key. Choosing a wrong partition key will lead to reorganizing the fact table. Let’s have an example. Suppose we want to partition the following table.

Account_Txn_Table
transaction_id
account_id
transaction_type
value
transaction_date
region
branch_name

我们可以选择对任何键进行分区。两个可能的键可以是

We can choose to partition on any key. The two possible keys could be

  1. region

  2. transaction_date

假设业务组织在 30 个地理区域,每个区域都有不同数量的分支机构。这将为我们提供 30 个分区,这是合理的。此分区足够好,因为我们的要求捕获表明绝大多数查询都限制在用户自己的业务区域。

Suppose the business is organized in 30 geographical regions and each region has different number of branches. That will give us 30 partitions, which is reasonable. This partitioning is good enough because our requirements capture has shown that a vast majority of queries are restricted to the user’s own business region.

如果我们按交易日期而不是按区域分区,那么每个区域的最新交易将位于一个分区中。现在,希望查看自己区域内数据的用户必须跨多个分区查询。

If we partition by transaction_date instead of region, then the latest transaction from every region will be in one partition. Now the user who wants to look at data within his own region has to query across multiple partitions.

因此,值得确定正确的分区键。

Hence it is worth determining the right partitioning key.

Data Warehousing - Metadata Concepts

What is Metadata?

元数据简单地被定义为关于数据的数据。用于表示其他数据的数据称为元数据。例如,一本的索引充当了书中内容的元数据。换言之,我们可以说元数据是将我们引导到详细数据的汇总数据。就数据仓库而言,我们可以将元数据定义如下。

Metadata is simply defined as data about data. The data that is used to represent other data is known as metadata. For example, the index of a book serves as a metadata for the contents in the book. In other words, we can say that metadata is the summarized data that leads us to detailed data. In terms of data warehouse, we can define metadata as follows.

  1. Metadata is the road-map to a data warehouse.

  2. Metadata in a data warehouse defines the warehouse objects.

  3. Metadata acts as a directory. This directory helps the decision support system to locate the contents of a data warehouse.

Note − 在数据仓库中,我们创建元数据,用于给定数据仓库的数据名称和定义。在此元数据中,我们还创建其他元数据,用于对提取的任何数据的时间戳记进行时间戳记和提取数据的源。

Note − In a data warehouse, we create metadata for the data names and definitions of a given data warehouse. Along with this metadata, additional metadata is also created for time-stamping any extracted data, the source of extracted data.

Categories of Metadata

元数据可以大致分为三类 −

Metadata can be broadly categorized into three categories −

  1. Business Metadata − It has the data ownership information, business definition, and changing policies.

  2. Technical Metadata − It includes database system names, table and column names and sizes, data types and allowed values. Technical metadata also includes structural information such as primary and foreign key attributes and indices.

  3. Operational Metadata − It includes currency of data and data lineage. Currency of data means whether the data is active, archived, or purged. Lineage of data means the history of data migrated and transformation applied on it.

metadata categories

Role of Metadata

元数据在数据仓库中发挥着非常重要的作用。元数据在仓库中的作用与仓库数据的作用不同,但它扮演着重要的作用。下面解释了元数据的各种作用。

Metadata has a very important role in a data warehouse. The role of metadata in a warehouse is different from the warehouse data, yet it plays an important role. The various roles of metadata are explained below.

  1. Metadata acts as a directory.

  2. This directory helps the decision support system to locate the contents of the data warehouse.

  3. Metadata helps in decision support system for mapping of data when data is transformed from operational environment to data warehouse environment.

  4. Metadata helps in summarization between current detailed data and highly summarized data.

  5. Metadata also helps in summarization between lightly detailed data and highly summarized data.

  6. Metadata is used for query tools.

  7. Metadata is used in extraction and cleansing tools.

  8. Metadata is used in reporting tools.

  9. Metadata is used in transformation tools.

  10. Metadata plays an important role in loading functions.

下图显示了元数据的角色。

The following diagram shows the roles of metadata.

metadata role

Metadata Repository

元数据存储库是数据仓库系统的一个组成部分。它具有以下元数据:

Metadata repository is an integral part of a data warehouse system. It has the following metadata −

  1. Definition of data warehouse − It includes the description of structure of data warehouse. The description is defined by schema, view, hierarchies, derived data definitions, and data mart locations and contents.

  2. Business metadata − It contains has the data ownership information, business definition, and changing policies.

  3. Operational Metadata − It includes currency of data and data lineage. Currency of data means whether the data is active, archived, or purged. Lineage of data means the history of data migrated and transformation applied on it.

  4. Data for mapping from operational environment to data warehouse − It includes the source databases and their contents, data extraction, data partition cleaning, transformation rules, data refresh and purging rules.

  5. Algorithms for summarization − It includes dimension algorithms, data on granularity, aggregation, summarizing, etc.

Challenges for Metadata Management

元数据的重要性不容忽视。元数据有助于提高报告的准确性、验证数据转换并确保计算准确性。元数据还强制终端用户遵守业务术语的定义。除了所有这些对元数据的应用,元数据也面临着挑战。以下讨论了部分挑战。

The importance of metadata can not be overstated. Metadata helps in driving the accuracy of reports, validates data transformation, and ensures the accuracy of calculations. Metadata also enforces the definition of business terms to business end-users. With all these uses of metadata, it also has its challenges. Some of the challenges are discussed below.

  1. Metadata in a big organization is scattered across the organization. This metadata is spread in spreadsheets, databases, and applications.

  2. Metadata could be present in text files or multimedia files. To use this data for information management solutions, it has to be correctly defined.

  3. There are no industry-wide accepted standards. Data management solution vendors have narrow focus.

  4. There are no easy and accepted methods of passing metadata.

Data Warehousing - Data Marting

Why Do We Need a Data Mart?

以下是创建数据市集的原因列表:

Listed below are the reasons to create a data mart −

  1. To partition data in order to impose access control strategies.

  2. To speed up the queries by reducing the volume of data to be scanned.

  3. To segment data into different hardware platforms.

  4. To structure data in a form suitable for a user access tool.

Note − 不要出于任何其他原因建立数据市集,因为数据市集的运营成本可能非常高。在建立数据市集之前,请确保数据市集策略适合你的特定解决方案。

Note − Do not data mart for any other reason since the operation cost of data marting could be very high. Before data marting, make sure that data marting strategy is appropriate for your particular solution.

Cost-effective Data Marting

按照以下步骤使数据市集具有成本效益:

Follow the steps given below to make data marting cost-effective −

  1. Identify the Functional Splits

  2. Identify User Access Tool Requirements

  3. Identify Access Control Issues

Identify the Functional Splits

在此步骤中,我们确定组织是否有自然的职能划分。我们寻找部门之间的划分,并确定各部门使用信息的方式是否与组织的其余部分隔离。我们举个例子。

In this step, we determine if the organization has natural functional splits. We look for departmental splits, and we determine whether the way in which departments use information tend to be in isolation from the rest of the organization. Let’s have an example.

考虑一个零售组织,其中每个商家负责最大化销售一组产品。为此,以下是有价值的信息 -

Consider a retail organization, where each merchant is accountable for maximizing the sales of a group of products. For this, the following are the valuable information −

  1. sales transaction on a daily basis

  2. sales forecast on a weekly basis

  3. stock position on a daily basis

  4. stock movements on a daily basis

由于商家对他们不经手的产品不感兴趣,所以数据市集是处理的产品组感兴趣的数据的一个子集。下图显示了面向不同用户的数仓。

As the merchant is not interested in the products they are not dealing with, the data marting is a subset of the data dealing which the product group of interest. The following diagram shows data marting for different users.

data marting

以下是在确定功能划分时需要考虑的问题 -

Given below are the issues to be taken into account while determining the functional split −

  1. The structure of the department may change.

  2. The products might switch from one department to other.

  3. The merchant could query the sales trend of other products to analyze what is happening to the sales.

Note - 我们需要确定使用数据市集的业务优势和技术可行性。

Note − We need to determine the business benefits and technical feasibility of using a data mart.

Identify User Access Tool Requirements

我们需要数据市集来支持 user access tools ,它们需要内部数据结构。此类结构中的数据不受数据仓库控制,但需要定期填充和更新。

We need data marts to support user access tools that require internal data structures. The data in such structures are outside the control of data warehouse but need to be populated and updated on a regular basis.

有些工具直接从源系统填充,但有些工具不能。因此,需要识别超出工具范围的额外需求,以备将来使用。

There are some tools that populate directly from the source system but some cannot. Therefore additional requirements outside the scope of the tool are needed to be identified for future.

Note - 为了确保所有访问工具中数据的一致性,不应直接从数据仓库填充数据,而是每个工具都必须有自己的数据市集。

Note − In order to ensure consistency of data across all access tools, the data should not be directly populated from the data warehouse, rather each tool must have its own data mart.

Identify Access Control Issues

应该制定隐私规则,以确保只有经过授权的用户才能访问数据。例如,零售银行机构的数据仓库可确保所有帐户都属于同一法人实体。隐私法可能会迫使你完全禁止访问不属于特定银行的信息。

There should to be privacy rules to ensure the data is accessed by authorized users only. For example a data warehouse for retail banking institution ensures that all the accounts belong to the same legal entity. Privacy laws can force you to totally prevent access to information that is not owned by the specific bank.

数据市集允许我们在数据仓库内物理分离数据段,从而构建一个完整的墙。为了避免可能的隐私问题,可以从数据仓库中删除详细数据。我们可以为每个法人实体创建数据市集,并通过数据仓库加载,同时提供详细的帐户数据。

Data marts allow us to build a complete wall by physically separating data segments within the data warehouse. To avoid possible privacy problems, the detailed data can be removed from the data warehouse. We can create data mart for each legal entity and load it via data warehouse, with detailed account data.

Designing Data Marts

数据市集应设计为数据仓库中星型雪花模式的较小版本,并且应与数据仓库的数据库设计相匹配。它有助于控制数据库实例。

Data marts should be designed as a smaller version of starflake schema within the data warehouse and should match with the database design of the data warehouse. It helps in maintaining control over database instances.

designing datamart

摘要以与在数据仓库内设计相同的方式进行数据市集。汇总表有助于利用星型雪花模式中的所有维度数据。

The summaries are data marted in the same way as they would have been designed within the data warehouse. Summary tables help to utilize all dimension data in the starflake schema.

Cost of Data Marting

数据市集的成本度量如下 -

The cost measures for data marting are as follows −

  1. Hardware and Software Cost

  2. Network Access

  3. Time Window Constraints

Hardware and Software Cost

虽然数据集市创建在相同硬件上,但需要一些额外的硬件和软件。为了处理用户查询,它需要额外的处理能力和磁盘存储。如果在数据仓库中存在详细信息和数据集市,那么我们将面临额外的存储和管理复制数据的花费。

Although data marts are created on the same hardware, they require some additional hardware and software. To handle user queries, it requires additional processing power and disk storage. If detailed data and the data mart exist within the data warehouse, then we would face additional cost to store and manage replicated data.

Note − 数据集市比聚合花费更高,因此它应当用作一种附加策略,而不是备选策略。

Note − Data marting is more expensive than aggregations, therefore it should be used as an additional strategy and not as an alternative strategy.

Network Access

数据集市可能位于数据仓库的不同位置,因此我们应当确保局域网或广域网有能力处理在 data mart load process. 内传输的数据量

A data mart could be on a different location from the data warehouse, so we should ensure that the LAN or WAN has the capacity to handle the data volumes being transferred within the data mart load process.

Time Window Constraints

数据集市加载进程消耗可用时间窗口的程度取决于转换的复杂性和运送的数据量。确定有多少个数据集市取决于 −

The extent to which a data mart loading process will eat into the available time window depends on the complexity of the transformations and the data volumes being shipped. The determination of how many data marts are possible depends on −

  1. Network capacity.

  2. Time window available

  3. Volume of data being transferred

  4. Mechanisms being used to insert data into a data mart

Data Warehousing - System Managers

系统管理对于数据仓库的成功实施至关重要。最重要的系统管理器包括 −

System management is mandatory for the successful implementation of a data warehouse. The most important system managers are −

  1. System configuration manager

  2. System scheduling manager

  3. System event manager

  4. System database manager

  5. System backup recovery manager

System Configuration Manager

  1. The system configuration manager is responsible for the management of the setup and configuration of data warehouse.

  2. The structure of configuration manager varies from one operating system to another.

  3. In Unix structure of configuration, the manager varies from vendor to vendor.

  4. Configuration managers have single user interface.

  5. The interface of configuration manager allows us to control all aspects of the system.

Note − 最重要的配置工具是 I/O 管理器。

Note − The most important configuration tool is the I/O manager.

System Scheduling Manager

系统调度管理器负责数据仓库的成功实施。其目的是调度临时查询。每个操作系统都有自己 的调度器,并采用某种形式的批处理控制机制。系统调度管理器必须具备的功能列表如下 −

System Scheduling Manager is responsible for the successful implementation of the data warehouse. Its purpose is to schedule ad hoc queries. Every operating system has its own scheduler with some form of batch control mechanism. The list of features a system scheduling manager must have is as follows −

  1. Work across cluster or MPP boundaries

  2. Deal with international time differences

  3. Handle job failure

  4. Handle multiple queries

  5. Support job priorities

  6. Restart or re-queue the failed jobs

  7. Notify the user or a process when job is completed

  8. Maintain the job schedules across system outages

  9. Re-queue jobs to other queues

  10. Support the stopping and starting of queues

  11. Log Queued jobs

  12. Deal with inter-queue processing

Note − 上述列表可用作评估良好调度器的评估参数。

Note − The above list can be used as evaluation parameters for the evaluation of a good scheduler.

调度器必须能够处理的一些重要任务包括 −

Some important jobs that a scheduler must be able to handle are as follows −

  1. Daily and ad hoc query scheduling

  2. Execution of regular report requirements

  3. Data load

  4. Data processing

  5. Index creation

  6. Backup

  7. Aggregation creation

  8. Data transformation

Note −如果数据仓库在一个群集或 MPP 体系结构上运行,那么系统调度管理器一定要能够跨体系结构运行。

Note − If the data warehouse is running on a cluster or MPP architecture, then the system scheduling manager must be capable of running across the architecture.

System Event Manager

事件管理器是一种软件。事件管理器管理数据仓库系统上所定义的事件。我们无法手动管理数据仓库,因为数据仓库的结构非常复杂。因此,我们需要一种无需用户干预就能自动处理所有事件的工具。

The event manager is a kind of a software. The event manager manages the events that are defined on the data warehouse system. We cannot manage the data warehouse manually because the structure of data warehouse is very complex. Therefore we need a tool that automatically handles all the events without any intervention of the user.

Note −事件管理器监视事件发生并处理它们。事件管理器还会追踪这个复杂的数据仓库系统中可能出现的大量问题。

Note − The Event manager monitors the events occurrences and deals with them. The event manager also tracks the myriad of things that can go wrong on this complex data warehouse system.

Events

事件是由用户或系统本身生成的活动。可以注意到,事件是一个定义活动的可度量、可观察的发生。

Events are the actions that are generated by the user or the system itself. It may be noted that the event is a measurable, observable, occurrence of a defined action.

下面是一个需要追踪的常见事件列表。

Given below is a list of common events that are required to be tracked.

  1. Hardware failure

  2. Running out of space on certain key disks

  3. A process dying

  4. A process returning an error

  5. CPU usage exceeding an 805 threshold

  6. Internal contention on database serialization points

  7. Buffer cache hit ratios exceeding or failure below threshold

  8. A table reaching to maximum of its size

  9. Excessive memory swapping

  10. A table failing to extend due to lack of space

  11. Disk exhibiting I/O bottlenecks

  12. Usage of temporary or sort area reaching a certain thresholds

  13. Any other database shared memory usage

事件最重要的事情是,它们应该能够自己执行。事件包定义预定义事件的过程。与每个事件关联的代码称为事件处理程序。此代码在发生事件时执行。

The most important thing about events is that they should be capable of executing on their own. Event packages define the procedures for the predefined events. The code associated with each event is known as event handler. This code is executed whenever an event occurs.

System and Database Manager

系统和数据库管理器可能是两个独立的软件,但它们执行相同的工作。这些工具的目标是自动化某些过程并简化其他过程的执行。选择系统和数据库管理器的标准如下 −

System and database manager may be two separate pieces of software, but they do the same job. The objective of these tools is to automate certain processes and to simplify the execution of others. The criteria for choosing a system and the database manager are as follows −

  1. increase user’s quota.

  2. assign and de-assign roles to the users

  3. assign and de-assign the profiles to the users

  4. perform database space management

  5. monitor and report on space usage

  6. tidy up fragmented and unused space

  7. add and expand the space

  8. add and remove users

  9. manage user password

  10. manage summary or temporary tables

  11. assign or deassign temporary space to and from the user

  12. reclaim the space form old or out-of-date temporary tables

  13. manage error and trace logs

  14. to browse log and trace files

  15. redirect error or trace information

  16. switch on and off error and trace logging

  17. perform system space management

  18. monitor and report on space usage

  19. clean up old and unused file directories

  20. add or expand space.

System Backup Recovery Manager

备份和恢复工具使用户操作和管理人员可以轻松备份数据。注意,系统备份管理器必须与正在使用的计划管理器软件集成。备份管理所需的重要功能如下:

The backup and recovery tool makes it easy for operations and management staff to back-up the data. Note that the system backup manager must be integrated with the schedule manager software being used. The important features that are required for the management of backups are as follows −

  1. Scheduling

  2. Backup data tracking

  3. Database awareness

仅进行备份以防止数据丢失。以下是要记住的重要事项:

Backups are taken only to protect against data loss. Following are the important points to remember −

  1. The backup software will keep some form of database of where and when the piece of data was backed up.

  2. The backup recovery manager must have a good front-end to that database.

  3. The backup recovery software should be database aware.

  4. Being aware of the database, the software then can be addressed in database terms, and will not perform backups that would not be viable.

Data Warehousing - Process Managers

进程管理器负责维护进出数据仓库的数据流。有三种不同类型的进程管理器:

Process managers are responsible for maintaining the flow of data both into and out of the data warehouse. There are three different types of process managers −

  1. Load manager

  2. Warehouse manager

  3. Query manager

Data Warehouse Load Manager

加载管理器执行从数据库中提取和加载数据的所需操作。加载管理器的规模和复杂性因从一个数据仓库到另一个数据仓库的特定解决方案而异。

Load manager performs the operations required to extract and load the data into the database. The size and complexity of a load manager varies between specific solutions from one data warehouse to another.

Load Manager Architecture

负载管理器执行以下函数 -

The load manager does performs the following functions −

  1. Extract data from the source system.

  2. Fast load the extracted data into temporary data store.

  3. Perform simple transformations into structure similar to the one in the data warehouse.

load manager

Extract Data from Source

数据是从操作数据库或外部信息提供程序中提取的。网关是指用于提取数据应用程序程序。它受底层 DBMS 支持,并允许客户端程序生成要在服务器上执行的 SQL。开放数据库连接 (ODBC) 和 Java 数据库连接 (JDBC) 都是网关的示例。

The data is extracted from the operational databases or the external information providers. Gateways are the application programs that are used to extract data. It is supported by underlying DBMS and allows the client program to generate SQL to be executed at a server. Open Database Connection (ODBC) and Java Database Connection (JDBC) are examples of gateway.

Fast Load

  1. In order to minimize the total load window, the data needs to be loaded into the warehouse in the fastest possible time.

  2. Transformations affect the speed of data processing.

  3. It is more effective to load the data into a relational database prior to applying transformations and checks.

  4. Gateway technology is not suitable, since they are inefficient when large data volumes are involved.

Simple Transformations

在加载时,可能需要执行简单转换。在完成简单转换后,我们可以做复杂检查。假设我们正在加载 EPOS 销售交易,我们需要执行以下检查 -

While loading, it may be required to perform simple transformations. After completing simple transformations, we can do complex checks. Suppose we are loading the EPOS sales transaction, we need to perform the following checks −

  1. Strip out all the columns that are not required within the warehouse.

  2. Convert all the values to required data types.

Warehouse Manager

仓库管理员负责仓库管理流程,它由第三方系统软件、C 程序与 Shell 脚本组成。仓库管理员的大小和复杂性因具体解决方案而异。

The warehouse manager is responsible for the warehouse management process. It consists of a third-party system software, C programs, and shell scripts. The size and complexity of a warehouse manager varies between specific solutions.

Warehouse Manager Architecture

仓库管理员包括以下内容 -

A warehouse manager includes the following −

  1. The controlling process

  2. Stored procedures or C with SQL

  3. Backup/Recovery tool

  4. SQL scripts

warehouse manager

Functions of Warehouse Manager

仓库管理员执行以下函数 -

A warehouse manager performs the following functions −

  1. Analyzes the data to perform consistency and referential integrity checks.

  2. Creates indexes, business views, partition views against the base data.

  3. Generates new aggregations and updates the existing aggregations.

  4. Generates normalizations.

  5. Transforms and merges the source data of the temporary store into the published data warehouse.

  6. Backs up the data in the data warehouse.

  7. Archives the data that has reached the end of its captured life.

Note - 仓库管理员分析查询配置文件以确定索引和聚合是否合适。

Note − A warehouse Manager analyzes query profiles to determine whether the index and aggregations are appropriate.

Query Manager

查询管理器负责将查询指向合适的表。通过将查询指向合适的表,它可以加快查询请求和响应的过程。此外,查询管理器还负责调度用户发布的查询的执行。

The query manager is responsible for directing the queries to suitable tables. By directing the queries to appropriate tables, it speeds up the query request and response process. In addition, the query manager is responsible for scheduling the execution of the queries posted by the user.

Query Manager Architecture

查询管理器包括以下组件 -

A query manager includes the following components −

  1. Query redirection via C tool or RDBMS

  2. Stored procedures

  3. Query management tool

  4. Query scheduling via C tool or RDBMS

  5. Query scheduling via third-party software

query manager

Functions of Query Manager

  1. It presents the data to the user in a form they understand.

  2. It schedules the execution of the queries posted by the end-user.

  3. It stores query profiles to allow the warehouse manager to determine which indexes and aggregations are appropriate.

Data Warehousing - Security

数据仓库的目标是让用户可以轻松访问大量数据,从而让用户能够提取有关整个业务的信息。但我们知道,可以对数据应用一些安全限制,这可能会成为访问信息时的障碍。如果分析师对数据的看法受到限制,那么就不可能全面了解业务中的趋势。

The objective of a data warehouse is to make large amounts of data easily accessible to the users, hence allowing the users to extract information about the business as a whole. But we know that there could be some security restrictions applied on the data that can be an obstacle for accessing the information. If the analyst has a restricted view of data, then it is impossible to capture a complete picture of the trends within the business.

可以汇总每个分析师的数据并传递给管理层,在管理层可以汇总不同的汇总。由于汇总汇总与整体汇总不同,因此有可能错过数据中的一些信息趋势,除非有人整体分析数据。

The data from each analyst can be summarized and passed on to management where the different summaries can be aggregated. As the aggregations of summaries cannot be the same as that of the aggregation as a whole, it is possible to miss some information trends in the data unless someone is analyzing the data as a whole.

Security Requirements

添加安全功能会影响数据仓库的性能,因此尽早确定安全需求非常重要。在数据仓库正式投入使用后添加安全功能很困难。

Adding security features affect the performance of the data warehouse, therefore it is important to determine the security requirements as early as possible. It is difficult to add security features after the data warehouse has gone live.

在数据仓库的设计阶段,我们应该记住将来可能添加哪些数据源,以及添加这些数据源会产生什么影响。我们在设计阶段应该考虑以下可能性。

During the design phase of the data warehouse, we should keep in mind what data sources may be added later and what would be the impact of adding those data sources. We should consider the following possibilities during the design phase.

  1. Whether the new data sources will require new security and/or audit restrictions to be implemented?

  2. Whether the new users added who have restricted access to data that is already generally available?

当未来的用户和数据源不明确时,会出现这种情况。在这种情况下,我们需要利用业务知识和数据仓库的目标来了解可能的条件。

This situation arises when the future users and the data sources are not well known. In such a situation, we need to use the knowledge of business and the objective of data warehouse to know likely requirements.

以下活动会受到安全措施的影响 −

The following activities get affected by security measures −

  1. User access

  2. Data load

  3. Data movement

  4. Query generation

User Access

我们需要先对数据进行分类,然后再根据用户可以访问的数据对用户进行分类。换句话说,用户会根据其可访问的数据进行分类。

We need to first classify the data and then classify the users on the basis of the data they can access. In other words, the users are classified according to the data they can access.

Data Classification

Data Classification

可以使用以下两种方法对数据进行分类 −

The following two approaches can be used to classify the data −

  1. Data can be classified according to its sensitivity. Highly-sensitive data is classified as highly restricted and less-sensitive data is classified as less restrictive.

  2. Data can also be classified according to the job function. This restriction allows only specific users to view particular data. Here we restrict the users to view only that part of the data in which they are interested and are responsible for.

第二种方法存在一些问题。为了理解,我们举个例子。假设您正在为银行构建数据仓库。请考虑存储在数据仓库中的数据是所有帐户的交易数据。问题是,谁可以查看交易数据。解决方案在于根据职能对数据进行分类。

There are some issues in the second approach. To understand, let’s have an example. Suppose you are building the data warehouse for a bank. Consider that the data being stored in the data warehouse is the transaction data for all the accounts. The question here is, who is allowed to see the transaction data. The solution lies in classifying the data according to the function.

User classification

User classification

下列方法可用于对用户分类 −

The following approaches can be used to classify the users −

  1. Users can be classified as per the hierarchy of users in an organization, i.e., users can be classified by departments, sections, groups, and so on.

  2. Users can also be classified according to their role, with people grouped across departments based on their role.

Classification on basis of Department

Classification on basis of Department

让我们举一个数据仓库的示例,其中用户来自销售和营销部门。我们可以通过自上而下的公司视图进行安全管理,访问权限以不同部门为中心。但对不同级别的用户可能会有一些限制。此结构如下图所示。

Let’s have an example of a data warehouse where the users are from sales and marketing department. We can have security by top-to-down company view, with access centered on the different departments. But there could be some restrictions on users at different levels. This structure is shown in the following diagram.

user access hierarchy

但是,如果每个部门访问不同的数据,那么我们应该为每个部门单独设计安全访问权限。可以通过部门数据市实现此目的。由于这些数据市与数据仓库分离,因此我们可以在每个数据市上强制执行单独的安全限制。此方法如下图所示。

But if each department accesses different data, then we should design the security access for each department separately. This can be achieved by departmental data marts. Since these data marts are separated from the data warehouse, we can enforce separate security restrictions on each data mart. This approach is shown in the following figure.

user restriction

Classification Based on Role

Classification Based on Role

如果数据通常对所有部门可用,那么遵循角色访问层次结构十分有用。换句话说,如果数据通常由所有部门访问,那么按照用户的角色应用安全限制。角色访问层次结构如下图所示。

If the data is generally available to all the departments, then it is useful to follow the role access hierarchy. In other words, if the data is generally accessed by all the departments, then apply security restrictions as per the role of the user. The role access hierarchy is shown in the following figure.

role access hierarchy

Audit Requirements

审计是安全性的子集,也是一项成本高昂的活动。审计可能会导致系统的大量开销。为了及时完成审计,我们需要更多的硬件,因此,建议在可能的情况下关闭审计。审计要求可分类如下 −

Auditing is a subset of security, a costly activity. Auditing can cause heavy overheads on the system. To complete an audit in time, we require more hardware and therefore, it is recommended that wherever possible, auditing should be switched off. Audit requirements can be categorized as follows −

  1. Connections

  2. Disconnections

  3. Data access

  4. Data change

Note −对于上述每个类别,都需要审计成功、失败或两者。从安全性的角度考虑,审计失败非常重要。审计失败很重要,因为它们可以突出显示未经授权或欺诈性访问。

Note − For each of the above-mentioned categories, it is necessary to audit success, failure, or both. From the perspective of security reasons, the auditing of failures are very important. Auditing of failure is important because they can highlight unauthorized or fraudulent access.

Network Requirements

网络安全与其他安全性一样重要。我们不能忽视网络安全要求。我们需要考虑以下问题 −

Network security is as important as other securities. We cannot ignore the network security requirement. We need to consider the following issues −

  1. Is it necessary to encrypt data before transferring it to the data warehouse?

  2. Are there restrictions on which network routes the data can take?

这些限制需要谨慎考虑。以下是需要记住的要点 −

These restrictions need to be considered carefully. Following are the points to remember −

  1. The process of encryption and decryption will increase overheads. It would require more processing power and processing time.

  2. The cost of encryption can be high if the system is already a loaded system because the encryption is borne by the source system.

Data Movement

在移动数据时存在潜在的安全隐患。假设我们需要传输一些受限数据作为需要加载的平面文件。将数据加载到数据仓库时,会产生以下问题 −

There exist potential security implications while moving the data. Suppose we need to transfer some restricted data as a flat file to be loaded. When the data is loaded into the data warehouse, the following questions are raised −

  1. Where is the flat file stored?

  2. Who has access to that disk space?

如果我们讨论这些平面文件备份,将出现以下问题 -

If we talk about the backup of these flat files, the following questions are raised −

  1. Do you backup encrypted or decrypted versions?

  2. Do these backups need to be made to special tapes that are stored separately?

  3. Who has access to these tapes?

数据移动的其他一些形式(如查询结果集)也需要考虑。创建临时表时提出的问题如下 -

Some other forms of data movement like query result sets also need to be considered. The questions raised while creating the temporary table are as follows −

  1. Where is that temporary table to be held?

  2. How do you make such table visible?

我们应该避免意外违背安全限制。如果一个可以访问受限数据的用户可以生成可访问的临时表,数据将对未授权用户可见。通过为有权访问受限数据的用户提供一个单独的临时区域,我们可以解决这个问题。

We should avoid the accidental flouting of security restrictions. If a user with access to the restricted data can generate accessible temporary tables, data can be visible to non-authorized users. We can overcome this problem by having a separate temporary area for users with access to restricted data.

Documentation

审计和安全要求需要得到妥善的记录。这将被视为证明的一部分。本文件可包含从下列信息收集的所有信息 -

The audit and security requirements need to be properly documented. This will be treated as a part of justification. This document can contain all the information gathered from −

  1. Data classification

  2. User classification

  3. Network requirements

  4. Data movement and storage requirements

  5. All auditable actions

Impact of Security on Design

安全性会影响应用程序代码和开发时间表。安全性影响以下领域 -

Security affects the application code and the development timescales. Security affects the following area −

  1. Application development

  2. Database design

  3. Testing

Application Development

安全性会影响整体应用程序开发,并且也会影响数据仓库中加载管理器、仓库管理器和查询管理器等重要组件的设计。加载管理器可能需要检查代码以过滤记录并将其放在不同位置。也可能需要更多转换规则来隐藏某些数据。也可能需要额外的元数据来处理任何额外对象。

Security affects the overall application development and it also affects the design of the important components of the data warehouse such as load manager, warehouse manager, and query manager. The load manager may require checking code to filter record and place them in different locations. More transformation rules may also be required to hide certain data. Also there may be requirements of extra metadata to handle any extra objects.

为了创建和维护额外的视图,仓库管理器可能需要额外的代码来实施安全性。可能必须将额外检查编码到数据仓库中,以防止数据仓库在不应该提供的位置移动数据时受到愚弄。查询管理器需要进行更改以处理任何访问限制。查询管理器需要了解所有额外视图和聚合。

To create and maintain extra views, the warehouse manager may require extra codes to enforce security. Extra checks may have to be coded into the data warehouse to prevent it from being fooled into moving data into a location where it should not be available. The query manager requires the changes to handle any access restrictions. The query manager will need to be aware of all extra views and aggregations.

Database design

数据库布局也会受到影响,因为在实施安全措施后,视图和表的数量会增加。增加安全性会增加数据库的大小,从而增加数据库设计和管理的复杂性。它还将增加备份管理和恢复计划的复杂性。

The database layout is also affected because when security measures are implemented, there is an increase in the number of views and tables. Adding security increases the size of the database and hence increases the complexity of the database design and management. It will also add complexity to the backup management and recovery plan.

Testing

测试数据仓库是一个复杂且漫长的过程。向数据仓库中增加安全性也会影响测试时间复杂度。它在以下两个方面影响测试 -

Testing the data warehouse is a complex and lengthy process. Adding security to the data warehouse also affects the testing time complexity. It affects the testing in the following two ways −

  1. It will increase the time required for integration and system testing.

  2. There is added functionality to be tested which will increase the size of the testing suite.

Data Warehousing - Backup

数据仓库是一个复杂的系统,它包含大量数据。因此,备份所有数据非常重要,以便日后可以根据要求进行恢复。在本章中,我们将讨论设计备份策略中的问题。

A data warehouse is a complex system and it contains a huge volume of data. Therefore it is important to back up all the data so that it becomes available for recovery in future as per requirement. In this chapter, we will discuss the issues in designing the backup strategy.

Backup Terminologies

在进一步操作之前,您应该了解下面讨论的一些备份术语。

Before proceeding further, you should know some of the backup terminologies discussed below.

  1. Complete backup − It backs up the entire database at the same time. This backup includes all the database files, control files, and journal files.

  2. Partial backup − As the name suggests, it does not create a complete backup of the database. Partial backup is very useful in large databases because they allow a strategy whereby various parts of the database are backed up in a round-robin fashion on a day-to-day basis, so that the whole database is backed up effectively once a week.

  3. Cold backup − Cold backup is taken while the database is completely shut down. In multi-instance environment, all the instances should be shut down.

  4. Hot backup − Hot backup is taken when the database engine is up and running. The requirements of hot backup varies from RDBMS to RDBMS.

  5. Online backup − It is quite similar to hot backup.

Hardware Backup

决定使用哪种硬件进行备份非常重要。备份和恢复处理速度取决于所使用的硬件、硬件连接方式、网络带宽、备份软件和服务器 I/O 系统的速度。在这里,我们将讨论一些可用的硬件选择及其优缺点。这些选择如下所示−

It is important to decide which hardware to use for the backup. The speed of processing the backup and restore depends on the hardware being used, how the hardware is connected, bandwidth of the network, backup software, and the speed of server’s I/O system. Here we will discuss some of the hardware choices that are available and their pros and cons. These choices are as follows −

  1. Tape Technology

  2. Disk Backups

Tape Technology

磁带选择可以分为以下几类−

The tape choice can be categorized as follows −

  1. Tape media

  2. Standalone tape drives

  3. Tape stackers

  4. Tape silos

Tape Media

Tape Media

磁带介质有几种,下表列出了部分磁带介质标准−

There exists several varieties of tape media. Some tape media standards are listed in the table below −

Tape Media

Capacity

I/O rates

DLT

40 GB

3 MB/s

3490e

1.6 GB

3 MB/s

8 mm

14 GB

1 MB/s

需要考虑的其他因素如下−

Other factors that need to be considered are as follows −

  1. Reliability of the tape medium

  2. Cost of tape medium per unit

  3. Scalability

  4. Cost of upgrades to tape system

  5. Cost of tape medium per unit

  6. Shelf life of tape medium

Standalone Tape Drives

Standalone Tape Drives

磁带驱动器可以通过以下方式连接 −

The tape drives can be connected in the following ways −

  1. Direct to the server

  2. As network available devices

  3. Remotely to other machine

将磁带驱动器连接到数据仓库可能会存在问题。

There could be issues in connecting the tape drives to a data warehouse.

  1. Consider the server is a 48node MPP machine. We do not know the node to connect the tape drive and we do not know how to spread them over the server nodes to get the optimal performance with least disruption of the server and least internal I/O latency.

  2. Connecting the tape drive as a network available device requires the network to be up to the job of the huge data transfer rates. Make sure that sufficient bandwidth is available during the time you require it.

  3. Connecting the tape drives remotely also require high bandwidth.

Tape Stackers

将多个磁带加载到单个磁带驱动器的做法称为磁带堆叠器。堆叠器在处理完当前磁带后将其卸载,然后加载下一个磁带,因此一次只能访问一个磁带。价格和功能可能有所不同,但共同点是它们可以执行无人值守的备份。

The method of loading multiple tapes into a single tape drive is known as tape stackers. The stacker dismounts the current tape when it has finished with it and loads the next tape, hence only one tape is available at a time to be accessed. The price and the capabilities may vary, but the common ability is that they can perform unattended backups.

Tape Silos

磁带孤岛提供大存储容量。磁带孤岛可以存储和管理数千个磁带。它们可以集成多个磁带驱动器。它们拥有标记和存储其存储磁带的软件和硬件。孤岛通过网络或专线远程连接非常常见。我们应确保连接的带宽能够胜任任务。

Tape silos provide large store capacities. Tape silos can store and manage thousands of tapes. They can integrate multiple tape drives. They have the software and hardware to label and store the tapes they store. It is very common for the silo to be connected remotely over a network or a dedicated link. We should ensure that the bandwidth of the connection is up to the job.

Disk Backups

磁盘备份的方法有 −

Methods of disk backups are −

  1. Disk-to-disk backups

  2. Mirror breaking

这些方法用于 OLTP 系统中。这些方法最大程度地减少数据库停机时间,最大限度地提高可用性。

These methods are used in the OLTP system. These methods minimize the database downtime and maximize the availability.

Disk-to-Disk Backups

Disk-to-Disk Backups

在此,备份是在磁盘上进行,而不是在磁带上进行。磁盘到磁盘的备份出于以下原因进行 −

Here backup is taken on the disk rather on the tape. Disk-to-disk backups are done for the following reasons −

  1. Speed of initial backups

  2. Speed of restore

从磁盘备份数据到磁盘比备份到磁带快得多。然而,这是备份的中间步骤。稍后,数据将备份到磁带上。磁盘到磁盘备份的另一个优点是,它为你提供了最新备份的在线副本。

Backing up the data from disk to disk is much faster than to the tape. However it is the intermediate step of backup. Later the data is backed up on the tape. The other advantage of disk-to-disk backups is that it gives you an online copy of the latest backup.

Mirror Breaking

Mirror Breaking

其理念是,在工作日内对磁盘进行镜像以提升弹性。当需要备份时,可以中断其中一个镜像集。此技术是磁盘到磁盘备份的一种变体。

The idea is to have disks mirrored for resilience during the working day. When backup is required, one of the mirror sets can be broken out. This technique is a variant of disk-to-disk backups.

Note − 数据库可能需要关闭以确保备份的一致性。

Note − The database may need to be shutdown to guarantee consistency of the backup.

Optical Jukeboxes

光盘自动换碟器允许将数据存储在近线上。此技术允许大量的光盘以与磁带堆叠器或磁带孤岛相同的方式进行管理。此技术的缺点是其写入速度比磁盘慢。但光盘介质提供了长寿命和可靠性,这使其成为归档的良好介质选择。

Optical jukeboxes allow the data to be stored near line. This technique allows a large number of optical disks to be managed in the same way as a tape stacker or a tape silo. The drawback of this technique is that it has slow write speed than disks. But the optical media provides long-life and reliability that makes them a good choice of medium for archiving.

Software Backups

有可用于帮助备份过程的软件工具。这些软件工具作为一个包提供。这些工具不仅可以进行备份,还可以有效地管理和控制备份策略。市场上有许多软件包可用。其中一些列在以下表格中 −

There are software tools available that help in the backup process. These software tools come as a package. These tools not only take backup, they can effectively manage and control the backup strategies. There are many software packages available in the market. Some of them are listed in the following table −

Package Name

Vendor

Networker

Legato

ADSM

IBM

Epoch

Epoch Systems

Omniback II

HP

Alexandria

Sequent

Criteria for Choosing Software Packages

选择最佳软件包的标准如下 −

The criteria for choosing the best software package are listed below −

  1. How scalable is the product as tape drives are added?

  2. Does the package have client-server option, or must it run on the database server itself?

  3. Will it work in cluster and MPP environments?

  4. What degree of parallelism is required?

  5. What platforms are supported by the package?

  6. Does the package support easy access to information about tape contents?

  7. Is the package database aware?

  8. What tape drive and tape media are supported by the package?

Data Warehousing - Tuning

数据仓库不断发展,无法预测用户将来会发布什么查询。因此,调整数据仓库系统变得更加困难。在本章中,我们将讨论如何调整数据仓库的不同方面,例如性能、数据加载、查询等。

A data warehouse keeps evolving and it is unpredictable what query the user is going to post in the future. Therefore it becomes more difficult to tune a data warehouse system. In this chapter, we will discuss how to tune the different aspects of a data warehouse such as performance, data load, queries, etc.

Difficulties in Data Warehouse Tuning

由于以下原因,调整数据仓库是一项困难的过程:

Tuning a data warehouse is a difficult procedure due to following reasons −

  1. Data warehouse is dynamic; it never remains constant.

  2. It is very difficult to predict what query the user is going to post in the future.

  3. Business requirements change with time.

  4. Users and their profiles keep changing.

  5. The user can switch from one group to another.

  6. The data load on the warehouse also changes with time.

Note − 对数据仓库有全面的了解非常重要。

Note − It is very important to have a complete knowledge of data warehouse.

Performance Assessment

下面是有关性能的客观衡量标准。

Here is a list of objective measures of performance −

  1. Average query response time

  2. Scan rates

  3. Time used per day query

  4. Memory usage per process

  5. I/O throughput rates

以下为需要记住的要点。

Following are the points to remember.

  1. It is necessary to specify the measures in service level agreement (SLA).

  2. It is of no use trying to tune response time, if they are already better than those required.

  3. It is essential to have realistic expectations while making performance assessment.

  4. It is also essential that the users have feasible expectations.

  5. To hide the complexity of the system from the user, aggregations and views should be used.

  6. It is also possible that the user can write a query you had not tuned for.

Data Load Tuning

数据加载是夜间处理的关键部分。在数据加载完成之前,不能运行其他任何操作。这是进入系统时的入口。

Data load is a critical part of overnight processing. Nothing else can run until data load is complete. This is the entry point into the system.

Note − 如果数据传输延迟或数据到达延迟,则整个系统都将受到严重影响。因此,首先调优数据加载非常重要。

Note − If there is a delay in transferring the data, or in arrival of data then the entire system is affected badly. Therefore it is very important to tune the data load first.

有不同的数据加载调优方法,如下所述 −

There are various approaches of tuning data load that are discussed below −

  1. The very common approach is to insert data using the SQL Layer. In this approach, normal checks and constraints need to be performed. When the data is inserted into the table, the code will run to check for enough space to insert the data. If sufficient space is not available, then more space may have to be allocated to these tables. These checks take time to perform and are costly to CPU.

  2. The second approach is to bypass all these checks and constraints and place the data directly into the preformatted blocks. These blocks are later written to the database. It is faster than the first approach, but it can work only with whole blocks of data. This can lead to some space wastage.

  3. The third approach is that while loading the data into the table that already contains the table, we can maintain indexes.

  4. The fourth approach says that to load the data in tables that already contain data, drop the indexes & recreate them when the data load is complete. The choice between the third and the fourth approach depends on how much data is already loaded and how many indexes need to be rebuilt.

Integrity Checks

完整性检查会极大地影响加载的性能。以下为需要记住的要点 −

Integrity checking highly affects the performance of the load. Following are the points to remember −

  1. Integrity checks need to be limited because they require heavy processing power.

  2. Integrity checks should be applied on the source system to avoid performance degrade of data load.

Tuning Queries

数据仓库中有两种查询 −

We have two kinds of queries in data warehouse −

  1. Fixed queries

  2. Ad hoc queries

Fixed Queries

固定查询定义明确。以下是固定查询的示例 −

Fixed queries are well defined. Following are the examples of fixed queries −

  1. regular reports

  2. Canned queries

  3. Common aggregations

数据仓库中固定查询的调整与关系数据库系统中的调整相同。唯一的区别在于要查询的数据量可能不同。在测试固定查询时,最好存储最成功的执行计划。存储这些执行计划将使我们能够发现不断变化的数据大小和数据偏斜,因为它将导致执行计划发生更改。

Tuning the fixed queries in a data warehouse is same as in a relational database system. The only difference is that the amount of data to be queried may be different. It is good to store the most successful execution plan while testing fixed queries. Storing these executing plan will allow us to spot changing data size and data skew, as it will cause the execution plan to change.

Note − 我们不能对事实表进行更多操作,但在处理维表或汇总时,可使用通常收集的 SQL 调整、存储机制和访问方法来调整这些查询。

Note − We cannot do more on fact table but while dealing with dimension tables or the aggregations, the usual collection of SQL tweaking, storage mechanism, and access methods can be used to tune these queries.

Ad hoc Queries

要理解即席查询,了解数据仓库的即席用户非常重要。对于每个用户或用户组,你需要了解以下内容 −

To understand ad hoc queries, it is important to know the ad hoc users of the data warehouse. For each user or group of users, you need to know the following −

  1. The number of users in the group

  2. Whether they use ad hoc queries at regular intervals of time

  3. Whether they use ad hoc queries frequently

  4. Whether they use ad hoc queries occasionally at unknown intervals.

  5. The maximum size of query they tend to run

  6. The average size of query they tend to run

  7. Whether they require drill-down access to the base data

  8. The elapsed login time per day

  9. The peak time of daily usage

  10. The number of queries they run per peak hour

Points to Note

Points to Note

  1. It is important to track the user’s profiles and identify the queries that are run on a regular basis.

  2. It is also important that the tuning performed does not affect the performance.

  3. Identify similar and ad hoc queries that are frequently run.

  4. If these queries are identified, then the database will change and new indexes can be added for those queries.

  5. If these queries are identified, then new aggregations can be created specifically for those queries that would result in their efficient execution.

Data Warehousing - Testing

测试对于数据仓库系统使其正确有效地工作非常重要。对数据仓库执行三项基本的测试级别:

Testing is very important for data warehouse systems to make them work correctly and efficiently. There are three basic levels of testing performed on a data warehouse −

  1. Unit testing

  2. Integration testing

  3. System testing

Unit Testing

  1. In unit testing, each component is separately tested.

  2. Each module, i.e., procedure, program, SQL Script, Unix shell is tested.

  3. This test is performed by the developer.

Integration Testing

  1. In integration testing, the various modules of the application are brought together and then tested against the number of inputs.

  2. It is performed to test whether the various components do well after integration.

System Testing

  1. In system testing, the whole data warehouse application is tested together.

  2. The purpose of system testing is to check whether the entire system works correctly together or not.

  3. System testing is performed by the testing team.

  4. Since the size of the whole data warehouse is very large, it is usually possible to perform minimal system testing before the test plan can be enacted.

Test Schedule

首先,在制定测试计划的过程中创建测试计划。在此计划中,我们预测测试整个数据仓库系统所需的时间。

First of all, the test schedule is created in the process of developing the test plan. In this schedule, we predict the estimated time required for the testing of the entire data warehouse system.

有不同的方法可用于创建测试计划,但没有一个是完美的,因为数据仓库非常复杂且庞大。此外,数据仓库系统本质上也在不断演变。在创建测试计划时可能面临以下问题:

There are different methodologies available to create a test schedule, but none of them are perfect because the data warehouse is very complex and large. Also the data warehouse system is evolving in nature. One may face the following issues while creating a test schedule −

  1. A simple problem may have a large size of query that can take a day or more to complete, i.e., the query does not complete in a desired time scale.

  2. There may be hardware failures such as losing a disk or human errors such as accidentally deleting a table or overwriting a large table.

Note -由于上述困难,建议将您通常为测试分配的时间加倍。

Note − Due to the above-mentioned difficulties, it is recommended to always double the amount of time you would normally allow for testing.

Testing Backup Recovery

测试备份恢复策略极其重要。这是需要该测试的情况列表-

Testing the backup recovery strategy is extremely important. Here is the list of scenarios for which this testing is needed −

  1. Media failure

  2. Loss or damage of table space or data file

  3. Loss or damage of redo log file

  4. Loss or damage of control file

  5. Instance failure

  6. Loss or damage of archive file

  7. Loss or damage of table

  8. Failure during data failure

Testing Operational Environment

需要测试大量方面。这些方面列在下面。

There are a number of aspects that need to be tested. These aspects are listed below.

  1. Security − A separate security document is required for security testing. This document contains a list of disallowed operations and devising tests for each.

  2. Scheduler − Scheduling software is required to control the daily operations of a data warehouse. It needs to be tested during system testing. The scheduling software requires an interface with the data warehouse, which will need the scheduler to control overnight processing and the management of aggregations.

  3. Disk Configuration. − Disk configuration also needs to be tested to identify I/O bottlenecks. The test should be performed with multiple times with different settings.

  4. Management Tools. − It is required to test all the management tools during system testing. Here is the list of tools that need to be tested. Event managerSystem managerDatabase managerConfiguration managerBackup recovery manager

Testing the Database

数据库通过以下三种方式进行测试-

The database is tested in the following three ways −

  1. Testing the database manager and monitoring tools − To test the database manager and the monitoring tools, they should be used in the creation, running, and management of test database.

  2. Testing database features − Here is the list of features that we have to test − Querying in parallel Create index in parallel Data load in parallel

  3. Testing database performance − Query execution plays a very important role in data warehouse performance measures. There are sets of fixed queries that need to be run regularly and they should be tested. To test ad hoc queries, one should go through the user requirement document and understand the business completely. Take time to test the most awkward queries that the business is likely to ask against different index and aggregation strategies.

Testing the Application

  1. All the managers should be integrated correctly and work in order to ensure that the end-to-end load, index, aggregate and queries work as per the expectations.

  2. Each function of each manager should work correctly

  3. It is also necessary to test the application over a period of time.

  4. Week end and month-end tasks should also be tested.

Logistic of the Test

系统测试的目标是测试以下所有领域−

The aim of system test is to test all of the following areas −

  1. Scheduling software

  2. Day-to-day operational procedures

  3. Backup recovery strategy

  4. Management and scheduling tools

  5. Overnight processing

  6. Query performance

Note − 最重要的一点是测试可伸缩性。如果不这样做,我们的系统设计在系统增长时将无法正常工作。

Note − The most important point is to test the scalability. Failure to do so will leave us a system design that does not work when the system grows.

Data Warehousing - Future Aspects

以下是在数据仓库的未来方面。

Following are the future aspects of data warehousing.

  1. As we have seen that the size of the open database has grown approximately double its magnitude in the last few years, it shows the significant value that it contains.

  2. As the size of the databases grow, the estimates of what constitutes a very large database continues to grow.

  3. The hardware and software that are available today do not allow to keep a large amount of data online. For example, a Telco call record requires 10TB of data to be kept online, which is just a size of one month’s record. If it requires to keep records of sales, marketing customer, employees, etc., then the size will be more than 100 TB.

  4. The record contains textual information and some multimedia data. Multimedia data cannot be easily manipulated as text data. Searching the multimedia data is not an easy task, whereas textual information can be retrieved by the relational software available today.

  5. Apart from size planning, it is complex to build and run data warehouse systems that are ever increasing in size. As the number of users increases, the size of the data warehouse also increases. These users will also require to access the system.

  6. With the growth of the Internet, there is a requirement of users to access data online.

因此,未来数据仓库的形态将与当今创建的形态大不相同。

Hence the future shape of data warehouse will be very different from what is being created today.