Dwh 简明教程
Data Warehousing - Delivery Process
数据仓库绝不是一成不变的;它会随着业务的扩展而不断发展。随着业务发展,其需求不断变化,因此必须设计一个数据仓库来应对这些变化。因此,数据仓库系统需要具有灵活性。
A data warehouse is never static; it evolves as the business expands. As the business evolves, its requirements keep changing and therefore a data warehouse must be designed to ride with these changes. Hence a data warehouse system needs to be flexible.
理想情况下,应该有一个交付流程来交付数据仓库。然而,数据仓库项目通常会受到各种问题的困扰,使得按照瀑布方法要求的严格有序的方式完成任务和交付成果变得困难。大多数时候,需求并没有被完全理解。只有在收集并研究所有需求后,才能完成架构、设计和构建组件。
Ideally there should be a delivery process to deliver a data warehouse. However data warehouse projects normally suffer from various issues that make it difficult to complete tasks and deliverables in the strict and ordered fashion demanded by the waterfall method. Most of the times, the requirements are not understood completely. The architectures, designs, and build components can be completed only after gathering and studying all the requirements.
Delivery Method
交付方法是为交付数据仓库而采用的联合应用程序开发方法的一种变体。我们对数据仓库交付流程进行了分阶段,以最大程度降低风险。我们将在此处讨论的方法并没有缩短整体交付时间表,而是确保在开发过程中逐步交付业务收益。
The delivery method is a variant of the joint application development approach adopted for the delivery of a data warehouse. We have staged the data warehouse delivery process to minimize risks. The approach that we will discuss here does not reduce the overall delivery time-scales but ensures the business benefits are delivered incrementally through the development process.
Note - 交付流程被分解成各个阶段以降低项目和交付风险。
Note − The delivery process is broken into phases to reduce the project and delivery risk.
下图解释了交付过程中的各个阶段 -
The following diagram explains the stages in the delivery process −
IT Strategy
数据仓库是战略性投资,需要一个业务流程来创收。IT 战略是获得和保留项目资金所必需的。
Data warehouse are strategic investments that require a business process to generate benefits. IT Strategy is required to procure and retain funding for the project.
Business Case
商业案例的目的是估算企业在使用数据仓库时应考虑的收益。这些收益可能无法量化,但预计的收益需要明确表述。如果某个数据仓库没有明确的商业案例,那么该数据仓库在交付过程中的某个阶段可能会存在信誉问题。因此,在数据仓库项目中,我们需要了解投资的商业案例。
The objective of business case is to estimate business benefits that should be derived from using a data warehouse. These benefits may not be quantifiable but the projected benefits need to be clearly stated. If a data warehouse does not have a clear business case, then the business tends to suffer from credibility problems at some stage during the delivery process. Therefore in data warehouse projects, we need to understand the business case for investment.
Education and Prototyping
在确定解决方案之前,各种组织会尝试数据分析概念,并自学数据仓库的价值。通过原型展示来满足这一目标。它有助于了解数据仓库的可行性和收益。小规模的原型展示活动可以促进教育过程的发挥,只要满足以下条件:
Organizations experiment with the concept of data analysis and educate themselves on the value of having a data warehouse before settling for a solution. This is addressed by prototyping. It helps in understanding the feasibility and benefits of a data warehouse. The prototyping activity on a small scale can promote educational process as long as −
-
The prototype addresses a defined technical objective.
-
The prototype can be thrown away after the feasibility concept has been shown.
-
The activity addresses a small subset of eventual data content of the data warehouse.
-
The activity timescale is non-critical.
以下几点对于产生早期版本并满足收益至关重要:
The following points are to be kept in mind to produce an early release and deliver business benefits.
-
Identify the architecture that is capable of evolving.
-
Focus on business requirements and technical blueprint phases.
-
Limit the scope of the first build phase to the minimum that delivers business benefits.
-
Understand the short-term and medium-term requirements of the data warehouse.
Business Requirements
为提供高质量的交付成果,我们应确保了解整体需求。如果我们了解了短期和中期的业务需求,那么就能够设计一个解决方案来满足短期需求。随后可以将短期解决方案扩展至一个完整的解决方案。
To provide quality deliverables, we should make sure the overall requirements are understood. If we understand the business requirements for both short-term and medium-term, then we can design a solution to fulfil short-term requirements. The short-term solution can then be grown to a full solution.
此阶段确定以下方面:
The following aspects are determined in this stage −
-
The business rule to be applied on data.
-
The logical model for information within the data warehouse.
-
The query profiles for the immediate requirement.
-
The source systems that provide this data.
Technical Blueprint
此阶段需要交付一个满足长期需求的整体架构。该阶段还会交付应在短期内实施以实现业务收益的组成部分。蓝图需要识别以下内容:
This phase need to deliver an overall architecture satisfying the long term requirements. This phase also deliver the components that must be implemented in a short term to derive any business benefit. The blueprint need to identify the followings.
-
The overall system architecture.
-
The data retention policy.
-
The backup and recovery strategy.
-
The server and data mart architecture.
-
The capacity plan for hardware and infrastructure.
-
The components of database design.
Building the Version
在此阶段,生成第一个生产可交付成果。此生产可交付成果是数据仓库中最小的组成部分。这个最小组成部分增加了业务效益。
In this stage, the first production deliverable is produced. This production deliverable is the smallest component of a data warehouse. This smallest component adds business benefit.
History Load
这是将所需的剩余历史记录加载到数据仓库中的阶段。在此阶段,我们不会添加新实体,但可能会创建其他物理表来存储增加的数据量。
This is the phase where the remainder of the required history is loaded into the data warehouse. In this phase, we do not add new entities, but additional physical tables would probably be created to store increased data volumes.
让我们举个例子。假设构建版本阶段已交付了一个包括两个月历史记录的零售销售分析数据仓库。此信息将允许用户仅分析最近的趋势并解决短期问题。在这种情况下,用户无法识别每年的季节性趋势。为了帮助他这样做,可以从归档中加载过去两年的销售历史记录。现在,40GB 数据已扩展到 400GB。
Let us take an example. Suppose the build version phase has delivered a retail sales analysis data warehouse with 2 months’ worth of history. This information will allow the user to analyze only the recent trends and address the short-term issues. The user in this case cannot identify annual and seasonal trends. To help him do so, last 2 years’ sales history could be loaded from the archive. Now the 40GB data is extended to 400GB.
Note − 备份和恢复过程可能会变得很复杂,因此建议在单独的阶段中执行此活动。
Note − The backup and recovery procedures may become complex, therefore it is recommended to perform this activity within a separate phase.
Ad hoc Query
在此阶段,我们配置一个用于操作数据仓库的临时查询工具。这些工具可以生成数据库查询。
In this phase, we configure an ad hoc query tool that is used to operate a data warehouse. These tools can generate the database query.
Note − 建议在对数据库进行大幅修改时不要使用这些访问工具。
Note − It is recommended not to use these access tools when the database is being substantially modified.
Automation
在此阶段,操作管理过程已完全自动化。这些内容包括:
In this phase, operational management processes are fully automated. These would include −
-
Transforming the data into a form suitable for analysis.
-
Monitoring query profiles and determining appropriate aggregations to maintain system performance.
-
Extracting and loading data from different source systems.
-
Generating aggregations from predefined definitions within the data warehouse.
-
Backing up, restoring, and archiving the data.
Extending Scope
在此阶段,数据仓库已扩展以满足新的业务需求。范围可以通过两种方式扩展:
In this phase, the data warehouse is extended to address a new set of business requirements. The scope can be extended in two ways −
-
By loading additional data into the data warehouse.
-
By introducing new data marts using the existing information.
Note − 由于此阶段将涉及相当的工作量和复杂性,因此应单独执行。
Note − This phase should be performed separately, since it involves substantial efforts and complexity.
Requirements Evolution
从交付流程的角度来看,需求总是可以更改的。它们不是一成不变的。交付流程必须支持这一点,并允许这些更改反映在系统中。
From the perspective of delivery process, the requirements are always changeable. They are not static. The delivery process must support this and allow these changes to be reflected within the system.
通过围绕业务流程内的数据使用来设计数据仓库,而不是围绕现有查询的数据需求来解决此问题。
This issue is addressed by designing the data warehouse around the use of data within business processes, as opposed to the data requirements of existing queries.
该体系结构旨在随着业务需求而改变和增长,该流程作为一个伪应用程序开发流程运行,其中新需求不断输入到开发活动中,并生成部分可交付成果。这些部分可交付成果反馈给用户,然后重新加工,以确保整个系统不断更新,以满足业务需求。
The architecture is designed to change and grow to match the business needs, the process operates as a pseudo-application development process, where the new requirements are continually fed into the development activities and the partial deliverables are produced. These partial deliverables are fed back to the users and then reworked ensuring that the overall system is continually updated to meet the business needs.