Dwh 简明教程

Data Warehousing - Data Marting

Why Do We Need a Data Mart?

以下是创建数据市集的原因列表:

Listed below are the reasons to create a data mart −

  1. To partition data in order to impose access control strategies.

  2. To speed up the queries by reducing the volume of data to be scanned.

  3. To segment data into different hardware platforms.

  4. To structure data in a form suitable for a user access tool.

Note − 不要出于任何其他原因建立数据市集,因为数据市集的运营成本可能非常高。在建立数据市集之前,请确保数据市集策略适合你的特定解决方案。

Note − Do not data mart for any other reason since the operation cost of data marting could be very high. Before data marting, make sure that data marting strategy is appropriate for your particular solution.

Cost-effective Data Marting

按照以下步骤使数据市集具有成本效益:

Follow the steps given below to make data marting cost-effective −

  1. Identify the Functional Splits

  2. Identify User Access Tool Requirements

  3. Identify Access Control Issues

Identify the Functional Splits

在此步骤中,我们确定组织是否有自然的职能划分。我们寻找部门之间的划分,并确定各部门使用信息的方式是否与组织的其余部分隔离。我们举个例子。

In this step, we determine if the organization has natural functional splits. We look for departmental splits, and we determine whether the way in which departments use information tend to be in isolation from the rest of the organization. Let’s have an example.

考虑一个零售组织,其中每个商家负责最大化销售一组产品。为此,以下是有价值的信息 -

Consider a retail organization, where each merchant is accountable for maximizing the sales of a group of products. For this, the following are the valuable information −

  1. sales transaction on a daily basis

  2. sales forecast on a weekly basis

  3. stock position on a daily basis

  4. stock movements on a daily basis

由于商家对他们不经手的产品不感兴趣,所以数据市集是处理的产品组感兴趣的数据的一个子集。下图显示了面向不同用户的数仓。

As the merchant is not interested in the products they are not dealing with, the data marting is a subset of the data dealing which the product group of interest. The following diagram shows data marting for different users.

data marting

以下是在确定功能划分时需要考虑的问题 -

Given below are the issues to be taken into account while determining the functional split −

  1. The structure of the department may change.

  2. The products might switch from one department to other.

  3. The merchant could query the sales trend of other products to analyze what is happening to the sales.

Note - 我们需要确定使用数据市集的业务优势和技术可行性。

Note − We need to determine the business benefits and technical feasibility of using a data mart.

Identify User Access Tool Requirements

我们需要数据市集来支持 user access tools ,它们需要内部数据结构。此类结构中的数据不受数据仓库控制,但需要定期填充和更新。

We need data marts to support user access tools that require internal data structures. The data in such structures are outside the control of data warehouse but need to be populated and updated on a regular basis.

有些工具直接从源系统填充,但有些工具不能。因此,需要识别超出工具范围的额外需求,以备将来使用。

There are some tools that populate directly from the source system but some cannot. Therefore additional requirements outside the scope of the tool are needed to be identified for future.

Note - 为了确保所有访问工具中数据的一致性,不应直接从数据仓库填充数据,而是每个工具都必须有自己的数据市集。

Note − In order to ensure consistency of data across all access tools, the data should not be directly populated from the data warehouse, rather each tool must have its own data mart.

Identify Access Control Issues

应该制定隐私规则,以确保只有经过授权的用户才能访问数据。例如,零售银行机构的数据仓库可确保所有帐户都属于同一法人实体。隐私法可能会迫使你完全禁止访问不属于特定银行的信息。

There should to be privacy rules to ensure the data is accessed by authorized users only. For example a data warehouse for retail banking institution ensures that all the accounts belong to the same legal entity. Privacy laws can force you to totally prevent access to information that is not owned by the specific bank.

数据市集允许我们在数据仓库内物理分离数据段,从而构建一个完整的墙。为了避免可能的隐私问题,可以从数据仓库中删除详细数据。我们可以为每个法人实体创建数据市集,并通过数据仓库加载,同时提供详细的帐户数据。

Data marts allow us to build a complete wall by physically separating data segments within the data warehouse. To avoid possible privacy problems, the detailed data can be removed from the data warehouse. We can create data mart for each legal entity and load it via data warehouse, with detailed account data.

Designing Data Marts

数据市集应设计为数据仓库中星型雪花模式的较小版本,并且应与数据仓库的数据库设计相匹配。它有助于控制数据库实例。

Data marts should be designed as a smaller version of starflake schema within the data warehouse and should match with the database design of the data warehouse. It helps in maintaining control over database instances.

designing datamart

摘要以与在数据仓库内设计相同的方式进行数据市集。汇总表有助于利用星型雪花模式中的所有维度数据。

The summaries are data marted in the same way as they would have been designed within the data warehouse. Summary tables help to utilize all dimension data in the starflake schema.

Cost of Data Marting

数据市集的成本度量如下 -

The cost measures for data marting are as follows −

  1. Hardware and Software Cost

  2. Network Access

  3. Time Window Constraints

Hardware and Software Cost

虽然数据集市创建在相同硬件上,但需要一些额外的硬件和软件。为了处理用户查询,它需要额外的处理能力和磁盘存储。如果在数据仓库中存在详细信息和数据集市,那么我们将面临额外的存储和管理复制数据的花费。

Although data marts are created on the same hardware, they require some additional hardware and software. To handle user queries, it requires additional processing power and disk storage. If detailed data and the data mart exist within the data warehouse, then we would face additional cost to store and manage replicated data.

Note − 数据集市比聚合花费更高,因此它应当用作一种附加策略,而不是备选策略。

Note − Data marting is more expensive than aggregations, therefore it should be used as an additional strategy and not as an alternative strategy.

Network Access

数据集市可能位于数据仓库的不同位置,因此我们应当确保局域网或广域网有能力处理在 data mart load process. 内传输的数据量

A data mart could be on a different location from the data warehouse, so we should ensure that the LAN or WAN has the capacity to handle the data volumes being transferred within the data mart load process.

Time Window Constraints

数据集市加载进程消耗可用时间窗口的程度取决于转换的复杂性和运送的数据量。确定有多少个数据集市取决于 −

The extent to which a data mart loading process will eat into the available time window depends on the complexity of the transformations and the data volumes being shipped. The determination of how many data marts are possible depends on −

  1. Network capacity.

  2. Time window available

  3. Volume of data being transferred

  4. Mechanisms being used to insert data into a data mart