Dwh 简明教程
Data Warehousing - Partitioning Strategy
分区旨在提高性能,便于数据管理。分区还有助于平衡系统的各种需求。它通过将每个事实表分区到多个单独的分区来优化硬件性能并简化数据仓库管理。在本章中,我们将讨论不同的分区策略。
Partitioning is done to enhance performance and facilitate easy management of data. Partitioning also helps in balancing the various requirements of the system. It optimizes the hardware performance and simplifies the management of data warehouse by partitioning each fact table into multiple separate partitions. In this chapter, we will discuss different partitioning strategies.
Why is it Necessary to Partition?
分区出于以下原因非常重要 −
Partitioning is important for the following reasons −
-
For easy management,
-
To assist backup/recovery,
-
To enhance performance.
For Easy Management
数据仓库中的事实表可能增长到数百个 GB 大小。如此巨大的事实表大小作为单个实体进行管理非常困难。因此它需要分区。
The fact table in a data warehouse can grow up to hundreds of gigabytes in size. This huge size of fact table is very hard to manage as a single entity. Therefore it needs partitioning.
To Assist Backup/Recovery
如果我们不对事实表进行分区,那么我们就必须加载包含所有数据的完整事实表。分区允许我们只加载定期所需的数据。它减少了加载时间,并提高了系统的性能。
If we do not partition the fact table, then we have to load the complete fact table with all the data. Partitioning allows us to load only as much data as is required on a regular basis. It reduces the time to load and also enhances the performance of the system.
Note − 为了减少备份大小,可以将除当前分区外的所有分区标记为只读。然后,我们可以将这些分区置于无法修改的状态。然后可以对其进行备份。这意味着只需要备份当前分区。
Note − To cut down on the backup size, all partitions other than the current partition can be marked as read-only. We can then put these partitions into a state where they cannot be modified. Then they can be backed up. It means only the current partition is to be backed up.
To Enhance Performance
通过将事实表分区为数据集合,可以增强查询过程。查询性能得到增强,因为现在查询仅扫描那些相关的分区。它不必扫描整个数据。
By partitioning the fact table into sets of data, the query procedures can be enhanced. Query performance is enhanced because now the query scans only those partitions that are relevant. It does not have to scan the whole data.
Horizontal Partitioning
有各种方法可以对事实表进行分区。在水平分区中,我们必须牢记数据仓库可管理性的要求。
There are various ways in which a fact table can be partitioned. In horizontal partitioning, we have to keep in mind the requirements for manageability of the data warehouse.
Partitioning by Time into Equal Segments
在这种分区策略中,事实表根据时间段进行分区。此处的每个时间段都代表业务中的一个重要保留期。例如,如果用户查询 month to date data ,那么将数据分区到按月细分的段中是合适的。我们可以通过删除其中的数据来重用分区表。
In this partitioning strategy, the fact table is partitioned on the basis of time period. Here each time period represents a significant retention period within the business. For example, if the user queries for month to date data then it is appropriate to partition the data into monthly segments. We can reuse the partitioned tables by removing the data in them.
Partition by Time into Different-sized Segments
当访问过时的旧数据不频繁时采用这种划分。它作为一系列小分区实现,用于相对较新的数据,而较大的分区则用于非活动数据。
This kind of partition is done where the aged data is accessed infrequently. It is implemented as a set of small partitions for relatively current data, larger partition for inactive data.
Points to Note
-
The detailed information remains available online.
-
The number of physical tables is kept relatively small, which reduces the operating cost.
-
This technique is suitable where a mix of data dipping recent history and data mining through entire history is required.
-
This technique is not useful where the partitioning profile changes on a regular basis, because repartitioning will increase the operation cost of data warehouse.
Partition on a Different Dimension
事实表还可以根据时间维度之外的其他维度(如产品组、地区、供应商或任何其他维度)进行划分。让我们举个例子。
The fact table can also be partitioned on the basis of dimensions other than time such as product group, region, supplier, or any other dimension. Let’s have an example.
假设市场功能已按 state by state 方式被划分到不同的地区部门中。如果每个地区想要查询其区域内获取的信息,将事实表划分为区域分区会更有效。这会因为无需扫描不相关的信息而加速查询。
Suppose a market function has been structured into distinct regional departments like on a state by state basis. If each region wants to query on information captured within its region, it would prove to be more effective to partition the fact table into regional partitions. This will cause the queries to speed up because it does not require to scan information that is not relevant.
Points to Note
-
The query does not have to scan irrelevant data which speeds up the query process.
-
This technique is not appropriate where the dimensions are unlikely to change in future. So, it is worth determining that the dimension does not change in future.
-
If the dimension changes, then the entire fact table would have to be repartitioned.
Note - 我们建议仅根据时间维度进行分区,除非你确定建议的维度组在数据仓库的生命周期内不会改变。
Note − We recommend to perform the partition only on the basis of time dimension, unless you are certain that the suggested dimension grouping will not change within the life of the data warehouse.
Partition by Size of Table
当没有任何明确的维度可以将事实表进行分区时,我们应该 partition the fact table on the basis of their size. 我们将预先确定的尺寸设为一个临界点。当表超过预定的尺寸时,将创建一个新的表分区。
When there are no clear basis for partitioning the fact table on any dimension, then we should partition the fact table on the basis of their size. We can set the predetermined size as a critical point. When the table exceeds the predetermined size, a new table partition is created.
Points to Note
-
This partitioning is complex to manage.
-
It requires metadata to identify what data is stored in each partition.
Partitioning Dimensions
如果维度包含大量条目,则需要对维度进行分区。这里我们必须检查维度的尺寸。
If a dimension contains large number of entries, then it is required to partition the dimensions. Here we have to check the size of a dimension.
考虑一个随着时间推移而变化的大型设计。如果我们需要存储所有变体以进行比较,该维度可能非常大。这肯定会影响响应时间。
Consider a large design that changes over time. If we need to store all the variations in order to apply comparisons, that dimension may be very large. This would definitely affect the response time.
Round Robin Partitions
在循环法中,当需要一个新分区时,旧的分区将被存档。它使用元数据来允许用户访问工具引用正确的表分区。
In the round robin technique, when a new partition is needed, the old one is archived. It uses metadata to allow user access tool to refer to the correct table partition.
这种技术使得自动化数据仓库中的表管理功能变得容易。
This technique makes it easy to automate table management facilities within the data warehouse.
Vertical Partition
垂直分区垂直划分数据。下图描绘了如何执行垂直分区。
Vertical partitioning, splits the data vertically. The following images depicts how vertical partitioning is done.
垂直分区可以通过以下两种方式执行:
Vertical partitioning can be performed in the following two ways −
-
Normalization
-
Row Splitting
Normalization
规范化是数据库组织的标准关系方法。在此方法中,行合并到单行中,因此它减少了空间。看看以下表格,显示如何执行规范化。
Normalization is the standard relational method of database organization. In this method, the rows are collapsed into a single row, hence it reduce space. Take a look at the following tables that show how normalization is performed.
规范化之前的表格
Table before Normalization
Product_id |
Qty |
Value |
sales_date |
Store_id |
Store_name |
Location |
Region |
30 |
5 |
3.67 |
3-Aug-13 |
16 |
sunny |
Bangalore |
S |
35 |
4 |
5.33 |
3-Sep-13 |
16 |
sunny |
Bangalore |
S |
40 |
5 |
2.50 |
3-Sep-13 |
64 |
san |
Mumbai |
W |
45 |
7 |
5.66 |
3-Sep-13 |
16 |
sunny |
Bangalore |
S |
规范化之后的表格
Table after Normalization
Store_id |
Store_name |
Location |
Region |
16 |
sunny |
Bangalore |
W |
64 |
san |
Mumbai |
S |
Product_id |
Quantity |
Value |
sales_date |
Store_id |
30 |
5 |
3.67 |
3-Aug-13 |
16 |
35 |
4 |
5.33 |
3-Sep-13 |
16 |
40 |
5 |
2.50 |
3-Sep-13 |
64 |
45 |
7 |
5.66 |
3-Sep-13 |
16 |
Row Splitting
行拆分倾向于在分区之间保留一对一映射。行拆分的动机是通过减小其大小来加快对大表的访问速度。
Row splitting tends to leave a one-to-one map between partitions. The motive of row splitting is to speed up the access to large table by reducing its size.
Note − 在使用垂直分区时,确保在两个分区之间无需执行主要连接操作。
Note − While using vertical partitioning, make sure that there is no requirement to perform a major join operation between two partitions.
Identify Key to Partition
选择正确的分区键非常关键。选择错误的分区键将导致重新组织事实表。让我们举个例子。假设我们想要对以下表格进行分区。
It is very crucial to choose the right partition key. Choosing a wrong partition key will lead to reorganizing the fact table. Let’s have an example. Suppose we want to partition the following table.
Account_Txn_Table
transaction_id
account_id
transaction_type
value
transaction_date
region
branch_name
我们可以选择对任何键进行分区。两个可能的键可以是
We can choose to partition on any key. The two possible keys could be
-
region
-
transaction_date
假设业务组织在 30 个地理区域,每个区域都有不同数量的分支机构。这将为我们提供 30 个分区,这是合理的。此分区足够好,因为我们的要求捕获表明绝大多数查询都限制在用户自己的业务区域。
Suppose the business is organized in 30 geographical regions and each region has different number of branches. That will give us 30 partitions, which is reasonable. This partitioning is good enough because our requirements capture has shown that a vast majority of queries are restricted to the user’s own business region.
如果我们按交易日期而不是按区域分区,那么每个区域的最新交易将位于一个分区中。现在,希望查看自己区域内数据的用户必须跨多个分区查询。
If we partition by transaction_date instead of region, then the latest transaction from every region will be in one partition. Now the user who wants to look at data within his own region has to query across multiple partitions.
因此,值得确定正确的分区键。
Hence it is worth determining the right partitioning key.