Dwh 简明教程

Data Warehousing - Tuning

数据仓库不断发展,无法预测用户将来会发布什么查询。因此,调整数据仓库系统变得更加困难。在本章中,我们将讨论如何调整数据仓库的不同方面,例如性能、数据加载、查询等。

A data warehouse keeps evolving and it is unpredictable what query the user is going to post in the future. Therefore it becomes more difficult to tune a data warehouse system. In this chapter, we will discuss how to tune the different aspects of a data warehouse such as performance, data load, queries, etc.

Difficulties in Data Warehouse Tuning

由于以下原因,调整数据仓库是一项困难的过程:

Tuning a data warehouse is a difficult procedure due to following reasons −

  1. Data warehouse is dynamic; it never remains constant.

  2. It is very difficult to predict what query the user is going to post in the future.

  3. Business requirements change with time.

  4. Users and their profiles keep changing.

  5. The user can switch from one group to another.

  6. The data load on the warehouse also changes with time.

Note − 对数据仓库有全面的了解非常重要。

Note − It is very important to have a complete knowledge of data warehouse.

Performance Assessment

下面是有关性能的客观衡量标准。

Here is a list of objective measures of performance −

  1. Average query response time

  2. Scan rates

  3. Time used per day query

  4. Memory usage per process

  5. I/O throughput rates

以下为需要记住的要点。

Following are the points to remember.

  1. It is necessary to specify the measures in service level agreement (SLA).

  2. It is of no use trying to tune response time, if they are already better than those required.

  3. It is essential to have realistic expectations while making performance assessment.

  4. It is also essential that the users have feasible expectations.

  5. To hide the complexity of the system from the user, aggregations and views should be used.

  6. It is also possible that the user can write a query you had not tuned for.

Data Load Tuning

数据加载是夜间处理的关键部分。在数据加载完成之前,不能运行其他任何操作。这是进入系统时的入口。

Data load is a critical part of overnight processing. Nothing else can run until data load is complete. This is the entry point into the system.

Note − 如果数据传输延迟或数据到达延迟,则整个系统都将受到严重影响。因此,首先调优数据加载非常重要。

Note − If there is a delay in transferring the data, or in arrival of data then the entire system is affected badly. Therefore it is very important to tune the data load first.

有不同的数据加载调优方法,如下所述 −

There are various approaches of tuning data load that are discussed below −

  1. The very common approach is to insert data using the SQL Layer. In this approach, normal checks and constraints need to be performed. When the data is inserted into the table, the code will run to check for enough space to insert the data. If sufficient space is not available, then more space may have to be allocated to these tables. These checks take time to perform and are costly to CPU.

  2. The second approach is to bypass all these checks and constraints and place the data directly into the preformatted blocks. These blocks are later written to the database. It is faster than the first approach, but it can work only with whole blocks of data. This can lead to some space wastage.

  3. The third approach is that while loading the data into the table that already contains the table, we can maintain indexes.

  4. The fourth approach says that to load the data in tables that already contain data, drop the indexes & recreate them when the data load is complete. The choice between the third and the fourth approach depends on how much data is already loaded and how many indexes need to be rebuilt.

Integrity Checks

完整性检查会极大地影响加载的性能。以下为需要记住的要点 −

Integrity checking highly affects the performance of the load. Following are the points to remember −

  1. Integrity checks need to be limited because they require heavy processing power.

  2. Integrity checks should be applied on the source system to avoid performance degrade of data load.

Tuning Queries

数据仓库中有两种查询 −

We have two kinds of queries in data warehouse −

  1. Fixed queries

  2. Ad hoc queries

Fixed Queries

固定查询定义明确。以下是固定查询的示例 −

Fixed queries are well defined. Following are the examples of fixed queries −

  1. regular reports

  2. Canned queries

  3. Common aggregations

数据仓库中固定查询的调整与关系数据库系统中的调整相同。唯一的区别在于要查询的数据量可能不同。在测试固定查询时,最好存储最成功的执行计划。存储这些执行计划将使我们能够发现不断变化的数据大小和数据偏斜,因为它将导致执行计划发生更改。

Tuning the fixed queries in a data warehouse is same as in a relational database system. The only difference is that the amount of data to be queried may be different. It is good to store the most successful execution plan while testing fixed queries. Storing these executing plan will allow us to spot changing data size and data skew, as it will cause the execution plan to change.

Note − 我们不能对事实表进行更多操作,但在处理维表或汇总时,可使用通常收集的 SQL 调整、存储机制和访问方法来调整这些查询。

Note − We cannot do more on fact table but while dealing with dimension tables or the aggregations, the usual collection of SQL tweaking, storage mechanism, and access methods can be used to tune these queries.

Ad hoc Queries

要理解即席查询,了解数据仓库的即席用户非常重要。对于每个用户或用户组,你需要了解以下内容 −

To understand ad hoc queries, it is important to know the ad hoc users of the data warehouse. For each user or group of users, you need to know the following −

  1. The number of users in the group

  2. Whether they use ad hoc queries at regular intervals of time

  3. Whether they use ad hoc queries frequently

  4. Whether they use ad hoc queries occasionally at unknown intervals.

  5. The maximum size of query they tend to run

  6. The average size of query they tend to run

  7. Whether they require drill-down access to the base data

  8. The elapsed login time per day

  9. The peak time of daily usage

  10. The number of queries they run per peak hour

Points to Note

Points to Note

  1. It is important to track the user’s profiles and identify the queries that are run on a regular basis.

  2. It is also important that the tuning performed does not affect the performance.

  3. Identify similar and ad hoc queries that are frequently run.

  4. If these queries are identified, then the database will change and new indexes can be added for those queries.

  5. If these queries are identified, then new aggregations can be created specifically for those queries that would result in their efficient execution.