Snowflake 简明教程

Snowflake - Cache

Snowflake 有一个独特的缓存功能。它基于此缓存提供快速的结果,扫描更少的数据。它甚至可以帮助客户降低账单。

Snowflake 中基本上有三种类型的缓存。

  1. Metadata Caching

  2. Query Result Caching

  3. Data Caching

默认情况下,为所有雪花会话启用缓存。但用户可以根据自己的需要将其禁用。但是,用户只能禁用“查询结果”缓存,但不能禁用 Metadata CachingData Caching

在本章中,我们将讨论不同类型的缓存以及 Snowflake 如何确定缓存。

Metadata Caching

元数据存储在云服务层中,因此缓存也在同一层中。这种元数据缓存始终对每个人启用。

它基本上包含以下详细信息:

  1. 表中的行数。

  2. 列的最小/最大值

  3. Number of DISTINCT values in a column>

  4. Number of NULL values in a column

  5. Details of different table versions

  6. References of physical files

This information is basically used by SQL optimizer to execute faster and quicker. There could be a few queries those can be answered completely by metadata itself. For such kind of queries no virtual warehouse is required but Cloud service charges may be applicable.

Such queries are like −

  1. All SHOW commands

  2. MIN, MAX but limited to only Integer/Number/Date data types of columns.

  3. COUNT

Let’s run a query to see how Metadata Caching works and user can validate.

Login into Snowflake and go to Worksheets. Suspend the warehouse by running following query −

ALTER WAREHOUSE COMPUTE_WH SUSPEND;

Now, run following queries sequentially −

USE SCHEMA SNOWFLAKE_SAMPLE_DATA.TPCH_SF100;
SELECT MIN(L_orderkey), MAX(l_orderkey), COUNT(*) FROM lineitem;

User will be able to see the result in less than 100ms as shown in the following screenshot. Click the Query Id. It will display the link of query Id. Then click the link as shown below −

query id

By default, it opens the details page where SQL is mentioned. Click the Profile tab. It displays 100% Metadata-Based Result. It means that without any compute warehouse it ran the result and fetched details based on Metadata caching.

The following screenshot displays above steps −

profile

Query Result Caching

Query Results are stored and managed by Cloud Service Layer. It is very useful if the same query run multiple times, but condition is underlying data or base tables are not changed between time duration when query has to run multiple times. This caching has unique feature that is available for other users within the same account.

For example, If user1 runs a query first time, the result gets stored in caching. When user2 also tries to run same query (by assuming that base tables and data are not changed), it fetches the result from Query Result caching.

Result cached are available for 24hours. But, counter of 24hours get reset each time when the same query re-run. For example, if a query ran at 10AM, its caching will be available until 10AM next day. If the same query re-run at 2PM on same day, now the caching will be available until 2PM next day.

There are some criteria to fulfil to use query result cache −

  1. Exact same SQL query should be re-run.

  2. There should not be any random function in the SQL.

  3. User must have right permissions to use it.

  4. 在运行查询时应用启用查询结果。默认情况下会启用,直至另行设置。

以下是一些使用查询结果缓存的情况 −

  1. 需要大量计算的查询,如聚合函数和半结构化数据分析。

  2. 非常频繁运行的查询。

  3. Queries those are complex.

  4. 重构另一个查询的输出,如“使用表函数 RESULT_SCAN(<query_id>)”。

让我们运行一个查询,看看查询结果缓存是如何工作的,用户可以验证。

登录 Snowflake 并转到工作表。通过运行以下查询恢复仓库 −

ALTER WAREHOUSE COMPUTE_WH Resume;

Now, run following queries sequentially −

USE SCHEMA SNOWFLAKE_SAMPLE_DATA.TPCH_SF100;
SELECT l_returnflag, l_linestatus,
SUM(l_quantity) AS sum_qty,
SUM(l_extendedprice) AS sum_base_price,
SUM(l_extendedprice * (l_discount)) AS sum_disc_price,
SUM(l_extendedprice * (l_discount) * (1+l_tax)) AS sum_charge,
AVG(l_quantity) AS avg_qty,
AVG(l_extendedprice) AS avg_price,
AVG(l_discount) AS avg_disc,
COUNT(*) AS count_order
FROM lineitem
WHERE l_shipdate <= dateadd(day, 90, to_date('1998-12-01'))
GROUP BY l_returnflag, l_linestatus
ORDER BY l_returnflag, l_linestatus;

单击查询 ID。它将显示查询 ID 的链接。然后单击链接,如上一个示例(Metadata-Caching)中所示。检查查询概要文件,它将如下所示 −

query profile

它显示已扫描 80.5% 的数据,因此未涉及缓存。通过运行以下查询暂停仓库 −

ALTER WAREHOUSE COMPUTE_WH Suspend;

再次运行与之前相同的查询 −

USE SCHEMA SNOWFLAKE_SAMPLE_DATA.TPCH_SF100;
SELECT l_returnflag, l_linestatus,
SUM(l_quantity) AS sum_qty,
SUM(l_extendedprice) AS sum_base_price,
SUM(l_extendedprice * (l_discount)) AS sum_disc_price,
SUM(l_extendedprice * (l_discount) * (1+l_tax)) AS sum_charge,
AVG(l_quantity) AS avg_qty,
AVG(l_extendedprice) AS avg_price,
AVG(l_discount) AS avg_disc,
COUNT(*) AS count_order
FROM lineitem
WHERE l_shipdate <= dateadd(day, 90, to_date('1998-12-01'))
GROUP BY l_returnflag, l_linestatus
ORDER BY l_returnflag, l_linestatus;

单击 Query ID。它将显示查询 ID 的链接。然后单击链接,如上一个示例(Metadata-Caching)中所示。检查查询概要文件,它将如下所示 −

query result

它显示查询结果重复使用。这意味着它在没有仓库查询的情况下成功运行,整个结果集已从查询结果缓存中获取。

Data Cache

数据缓存发生在存储层。它缓存存储文件标头和查询列数据。它存储来自所有查询的数据,但并不完全像查询结果一样。它将这些数据存储到虚拟仓库的 SS 中。当运行类似查询时,Snowflake 尽可能使用数据缓存。用户无法禁用数据缓存。数据缓存适用于在同一虚拟仓库上运行的所有查询。这意味着与元数据和查询结果缓存不同,数据缓存无法在没有虚拟仓库的情况下运行。

当运行查询时,其标头和列数据存储在虚拟仓库的 SSD 上。虚拟仓库首先读取本地可用数据(虚拟仓库的 SSD),然后从远程云存储(Snowflake 的实际存储系统)读取剩余数据。当缓存存储空间填满时,数据会根据最近最少使用(LRU)原则进行丢弃。

让我们运行一个查询,看看查询结果缓存是如何工作的,用户可以验证。

登录 Snowflake 并转到 Worksheets 。通过运行以下查询恢复仓库 −

ALTER WAREHOUSE COMPUTE_WH Resume;

使用以下 SQL 禁用 Query_Result 缓存 −

ALTER SESSION SET USE_CACHED_RESULT = FALSE;

运行以下查询 −

SELECT l_returnflag, l_linestatus,
SUM(l_quantity) AS sum_qty,
SUM(l_extendedprice) AS sum_base_price,
SUM(l_extendedprice * (l_discount)) AS sum_disc_price, SUM(l_extendedprice *
(l_discount) * (1+l_tax))
AS sum_charge, AVG(l_quantity) AS avg_qty,
AVG(l_extendedprice) AS avg_price,
AVG(l_discount) AS avg_disc,
COUNT(*) as count_order
FROM lineitem
WHERE l_shipdate <= dateadd(day, 90, to_date('1998-12-01'))
GROUP BY l_returnflag, l_linestatus
ORDER BY l_returnflag, l_linestatus;

单击 Query ID。它将显示查询 ID 的链接。然后单击链接,如上一个示例(Metadata-Caching)中所示。检查查询概要文件,它将如下所示 −

data cache

根据查询概要文件,已扫描 88.6% 的数据。如果您注意到右侧,则本地磁盘 IO = 2%,而远程磁盘 IO = 80%。这意味着几乎没有使用或根本没有使用数据缓存。现在,运行以下查询。WHERE 子句略有不同 −

SELECT l_returnflag, l_linestatus,
SUM(l_quantity) AS sum_qty,
SUM(l_extendedprice) AS sum_base_price,
SUM(l_extendedprice * (l_discount)) AS sum_disc_price, SUM(l_extendedprice *
(l_discount) * (1+l_tax))
AS sum_charge,
AVG(l_quantity) AS avg_qty,
AVG(l_extendedprice) AS avg_price,
AVG(l_discount) AS avg_disc,
COUNT(*) as count_order
FROM lineitem
WHERE l_shipdate <= dateadd(day, 90, to_date('1998-12-01'))
and l_extendedprice <= 20000
GROUP BY l_returnflag, l_linestatus
ORDER BY l_returnflag, l_linestatus;

单击 Query ID。它将显示查询 ID 的链接。然后单击链接,如上一个示例(Metadata-Caching)中所示。检查查询概要文件,它将如下所示 −

click query id

根据查询概要文件,已扫描 58.9% 的数据,这比第一次低得多。如果你注意到右侧,则本地磁盘 IO 已增至 4%,而远程磁盘 IO = 0%。这意味着几乎没有使用或根本没有从远程使用数据。