Snowflake 简明教程
Snowflake - Cache
Snowflake 有一个独特的缓存功能。它基于此缓存提供快速的结果,扫描更少的数据。它甚至可以帮助客户降低账单。
Snowflake has a unique feature of caching. It provides fast and quick result with less data scan based on this caching. It even helps the customer to reduce their billing as well.
Snowflake 中基本上有三种类型的缓存。
There are basically three types of caching in Snowflake.
-
Metadata Caching
-
Query Result Caching
-
Data Caching
默认情况下,为所有雪花会话启用缓存。但用户可以根据自己的需要将其禁用。但是,用户只能禁用“查询结果”缓存,但不能禁用 Metadata Caching 和 Data Caching 。
By default, cache is enabled for all snowflake session. But user can disable it based on their needs. However, user can disable only Query Result caching but there is no way to disable Metadata Caching as well as Data Caching.
在本章中,我们将讨论不同类型的缓存以及 Snowflake 如何确定缓存。
In this chapter, we will discuss about the different types of cache and how snowflake decides caching.
Metadata Caching
元数据存储在云服务层中,因此缓存也在同一层中。这种元数据缓存始终对每个人启用。
Metadata stores at Cloud Service Layer hence caching is also at same layer. These metadata caching is always enabled for everyone.
它基本上包含以下详细信息:
It basically contains the following details −
-
Row Count in a table.
-
MIN/MAX value of a column
-
Number of DISTINCT values in a column>
-
Number of NULL values in a column
-
Details of different table versions
-
References of physical files
这些信息基本上由 SQL 优化器用于更快地执行。可能会有一些查询可以通过元数据本身完全得到解答。对于此类查询,无需虚拟仓库,但可能会适用云服务费用。
This information is basically used by SQL optimizer to execute faster and quicker. There could be a few queries those can be answered completely by metadata itself. For such kind of queries no virtual warehouse is required but Cloud service charges may be applicable.
这类查询类似于:
Such queries are like −
-
All SHOW commands
-
MIN, MAX but limited to only Integer/Number/Date data types of columns.
-
COUNT
我们来运行一条查询,查看元数据缓存是如何工作的,以及用户如何验证。
Let’s run a query to see how Metadata Caching works and user can validate.
登录 Snowflake 并转到“工作表”。通过运行以下查询暂停仓库:
Login into Snowflake and go to Worksheets. Suspend the warehouse by running following query −
ALTER WAREHOUSE COMPUTE_WH SUSPEND;
现在,按顺序运行以下查询:
Now, run following queries sequentially −
USE SCHEMA SNOWFLAKE_SAMPLE_DATA.TPCH_SF100;
SELECT MIN(L_orderkey), MAX(l_orderkey), COUNT(*) FROM lineitem;
用户将能够在不到 100 毫秒的时间内看到结果,如下图所示。单击查询 ID。它将显示查询 ID 的链接。然后单击该链接,如下所示:
User will be able to see the result in less than 100ms as shown in the following screenshot. Click the Query Id. It will display the link of query Id. Then click the link as shown below −

默认情况下,它将打开提及 SQL 的详细信息页面。单击 Profile 选项卡。它将显示 100% 基于元数据的查询结果。这意味着无需任何计算仓库,就能根据元数据缓存运行并获取结果。
By default, it opens the details page where SQL is mentioned. Click the Profile tab. It displays 100% Metadata-Based Result. It means that without any compute warehouse it ran the result and fetched details based on Metadata caching.
下图显示了上述步骤:
The following screenshot displays above steps −

Query Result Caching
查询结果由云服务层存储和管理。这在同一查询多次运行时非常有用,但前提条件是查询必须多次运行期间的基础数据或基础表保持不变。此缓存具有一个独特的特性,对同一个帐户内的其他用户可用。
Query Results are stored and managed by Cloud Service Layer. It is very useful if the same query run multiple times, but condition is underlying data or base tables are not changed between time duration when query has to run multiple times. This caching has unique feature that is available for other users within the same account.
例如,如果 user1 首次运行查询,结果将存储在缓存中。当 user2 也尝试运行同一个查询(假设基础表和数据未更改)时,它会从查询结果缓存中获取结果。
For example, If user1 runs a query first time, the result gets stored in caching. When user2 also tries to run same query (by assuming that base tables and data are not changed), it fetches the result from Query Result caching.
缓存结果可保存 24 小时。但是,每次重新运行同一查询时,24 小时的计时器都会重置。例如,如果查询在上午 10 点运行,其缓存将持续到次日上午 10 点。如果同一查询在同一天下午 2 点重新运行,现在缓存将持续到次日下午 2 点。
Result cached are available for 24hours. But, counter of 24hours get reset each time when the same query re-run. For example, if a query ran at 10AM, its caching will be available until 10AM next day. If the same query re-run at 2PM on same day, now the caching will be available until 2PM next day.
要使用查询结果缓存,需要满足一些条件:
There are some criteria to fulfil to use query result cache −
-
Exact same SQL query should be re-run.
-
There should not be any random function in the SQL.
-
User must have right permissions to use it.
-
Query result should be enabled while running the query. By default, it’s enabled until set otherwise.
以下是一些使用查询结果缓存的情况 −
Some cases for Query result caching are −
-
Queries those required massive amount of computing like Aggregate function and semi structured data analysis.
-
Queries those run very frequently.
-
Queries those are complex.
-
Refactor the output of another query like "USE TABLE function RESULT_SCAN(<query_id>)".
让我们运行一个查询,看看查询结果缓存是如何工作的,用户可以验证。
Let’s run a query to see how Query Result Caching works and user can validate.
登录 Snowflake 并转到工作表。通过运行以下查询恢复仓库 −
Login into Snowflake and go to Worksheets. Resume the warehouse by running following query −
ALTER WAREHOUSE COMPUTE_WH Resume;
现在,按顺序运行以下查询:
Now, run following queries sequentially −
USE SCHEMA SNOWFLAKE_SAMPLE_DATA.TPCH_SF100;
SELECT l_returnflag, l_linestatus,
SUM(l_quantity) AS sum_qty,
SUM(l_extendedprice) AS sum_base_price,
SUM(l_extendedprice * (l_discount)) AS sum_disc_price,
SUM(l_extendedprice * (l_discount) * (1+l_tax)) AS sum_charge,
AVG(l_quantity) AS avg_qty,
AVG(l_extendedprice) AS avg_price,
AVG(l_discount) AS avg_disc,
COUNT(*) AS count_order
FROM lineitem
WHERE l_shipdate <= dateadd(day, 90, to_date('1998-12-01'))
GROUP BY l_returnflag, l_linestatus
ORDER BY l_returnflag, l_linestatus;
单击查询 ID。它将显示查询 ID 的链接。然后单击链接,如上一个示例(Metadata-Caching)中所示。检查查询概要文件,它将如下所示 −
Click the Query Id. It will display the link of query Id. Then click on link as shown in previous example (Metadata-Caching). Check the Query profile, it will be displayed as shown below −

它显示已扫描 80.5% 的数据,因此未涉及缓存。通过运行以下查询暂停仓库 −
It shows 80.5% data is scanned so no cache was involved. Suspend the warehouse by running following query −
ALTER WAREHOUSE COMPUTE_WH Suspend;
再次运行与之前相同的查询 −
Run the same query again as we previously did −
USE SCHEMA SNOWFLAKE_SAMPLE_DATA.TPCH_SF100;
SELECT l_returnflag, l_linestatus,
SUM(l_quantity) AS sum_qty,
SUM(l_extendedprice) AS sum_base_price,
SUM(l_extendedprice * (l_discount)) AS sum_disc_price,
SUM(l_extendedprice * (l_discount) * (1+l_tax)) AS sum_charge,
AVG(l_quantity) AS avg_qty,
AVG(l_extendedprice) AS avg_price,
AVG(l_discount) AS avg_disc,
COUNT(*) AS count_order
FROM lineitem
WHERE l_shipdate <= dateadd(day, 90, to_date('1998-12-01'))
GROUP BY l_returnflag, l_linestatus
ORDER BY l_returnflag, l_linestatus;
单击 Query ID。它将显示查询 ID 的链接。然后单击链接,如上一个示例(Metadata-Caching)中所示。检查查询概要文件,它将如下所示 −
Click the Query Id. It will display the link of query Id. Then click on link as shown in previous example (Metadata-Caching). Check the Query profile, it will be displayed as shown below −

它显示查询结果重复使用。这意味着它在没有仓库查询的情况下成功运行,整个结果集已从查询结果缓存中获取。
It shows query result reuse. It means that without warehouse query it ran successfully and entire result set has been taken from Query Result Caching.
Data Cache
数据缓存发生在存储层。它缓存存储文件标头和查询列数据。它存储来自所有查询的数据,但并不完全像查询结果一样。它将这些数据存储到虚拟仓库的 SS 中。当运行类似查询时,Snowflake 尽可能使用数据缓存。用户无法禁用数据缓存。数据缓存适用于在同一虚拟仓库上运行的所有查询。这意味着与元数据和查询结果缓存不同,数据缓存无法在没有虚拟仓库的情况下运行。
Data cache takes place at storage layer. It caches storage file headers and column data from queries. It stores the data from all queries but not exactly as query result. It stores these data into SS of virtual warehouses. When similar query runs, Snowflake uses data cache as much as possible. User can not disable data cache. Data cache is available for all queries those run on the same virtual warehouse. It means that Data cache cannot work without virtual warehouse unlike Metadata and Query Result Cache.
当运行查询时,其标头和列数据存储在虚拟仓库的 SSD 上。虚拟仓库首先读取本地可用数据(虚拟仓库的 SSD),然后从远程云存储(Snowflake 的实际存储系统)读取剩余数据。当缓存存储空间填满时,数据会根据最近最少使用(LRU)原则进行丢弃。
When a query runs, it’s header and column data are stored on SSD of virtual warehouse. Virtual warehouse first read locally available data (SSD of Virtual warehouse) and then read remaining from remote cloud storage (actual snowflake’s storage system). Data keeps dropping on the bases of Least Used fashion when cache storages fills.
让我们运行一个查询,看看查询结果缓存是如何工作的,用户可以验证。
Let’s run a query to see how Query Result Caching works and user can validate.
登录 Snowflake 并转到 Worksheets 。通过运行以下查询恢复仓库 −
Login into Snowflake and go to the Worksheets. Resume the warehouse by running following query −
ALTER WAREHOUSE COMPUTE_WH Resume;
使用以下 SQL 禁用 Query_Result 缓存 −
Disable Query_Result cache by using following SQL −
ALTER SESSION SET USE_CACHED_RESULT = FALSE;
运行以下查询 −
Run the following query −
SELECT l_returnflag, l_linestatus,
SUM(l_quantity) AS sum_qty,
SUM(l_extendedprice) AS sum_base_price,
SUM(l_extendedprice * (l_discount)) AS sum_disc_price, SUM(l_extendedprice *
(l_discount) * (1+l_tax))
AS sum_charge, AVG(l_quantity) AS avg_qty,
AVG(l_extendedprice) AS avg_price,
AVG(l_discount) AS avg_disc,
COUNT(*) as count_order
FROM lineitem
WHERE l_shipdate <= dateadd(day, 90, to_date('1998-12-01'))
GROUP BY l_returnflag, l_linestatus
ORDER BY l_returnflag, l_linestatus;
单击 Query ID。它将显示查询 ID 的链接。然后单击链接,如上一个示例(Metadata-Caching)中所示。检查查询概要文件,它将如下所示 −
Click the Query Id. It will display the link of query Id. Then click on link as shown in previous example (Metadata-Caching). Check the Query profile, it will be displayed as shown below −

根据查询概要文件,已扫描 88.6% 的数据。如果您注意到右侧,则本地磁盘 IO = 2%,而远程磁盘 IO = 80%。这意味着几乎没有使用或根本没有使用数据缓存。现在,运行以下查询。WHERE 子句略有不同 −
As per query profile, 88.6% data is scanned. If you notice in the right side, Local Disk IO = 2% while Remote Disk IO = 80%. It means that very low or no data caching has been utilized. Now, run the following query. There is a little difference in WHERE clause −
SELECT l_returnflag, l_linestatus,
SUM(l_quantity) AS sum_qty,
SUM(l_extendedprice) AS sum_base_price,
SUM(l_extendedprice * (l_discount)) AS sum_disc_price, SUM(l_extendedprice *
(l_discount) * (1+l_tax))
AS sum_charge,
AVG(l_quantity) AS avg_qty,
AVG(l_extendedprice) AS avg_price,
AVG(l_discount) AS avg_disc,
COUNT(*) as count_order
FROM lineitem
WHERE l_shipdate <= dateadd(day, 90, to_date('1998-12-01'))
and l_extendedprice <= 20000
GROUP BY l_returnflag, l_linestatus
ORDER BY l_returnflag, l_linestatus;
单击 Query ID。它将显示查询 ID 的链接。然后单击链接,如上一个示例(Metadata-Caching)中所示。检查查询概要文件,它将如下所示 −
Click the Query Id. It will display the link of query Id. Then click on link as shown in previous example (Metadata-Caching). Check the Query profile, it will be displayed as shown below −

根据查询概要文件,已扫描 58.9% 的数据,这比第一次低得多。如果你注意到右侧,则本地磁盘 IO 已增至 4%,而远程磁盘 IO = 0%。这意味着几乎没有使用或根本没有从远程使用数据。
As per query profile, 58.9% data is scanned that is quite lower than 1st time. If you notice in the right side, Local Disk IO is increased to 4% while Remote Disk IO = 0%. It means that very low or no data has been utilized from remote.