Snowflake 简明教程
Snowflake - Data Architecture
Snowflake 数据架构重新发明了一个新的 SQL 查询引擎。它仅设计用于云。Snowflake 不利用或建立在任何现有数据库技术之上。它甚至不使用 Hadoop 等大数据软件平台。Snowflake 提供了分析数据库的所有功能以及许多其他独特功能和功能供用户使用。
Snowflake data architecture re-invents a new SQL query engine. It is designed for the cloud only. Snowflake doesn’t utilize or built on top of any existing database technology. It doesn’t even use big data software platforms like Hadoop. Snowflake provides all functionalities of an analytical database plus numbers of additional unique features and capabilities to users.
Snowflake 具有用于存储结构化和半结构化数据的中央数据存储库。可以从 Snowflake 平台中所有可用计算节点访问这些数据。它使用虚拟仓库作为处理查询的计算环境。在处理查询时,它利用多集群、微分区和高级缓存概念。Snowflake 的云服务负责为用户提供端到端解决方案,例如将用户验证登录到选择查询的结果。
Snowflake has central data repository for storage of structured and semi-structured data. These data can be accessed from all available compute nodes in the Snowflake platform. It uses virtual warehouse as compute environment for processing the queries. While processing queries, it utilizes multi-cluster, micro-partitioning and advanced cache concepts. Snowflake’s cloud services are responsible to provide end to end solution to the user like logging validation of user to result of select queries.
Snowflake 的数据架构 has three main layers −
Snowflake’s data architecture has three main layers −
-
Database Storage
-
Query Processing
-
Cloud Services
以下为 Snowflake 的架构图 data architecture −
Following is the data architecture diagram of Snowflake −

Database Storage
Snowflake 支持 Amazon S3、Azure 和 Google Cloud,以使用文件系统将数据加载到 Snowflake 中。用户应将文件(.csv、.txt、.xlsx 等)上传到云中,并在创建 Snowflake 中的连接后获取数据。数据量不受限制,但文件大小最高为 5GB(根据云服务而定)。一旦数据加载到 Snowflake 中,它将利用其内部优化和压缩技术将数据以列格式存储到中央存储库中。中央存储库基于数据存储的云。
Snowflake supports Amazon S3, Azure and Google Cloud to load data into Snowflake using file system. User should upload a file (.csv, .txt, .xlsx etc.) into the cloud and after they create a connection in Snowflake to bring the data. Data size is unlimited, but file size is up to 5GB as per cloud services. Once data is loaded into Snowflake, it utilizes its internal optimization and compression techniques to store the data into central repository as columnar format. The central repository is based on cloud where data stores.
Snowflake 负责数据管理的所有方面,例如如何使用数据自动集群存储数据、数据的组织和结构、通过将数据保留在多个微分区中的压缩技术、元数据、统计信息等。Snowflake 将数据存储为数据对象,用户无法直接查看或访问它们。用户可以通过 SQL 查询(在 Snowflake 的 UI 中或使用 Java、Python、PHP、Ruby 等编程语言)访问这些数据。
Snowflake owns responsibilities to all aspects of data management like how data is stored using automatic clustering of data, organization and structure of data, compression technique by keeping data into many micro-partitions, metadata, statistics and many more. Snowflake stores data as data objects and users can’t see or access them directly. Users can access these data through SQL queries either in Snowflake’s UI or using programming language like Java, Python, PHP, Ruby etc.
Query Processing
查询执行是处理层或计算层的一部分。为了处理查询,Snowflake 要求计算环境,在 Snowflake 的世界中被称为“虚拟仓库”。虚拟仓库是一个计算集群。虚拟仓库由 CPU、内存和临时存储系统组成,以便它可以执行 SQL 执行和 DML(数据操作语言)操作。
Query execution is a part of processing layer or compute layer. To process a query, Snowflake requires compute environment, known as "Virtual Warehouse" in Snowflake’s world. Virtual warehouse is a compute cluster. A virtual warehouse consists of CPU, Memory and temporary storage system so that it could perform SQL execution and DML (Data Manipulation Language) operations.
-
SQL SELECT executions
-
Updating of data using Update, Insert, Update
-
Loading data into tables using COPY INTO <tables>
-
Unloading data from tables using COPY INTO <locations>
但是,服务器的数量取决于虚拟仓库的大小。例如,小型仓库每个集群有 1 台服务器,而小型仓库每个集群有 2 台服务器,并且随着大型、超大型等规模的增加而加倍。
However, the number of servers depends on size of virtual warehouses. For example, XSmall warehouse has 1 Server per cluster, while a Small Warehouse has 2 Servers per cluster and it gets double on increasing the size such as Large, XLarge, etc.
在执行查询时,Snowflake 分析所请求的查询,并使用最新的微分区,并在不同的阶段评估缓存以提高性能并减少获取数据的时间。减少时间意味着减少用户使用的积分。
While executing a query, Snowflake analyzes the requested query and uses the latest micro-partitions and evaluates caching at different stages to increase performance and decrease the time for bringing the data. Decrease the time means less credit is used of a user.
Cloud Services
云服务是 Snowflake 的“大脑”。它协调和管理 Snowflake 中的活动。它将 Snowflake 的所有组件整合在一起,以处理用户的请求,从日志验证到交付查询的响应。
Cloud Service is the 'Brain' of the Snowflake. It coordinates and manages activities across Snowflake. It brings all components of Snowflake together to process user requests from logging validation to deliver query’s response.
以下服务在此层进行管理:−
The following services are managed at this layer −
-
It is the centralized management for all storage.
-
It manages the compute environments to work with storage.
-
It is responsible for upgrades, updates, patching and configuration of Snowflake at cloud.
-
It performs cost-based optimizers on SQL queries.
-
It gathers statistics automatically like credit used, storage capacity utilization
-
Security like Authentication, Access controls based on roles and users
-
It performs encryption as well as key management services.
-
It stores metadata as data is loaded into the system.
等等……
And many more…