According to IDC, the global data volume is expected to increase from 50 ZB today to 175 ZB in 2025 and 30% of that data will be generated in real-time. More than ever, we’ll have to be ready to ingest, analyze and learn from that data, and more than ever, scalable data solutions will be necessary to fulfill that demand.
Enter cloud data warehousing. Cloud data warehousing stands out from on premise data warehousing, not only because of its scalability, but also its ease of integration with other systems, secureness and low maintenance. Yes, on premise solutions do enable full control access over your stack, but does this really weigh against the costs, system downtime, necessary maintenance and performance issues that might come into place as well?
Below picture depicts a simplified distinction between on premise and cloud data warehouses. Discussing them one by one would lead us too far, but if you’re interested in this topic, I highly recommend reading Bjorn’s related blog which compares ETL with ELT.
Now, after considering those key factors that distinguish cloud from on premise data warehouses, how does Snowflake differentiates itself from other cloud data warehouse platforms?
A lot of people talk about cloud data warehousing these days, but what exactly are the key differences between cloud and on premise data warehouses and how does Snowflake differentiate itself from others? This blog post explains the basics of Snowflake’s revolutionary architecture and its components.
If we take a look at Snowflake’s architecture, there are three layers: Storage, Compute and Services. Each layer in this framework plays a crucial role in making the efficient, performant and scalable solution that Snowflake is. Let’s go through them one by one.
The first component of Snowflake’s architecture is the Storage Layer. This layer, also referred to as the Data Layer, provides the long term storage of results, i.e. Remote Disk, which is configured on top of either Amazon S3, Azure Blob or Google Bucket storage.
Here, Snowflake stores all of the data, independently from the other layers, resulting in compute-independent scalability of storage. The Storage Layer consists of data, structured in databases, schemas, tables and optionally materialized views. Whereas the other layers of Snowflake have their own cache, the Storage Layer does not, this layer is responsible for the data resilience.
The second layer is called Compute, also known as the Query Processing Layer. This is where the actual work, i.e. querying data, is done by server clusters. Query processing happens with the use of (virtual) warehouses. In contrast to traditional data solutions, where a data warehouse consists of multiple databases, a Snowflake warehouse represents the required resources such as CPU, memory and cache to perform all kinds of operations related to data processing, such as selecting, inserting, deleting and updating data.
Warehouses can differ in size, like the size of a hoodie. Scaling up your warehouse, i.e. going from a smaller hoodie size (S) to a larger size (L), can be beneficial when the complexity of the query increases considerably. Be aware that running operations on larger warehouses will result in higher costs as well, and larger warehouses are not necessarily more performant than smaller warehouses when dealing with relatively simple queries.
Warehouses can also differ in its type of clustering. By default, a Snowflake warehouse consists of a single cluster of servers. With multi-cluster warehouses, a warehouse can control multiple resources of the predefined warehouse size.
For example, think of a warehouse cluster as a group of technicians fixing the plumbing in your house or apartment. Fixing a pipe has a certain level of complexity, and a group of technicians need some time to fix it, while sticking together as a team. At the moment they start working, other pipes suddenly break. With a single group of technicians, they are only able to fix another pipe as soon as they finish or cancel one. Now, in this story, where plumbers are clusters of servers, Snowflake would add a set of clusters as soon as more issues pop up, resulting in more problems being solved in parallel by the set of clusters.
That’s an absurd way of depicting multi-clustered warehouses. Higher concurrency, i.e. more workload, can be resolved with scaling out your warehouse. Important to note here, is that a more complex problem wouldn’t have been fixed by scaling out your warehouse, because more independent clusters of servers doesn’t mean that they would solve the problem faster.
The cache in this layer is often referred to as the Raw Data Cache or Warehouse Cache. This cache contains the actual raw data that was queried recently, and will disappear once a warehouse is suspended or dropped.
The last layer is called the Cloud Services Layer, or simply Services Layer. If we refer to the Compute Layer as the muscle of Snowflake architecture, then the Cloud Services Layer definitely is the brain or intellect. Snowflake's Services Layer consists of several sub-services, namely
Query Compilation & Optimization
Infrastructure & User Management
Authentication & Security
So basically, the Cloud Services not only form the (front-end) layer that an end-user will interact with, it also provides a highly available and distributed metadata store without the need of user compute resources, which means gathering statistics like table size, the current schema you're in, or recently query results can be done without the requirement of having a running warehouse.
Certain Data Definition Language (DDL) commands don't require a running warehouse. For example
CREATE, ALTER or
DROP DATABASE/SCHEMA/TABLE can be run without a warehouse, because of the simple reason that we're changing the underlying metadata of the table, rather than the actual data.
Also some Data Manipulation Language (DML) commands can be run without an active, running warehouse. The Services Layer keeps some statistics and table information that don't need actual processing, such as
COUNT, MIN, MAX and table size in bytes.
Also query compilation and optimization is performed in the Cloud Services Layer. As you might be aware of, the data within Snowflake's centralized storage consists of micro-partitions, about 10-100 MB compressed in size (50-500 MB uncompressed). To query this data optimally, Snowflake uses the concept of data pruning and clustering, which is performed by the Services Layer as well.
Infrastructure, user management, authentication and security also happen in the Cloud Services Layer. Explaining these in detail would lead us too far, but, nonetheless, it is important to know that they're strictly separated from the other layers within Snowflake.
What is worthy to highlight, is data sharing and how it is Snowflake differs from more traditional systems. With traditional solutions, companies struggle to share their data, especially outside the organization, which often results in scattered, inconsistent data in different places, partly internal and external. In short: shared-disk and shared-nothing environments.
Snowflake manages to overcome this issue in a highly efficient way, i.e. via the data sharing functionality. Unlike traditional data warehouses, storage, compute and services are strictly separated from each other, although they do communicate seamlessly with each other. With its massively parallel processing, multi-cluster shared data design, Snowflake becomes one of the first solutions with unlimited storage and compute, on account of the cloud's scalability.
Data sharing with externals does not duplicate your data externally. Instead of keeping copies in several environments, your data is stored in a centralized storage, with several references to this data. Data cloning or sharing is nothing more than simple references to your centralized storage, which can be both internal or external.
Caching in the Services Layer also plays a crucial role in Snowflake's performance. This layer manages two types of cache: Metadata Cache and Results Cache.
Metadata is stored for two things: tables and micro-partitions. For tables, as mentioned above, statistics like row count, table size in bytes and file references are kept. For micro-partitions, Snowflake's contiguous units of storage, the count of NULL values, number of distinct values and the minimum/maximum of all values.
The Results Cache contains aggregated data that was queried recently. This means that if you calculated the total profit by month for this year, and you (or someone else with the same user role) run the exact same query not later than 24 hours after the initial query, Snowflake doesn't have to spin up or use a warehouse to retrieve this result. The aggregates results will be derived directly from the cache, and no warehouse costs will be incurred.
Note that the Results Cache doesn't change only when the underlying data didn't change in the designated timeframe, i.e. 24 hours.