Snowflake – Part 1: Introduction

 

Microsoft have a well-established and successful
architecture for modern data warehousing which has been implemented by Adatis
at multiple clients. This architecture could look like the
following:

 

clip_image002[5]

 

 

Azure SQL Data Warehouse is Microsoft’s cloud-base data
warehousing offering providing an MPP architecture where compute can be scaled
elastically quickly with limited downtime, easily integrated with other Azure
services such as Azure Databricks and Azure Analysis Services and can be
secured using Azure Analysis Services.

 

 

 

While Azure SQL Data Warehouse will allow you to ingest
large datasets and provide very fast querying times there are time when it
might not exactly fit with your requirements. One example is that Azure SQL
Data Warehouse requires downtime to scale. Due to its architecture this is
unavoidable however some companies want 100% uptime even when you are scaling.
This is where new technologies such as Snowflake come in. Snowflake sells
itself as a new approach to data warehousing, promising to address some of the
issues other cloud data warehouse solutions have, especially performance issues
around concurrency and downtime to scale. This blog will introduce Snowflake’s
architecture and how its Azure implementation can fit into a larger Microsoft
workflow. Later blogs in the series will look at setting up a Snowflake
instance and comparing Snowflake’s performance to Azure SQL Data
Warehouse.

 

 

 

Snowflake’s architecture consists of three components:

 

clip_image004[4]

 

 

The Database storage layer is where data is stored when
loaded into Snowflake, which manages all aspects of how it is stored including
organisation, file sizes, compressions and metadata. Data is stored in a
columnar format, seemingly like a clustered columnstore index in an Azure SQL
Data Warehouse and replicated across multiple regions to ensure high
availability. The data stored in this layer is only accessible through SQL
queries in Snowflake; there is no direct access to the underlying database as
there would be in a SQL Server database.

 

 

 

The Query processing layer is a series of “Virtual
warehouses”, which are independent MPP clusters. This means you can provision
multiple clusters without any performance impact on another clusters and there
is a high level of resilience as while virtual warehouses only exist in a
single region, if that region is unavailable, Snowflake will automatically
re-provision the Virtual Warehouse in another. Another feature of the Query
processing layer is, for every query processed it will automatically cache
data, which the query optimiser in the Cloud services layer can use for
subsequent queries – greatly reducing query times in some
cases.

 

 

 

Finally. the Cloud services layer is the coordination
layer for Snowflake handling everything from authentication to query
optimisation to infrastructure management. This means much of the administration
associated with cloud data warehousing is removed from the user and handled
internally by Snowflake.

 

 

 

By completely separating storage and compute, Snowflake
can take advantage of the performance benefits of existing data architectures
(such as shared-disk and shared-nothing*) while eliminating some of the issues
with these architectures such as problems with concurrency and data movement and
reorganisation between disks. With concurrency, many existing architectures
struggle with multiple connections all competing for the same resources,
Snowflake manages this by allowing scaling of virtual warehouses instantly to
make more resources available. For data movement and reorganisation, a
shared-nothing architecture relies on the correct type of distribution across
clusters to minimise data movement at query time, if this is misjudged it can
lead to much slower queries times. However, with a single data store no data
movement is required in Snowflake.

 

 

 

Conceptually this is all well and good but as a
Microsoft shop, how could Snowflake fit with the work that Adatis does.
Previously Snowflake has been deployed using AWS components, but it was recently
announced that it would be made available on Azure.  This means using blob store for storage and Azure Compute (read: VMs) for
compute. By having this separation, it is more straightforward to see how it is
possible to scale seamlessly. Need more compute? Add a more powerful VM to your
cluster. Need concurrency between an analytical workload, a data ingestion
workload and a data science workload? Split these workloads between different
compute clusters (Virtual Warehouses). In addition, Snowflake can utilise Azure
Active Directory for enterprise-grade security, use Azure Data Lake Store as a
source, and via Azure Data Factory use a custom activity to connect to an
instance with a .net connection and export ARM templates so it can be integrated
into a DevOps pipeline.
 

 

 

 

As we have seen Snowflake on Azure is a solution to
some of the issues which exist with other cloud data warehouses. However, there
are some potential red flags including a perceived lack of control over settings
and configuration, and a slightly opaque costing model. In the next blog in
this series, we will look at what you need to do to provision a Snowflake
instance as well as explore these red flags more.

 

 

 

 

 

 

 

* Shared-disk architecture is where all data is stored
on a single disk which is accessible by all compute nodes – a traditional
on-premise data warehouse. Shared-nothing architecture is where data is
distributed across multiple nodes – Azure SQL Data Warehouse.

 

 

 

 

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *