Snowflake – Part 2: Provisioning and Potential Red Flags

Its been a while, just over 500 days, since I published my first Snowflake blog but things got in the way so here is the second promised part. Fingers crossed it won’t be another 500 days until the next part.

Now you have decided on Snowflake as your data warehouse of choice, how do you get one going? And how do you load data into it? In this blog, the second of a series on Snowflake, I will be showing you how to create your database and load data into it. I will also discuss a couple of the red flags which I thought of when looking into Snowflake.

Provisioning and loading a Snowflake database

For this example, I’ve chosen an open dataset of 2017 taxi rides in New York City. There are a few options for interacting with Snowflake: a dialog box approach in the web-based GUI, using SQL statements in the Worksheets tab in the GUI or a CLI called SnowSQL. For this example, I used SQL statements as I find them easier to follow what’s happening. Once you have set up your account (or trial) and logged in, you need to create your first database.

Go the Worksheets tab and click the + to create a new worksheet (if one isn’t already open). To create your database, use the following:

CREATE DATABASE TAXIRIDE;

 

Now you have a database you want to create your first table, in the worksheet use the following:

CREATE TABLE "TAXIRIDE"."PUBLIC"."TAXIRIDES" (
"ID" VARCHAR(50) NULL, 
"VENDORID" VARCHAR(50) NULL,
"TPEP_PICKUP_DATETIME" VARCHAR(50)  NULL,
"TPEP_DROPOFF_DATETIME" VARCHAR(50) NULL,
"PASSENGER_COUNT" VARCHAR(50)  NULL,
"TRIP_DISTANCE" VARCHAR(50)  NULL,
"RATECODEID" VARCHAR(50)  NULL,
"STORE_AND_FWD_FLAG" VARCHAR(50)  NULL,
"PULOCATIONID" VARCHAR(50)  NULL,
"DOLOCATIONID" VARCHAR(50)  NULL,
"PAYMENT_TYPE" VARCHAR(50)  NULL,
"FARE_AMOUNT" FLOAT  NULL,
"EXTRA" FLOAT  NULL,
"MTA_TAX" FLOAT  NULL,
"TIP_AMOUNT" FLOAT  NULL,
"TOLLS_AMOUNT" FLOAT  NULL,
"IMPROVEMENT_SURCHARGE" FLOAT  NULL,
"TOTAL_AMOUNT" FLOAT  NULL
);

 

To load data into your table, you need to create a File Format and a Stage. The File Format is definition of a file(s) used in bulk loading. To create this:

CREATE FILE FORMAT "TAXIRIDE"."PUBLIC".TAXI_SOURCE 
FIELD_DELIMITER = ',' 
RECORD_DELIMITER = '\n' 
SKIP_HEADER = 0

 

The Stage is another metadata object which defines the connection between Snowflake and where the files are stored. The steps to set this up are a blog in themselves but in summary you need the following things:

  • Upload your file(s) to a blobstore or data lake – Snowflake has a file limit of 100mb so you may need to do some preprocessing on the files to split them up
  • Create a Shared Access Token (SAS) on your Azure store
  • Create a user-defined encryption key on your Azure store – by default data is stored in Azure encrypted so Snowflake needs a key to decrypt this. Also, by default Azure uses Microsoft issued encryption keys which you cannot access so you need to setup user-defined keys. This series explains this is much more detail: https://docs.microsoft.com/en-us/azure/storage/common/storage-service-encryption?toc=%2fazure%2fstorage%2fblobs%2ftoc.json.

 

Now you have the above things in place you can setup your Stage in Snowflake with the following SQL:

CREATE OR REPLACE STAGE my_azure_stage
url = '<url to your Azure store>'
credentials=(azure_sas_token='<your SAS token>')
encryption=(type='AZURE_CSE' master_key='<your encryption key>')
file_format = TAXI_SOURCE;

 

We have created our table and connection metadata objects so we will load the data into our table:

COPY INTO TAXIRIDES
FROM @my_azure_storage

 

This will load our data from our Azure storage into Snowflake. When the load has finished you can check it by running a simple SELECT statement:

SELECT COUNT(*) FROM TAXIRIDES

 

This brief overview of how to setup and load data into Snowflake shows how easy it is to get an instance up and running and to get data into it. In my last blog I mentioned a couple of potential red flags that I see with Snowflake – cost and lack of fine tuning controls, which I will discuss more now.

 

Cost

Snowflake sells itself as a more cost-effective solution for data warehousing based on the lack of need for upfront capital costs and the ability to scale compute up or down as needed – for example if you are doing a heavy daily load overnight you can scale up a virtual warehouse and pause it when it is not needed. Your analysts are not going to be using a virtual warehouse overnight, so you can pause the compute during this period and have it running during business hours. This makes sense but in practice Snowflakes presented costs are slightly opaque, especially around compute.

Storage is costed at $50 per terabyte per month and the Services layer has no additional cost. Compute is costed using Snowflake Credits. These vary is cost depending on the tier of service you choose (https://www.snowflake.com/pricing/) and their usage is dependent on the size of the virtual warehouse being used. An indicative cost from Snowflake is that 8 analysts using a M sized virtual warehouse will use 8 units per day. A unit at the Enterprise tier (which offers 24×7 support) is $3.70 which would equate to just under $8,000 per year. Pretty cheap – but one of the advantages of Snowflake is the ability to have multiple virtual warehouses performing operations on a database at the same time. If you scale up the sizes across multiple virtual warehouses it is easy to see costs ballooning without a watchful eye. One feature is an Auto Suspend timeout which can be set as low as 5 minutes to help keep a cap on the time warehouses are sitting idle.

 

Lack of fine tune controls

Another of Snowflakes selling points is its ease to set up and use. As we’ve seen above, setting up a database and loading data is straightforward and quick. The services layer provisions resources automatically and removes the need for fine tuning the database for optimal performance. This is a welcome step forward and removes barriers for quicker adoption in many cases but is not a panacea for all ailments.

Snowflakes approach to performance issues seems to be a “scale up” one. Queries running too slow, you can scale up your virtual warehouse instantly. Need to ingest more data quicker, scale up. However, this ignores some of the key causes of slow queries or ingestion. Perhaps your indexes are causing inefficient querying or your distribution of data across nodes means data is being moved around to complete queries. By throwing more compute at the problem you are getting to a solution but potentially at a higher cost and energy usage.

Perhaps a hybrid approach is the best, by combining a system with a greater level of fine-tuned controls and skilled workers (whether internal or by bringing in a consultancy partner) you can keep using existing implementations in tandem with systems like Snowflake getting the best of both worlds, while keeping costs low.