Data Engineering

Loading Data in Snowflake

Posted on 16th May 20197th October 2019 by Dan Bilsborough

16
May

In this blog I will provide an overview of the process for loading data into Snowflake. It will explain what staging the data means in the context of Snowflake, before illustrating the process of how data can be loaded into the Snowflake database from an Azure container. The goal of this blog is to help you understand from a conceptual point of view how data is loaded into Snowflake.

Staging

Before being loaded into a Snowflake table, the data can be optionally staged, which is essentially just a pointer to a location where the files are stored. There are different types of stages including:

– User stages, which each user will have by default

– Table stages, which each table will have by default

– Internal named stages, meaning staged within Snowflake

Internal named stages are the best option for regular data loads, if you are thinking along the lines of your standard daily ETL process. One benefit of these is the flexibility in that they are database objects, so you can grant privileges to roles to access these objects as you would expect. Alternatively, there are external stages, such as Azure Blob storage.

Let’s have a high level look at how you would load data into Snowflake.

1. Source data is stored in files in an Azure container (Blob Storage, Azure Data Lake Store Gen1/Gen2).

2. Optionally, data can be staged, either internally within a Snowflake table, or externally, in an Azure container. If you choose to stage the data internally, you will need to use the PUT command to load the data into stage if it is stored on a local machine. This cannot be done from the Web UI for Snowflake, it needs to be done using SnowSQL, the CLI client for Snowflake. If you want to stage the data externally, such as in Blob Storage for example, you cannot use the PUT command, and may need to manually upload the data, use PowerShell, or Azure Data Factory for example, to load the data into the desired external stage location.

3. Use the ‘COPY INTO’ command (shown below) while a virtual warehouse is running with reference to a stage to load the data, or if the data is not staged, it can be loaded directly from the location the data is stored in, such as referencing a SAS token to load straight from Blob Storage.

Snowpipe

However, there is a second way that you can load data into Snowflake, which is called Snowpipe, allowing data to be continuously loaded in Snowflake once a file becomes available in stage. Essentially, you can call a public REST endpoint with a list of filenames and a pipe name, and if any of these files are in stage, they are added to the queue to be loaded. Discussing Snowpipe in-depth is a whole other blog in itself, but it’s worth noting that it is an option if you need data to be loaded into Snowflake more often than your daily ETL can provide.

Additional Features

Hopefully that should give a very brief overview of the concept behind loading data into Snowflake, however, there are a couple of features that are worth mentioning here.

– Firstly, Snowflake can load semi-structured data, think JSON or Parquet for example, which get loaded into a single VARIANT column, or you can transform the data into separate columns using the COPY INTO command.

– Secondly, if using an internal stage, unencrypted data will automatically be encrypted. Encrypted data can also be stored in an internal stage, providing the key used to encrypt the data is supplied to Snowflake.

Thanks for reading!

Dan Bilsborough

Introduction to Data Wrangler in Microsoft Fabric

What is Data Wrangler? A key selling point of Microsoft Fabric is the Data Science

25
Jul

Autogen Power BI Model in Tabular Editor

In the realm of business intelligence, Power BI has emerged as a powerful tool for

12
Jul

Microsoft Healthcare Accelerator for Fabric

Microsoft released the Healthcare Data Solutions in Microsoft Fabric in Q1 2024. It was introduced

09
Jul

Unlock the Power of Colour: Make Your Power BI Reports Pop

Colour is a powerful visual tool that can enhance the appeal and readability of your

09
Jul

Python vs. PySpark: Navigating Data Analytics in Databricks – Part 2

Part 2: Exploring Advanced Functionalities in Databricks Welcome back to our Databricks journey! In this

20
May

GPT-4 with Vision vs Custom Vision in Anomaly Detection

Businesses today are generating data at an unprecedented rate. Automated processing of data is essential

08
May

Exploring DALL·E Capabilities

What is DALL·E? DALL·E is text-to-image generation system developed by OpenAI using deep learning methodologies.

03
May

Using Copilot Studio to Develop a HR Policy Bot

The next addition to Microsoft’s generative AI and large language model tools is Microsoft Copilot

29
Apr

Data Engineering

Loading Data in Snowflake

Staging

Snowpipe

Additional Features

Dan Bilsborough

Introduction to Data Wrangler in Microsoft Fabric

Autogen Power BI Model in Tabular Editor

Microsoft Healthcare Accelerator for Fabric

Unlock the Power of Colour: Make Your Power BI Reports Pop

Python vs. PySpark: Navigating Data Analytics in Databricks – Part 2

GPT-4 with Vision vs Custom Vision in Anomaly Detection

Exploring DALL·E Capabilities

Using Copilot Studio to Develop a HR Policy Bot

London

Aberdeen

Surrey

Bulgaria

India

Glasgow

Manchester

Get in touch

Our Social Channels

Staging

Snowpipe

Additional Features

Related Posts

Dan Bilsborough

Introduction to Data Wrangler in Microsoft Fabric

Autogen Power BI Model in Tabular Editor

Microsoft Healthcare Accelerator for Fabric

Unlock the Power of Colour: Make Your Power BI Reports Pop

Python vs. PySpark: Navigating Data Analytics in Databricks – Part 2

GPT-4 with Vision vs Custom Vision in Anomaly Detection

Exploring DALL·E Capabilities

Using Copilot Studio to Develop a HR Policy Bot