Modern Data Warehouse

Azure SQLDW-Polybase Limitations

Posted on 27th June 20172nd April 2020 by Simon Whiteley

27
Jun

If it wasn’t clear from my other blogs, I LOVE polybase. It’s a super powerful tool that seamlessly integrates flat-file parallelism with relational querying. However – there are some ways that the implementation really has to improve – These are true as of June 2017 – given how fast things change, I’m hoping they won’t be limitations for too much longer…

Push-down predicates

This one is a biggie – if you’re querying over a whole range of flat files that are organised into [YEAR]/[MONTH] folders, for example, you should be able to write a query like the following:

SELECT * FROM dbo.MyExternalTable WHERE [YEAR] > 2016

This filter would be pushed down to the polybase engine and tell it to ignore any files that have been vertically partitioned outside of our chosen range. Instead, all files are read and returned to the SQL Server engine and the filtering is done in-memory on the returned dataset. This is obviously hugely inefficient in some cases – especially when you’re using Polybase as a data loading mechanism. This feature is available in HIVE tables and you can do it in U-SQL – hopefully it’s only a matter of time before a similar feature is implemented here.

File Metadata

As a BI Developer, I love metadata – I want to know where a record came from, which flat file it was carried into the Lake aboard. The ability to query from thousands of files at once recursively is insanely useful – it gives you the ability to keep delta files as an immutable raw layer that you can query as a single entity. However, there is currently no ability to include the “filename” within the columns returned by the external table.

Again, this is something that you can do in equivalent technologies and seems mad that it’s not included by default. When you look at the DMVs underneath the external readers, you can see, in detail, the file that each thread is currently reading, it’s very aware of the file that it’s pulling data from.

Just query the sys.dm_exec_external_work table whilst running a query over a large set of files in an external table and you’ll see what I mean.

Rejection Handling

This one is even more frustrating – Polybase has a really useful feature where you can tell it to reject rows that fail to be parsed into the data types you’ve specified.

The most common usage for this is around file headers, discussed in the next limitation. However – the reject functionality has two flaws:

Firstly – you can’t see the rows that failed. Such a simple thing, but if you’re using Polybase as a data loading tool for an analytics warehouse, you can’t accept that some rows simply won’t make it through and you’ll never know which rows failed.

This is a tricky limitation however – I have no idea how they would model this effectively within SQLDW. If we were in the standard SQL engine, I’d like to see External Tables to have an option to automatically have a shadow “External Error Rows” table created. Any row that fails parsing gets dumped into that table as a JSON record with a query date and object_id.

I’ve actually raised the idea on connect here: http://bit.ly/2rWT35D

There is also the issue of speed – I’ve seen real performance problems when having rejection percentages turned on, the continuous checking against tolerance seems to be a big drain on execution.

Header Row Skipping

Related to the above issue is Header Row skipping. Such a common thing to do, available in U-SQL, HIVE and ETL tools going back to the dawn of time… but Polybase cannot currently skip header rows. Preferably this would give you the option of skipping x rows of the file for situations where there is a metadata/header section at the top of your file.

Currently, you either have to accept your fate and perform some pre-processing on files to strip out and header data, maybe by using U-SQL or accept the performance overhead of having a reject rows tolerance set within your external tables.

Conclusions

None of the above are deal-breakers for me – the sheer speed and opportunities opened up by being able to query over a huge flat-file store, whilst still within your relational engine more than makes up for the limitations. However many of the limitations don’t exist within similar tools – this at least gives me hope that they can be fixed and we should be seeing improvements over time.

Unfortunately, for the time being the focus seems to be on increasing Polybase compatibility – adding further data stores and additional security protocols rather than improving the functionality of the tool itself.

In the next post, I want to share some design patterns about Polybase use cases – dynamically switching external locations, wider architectures and more.

Simon Whiteley

Introduction to Data Wrangler in Microsoft Fabric

What is Data Wrangler? A key selling point of Microsoft Fabric is the Data Science

25
Jul

Autogen Power BI Model in Tabular Editor

In the realm of business intelligence, Power BI has emerged as a powerful tool for

12
Jul

Microsoft Healthcare Accelerator for Fabric

Microsoft released the Healthcare Data Solutions in Microsoft Fabric in Q1 2024. It was introduced

09
Jul

Unlock the Power of Colour: Make Your Power BI Reports Pop

Colour is a powerful visual tool that can enhance the appeal and readability of your

09
Jul

Python vs. PySpark: Navigating Data Analytics in Databricks – Part 2

Part 2: Exploring Advanced Functionalities in Databricks Welcome back to our Databricks journey! In this

20
May

GPT-4 with Vision vs Custom Vision in Anomaly Detection

Businesses today are generating data at an unprecedented rate. Automated processing of data is essential

08
May

Exploring DALL·E Capabilities

What is DALL·E? DALL·E is text-to-image generation system developed by OpenAI using deep learning methodologies.

03
May

Using Copilot Studio to Develop a HR Policy Bot

The next addition to Microsoft’s generative AI and large language model tools is Microsoft Copilot

29
Apr