Background
Up until recently, I was one of the SQL Server developers adopting the bad habit that is known as the NOT IN clause. It is an easy way of finding data in one table, that does not exist in another. For this purpose, I thought using the NOT IN would help me conceptualise a query result, as well as help make it easier for someone else looking at the code. In fact, although the performance (within an execution plan) is OK, you can pull back incorrect results from the overall query.
The Problem
The NOT IN clause is problematic in only one, but VERY IMPORTANT way…….it DOES NOT include NULLS in the comparison table. Please see the example below:
Create two tables for NOT In Example:
Query results for both tables:
NOT In Query:
As you can see, 0 records were returned. We would expect the record (containing Striker, Andy Cole) in the NewFootyPlayers table to be returned. The NOT IN Clause is ignoring any comparisons on NULLS.
NOTE
Adding an additional ‘WHERE Position IS NOT NULL’ filter to the NOT IN clause would also give the same result but a lot of people will forget to add it and spend a substantial amount of time wondering why certain records are missing from their result set.
The Solution(s)
There are a number of clauses or SQL syntax that can be used instead of the NOT IN. although most do not have any major performance benefits, they actually return what is expected. The three examples below all return the one expected record:
All three return the below result, which we expected in the first place:
Recommended Solution
Whilst none of the solutions above cause major performance problems, there is one method that is better than the others. If we are working with hundreds of millions of records in both tables, using the NOT EXISTS is the most efficient query. Its performance is similar to NOT IN
and EXCEPT
, and it produces an identical plan, but is not prone to the potential issues caused by NULLs or duplicates.
I would be interested to see if anyone else has performance tested each query type and if there are better alternatives to NOT EXISTS. One thing I am certain on, however, is that no one should have to use the NOT IN clause.
How Artificial Intelligence and Data Add Value to Businesses
Knowledge is power. And the data that you collect in the course of your business
May
Databricks Vs Synapse Spark Pools – What, When and Where?
Databricks or Synapse seems to be the question on everyone’s lips, whether its people asking
1 Comment
May
Power BI to Power AI – Part 2
This post is the second part of a blog series on the AI features of
Apr
Geospatial Sample architecture overview
The first blog ‘Part 1 – Introduction to Geospatial data’ gave an overview into geospatial
Apr
Data Lakehouses for Dummies
When we are thinking about data platforms, there are many different services and architectures that
Apr
Enable Smart Facility Management with Azure Digital Twins
Before I started writing this blog, I went to Google and searched for the keywords
Apr
Migrating On-Prem SSIS workload to Azure
Goal of this blog There can be scenario where organization wants to migrate there existing
Mar
Send B2B data with Azure Logic Apps and Enterprise Integration Pack
After creating an integration account that has partners and agreements, we are ready to create
Mar