Databricks Vs Synapse Spark Pools – What, When and Where?

Databricks or Synapse seems to be the question on everyone’s lips, whether its people asking why you are using Databricks instead of Synapse Spark Pools or people asking what I should use and when, it’s definitely a hot topic right now. Given Microsoft are developing, investing, and recommending Synapse it is a reasonable question for everyone to be asking.

As we know generating insights from big data is a challenge for every organisation since data is collected from various sources are mostly unstructured. Both Microsoft and Databricks provide scalable analytics platforms with Synapse and Databricks Workspace that combine enterprise data warehousing, ETL pipelines, and Machine Learning workflows.

Synapse provides an end-to-end analytics solution by blending big data analytics, data lake, data warehousing, and data integration into a single unified platform. It has the ability to query relational and non-relational data at a peta-byte scale. The Synapse architecture consists of four components: Synapse SQL, Spark, Synapse Pipeline, and Studio. Whilst Synapse SQL helps perform SQL queries, Apache Spark executes batch/stream processing on Big Data. Synapse pipeline provides ETL as well as Data integration capabilities, Synapse Studio provides a secure collaborative cloud-based analytics platform, providing AI, ML, IoT, and BI in a single place.

Synapse also offers T-SQL based analytics that compromises dedicated and serverless SQL pools for entire analytics and data storage. While the dedicated pool of SQL servers provides the necessary infrastructure for implementing large scale Data Warehouses, whilst the serverless model enables ad-hoc querying of the data lake and provisioning of logical data warehouses with a pay per query arrangement.

Databricks facilities a zero-management cloud platform that is built around spark cluster to provide interactive workspace. It enables Data Analysts, Data Scientists, and Developers to extract values from big data efficiently. It seamlessly supports third party applications such as BI and domain specific tools for generating valuable insights. Large scale enterprises utilise this platform for a broader spectrum to perform ETL, data warehousing, or dashboarding.

Databricks has a ‘lake house’ architecture that leverages data lake and data warehouse elements to provide a low-cost data management. This architecture facilitates atomicity, consistency, isolation, and durability (ACID) transaction, robust data governance, decoupled storage for computation, and end-to-end streaming.

When looking at the differences between the two products you have a few different areas where the products differ, both are powered by Apache Spark but not in the same way. Synapse has an open-source Spark version with built-in support for .NET, whereas Databricks has an optimised version of Spark which offers increased performance and with this allows users to select GPU-enabled clusters which will process data faster and have a higher data concurrency.

Synapse successfully integrates analytical services to bring enterprise data warehouse and big data analytics into a single platform, whereas on the other hand, Databricks not only does big data analytics but also allows users to build complex ML products.

Databricks uses a group of magic commands which are known as DBUtils but Microsoft has invested time into Synapse in order to bring out the equivalent known as MSSparkUtils.

When to use Databricks and When to use Synapse

Given that there are so many new features in Synapse now and lots of similar functionalities between the two it raises the question about when to use which.

  • Both can access the data from Data Lake however you need to mount the Data Lake in Databricks first whereas this is not needed in Synapse.
  • The both use Spark, but Synapse is open source and tends to be on a different version that Databricks whereas Databricks has a data processing engine built on a version of Spark offering high performance.
  • Notebooks are used in both with the main difference being Databricks allows co-authoring in real time and Synapse requires the notebook to be saved before the other person can see the changes.
  • Synapse has a traditional SQL engine and will feel familiar to the traditional BI developer, but it does have spark engine which will fit the data scientists and analysts. It is a Data Warehouse and an interface tool. Databricks on the other hand is not a Data Warehouse tool but a spark-based notebook tool and has a focus on spark.

There are different cases for using both depending on the specific needs and requirements, Synapse and Databricks are similar, but both have their own areas of specialities or rather areas where they are above the other.

Data Lake – they both allow you to query the data from the data lake, Synapse uses either the SQL on demand pool or Spark and Databricks uses the Databricks workspace once you have mounted the data lake. If you are predominately a SQL user and prefer the code and the BI developer feel then Synapse would be the correct choice whereas if you are a Data Scientist and prefer to code in Python or R then Databricks would feel more at home.

Data Warehousing and SQL Analytics – If your requirements are around data warehousing and SQL analytics then even though it is possible with Databricks – Note that it does not have the width of SQL and data warehouse capabilities, and it does not provide a full T-SQL experience. You would be better using Synapse as it does provide all SQL features, capabilities, and brings together the best SQL technologies and allows a full data warehouse including full relational data model and stored procedures.

Machine Learning – Databricks has machine learning optimised runtimes which include some of the most popular libraries like PyTorch and Keras, and it has a managed and hosted version of MLflow which is provided. Synapse however has support for Azure ML and you can use open-source MLflow. Databricks goes broader in features for ML and if you are going to be heavily using the machine learning function then Databricks would be the favourable choice, having said that, the capabilities in Synapse are more than sufficient for most companies’ uses of machine learning it is only those that are anticipating using machine learning on a big scale.

Reporting and BI – Synapse would be the preferred choice for BI as you can use Power BI directly from the Synapse Studio.

Putting it into action – What we found out

Here at Adatis we are not all talk, we like to put things into action and back up our words with facts and what we have discovered by trying things out. We decided to amend our current framework to replace Databricks with Synapse Spark Pools and then see what we could uncover.

We currently use Databricks to cleanse and run validation on the data as we take it from our raw layer into our base layer. We then go on to use Polybase to push our data down the line into Synapse. We noticed there were some areas for improvement by changing our framework to use Synapse instead of Databricks these were as follows:

Data Duplication – Our current framework was built around having two base layers, one in the Data Lake and one in the Data Warehouse which obviously uses more space but also has more complex issues around things like GDPR.

Cost – Synapse for dedicated pools can be expensive and especially if you have requirements to have this available all the time.

We amended our framework to use Synapse Spark Pools instead of Databricks which changed the flow of the data by taking the data from the raw layer and using the Spark Pools to cleanse and run validation and place the data into the base layer. The main difference apart from using Synapse instead of Databricks is that this means we only now have one base layer and we also have ensured this one base layer is in Delta instead of Parquet which has other benefits, such as allowing the option of incremental loads to be done easily and time travel, to see the data at a point in time.

Other benefits which have been drivers for the change are:

  • Flexibility amending our framework has allowed us to use any type of Spark, including Databricks, meaning we are not locked in to any one service.
  • Event Driven we have been able to make our framework fully event driven.
  • Cost Optimisation Serverless is cheap and costs around £3.40 per 1 TB processed and is charged per query you execute so there are no more issues around keeping Synapse clusters online constantly, you pay per use. Compression also helps with the cost optimisation as it is easier and smaller to query and therefore costs less.
  • Simplify amending the framework in the way we have means we have less components and therefore it is simpler.
  • Agility as it is simpler it is therefore easier to make small changes and push them out, working in a much more agile way.

Summary

In conclusion, I feel the differences between the two are relatively small most of the time and it just depends on what the requirements and needs of the company are. Given the fact it appears Microsoft is putting a lot of manpower into Synapse and investing its resources into it, I would say it is worth using the tool especially if you are already using parts of Synapse, it will allow you to simplify the architecture. If you are looking for a super powerful machine learning tool and plan on putting it to use on a huge amount of data then Databricks would be the best option. Other than that my opinion would be that Synapse is probably the right fit.

Leave a Reply

Your email address will not be published.