The What and Why of Apache Spark on Azure HDInsight

Big data i.e. that which ticks the 3V’s (volume, velocity and variety) boxes, has been around forever in the world of engineering and since the days of the exciting Human Genome Project completed in April 2003. Then I was a fairly new BI developer, and almost did a career switch into Bioinformatics as was so inspired by all that was being done in this area. If any of you remember The Summer of 2003 was a glorious one in England, I enjoyed it, and decided to stick with Microsoft BI technologies which I have derived much pleasure from. However, I have always kept my eye on what that bioinformaticians are doing. With data, whether it be big or not, there is always the non-functional requirements one needs to address with regard to:

Availability – how soon can I get my data after it was born. In traditional warehousing this is usually daily, falling into our daily circadian rhythms of work days. And sometimes, with up to biggish data sets this was possible with traditional Microsoft BI technologies
Accessibility – how soon can I get the data that makes sense after it is born. Usually data needs to be cleaned, scrubbed, conformed before we unleash it onto the end-user. So the daily ETL processes fitted quite nicely into this.
Interactivity – ok, so I have this report, but I see something of interest, how easily can I dig into that little trough on that graph and see it in a different way so I can understand it and do it quickly. Yes, this could be done, but was usually met with frustration as one got mesmerized by a little spinning wheel, or just gave up and raised a ticket to the reporting team to do an ‘adhoc query’.

So, what does this have to do with Apache Spark on HDInsight?

Apache Spark on Azure HDInsight is actually *the* open source processing framework. Here we see Microsoft supporting open source. However, the Microsoft angle on this is that it is hosted in the cloud on Azure HDInsight. Spark It is a fast, general purpose engine that supports in-memory operations. OK so what does this mean and how might this address some of the challenges we face delivering BI solutions on big data to users who are not demanding the answers right now? In the words of Freddie Mercury “we want it all and we want it now”

Microsoft also announced yesterday a “major commitment to Apache Spark” see here.

Speed

It is fast. I believe it is 100 times faster in-memory than Hadoop Map Reduce processes, but still uses the scale out of processing data on multiple clusters. This is because it uses the DAG (directed acyclical graph) execution engine that supports cyclic data flow in in-memory parallel computing. If anyone from the ETL world might want to visualise what DAG might look this there is a nice diagram here. Data can be persisted in-memory or on disk. In HD Insight, the on-disk would be blob storage or data lakes. Unlike Hadoop, Spark can manipulate data in-memory. Manipulate data could apply to ETL and/or Reporting operations. The lines between ETL and extracting data for reporting and analysis is blurring. We see this in Power BI where one can perform pretty powerful ETL operations and then visualise right away.

A BI Big Data one stop shop

Spark is a single platform to support the following all flavours of data manipulation operations on big data:

Batch processing
Real time and Stream Analytics
Machine Learning and Predictive Analytics

It is general purpose, so does not support only one kind of language. Developers can write data manipulation jobs and queries in:

Java
Scala
Python
R – with R Server now also being hosted on Spark on HDInsight

Looking at this list of languages we see the data scientists and traditional developers from the Microsoft and Open Source worlds collide. Techniques used by those long standing bioinformaticians can be applied to our corporate big data, in tools they are used to working with. This opens up all sorts of possibilities of recruiting and the landscape of the traditional BI team. And then for us traditional ETL developers, while SSIS might have an in-memory pipeline for data transformations and manipulation this does need to be materialised along the way, and/or processed into a cache in-memory in another reporting specific technology before the end user can access it. Spark, being a multi-purpose in-memory data manipulation platform, shortens the gap between data being born and it being available and accessible, in all the layers from it being 100% raw organic to being packaged and processed.

Ready to Go

Apache Spark supported on Azure HD Insight means that is is in the cloud. It is “Software as a Service”. The cost utilising a Spark cluster includes the managed service costs. One does not need to get a physical server, you are just using someone else’s server. With it being in the cloud, the storage is separated from the cluster. You pay for each (cluster and storage) separately. The storage is cheap. The cluster is more expensive. The cluster does the computation so can be turned on an off when required and you only pay for the up-time and size of your cluster in terms of the number of nodes. The cluster is scalable, you can increase the number of nodes depending on how much data you have to process and how quickly.

With spinning up a Spark cluster in Azure HD Insight, it also comes pre-loaded with Jupyter and Apache Zeppelin notebooks. These are browser based tools which allow for the creation of Python or Scala scripts and a power user to be able to run queries and visualise data using Spark SQL.

Spark also integrates with all our favourite reporting and visualisation tools i.e.

Power BI – also supporting Spark Streaming
Tableau
Qlik
SAP Lumira

So with drag and drop operations the Spark SQL gets generated behind the scenes with almost immediacy even on vary large to big data sets.

In Conclusion

I have been keeping an eye on this one. My delight has been *sparked* with the appearance on R Server being hosted on Spark on HDInsight. I even heard a Microsoft engineer saying he “was in love with this technology”. Next I will be looking at the automation of getting data into the cloud based in-memory structures on Spark.

If your curiosity has been piqued, here are some good starter links…

https://blogs.technet.microsoft.com/dataplatforminsider/2016/06/06/microsoft-announces-major-commitment-to-apache-spark/

https://azure.microsoft.com/en-us/services/hdinsight/apache-spark/

https://spark.apache.org/

And if you are really keen and want to understand more about DAG execution and the Dryad Microsoft Research project there is a very detailed paper here…

http://research.microsoft.com/en-us/projects/dryad/eurosys07.pdf

Data Engineering

The What and Why of Apache Spark on Azure HDInsight