SQL Server 2017 Graph data processing. An introduction.

 

On the 19th of April 2017 at the Microsoft Data AMP, Microsoft announced SQL Server 2017 and a few new advanced analytics features (slide above). You can watch the full AMP here https://www.microsoft.com/en-us/sql-server/data-amp. One of the announcements related to the new support of Graph objects. There have been rumblings over the past few years of Microsoft working on a Graph engine. Project Trinity appeared to be just that https://www.microsoft.com/en-us/research/project/trinity/. Trinity never really made much of an impact, possibly due to other vendors having an easier to use product. Hopefully Microsoft are changing this by introducing graph querying directly in to the engine for SQL Server 2017. In this blog post I will look at the upcoming SQL Graph the graph processing engine in SQL Server 2017.

What is a Graph Database?
Relational databases are fantastic at answering many types of queries. There are developers who can write a query to answer any question you can ask of your data. But sometimes a relational database is not always the best place to ask those questions.

The rise of NoSQL databases in the early 2000s is in part due to this. Developers felt they could make a better solution which was built to solve a particular problem or support a particular process, rather than hack a generalised system (relational database). One of those patterns was social graphs. In a social graph you want to analyse deep hieratical data and traverse up and down hierarchies. Imagine you are a developer at LinkedIn. You want to know who to suggest to a new member who they should connect with. A simple choice is to show them everyone who is connected to someone they already know – Think about the six degrees of separation to Kevin Bacon.

Imagine we have the following social graph. Simon is new to LinkedIn. He has recently signed up and has made connections with Emma and Sacha. We want to suggest connections to Simon based on the people Sacha and Emma know. In the diagram below you can see our social graph. We want to recommend the people in light blue. We can do this by moving one level up in our hierarch away from Simon and collecting the returned results.

image

Each of the elements on the diagram above has a name. The circles are referred to as Nodes. The lines connecting them are relationships or Edges. You can think of a node as an entity (a thing, a person, an account, a movie, an actor etc). The Edge is what connects them, it is what gives them context. A person knows a person, a movie is directed by a director. Node->Relationship->Node OR Node->Edge->Node. There can be multiple nodes and multiple edges. Simon might know Emma and work with Sacha (know and work are our edges/relationships). With this set up we can then ask questions such “Of the people who know Simon, who do they know?”. This would return our people coloured blue. A query that is relatively trivial in SQL at this level becomes incredibly complex when you add in “now show me who each of the blue nodes knows” and so forth, moving up and up the nodes.

This query pattern is perfect for creating a recommendation engine. Simon bought a book on Lambda Architecture, what did people who bought the same book also buy. That query would be very hard (yes possible, but complicated) to write in SQL. So graph databases were built to solve this problem. There are a few big players already in the Graph world, Neo4j being the big one. Neo4j has its own SQL like syntax for writing these matching processes call Cypher Query Language or CQL. CQL is very mature and we equipped to answer complex graph queries.

A quick intro to SQL Graph
SQL Graph is a similar concept to what is described above, but built in to the core SQL Server engine. This means 2 new table types NODE and EDGE and a few new TSQL functions in particular MATCH(). SQL Graph at the time of writing is only available in SQL 2017 ctp 2.0. You can read more and download ctp2.0 here https://blogs.technet.microsoft.com/dataplatforminsider/2017/04/19/sql-server-2017-community-technology-preview-2-0-now-available/. Once ctp 2.0 is installed there is nothing else you need to do to enable the new graph syntax and storage.

There is an example you can download from Microsoft which is a similar set up to the example in the image above. However I have used some real data shredded from IMDB the internet movie database. This data is available to download from Kaggle https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset

This data contains 5000 movies with the following information:

  1. The name of the movie
  2. The 3 main actors
  3. The director
  4. The genres

New syntax – https://docs.microsoft.com/en-us/sql/t-sql/statements/create-table-sql-graph

I have loaded this in to a SQL Server 2017 ctp 2.0 instance. Graph queries will not just work on your existing tables, you need to transform the data in to NODES and EDGES.

NODE – A representation of an entity or object

Create NODE tables DROP TABLE IF EXISTS Genre; CREATE TABLE Genre (GenreId INTEGER PRIMARY KEY, Genre VARCHAR(100)) AS NODE;

When data is inserted in to a NODE a unique $_nodeid is generated. This is a key value pair with metadata about that NODE.

EDGE – A representation of a relationship between two NODES

Create EDGE tables. DROP TABLE IF EXISTS ActedIn; CREATE TABLE ActedIn AS EDGE; Actor ActedIn Movie

Using the CREATE TABLE AS EDGE will build the required structure for you EDGE. When you create an EDGE, you are creating a relationship between two $_nodeid. That could be a movie and an actor. An EDGE needs to be inserted with two corresponding $_nodeid.

MATCH – How to query your graph objects

SELECT Movie.Movie FROM Actor Actor1, ActedIn, Movie WHERE MATCH (Actor1(ActedIn)->Movie) AND Actor1.Actor = 50 Cent

MATCH uses a pattern based query structure. In the example above this is NODE-(EDGE)->NODE. Executing this query will return all the films 50 Cent has acted in (more complicated examples are shown in the video).

In the following video I will take you through the process of creating a graph with SQL Server 2017.

 

Final thoughts
SQL Server 2017 ctp 2.0 is classed as production ready, so SQL Graph is ready to be used in a production environment. The level of functionality listed in the documentation is quite lacking when compared with competitors tools. Cypher for Neo4j has a fantastic pattern matching function which allows you to hop n relationships. This is not currently available nor is the ability to find any node connected to my current node. SQL Server Graph currently feels like a version 0.1. It is almost there, but not quite. I look forward to seeing how graph develops over the coming months and hope to see the missing functionality appear. You can read more about the limitations here https://blogs.technet.microsoft.com/dataplatforminsider/2017/04/20/graph-data-processing-with-sql-server-2017/

Links
A SQL Server Graph overview – https://docs.microsoft.com/en-us/sql/relational-databases/graphs/sql-graph-overview
Sample from Microsoft – https://docs.microsoft.com/en-us/sql/relational-databases/graphs/sql-graph-sample
Limitations – https://blogs.technet.microsoft.com/dataplatforminsider/2017/04/20/graph-data-processing-with-sql-server-2017/