Data Engineering for Graduates and New Starters

This blog aims to give potential graduates and other new starters in the industry some background knowledge on the role of the Data Engineer and what technologies they use in their day to day work.

Over the past 5 years the world of data within the IT and computing industry has exploded. Many new job titles were coined, and many buzzwords generated by marketing teams of the big technology companies spearheading this data revolution. One of the most popular roles that came about due to this data craze was the Data Scientist, which was declared as one of the best jobs of the 21st century according to a Harvard Business Review. Companies from all around the globe bought into this and data science teams were set up to unlock new insights and develop advanced machine learning models to enable predictive capabilities within their organisations.

Though many data science projects were successful, they would often fall short in certain areas. This could be down to a plethora of reasons, but two large reasons were:

  1. Middle management didn’t always know what to do with their data science teams, and often gave them tasks that a data or business analyst may do in their day to day roles
  2. Data Science teams did not have the necessary experience to deploy live models to a productionised state or environment. This is where the data engineer comes in

The Data Engineer is another role that has emerged in this space over the last couple of years. In part, I’m willing to argue that it is an evolution of the BI / ETL developer role crossed with the software engineering discipline. However, it is worth mentioning that these roles may differ slightly depending on the organisation. In essence, Data Engineers build, maintain and scale an organisations data infrastructure, whether that be on premises or in the cloud. Typically, Data Engineers and Scientists work in tandem to achieve their goals. The Data Engineers source the data and build the software infrastructure so that the Data Scientists can deploy their models, applications and visualisations to a live environment.

Technologies

Everybody knows that within this industry the technology moves very quickly. Web developers often find themselves working with new JavaScript frameworks frequently, the same is true for Data Engineers. New libraries and languages are created, and cloud vendors release new offerings that organisations want to buy into. This rapid change of pace with these new tools and techniques means that you are constantly learning new things and creating new innovative solutions. While it can be a challenging domain to master, it is also very rewarding!

The Data Engineering role covers many areas, everything from ingesting source data from file shares and SFTPs, developing data pipelines in the cloud, building and conforming Data Warehouses and Data Lakes and transforming large volumes of structured and unstructured data using open source platforms and more. Below is a brief list of some of the technologies used here at Adatis within our engineering framework. A lot of these services are managed and built on Azure, which students can sign up for and have access to with a specialised student account. Alternately, there is a trial that gives users access to some of these services:

  • Azure Data Lake – A central repository to store both structured and unstructured data
  • Azure Data Factory – An ETL tool in the cloud used to build dynamic pipelines
  • Azure Data Bricks – Advanced distributed computing framework built upon Apache Spark
  • Azure Synapse – Storage of relational data, formally Azure Data Warehouse
  • Azure Functions – Serverless compute that allows for additional application functionality
  • Azure Event Hubs – Real-time data streaming, often used for IoT projects
  • Azure DevOps – Continuous Integration / Development and versioning control with Git
  • CosmosDB – NoSQL Database with Graph Engine capabilities. Also supports multiple third-party APIs
  • Kubernetes – Container service orchestration for automating application deployments
  • Docker – Used to build container instances and is often used in conjunction with Kubernetes

Below are two of important languages frequently used in Data Engineering:

  • SQL – Used to manipulate data in relational databases and warehouses. Though we are seeing more use of NoSQL systems, SQL itself is still one of the most important languages used in this space
  • Python – Often used in data transformation with the Pandas library. The PySpark variant of the language is often used on Spark clusters. Python has also become the go to language for machine learning

For any potential graduates looking to get into the realm of data engineering, I would recommend taking any modules that include:

Database Design and Theory – This includes learning SQL, the mother language of any RDBMS as well as relational modelling.

Programming – This one may seem obvious but learning how to program in a language such as Python is very important and will give you insight into programming paradigms and syntax. A declarative language like SQL is written very differently to an object-oriented language like C#. Additionally, knowing a scripting language such as PowerShell or Bash could be useful when managing cloud resources.

Data Structures and Algorithms – Ties in the programming element and is often taken at a more advanced level. Learning how arrays, lists and graphs work will help complement your skillset and further your understanding of writing code. Furthermore, learning file type structures such as B-Tree’s and Heap files will further enhance your knowledge within the database domain.

Contact and Questions

I hope this short post has given you more insight into what Data Engineering is and what tools are used to go along with it. If you have any questions, please feel free to contact me on LinkedIn and I would be happy to answer them for you. Alternatively, comment down below and I’ll do my best to get back to you quickly!

Thanks for reading, if you’d like to read our other blogs click here.