When you search online, you'll find loads of discussions about the distinctions between data science and data engineering. These conversations often emphasise the different technical skills each role requires.
I used to agree with these views. But, having worked as a data scientist and now tackling data engineering projects, I've learnt that these roles have more in common than many believe and that understanding each other's roles is vital for successful projects.
A tale of two roles
My data science journey
In my previous job as a data scientist, we had a wealth of data stored in a data warehouse, hosted and accessed through Snowflake, a data cloud product. This setup allowed us not only to read the data but also to write in the development clusters. It was fantastic for enhancing datasets with publicly available information and easily accessing data in Python or R using tools like Pandas or Spark.
We worked on diverse projects, from consolidating data from multiple sources to building predictive models with machine learning and deep learning. We had flexibility, but lacked proper data models and documentation, with our primary focus being on providing data access to various stakeholders. This led to a lot of exploration, data cleaning, ETL work, and working with engineers to prepare our data.
We had data engineers on the team, and one of them had a strong understanding of both data engineering and data science. But, for the most part, both parties were very detached from one another.
Data engineering in the present
Back to the present, I've now taken on a new role as an external data engineering expert for an ETL project. The client needed external support due to their lack of internal data engineering experience. The final product we are creating will benefit various teams, but the data science team would be the first and primary users of it.
During the project's discovery phase of this project, I engaged with various stakeholders, including the data scientists. What hit me was when I spoke to them, they immediately felt more understood and listened too because I could speak their language, something previous engineers on the project couldn’t do.
This showed me that effective communication between these parties, who excel in their individual fields but speak different "languages," is crucial. The feedback I received was that it was “so good to have a data engineer who understands data science”.
Anyway, now with the examples out of the way, here are what I believe are the key things data engineers and scientists need to do to work together more effectively.
Bridging the gap
Most discussions about data roles centre on technical skills. But as someone who's done both data science and data engineering, I think it's crucial for data scientists and data engineers to grasp each other's jobs on projects, or it just gets messy.
Data Engineers generally aim to provide broad database solutions for diverse business needs. Regardless of the data source or format, they develop pipelines to ingest, cleanse, and load data into accessible databases. They prioritise serving multiple business stakeholders, avoiding customisation unless necessary.
Data Scientists mostly strive to deliver accurate predictions and uncover insights. Trustworthy data, carefully maintained and deduplicated, is vital. They require easy access to data, preferably from cloud databases. Data scientists engage in exploratory analysis, data cleaning, and reshaping. They often create new data objects, enhancing predictive models, and need the ability to integrate new data with the original.
Understanding both mindsets makes it easier to meet each other's needs. Data engineers, aware that data scientists can handle their specific ETL needs, can explain that the standard data engineering format is flexible. Data scientists can adapt it to their preferences. Similarly, data scientists, understanding data engineers' resource constraints and broad user needs, can request precisely what they require.
Daily collaboration is essential if teams can create more refined data products together.
Data engineers use data models, code explanations, and tool documentation to document their databases and processes. This documentation aids data scientists in finding needed data and identifying gaps. A detailed data model can help them understand why certain data is missing and collaborate with data engineers to fill those gaps.
Data scientists often work with tools like Jupyter notebooks, using markdown to document their analyses. This comprehensive documentation helps data engineers understand their needs and encourages data scientists to create thorough documentation for better communication with data engineers.
Both need to be sharing these tools and models with one another and demonstrating their benefits.
Not every data engineer must be a data science expert, and not every data scientist must be a data engineering whiz. Building a team with everyone excelling in both areas is often impractical. Specialisation has its merits.
Some data engineering projects serve non-data scientist users, making in-depth data science knowledge unnecessary for data engineers in such cases.
Thanks to my experiences, I’ve come to the conclusion that bridging the gap between data engineering and data science is essential for successful projects. While discussions often focus on technical skills, both roles share common tools and goals. Through understanding each other's mindsets, effective communication and collaboration, and balancing expertise, both parties and companies as a whole can ensure their projects and teams thrive.
Our recent tech blog posts
Transformation is for everyone. We love sharing our thoughts, approaches, learning and research all gained from the work we do.
In this blog post, Head of Technology Mark Carton, delves into two Microsoft tools used to move data from legacy systems to Azure SQL.
In this article, we'll focus on a common issue faced by many companies involving fine-grained access.
Recently, a couple of our team had the opportunity to attend the annual Umbraco conference known as Codegarden.