Skip to content
  • Technology

Bridging the data science and engineering divide

Bridging The Data Science And Engineering Divide

by Tasos Pardalis

What my experiences have taught me about both roles and the need to work together.

When you search online, you'll find loads of discussions about the distinctions between data science and data engineering. These conversations often emphasise the different technical skills each role requires.

I used to agree with these views. But, having worked as a data scientist and now tackling data engineering projects, I've learnt that these roles have more in common than many believe and that understanding each other's roles is vital for successful projects.

A tale of two roles

My data science journey

In my previous job as a data scientist, we had a wealth of data stored in a data warehouse, hosted and accessed through Snowflake, a data cloud product. This setup allowed us not only to read the data but also to write in the development clusters. It was fantastic for enhancing datasets with publicly available information and easily accessing data in Python or R using tools like Pandas or Spark.

We worked on diverse projects, from consolidating data from multiple sources to building predictive models with machine learning and deep learning. We had flexibility, but lacked proper data models and documentation, with our primary focus being on providing data access to various stakeholders. This led to a lot of exploration, data cleaning, ETL work, and working with engineers to prepare our data.

We had data engineers on the team, and one of them had a strong understanding of both data engineering and data science. But, for the most part, both parties were very detached from one another.

Data engineering in the present

Back to the present, I've now taken on a new role as an external data engineering expert for an ETL project. The client needed external support due to their lack of internal data engineering experience. The final product we are creating will benefit various teams, but the data science team would be the first and primary users of it.

During the project's discovery phase of this project, I engaged with various stakeholders, including the data scientists. What hit me was when I spoke to them, they immediately felt more understood and listened too because I could speak their language, something previous engineers on the project couldn’t do. 

This showed me that effective communication between these parties, who excel in their individual fields but speak different "languages," is crucial. The feedback I received was that it was “so good to have a data engineer who understands data science”.

Anyway, now with the examples out of the way, here are what I believe are the key things data engineers and scientists need to do to work together more effectively.

Bridging the gap

Most discussions about data roles centre on technical skills. But as someone who's done both data science and data engineering, I think it's crucial for data scientists and data engineers to grasp each other's jobs on projects, or it just gets messy.

Data Engineers generally aim to provide broad database solutions for diverse business needs. Regardless of the data source or format, they develop pipelines to ingest, cleanse, and load data into accessible databases. They prioritise serving multiple business stakeholders, avoiding customisation unless necessary.

Data Scientists mostly strive to deliver accurate predictions and uncover insights. Trustworthy data, carefully maintained and deduplicated, is vital. They require easy access to data, preferably from cloud databases. Data scientists engage in exploratory analysis, data cleaning, and reshaping. They often create new data objects, enhancing predictive models, and need the ability to integrate new data with the original.

Understanding both mindsets makes it easier to meet each other's needs. Data engineers, aware that data scientists can handle their specific ETL needs, can explain that the standard data engineering format is flexible. Data scientists can adapt it to their preferences. Similarly, data scientists, understanding data engineers' resource constraints and broad user needs, can request precisely what they require.

Daily collaboration is essential if teams can create more refined data products together.

Effective documentation

Data engineers use data models, code explanations, and tool documentation to document their databases and processes. This documentation aids data scientists in finding needed data and identifying gaps. A detailed data model can help them understand why certain data is missing and collaborate with data engineers to fill those gaps.

Data scientists often work with tools like Jupyter notebooks, using markdown to document their analyses. This comprehensive documentation helps data engineers understand their needs and encourages data scientists to create thorough documentation for better communication with data engineers.

Both need to be sharing these tools and models with one another and demonstrating their benefits.

Balancing Expertise

Not every data engineer must be a data science expert, and not every data scientist must be a data engineering whiz. Building a team with everyone excelling in both areas is often impractical. Specialisation has its merits.

Some data engineering projects serve non-data scientist users, making in-depth data science knowledge unnecessary for data engineers in such cases.

Thanks to my experiences, I’ve come to the conclusion that bridging the gap between data engineering and data science is essential for successful projects. While discussions often focus on technical skills, both roles share common tools and goals. Through understanding each other's mindsets, effective communication and collaboration, and balancing expertise, both parties and companies as a whole can ensure their projects and teams thrive. 

Tasos Pardalis's avatar

Tasos Pardalis

Data Engineering Specialist

Tasos Pardalis has four passions, data engineering and science, business, and family. His main hobbies include watching TV series', travelling with his partner, reading interesting books, and playing electronic games.

Contact Tasos

Our recent tech blog posts

Transformation is for everyone. We love sharing our thoughts, approaches, learning and research all gained from the work we do.

Turbocharging Power BI Performance

Turbocharging Power BI performance

How external tools can improve the performance of Power BI semantic models.

Hash keys, the unsung hero of data warehousing - Part 2

We delve into the examples and benefits of using hash keys to improve data warehousing solutions

Hash keys, the unsung hero of data warehousing - Part 1

In our first piece on the subject, we examine what hash keys are and how they can have an impact on your data warehouse solutions.