Technical roles in Data Science: Who is doing what?
As data has become a formidable asset, the main driver of innovation and growth, companies started to transform their teams towards being more data-driven. With that, a diverse range of roles emerged and started to work collaboratively to unlock the power of data, in literally every domain. Because in every domain, there exists data.
Who is doing what” is a question that can be answered differently depending on which organization you are referring to. In this article, we’ll try to outline the most common distribution of technical roles within ML. Please note that these definitions for each role are not rigidly fixed. What truly matters is understanding the duties and assignments associated with each role, rather than the specific title assigned to an individual. In some organizations, the roles below might be combined into a single role, or the other way around, split into multiple roles.
Data Scientist: The backbone of any data science team. They focus on data processing, feature engineering, model building, and more. Data scientists are expected to have statistical expertise, machine learning knowledge, and programming skills in addition to domain knowledge of the use case they are working on.
Data processing
Feature engineering
Model building
Statistical expertise
Machine learning knowledge
Programming skills
Domain knowledge
Intersects with: Data Analysts (if they involve data analysis and interpretation), Machine Learning Engineers (if data scientists also deploy models), and Research Scientists (when data scientists engage in research-oriented projects).
Data Engineer: Their role is to build and maintain the infrastructure and workflows to handle vast amounts of data. They create ETL pipelines for data integration, cleansing, transformation, and data warehousing. They make sure the data science team and other data consumers get reliable and high-quality data.
Infrastructure building
Data pipelines (ETL)
Data cleansing
Data warehousing
Scalability
Reliability
Data Analyst: Their main focus is gathering meaningful insight from data and visualizing for other teams within the organization. They often work with business intelligence tools and stakeholders to understand their requirements and develop automated or ad-hoc reports and dashboards for product teams. Good data analysts are the feelers of the data organization. They often know the data best. They can be invaluable in coming up with new use cases.
Data exploration
Data visualization
Business intelligence (BI)
Data communication
Data monitoring
Intersects with: Data Scientists (both roles involve data analysis) and Data Engineers (to access and structure data for analysis).
Machine Learning Engineer (AI Engineer): They play a pivotal role in productionizing ML models. They work closely with data scientists to transform models into scalable and deployable products. We also see ML engineers being called AI engineers in some companies.
Model productionization
Scalable deployments
Integration
Monitoring
Model versioning
Intersects with: Data Scientists if they conduct data preprocessing, feature engineering, and model selection or MLOps Engineers if they get involved in infrastructure and monitoring.
MLOps Engineers / Teams: They are responsible for the deployment, scaling, monitoring, and maintenance of machine learning models and pipelines in production environments. They also create an MLOps system or platform which is a set of tools and processes that ensures the systematic development and productionization of ML models.
Deployment
Scaling
Monitoring
Maintenance
MLOps system/platform management
Intersects with: ML Engineers as they collaborate on model deployment and scalability, and platform engineers when they create infrastructure and monitoring tools.
Platform Engineer: They are responsible for designing and maintaining the underlying infrastructure that enables ML engineers, and data scientists to deploy models. They get requirements from ML engineers or data science teams and provide scalable and robust platforms. They can also be called Infrastructure Engineers or Cloud Engineers sometimes.
Infrastructure design
Scalable platforms
Requirements gathering
Security
Intersects with: Data Engineers if they collaborate on data infrastructure.
Research Scientist: Their primary focus is to find and develop new ML algorithms and build state-of-art models. Unlike other roles, they do not focus on the production side, instead, they spend most of their time developing and evaluating new techniques.
Algorithm development
Model innovation
Experimentation
Publication
Collaboration
Depending on the team size at a company, the same person might play multiple roles. You can be a data scientist, who is also deploying models together with platform engineers. Or you can be a data engineer and transition to a data scientist role by building an ML model. You can be an ML engineer who is also developing models. In large organizations, it’s likely to see roles distributed among different teams, not just people. So different teams might get involved in different phases of the ML model development cycle.