MLOps roadmap 2024

Dec 21, 2023

The MLOps engineer role is different from an ML engineer role. Even though the role varies from company to company, in general, ML engineers focus more on bringing individual projects to production, while MLOps engineers work more on building a platform that is used by machine learning engineers and data scientists.

To build such platforms, lots of different skills are required. Here is a roadmap that will help you to become an MLOps engineer. It is intended to be taken step-by-step, starting from programming skills to MLOps components and infrastructure as code.

Remember: you do not need to know all the tools. Having a proper understanding and experience with just one type of each tool for MLOps is enough. Here, we are suggesting a few of the most common and popular tools to get you started.

The roadmap may have some updates during the year. If you want to see the latest version and have some suggestions, check out the MLOps roadmap repo by Marvelous MLOps.

1. Programming

Programming skills are essential for an MLOps engineer. Python is the most common language used for machine learning. Since MLOps engineers collaborate with machine learning engineers and data scientists, learning Python is important.

1.1. Python & IDEs

We suggest starting learning Python by reading a proper Python book and practicing the concepts.

Tutorial suggestion: https://realpython.com

Book suggestion: Python Crash Course, 3rd Edition: A Hands-On, Project-Based Introduction to Programming 3rd Edition by Eric Matthes

Code practice suggestion: https://leetcode.com/problemset/

Course suggestions: Learn Python 3

Tracks suggestions: Python fundamentals, Python programming

Important things to know about using Python:

Installing Python, using virtual environments. Check out The right way to install Python on Mac article.
Using an IDE. Check out guide How to configure VS Code for ML
Python basics (part 1 of Python Crash Course book)
Pytest (part 1 of Python Crash Course book, Python programming track)
Packaging: How to build and publish Python packages with poetry.

1.2. Bash basics & command line editors

You will need to understand bash basics to add steps to your CI/CD pipelines, to create docker files, and many more other things.

Book suggestion: The Linux Command Line, 2nd Edition by William E. Shotts

Course suggestion: Bash mastery

VIM is one of the most widely used command-line editors. It is lightweight and easy to get started with.

Tutorial suggestion: VIM beginners guide, VIM adventures , VIM by Daniel Miessler.

2. Containerization and Kubernetes

Containers are isolated software environments that help to streamline software development and deployment, regardless of the underlying infrastructure. It is an essential piece of modern software engineering best practices.

2.1. Docker

Docker is one of the most popular open-source containerization platforms, also widely used in MLOps for multiple purposes: code development, model training, and endpoint deployment.

Docker roadmap: https://roadmap.sh/docker

Tutorial suggestion: Full docker tutorial by Techworld by Nana

2.2. Kubernetes

Kubernetes is a must to learn for an MLOps engineer. It is widely used for machine learning model training, model endpoint deployment, and serving dashboards.

Kubernetes roadmap: https://roadmap.sh/kubernetes

Tutorial suggestion: Kubernetes course by freecodecamp.com

Course suggestion: Kubernetes mastery

K9s is a powerful CLI tool that makes managing your Kubernetes clusters easy:

https://k9scli.io. Great for development!

3. Machine learning fundamentals

An MLOps engineer works with machine learning engineers and data scientists and should have some basic understanding of machine learning models.

Without having a proper understanding of what data scientists and machine learning engineers do, you can not fully embrace MLOps principles.

Course suggestions:

https://mlcourse.ai/
https://course.fast.ai

Book suggestion: Applied Machine Learning and AI for Engineers by Jeff Prosise

4. MLOps principles

MLOps engineers must be aware of MLOps principles and what the factors are that contribute to MLOps maturity.

Books:

Designing Machine Learning Systems 𝖻𝗒 Chip Huyen
Introducing MLOps 𝖻𝗒 Mark Treveil 𝖺𝗇𝖽 Dataiku

Check out MLOps maturity assessment and ml-ops.org.

5. MLOps components

MLOps platform consists of multiple MLOps components, such as version control, CI/CD, orchestration, compute, serving, and feature stores. In the end, the MLOps framework is about combining the tools. Check out the Minimum set of must-haves for MLOps article.

Book suggestion: ML Engineering with Python by Andy McMahon

Suggested courses:

5.1. Version control & CI/CD pipelines

Without version control and CI/CD pipelines, ML model deployments cannot be traceable and reproducible.

Git is the most popular version control system. GitLab and GitHub are the most popular version control services. You do not have to learn them both (even though through your career you might).

Books:

Learning GitHub Actions by Brent Laster
Learning Git by Anna Skoulikari

Tutorials & courses:

Git & GitHub for beginners
Taking Python to Production: A Professional Onboarding Guide
The Missing Semester of Your CS Education (mit.edu)
https://learngitbranching.js.org/

Pre-commit hooks are super useful for keeping your code neat and are an important piece of your CI pipeline. Check out Welcome to pre-commit heaven article.

5.2. Orchestration

Just like in data engineering, orchestration systems like Mage or Airflow are popular for machine learning engineering. There are also ML-specific orchestration tools (that do more than just orchestration), such as Kubeflow or Metaflow. Airflow seems to be still more common in industry.

Orchestration systems keep all your model runs in the same place and help with:

Sharing variables between different jobs running on the compute
Identifying which runs failed on the compute and repairing it
Defining complex execution logic

Course suggestion: Introduction to Airflow in Python

Note: ML Engineering with Python book by Andy McMahon and The full stack 7-steps MLOps framework also use Airflow.

5.3. Experiment tracking and model registries

Experiment tracking means logging metadata, parameters, and artifacts that belong to different model training runs. What is stored, depends on the algorithm and your needs. Experiment tracking makes it possible to compare different runs between each other. Models from different experiment runs can be registered and linked to the experiment, which helps traceability.

MLflow is probably the most popular tool for model registry and experiment tracking out there. MLflow is open source and integrates with a lot of platforms and tools. Check out Find your way to MLflow without confusion article.

Course suggestion: MLflow Udemy course, End-to-end machine learning (MLflow piece)

5.4. Data lineage and feature stores

Feature stores have become quite popular recently and now can be considered an important component of MLOps infrastructure. A feature store helps to keep track of feature use and allows the sharing of features across models.

Every major cloud provider or ML platform (like Databricks) has a feature store available, so consider using it. If you need an open-source solution, consider Feast as it seems to be the most popular one (based on the number of GitHub stars).

Tutorial suggestion: Creating a feature store with Feast part 1, part 2, part 3

You do not per se need a feature store if you do not have many models sharing the same features. But you do need to track what data was used to produce a model artifact. Consider using DVC for that purpose.

Course suggestion: End-to-end machine learning (DVC piece)

5.5. Model training & serving

Where to train your model and how to serve it is probably the most controversial topic in MLOps. The answer to this question would be "it depends".

Many data science teams rely on cloud-native solutions like AWS Sagemaker, Azure ML, or Vertex AI for training and serving their models.

If your organization relies heavily on Kubernetes and you have a proper team supporting it, consider using it. If you use Airflow for orchestration, it has KubernetesPodOperator that allows you to trigger a model training job on Kubernetes. For endpoint deployment, FastApi is the most common choice.

Repository suggestion: ML deployment k8s FastAPI

Tutorial suggestion: How to build machine learning app with FastAPI

If you have Kubeflow as an orchestrator, you can use Kubeflow pipelines for training and KServe for serving.

Tutorial suggestions: Basic kubeflow pipeline, Building and deploying machine learning pipelines, KServe tutorial

5.6. Monitoring & observability

Monitoring and observability are crucial parts of an MLOps platform. Even though these terms can be used interchangeably, there is a difference between them. Check out ML monitoring vs Observability article.

For ML system observability, the combination of Prometheus and Grafana is probably the most common out there. we suggest checking out Mastering Prometheus and Grafana course.

When it comes to ML-specific monitoring, like data and model drift, major cloud providers have their own solutions built into ML propositions. There are some open-source solutions available like Evidently.ai or NannyML.

Course suggestion: Machine learning monitoring concepts, Monitoring machine learning in Python.

6. Infrastructure as code: Terraform

Infrastructure as code is crucial to make your MLOps framework reproducible. Terraform is the most popular and powerful IaC tool. It works with all common cloud providers and platforms.

Course suggestion: Terraform course for beginners

Short video: 8 Terraform best practices by Techworld by Nana

Book suggestion: Terraform: Up and Running, 3rd Edition by Yevgeniy Brikman