The MLOps engineer role is different from an ML engineer role. Even though the role varies from company to company, in general, ML engineers focus more on bringing individual projects to production, while MLOps engineers work more on building a platform that is used by machine learning engineers and data scientists.
To build such platforms, lots of different skills are required. Here is a roadmap that will help you to become an MLOps engineer. It is intended to be taken step-by-step, starting from programming skills to MLOps components and infrastructure as code.
Remember: you do not need to know all the tools. Having a proper understanding and experience with just one type of each tool for MLOps is enough. Here, we are suggesting a few of the most common and popular tools to get you started.
The roadmap may have some updates during the year. If you want to see the latest version and have some suggestions, check out the MLOps roadmap repo by Marvelous MLOps.
1. Programming
Programming skills are essential for an MLOps engineer. Python is the most common language used for machine learning. Since MLOps engineers collaborate with machine learning engineers and data scientists, learning Python is important.
1.1. Python & IDEs
We suggest starting learning Python by reading a proper Python book and practicing the concepts.
Tutorial suggestion: https://realpython.com
Book suggestion: Python Crash Course, 3rd Edition: A Hands-On, Project-Based Introduction to Programming 3rd Edition by Eric Matthes
Code practice suggestion: https://leetcode.com/problemset/
Course suggestions: Learn Python 3
Tracks suggestions: Python fundamentals, Python programming
Important things to know about using Python:
Installing Python, using virtual environments. Check out The right way to install Python on Mac article.
Using an IDE. Check out guide How to configure VS Code for ML
Python basics (part 1 of Python Crash Course book)
Pytest (part 1 of Python Crash Course book, Python programming track)
Packaging: How to build and publish Python packages with poetry.
1.2. Bash basics & command line editors
You will need to understand bash basics to add steps to your CI/CD pipelines, to create docker files, and many more other things.
Book suggestion: The Linux Command Line, 2nd Edition by William E. Shotts
Course suggestion: Bash mastery
VIM is one of the most widely used command-line editors. It is lightweight and easy to get started with.
Tutorial suggestion: VIM beginners guide, VIM adventures, VIM by Daniel Miessler.
2. Containerization and Kubernetes
Containers are isolated software environments that help to streamline software development and deployment, regardless of the underlying infrastructure. It is an essential piece of modern software engineering best practices.
2.1. Docker
Docker is one of the most popular open-source containerization platforms, also widely used in MLOps for multiple purposes: code development, model training, and endpoint deployment.
Docker roadmap: https://roadmap.sh/docker
Tutorial suggestion: Full docker tutorial by Techworld by Nana
2.2. Kubernetes
Kubernetes is a must to learn for an MLOps engineer. It is widely used for machine learning model training, model endpoint deployment, and serving dashboards.
Kubernetes roadmap: https://roadmap.sh/kubernetes
Tutorial suggestion: Kubernetes course by freecodecamp.com
Course suggestion: Kubernetes mastery
K9s is a powerful CLI tool that makes managing your Kubernetes clusters easy:
https://k9scli.io. Great for development!
3. Machine learning fundamentals
An MLOps engineer works with machine learning engineers and data scientists and should have some basic understanding of machine learning models.
Without having a proper understanding of what data scientists and machine learning engineers do, you can not fully embrace MLOps principles.
Course suggestions:
https://mlcourse.ai/
https://course.fast.ai
Book suggestion: Applied Machine Learning and AI for Engineers by Jeff Prosise
4. MLOps principles
MLOps engineers must be aware of MLOps principles and what the factors are that contribute to MLOps maturity.
Books:
Designing Machine Learning Systems 𝖻𝗒 Chip Huyen
Introducing MLOps 𝖻𝗒 Mark Treveil 𝖺𝗇𝖽 Dataiku
Check out MLOps maturity assessment and ml-ops.org.
5. MLOps components
MLOps platform consists of multiple MLOps components, such as version control, CI/CD, orchestration, compute, serving, and feature stores. In the end, the MLOps framework is about combining the tools. Check out the Minimum set of must-haves for MLOps article.
Book suggestion: ML Engineering with Python by Andy McMahon
Suggested courses:
5.1. Version control & CI/CD pipelines
Without version control and CI/CD pipelines, ML model deployments cannot be traceable and reproducible.
Git is the most popular version control system. GitLab and GitHub are the most popular version control services. You do not have to learn them both (even though through your career you might).
Books:
Learning GitHub Actions by Brent Laster
Learning Git by Anna Skoulikari
Tutorials & courses:
Taking Python to Production: A Professional Onboarding Guide
https://learngitbranching.js.org/
Pre-commit hooks are super useful for keeping your code neat and are an important piece of your CI pipeline. Check out Welcome to pre-commit heaven article.
5.2. Orchestration
Just like in data engineering, orchestration systems like Mage or Airflow are popular for machine learning engineering. There are also ML-specific orchestration tools (that do more than just orchestration), such as Kubeflow or Metaflow. Airflow seems to be still more common in industry.
Orchestration systems keep all your model runs in the same place and help with:
Sharing variables between different jobs running on the compute
Identifying which runs failed on the compute and repairing it
Defining complex execution logic
Course suggestion: Introduction to Airflow in Python
Note: ML Engineering with Python book by Andy McMahon and The full stack 7-steps MLOps framework also use Airflow.
5.3. Experiment tracking and model registries
Experiment tracking means logging metadata, parameters, and artifacts that belong to different model training runs. What is stored, depends on the algorithm and your needs. Experiment tracking makes it possible to compare different runs between each other. Models from different experiment runs can be registered and linked to the experiment, which helps traceability.
MLflow is probably the most popular tool for model registry and experiment tracking out there. MLflow is open source and integrates with a lot of platforms and tools. Check out Find your way to MLflow without confusion article.
Course suggestion: MLflow Udemy course, End-to-end machine learning (MLflow piece)
5.4. Data lineage and feature stores
Feature stores have become quite popular recently and now can be considered an important component of MLOps infrastructure. A feature store helps to keep track of feature use and allows the sharing of features across models.
Every major cloud provider or ML platform (like Databricks) has a feature store available, so consider using it. If you need an open-source solution, consider Feast as it seems to be the most popular one (based on the number of GitHub stars).
Tutorial suggestion: Creating a feature store with Feast part 1, part 2, part 3
You do not per se need a feature store if you do not have many models sharing the same features. But you do need to track what data was used to produce a model artifact. Consider using DVC for that purpose.
Course suggestion: End-to-end machine learning (DVC piece)
5.5. Model training & serving
Where to train your model and how to serve it is probably the most controversial topic in MLOps. The answer to this question would be "it depends".
Many data science teams rely on cloud-native solutions like AWS Sagemaker, Azure ML, or Vertex AI for training and serving their models.
If your organization relies heavily on Kubernetes and you have a proper team supporting it, consider using it. If you use Airflow for orchestration, it has KubernetesPodOperator that allows you to trigger a model training job on Kubernetes. For endpoint deployment, FastApi is the most common choice.
Repository suggestion: ML deployment k8s FastAPI
Tutorial suggestion: How to build machine learning app with FastAPI
If you have Kubeflow as an orchestrator, you can use Kubeflow pipelines for training and KServe for serving.
Tutorial suggestions: Basic kubeflow pipeline, Building and deploying machine learning pipelines, KServe tutorial
5.6. Monitoring & observability
Monitoring and observability are crucial parts of an MLOps platform. Even though these terms can be used interchangeably, there is a difference between them. Check out ML monitoring vs Observability article.
For ML system observability, the combination of Prometheus and Grafana is probably the most common out there. we suggest checking out Mastering Prometheus and Grafana course.
When it comes to ML-specific monitoring, like data and model drift, major cloud providers have their own solutions built into ML propositions. There are some open-source solutions available like Evidently.ai or NannyML.
Course suggestion: Machine learning monitoring concepts, Monitoring machine learning in Python.
6. Infrastructure as code: Terraform
Infrastructure as code is crucial to make your MLOps framework reproducible. Terraform is the most popular and powerful IaC tool. It works with all common cloud providers and platforms.
Course suggestion: Terraform course for beginners
Short video: 8 Terraform best practices by Techworld by Nana
Book suggestion: Terraform: Up and Running, 3rd Edition by Yevgeniy Brikman
This is very well-thoroughly written. I wish we had university courses based on this. Most universities do not update their course syllabus.
This is great I would love to see a visual of this in a supply chain type of motion showing left to right flow.