This article explains the need for CI pipelines as a part of CI/CD practices. First I’ll share my thoughts on why they are so useful and what their added value is. Then I’ll show you how to build your first simple CI pipeline using Gitlab.
Why I love CI so much, and why you should too
CI pipelines are a key step, and usually the first step, in your automated deployment. And holy popcorn Batman, are they great! Once you go CI, you’ll never go back. You will ask yourself “how could I ever work without this?”. Let me break down the advantages for you:
1. Consistency: Automation ensures consistent execution, avoiding human errors.
2. Speed: Automation speeds up the ML lifecycle. Automated pipelines for preprocessing, model (re)training, testing, and deployment are much faster than manual processes. Eventually this will leave more time and headspace for the creative solutions that add value.
3. Scalability: Automation makes it easier to scale processes up or down as needed, without significant manual effort. You also want to build well designed pipelines where you can just adjust the configuration parameters and voilà, your ML runs at scale!
4. Version Control: Automated pipelines integrate with version control systems, making it easier to track changes, collaborate with team members, and roll back to previous versions. You can connect your pipelines to all kinds of version control conditions, on a push, on a certain branch, on a certain merge request, etc. Customization here is only limited by your creativity, the sky’s the limit!
5. Reproducibility: Automated pipelines record all the steps and parameters used during model deployment. Together with a data snapshot, this makes it possible to recreate the exact same model. This can be crucial for fixing problems and in some cases auditing.
6. Testing, Quality Assurance and Validation: Automated pipelines come with comprehensive testing and validation steps. This helps catch issues early in the development process, ensuring high-quality. You can write your own tests or use existing test protocols. For example in the form of pre-commit hooks. Check out my article on pre-commit hooks.
Are you convinced yet?
Okay chill Tom Cruise, no need to shout. Let’s check out the code.
The code
GitLab CI/CD is a powerful tool that allows you to build your own customised CI/CD pipelines. In this part, we’ll be building a CI pipeline. I will walk you through a simple GitLab CI configuration file for a Python project, focusing on its various stages and jobs. A GitLab CI configuration file usually lives in the root of your repository as .gitlab-ci.yml
. It is a YAML file that defines your GitLab pipeline. For more information please see the documentation on Gitlab CI/CD.
In this ML project repository I’ve included a GitLab CI configuration file. Its mere presence will create and trigger a CI Pipeline on every code push (since I haven’t defined any other conditional statements for triggering). You can find the full file CI configuration file here or check out its full code contents at the bottom of this article.
I do not want you to look at the actual ML python code too much! That is why it is just one preprocessing function with its unit test. We will build a full project repo with all bells and whistles in a future article. For now we’ll just focus on the CI.
Let me explain the CI Configuration File step by step alternating snippets of code with explanations. Mind you all these snippets should actually be concatenated together in one YAML file! You can find the full file at the end of the article or in the repository.
The start of our CI configuration
image: python:3.11
stages:
- test
- package
- docker
services:
- docker:20.10.17-dind
The start of the file defines the building blocks that we are going to use in our pipeline. There are some optional ones and some mandatory ones. Every pipeline runs in a container! The rest is up to you and your use case.
image: python:3.11
: This line specifies the base Docker image to be used for the CI/CD pipeline’s environment. In this case, it is a linux distribution (the standard on GitLab) with Python 3.11. This will be our runtime environment.
stages
: This section defines the different stages of the pipeline. We have three stages: `test`, `package`, and `docker`. Each stage represents a phase in the development and deployment process. There are some different conventions for this, but please organise it in a way that works for you and your teams! As we say in Dutch “it’s your party!”.
services
: Here, you can specify any additional services needed during the CI/CD process. In this case, we will be using Docker as a service with version 20.10.17-dind (Docker in Docker). This allows us to build Docker images within our CI/CD pipeline, which we will want to do in the last job of the Docker stage.
The jobs in the pipeline run in sequence (depending on your GitLab CI configuration they could also run in parallel within a stage, but we are not going to get into that for now). If a job fails, the pipeline will stop running and be returned as a “failed” pipeline. If all jobs are succesful the pipeline will have “passed”.
Note that before each job that requires pip I like to upgrade pip. Upgrading pip is important because it ensures that you have access to the latest features and bug fixes. Additionally, upgrading pip can help you avoid compatibility issues with other packages and dependencies. This will make your pipeline more robust! 💪🏽
Now, let’s dive into the individual jobs within each stage.
Important note: you’ll need a runner
To run pipelines you will need a GitLab runner. The availability and type of runner will depend on where and how you are running GitLab. The easiest way to play around with the code in this article is to just get a free GitLab SaaS account on gitlab.com. There you might need to create a runner if you have never used one before. I needed to do it on my new account here, but it was easy. I just navigated to Pipelines in the side menu and from there it was 3–4 clicks away. It should say something like “Looks like you don’t have a runner yet!” and give you a button to create one. If you are on self-hosted GitLab with self-hosted runners, become very friendly with your local platform engineer. You should be doing that anyway on your journey into MLOps ;)
Stage 1: Testing
lint:
stage: test
before_script:
- pip install --upgrade pip
- pip install pre-commit
script:
- pre-commit run --all-files
lint
is the name of this job, which falls under the test
stage. It’s responsible for ensuring code quality through linting. Linting is just another way or saying we run pre-commit hooks. Here we do this through the pre-commit package which uses the pre-commit configuration file to run a selection of pre-commit hooks. In this repo I just use three hooks: black, flake8 and mypy. It’s advisable to use more pre-commit hooks. To select the best hooks for your ML project check out my article on pre-commit hooks.
before_script
: This section defines the commands to be executed before the `script` section runs. Here, like I announced, we upgrade `pip` and install the `pre-commit` package.
script
: In this section, we use pre-commit
to run code linting checks on all project files. This uses existing libraries to maintain code consistency and identify potential issues early in the development process. It will be a big part of our automated testing. With the right collection of pre-commit hooks our code is screened against hundreds of protocols, auto formatted, auto adjusted and improved through active feedback. Isn’t the open source community just beautiful?
Unit Testing
unit-test:
stage: test
before_script:
- pip install --upgrade pip
- pip install -r requirements.txt
- pip install .[test]
script:
- pytest
unit-test
is another job within the test
stage, dedicated to running unit tests.
before_script
: Similar to the previous job, we upgrade pip
and install project dependencies from requirements.txt
. Additionally, we install test-specific dependencies using pip install .[test]
. For this you need test-specific dependencies in your python package configuration . This allows you to configure multiple testing dependencies. I think it’s the right way to go, but it requires some context. The simpler route could be to code your test dependencies inline. Here that would mean pip install .[test]
could be replaced by pip install pytest
.
script
: Here, we execute thepytest
command, which runs the project’s unit tests. Unit tests validate the correctness of individual components of your code, provided you have written them.
Stage 2: Packaging
build-package:
stage: package
before_script:
- pip install --upgrade build pip
script:
- python -m build --wheel
artifacts:
paths:
- dist/*.whl
expire_in: 1 week
build-package
is a job within the `build` stage, responsible for building a Python package.
before_script
: By now you know we upgrade pip, but this time we also upgrade build.
script
: In this section, we execute `python -m build — wheel` to create a Python package distribution in wheel format. The resulting package will be used in the next stage.
artifacts
: This section specifies the artifacts produced by this job. We are archiving the wheel distribution files in the `dist` directory, which can be used in subsequent stages. Here the artifacts will expire in one week, but choose whatever time window you’d like. To learn more check out GitLab’s artifact documentation.
Stage 3: Docker
build-image:
stage: docker
image: docker:20.10.17
variables:
DOCKER_DRIVER: overlay2
DOCKER_TLS_CERTDIR: ""
IMAGE_TAG: $CI_REGISTRY_IMAGE:$CI_COMMIT_REF_SLUG
before_script:
- docker info
script:
- docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
- docker build . -t $IMAGE_TAG
- docker push $IMAGE_TAG
Here we’ll build a docker image, using the python package from above. For this part you’ll need some basic knowledge about Docker. The most important part is that docker build .
will look to build the docker image from the docker file. There should be a docker file in the root of your project repository. The docker file is a file that contains the configuration on how to build your docker image.
build-image
is the single job within the docker
stage, dedicated to building a Docker image from our Python project.
image
: Here, we specify the Docker image to be used for this job. It’s using Docker version 20.10.17.
variables
: This section defines environment variables for the Docker build process. It ensures Docker uses the `overlay2` driver and disables TLS certificate directory checks.
before_script
: We use `docker info` to displays system wide information regarding the Docker installation, this may come in handy when we open our CI job’s container to see what’s up.
script
: Here we have some $ signs for using CI variables. You can forget about them for now. The most important thing to note here is that the docker image is built and tagged. Subsequently it’s pushed. Whereto? That’s the beauty: to the GitLab projects own Container Registry which can be found under Deploy \ Container Registry from the navigation menu. So now we have an image sitting ready for anyone or any service that wants pull it and spin it up.
Also note that every time we push a new image the current one in the container registry is overwritten. Don’t we want to save the old versions? No, we don’t have to. Because we have all our code versions in the repository, and all our pipelines are tied to commit hashes. We can always retrigger an old pipeline with an older version of the code to build an older version of our image. Now that’s reproducibility!
Running the pipelines
Because we haven’t set any conditions for running the pipelines and we have configured our GitLab runner to run “all untagged commits” the pipeline will be fired off in full every time we push a commit containing changes. We can find an overview of our pipelines in the pipelines section of this project. When we click on one of the icons in the status column, for example “passed” it takes it directly to an overview of the pipeline and its jobs.
Unleash the power of pipelines
Congratulations! You can now set up a GitLab CI configuration file for an ML python project. This configuration should make your life so much easier by automating various stages of deployment: linting, unit testing, building your package, creating a Docker image and pushing your image to a registry. Automating these tasks with GitLab CI/CD not only saves time but also ensures consistency and reliability throughout your ML projects, making it easier to maintain and deploy your models and applications.
In the next parts of this article series we will be:
Fully setting up our project repo so it’s ready for deployment
Polishing up our CI script with some docstrings and echo statements
Adding a smoke test
Adding and uploading your package to a package registry
Adding conditionals
Learning about dependency management
Caching things for your CI
Centrally configuring your CI for use of multiple, modular built, pipelines
Central configuring and managing your pre-commit hooks
Adding a CD step… maybe? 👨🏽💻 ⏩ 💥