Developing on Databricks (without compromises)

Oct 14, 2024

Environment reproducibility and making sure that development and production environments are in sync is one of the important principles that developers must follow.

Achieving this in Databricks is challenging because there’s no perfect solution. But let’s find the one with as little compromises as possible.

In this article, I explore multiple options:

Databricks notebooks
VS code + Databricks extension + virtual environment
VS code + custom docker image (+ devcontainers)

1. Databricks notebooks

The most common (and the most controversial) way you will find everywhere is using a Databricks notebook. Let’s explore how you could do that in the best way.

Step 1: Use Git Folder

First of all, do not create a notebook directly in the workspace. Instead, create a Git folder first which is initialized from the Git repository where the project is located. You can authenticate by linking your Git account to Databricks.

After the folder is created, you can create a notebook. Note that Databricks notebook is stored as a Python file by default. The first line in this file is “# Databricks notebook source” — that’s how it gets recognized as a Databricks notebook. The cells are delimited by a line containing “# COMMAND — — — — — “.

Example notebook on Databricks:

This is how it is saved in the repository:

Step 2: Customized Environment

To run a Databricks notebook, you need to choose a cluster. Let’s assume we are working with classic compute and choose runtime 15.4 LTS. That runtime comes with preinstalled libraries: https://docs.databricks.com/en/release-notes/runtime/15.4lts.html. The pre-installed libraries may not match the requirements of your project

Notebooks do not come with a clean environment. If you need some specific package versions (for example, a newer version of pandas than the one available within the runtime), you must install them on top of the defined environment.

To install specific libraries in a Databricks notebook, use the %pip command. This allows you to install them from PyPi, or a wheel located in the Workspace or a Volume. Here is the official documentation with the steps: https://docs.databricks.com/en/libraries/notebooks-python-libraries.html

You can also install libraries on top of the cluster by modifying the cluster specifications. After that, these libraries will be available in any notebook that uses that cluster. Project modules can be imported using relative import. It is not possible to install packages in editable mode.

Step 3: Commit Your Changes

When you modify a notebook, make sure to commit your changes and push them to the Git branch you are working on. This ensures your work is version-controlled.

Downsides:

It is not an IDE. Even though it is now possible to use a debugger in a Databricks notebook, you do not have all the advantages of an IDE like using code linters and formatters, not to mention that it is hard to work on packaging your code and unit testing in a Databricks notebook.
Merge conflicts. It can be hard to solve merge conflicts using Git Folders, and you cannot use precommit hooks this way.
Code quality risks. The code quality will eventually suffer if the team chooses this approach to develop on Databricks

Upsides:

Environment consistency. It is the only approach that guarantees environment consistency while developing on Databricks and running the code in production as a notebook scheduled in a workflow.
High memory requirements. If your code has high memory requirements for the driver node and you cannot work on a sample of the data (for example, if you need to load pyspark dataframe into pandas to train a scikit learn model), this is the only way to develop.

Since I am not a fan of notebooks, I explored all other ways to develop on Databricks.

Let’s move to an alternative: using VS Code with Databricks extension.

2. VS code + Databricks extension + virtual environment

Many developers prefer developing code in a Python IDE (for example, VS Code). The first thing you do is create a virtual environment. Databricks runtime must be specified when defining a cluster to run the code as part of the workflow. As mentioned earlier, Databricks runtime already comes with a certain Python version and preinstalled libraries.

The question arises: how can you create a virtual environment that keeps the development and production environments in sync? The answer is: you can’t fully achieve this, but you can create an approximation of the production environment. With integration testing included, you can ensure everything runs as intended in production.

Step 1: Set Up a Virtual Environment

The first instinct may tell you: let’s try to install all the runtime Python dependencies in a virtual environment that matches the Python version of the runtime. It will likely fail to resolve the dependencies (depending on what OS you are using). So, just start with a plain virtual environment and only specify the exact versions of packages that you need.

Let’s start with the assumption that our project has these high-level dependencies specified in a very minimalistic pyproject.toml file:

[project]
name = "demo"
version = "0.0.1"
description = "Demo project"
requires-python = ">=3.11"
dependencies = ["mlflow==2.16.0",
  "numpy==1.26.4",
  "pandas==2.2.2"]
[project.optional-dependencies]
dev = ["databricks-connect>=15.4.1, <16",
  "databricks-sdk>=0.32.0, <0.33",
  "ipykernel>=6.29.5, <7",
  "pip>=24.2"]

Step 2: Install and Lock Dependencies

Let’s use UV to create a virtual environment and install all the requirements including the optional dependencies. We can notice that there are many more packages installed than we have specified — those are intermediate dependencies. We locked them using the uv lock command and can commit the platform-agnostic lock file to the code repository so that all the developers use the same versions of all packages.

uv venv -p 3.11.0 .venv
source venv/bin/activate
uv pip install -r pyproject.toml - all-extras
uv lock

Step 3: Run Code Locally and on Databricks

Let’s say we want to run a Python script that contains some spark code. We can run that script using Databricks Connect, which is specified as an optional dependency). Spark code will be executed on the cluster, while non-spark code will be executed locally. You can read more about it in the documentation: https://docs.databricks.com/en/dev-tools/databricks-connect/index.html

So, we can run our code locally, but what about executing it as part of the workflow on Databricks? We cannot install all intermediate requirements we have locally on the cluster — we will run into the same problems when trying to install all the runtime libraries in the virtual environment locally.

This means we can only install high-level requirements again — we can specify them as task requirements. We must remember that we do not install them in an empty environment and that the intermediate packages for the runtime are not changing over time (except for some bugs- and security fixes). So, it is safe to assume that specifying only high-level requirements ensures the stability of the environment.

To make sure that the code developed locally runs exactly in the same way as part of the workflow, we must have integration tests that run the code on Databricks on a sample of the data.

Downsides:

Only an approximation of the environment can be created locally.
Compatibility issues. Some commands may not work using dbconnect (for example, SQL commands that include table_changes), I also experienced some problems with the feature-engineering package which worked without problems in a notebook.
High memory requirement. If your code has high memory requirements for code that must be executed locally, this option would not work.

Upsides:

You get all the benefits of an IDE.
Best practices. It is easier to promote best coding practices in the team.

3. VS code + custom docker image (+ devcontainers)

There is another option that assumes as little compromises as possible, but this is the most tricky one.

Step 1: Define a Dockerfile

A Databricks cluster can be initialized from a docker image. The one that comes from a Dockerfile that looks like:

FROM databricksruntime/python:15.4-LTS 

ARG PROJECT_DIR=/project

RUN pip install uv==0.4.20

WORKDIR ${PROJECT_DIR}
COPY pyproject.toml ${PROJECT_DIR}/

RUN uv pip install --python /databricks/python3 -r pyproject.toml

A few things are important to notice:

The base image is databricksruntime/python:15.4-LTS, which matches our chosen runtime.
Project dependencies must be installed in the /databricks/python3 folder, which is an existing Python environment with some pre-installed libraries. We install libraries on top of those.

Step 2: Set up a devcontainer

Because we modify the existing environment, it is not that straightforward how to reproduce it. Unless we use this Dockerfile to create a devcontainer, and develop in the devcontainer.

Here is one of our earlier articles on devcontainers. It is important to set up a Databricks extension within the devcontainer — we will use it to develop on Databricks just like in the previous setup we discussed.

After the devcontainer is set up and the extension is configured, you need to create a virtual environment. To do it, point to the Databrikcs-specific Python installation, not just Python 3.11:

uv venv -p /databricks/python3 .venv

Step 3: Build an image & use it to initialize a cluster

Now you can develop as usual, and when done, can build and push the image to the docker registry:

docker build -t marvelousmlops/databricks_docker:latest .
docker push marvelousmlops/databricks_docker:latest

Then use the image to initialize the cluster, here is an example of how to do it via UI (but of course, you can use it in the Databricks job too, and set it up programmatically(. If the Docker tab is not available, a workspace administrator can enable it by running the CLI command:

databricks workspace-conf set-status --json '{"enableDcs": "true"}'

Enjoy!

Downsides:

Can be slow. This option requires a beefy machine.
Compatibility issues + Databricks extension may not always work as expected.
Complicated.
May run locally, but break on Databricks — because Databricks installs some secret sauce on top. Very hard to debug.

Upsides: Everything else :-)

Conclusion

Each method for developing on Databricks has its advantages and disadvantages. Databricks notebooks offer consistency with production but lack the full capabilities of an IDE. VS Code with a virtual environment allows for more advanced coding features, however it only approximates the production environment. Using Docker gives more control, but comes with some complexity.

By understanding the trade-offs of each approach, you can choose the best development strategy for your team and project needs.

Marvelous MLOps Substack

Discussion about this post