Databricks recently introduced Free Edition, which opened the door for us to create a free hands-on course on MLOps with Databricks.
This article is part of that course series, where we walk through the tools, patterns, and best practices for building and deploying machine learning workflows on Databricks.
Watch lecture 8 on Youtube:
In this lecture, we’ll explore how to structure your data and assets for robust, secure, and scalable machine learning operations on Databricks, and how to automate deployments using CI/CD pipelines.
Unity Catalog, Workspaces, and Data Organization
We’ve already interacted with Unity Catalog, using it to create delta tables and register models. For a workspace to use Unity Catalog, it must be attached to a Unity Catalog metastore, which is the top-level container for all data and AI asset metadata.
You can only have one metastore per cloud region, and each workspace can only be attached to one metastore in that region.
Unity Catalog organizes assets in a three-tier hierarchy:
Catalogs (e.g., mlops_dev, mlops_acc, mlops_prd)
Schemas within catalogs (in our case, we have the same schema name in each catalog, marvel_characters)
Assets within schemas (tables, views, models, etc.)
Assets are referenced using a three-part naming convention: catalog.schema.asset.
Access Control: Securables and Permissions
In Databricks, permissions can be set on the workspace and on Unity Catalog level.
Workspace-level securables: Notebooks, clusters, jobs — accessed via ACLs.
Unity Catalog-level securables: Tables, schemas, models — accessed via metastore-level privileges.
Workspace binding and access modes: if the catalog has OPEN mode, it can be accessed from any workspace. Use ISOLATED mode to control cross-project access.
In a typical setup, an ML project or team has a set of Databricks workspaces (dev, acc, and prd), and set of catalogs or schemas within a larger catalog.
In the course example, for simplicity we use a shared workspace for development, acceptance and production. But we have a dedicated catalog for dev, acc, and prd, with limited permissions, so that we can ensure things do not get broken unintentionally.
For an ideal setup, you need to follow these rules:
All ML pipelines form all workspaces (dev, acc, prd) have read access to production data (e.g., prd_gold), ensuring consistency.
From each workspace, the data can only be wtitten to its own catalog.
Users only have direct access to the dev workspace; deployments to acc/prd must go through CI/CD pipelines, using service principals for security and traceability.
In the example below, the hotel booking team and dev SPN have “write” permissions to mlops_dev.hotel_booking, and “read” permissions on prd_gold.hotel_booking (production data delivered by the data engineering team), and the hotel_booking schema from mlops_acc and mlops_prd catalogs.
Service principals (SPNs) have scoped access, only operating within their intended workspace/catalog
Branching and Release Strategy
In the course, we use a version of Git Flow:
Feature branches are created from main.
Developers open PRs to main, triggering the CI pipeline.
CI runs pre-commit checks, unit tests, and version checks.
At least 2 approvals are required to merge (enforced via branch protection rules).
Direct pushes to main are not allowed.
Once merged, the CD pipeline deploys to acceptance and production, using environment-scoped secrets and SPNs. Production deployment should be protected by deployment protection rules (only deployed after approval).
CI/CD in Action
Let’s look at how this is implemented in our Marvelous MLOps codebase.
CI Pipeline: .github/workflows/ci.yml : This pipeline is triggered on every push or PR to main or dev:
name: CI
on:
pull_request:
branches:
- main
jobs:
pytest_and_checks:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 #v4.2.2
with:
# Fetch all history for all branches and tags
fetch-depth: 0
ref: ${{ github.head_ref }}
- name: Git tag from version.txt
run: |
echo "VERSION=$(cat version.txt)"
git tag $VERSION
- name: Install uv
uses: astral-sh/setup-uv@0c5e2b8115b80b4c7c5ddf6ffdd634974642d182 #v5.4.1
- name: Install the dependencies
run: |
uv sync --extra test
- name: Run pre-commit checks
run: |
uv run pre-commit run --all-files
- name: run pytest
run: |
uv run pytest -m "not ci_exclude"
What it does:
Runs on PRs and pushes to main/dev
Installs dependencies, runs linting and tests
Checks that the version is unique (to prevent accidental duplicate releases)
CD Pipeline: .github/workflows/cd.yml
This pipeline is triggered after a successful merge to main:
name: CD
on:
workflow_dispatch:
push:
branches:
- 'main'
jobs:
deploy:
runs-on: ubuntu-latest
strategy:
matrix:
environment: [acc, prd]
environment: ${{ matrix.environment }}
permissions:
contents: write # to push tag
env:
DATABRICKS_HOST: ${{ vars.DATABRICKS_HOST }}
DATABRICKS_CLIENT_ID: ${{ secrets.DATABRICKS_CLIENT_ID }}
DATABRICKS_CLIENT_SECRET: ${{ secrets.DATABRICKS_CLIENT_SECRET }}
steps:
- name: Checkout Source Code
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 #v4.2.2
- name: Install Databricks CLI
uses: databricks/setup-cli@49580195afe1ccb06d195764a1d0ae9fabfe2edd #v0.246.0
with:
version: 0.246.0
- name: Configure Databricks CLI
run: |
mkdir -p ~/.databricks
cat > ~/.databrickscfg << EOF
[marvelous]
host = ${{ vars.DATABRICKS_HOST }}
client_id = ${{ secrets.DATABRICKS_CLIENT_ID }}
client_secret = ${{ secrets.DATABRICKS_CLIENT_SECRET }}
EOF
- name: Install uv
uses: astral-sh/setup-uv@0c5e2b8115b80b4c7c5ddf6ffdd634974642d182 #v5.4.1
- name: Deploy to Databricks
env:
DATABRICKS_BUNDLE_ENV: ${{ matrix.environment }}
run: |
databricks bundle deploy \
--var="git_sha=${{ github.sha }}" \
--var="branch=${{ github.ref_name }}"
if [ "${{ matrix.environment }}" = "prd" ]; then
echo "VERSION=$(cat version.txt)"
git tag $VERSION
git push origin $VERSION
fi
What it does:
Only runs on push to main
Builds the wheel (databricks bundle deploy takes care of that)
Deploys the Lakeflow job to both acceptance and production using environment-specific secrets
Uses the Databricks CLI to deploy bundles
Setting Up Service Principals (SPNs) for Secure CI/CD
For your CD pipeline to deploy to Databricks automatically and securely, you must use a Service Principal (SPN). This ensures that all deployments are performed by a dedicated identity with tightly scoped permissions, rather than by personal user credentials.
A Service Principal is a special, non-human identity used by applications, automation tools, or CI/CD pipelines to authenticate and interact with cloud resources securely.
Why use an SPN instead of a user account?
Security: SPNs are not tied to any individual, so if someone leaves the team, you don’t risk losing access or exposing credentials.
Least privilege: SPNs can be granted only the permissions they need for deployment — nothing more.
Auditability: All actions performed by the CI/CD pipeline are clearly attributable to the SPN, making it easy to track changes and meet compliance requirements.
Automation: SPNs enable fully automated, hands-off deployments, since their credentials (client ID and secret) can be securely stored in your CI/CD system (like GitHub Actions).
Configuring SPNs
Create a Service Principal in Databricks:
Go to the Databricks workspace admin console.
Navigate to User Management → Service Principals.
Click Add Service Principal and follow the prompts.
Note the Client ID and Client Secret that are generated.
Assign Permissions to the SPN:
Grant the SPN the necessary privileges in Unity Catalog, jobs, and workspace resources (e.g., CAN_MANAGE or CAN_RUN on the relevant schemas, jobs, and endpoints).
Make sure the SPN has access only to the environments it should deploy to (e.g., acc and prd).
2. Add SPN Credentials to GitHub Actions (or your CI/CD system)
In your GitHub repository, go to Settings → Environments. Create environments “prd” and “acc”.
Add the following Environment secrets:
DATABRICKS_CLIENT_ID (the SPN’s client ID)
DATABRICKS_CLIENT_SECRET (the SPN’s client secret)
Add the following Environment variable:
DATABRICKS_HOST (your Databricks workspace URL)
These secrets and variables will be available to your GitHub Actions workflows and can be referenced as environment variables.
In your .github/workflows/cd.yml, we are referencing these secrets and variable in the env block for deployment steps. We also set the environment using matrix.
strategy:
matrix:
environment: [acc, prd]
environment: ${{ matrix.environment }}
env:
DATABRICKS_HOST: ${{ vars.DATABRICKS_HOST }}
DATABRICKS_CLIENT_ID: ${{ secrets.DATABRICKS_CLIENT_ID }}
DATABRICKS_CLIENT_SECRET: ${{ secrets.DATABRICKS_CLIENT_SECRET }}
The Databricks CLI and bundle deployment commands will pick up these variables and authenticate as the SPN.
Bonus: Authenticating to Serverless Endpoints with Service Principals
In Lecture 6, we showed how to send requests to a Databricks Serverless endpoint using a Personal Access Token (PAT).
However, for production systems and service-to-service communication, Service Principals (SPNs) provide a more secure and scalable authentication method.
Step 1: Grant Permissions to Your SPN
Before making requests, ensure your Service Principal has CAN_QUERY permission on the model serving endpoint:
Navigate to your endpoint in the Databricks UI
Click on “Permissions”
Add your Service Principal with “Can Query” permission
Step 2: Generate an OAuth Token Using SPN Credentials
import os
import requests
from requests.auth import HTTPBasicAuth
def get_token():
response = requests.post(
f"https://{os.environ['DBR_HOST']}/oidc/v1/token",
auth=HTTPBasicAuth(
os.environ["DATABRICKS_CLIENT_ID"],
os.environ["DATABRICKS_CLIENT_SECRET"]
),
data={
'grant_type': 'client_credentials',
'scope': 'all-apis'
}
)
return response.json()["access_token"]
os.environ['DBR_TOKEN'] = get_token()
Step 3: Use the Token to Call Your Model Endpoint
Now you can use this token with the same endpoint invocation code from Lecture 6, replacing the authentication method:
def call_endpoint(record):
"""
Calls the model serving endpoint with a given input record.
"""
# Ensure the host URL is complete with domain suffix (.com, etc.)
host = os.environ['DBR_HOST']
# If the host doesn't contain a dot, it's likely missing the domain suffix
if '.' not in host:
print(f"Warning: DBR_HOST '{host}' may be incomplete. Adding '.com' domain suffix.")
host = f"{host}.com"
serving_endpoint = f"https://{host}/serving-endpoints/marvel-character-model-serving/invocations"
print(f"Calling endpoint: {serving_endpoint}")
response = requests.post(
serving_endpoint,
headers={"Authorization": f"Bearer {os.environ['DBR_TOKEN']}"},
json={"dataframe_records": record},
)
return response.status_code, response.text
Key Takeaways
In this lecture, we discussed and showed how:
Catalogs, schemas, and workspaces provide clean separation and access control.
Service principals ensure automation is secure and scoped.
Git flow and branch protection rules enforce code quality and review.
CI/CD pipelines automate validation and deployment, with no manual pushes to production.
In the next lecture, we’ll talk about monitoring. Because machine learning models only start living once they are deployed, and there is no MLOps without proper monitoring.