CI/CD Pipelines for Databricks Projects: A Production-Ready Guide
CI/CD Pipelines for Databricks Projects: A Production-Ready Guide
One of the most common pain points for data engineering teams adopting Databricks is the lack of proper CI/CD. Notebooks get edited directly in the UI, "deployment" means copying files between folders, and production incidents are traced back to someone running ad-hoc code in the wrong workspace.
This guide walks through a production-grade CI/CD setup for Databricks using Databricks Asset Bundles (DAB) and GitHub Actions. By the end, you'll have automated testing, environment promotion, and deployment that your software engineering colleagues will actually respect.
The Problem with Notebook-First Development
Most Databricks teams start with notebooks. They're fast to iterate, easy to share, and require zero setup. But they scale poorly:
- No diff visibility — Git shows base64-encoded JSON, not readable Python
- Manual promotion — "Deploy to prod" means dragging a notebook between workspace folders
- No automated testing — how do you test a notebook before it runs on production data?
- Environment drift — prod and dev notebooks diverge silently
The solution is treating your Databricks project like a proper software project: code in .py files, infrastructure as code, automated tests, and deployment pipelines.
Project Structure
harbinger-pipelines/
├── .github/
│ └── workflows/
│ ├── ci.yml
│ └── deploy.yml
├── databricks.yml
├── resources/
│ ├── jobs.yml
│ └── pipelines.yml
├── src/
│ ├── pipelines/
│ │ ├── bronze/
│ │ │ └── ingest_events.py
│ │ ├── silver/
│ │ │ └── clean_events.py
│ │ └── gold/
│ │ └── aggregate_signals.py
│ └── utils/
│ ├── schema_utils.py
│ └── quality_checks.py
├── tests/
│ ├── unit/
│ │ └── test_clean_events.py
│ └── integration/
│ └── test_pipeline_e2e.py
├── notebooks/
├── pyproject.toml
└── requirements.txt
Databricks Asset Bundles (DAB)
Asset Bundles are Databricks' native IaC framework. They replace the older dbx tool and are the current recommended approach.
Root databricks.yml
bundle:
name: harbinger-pipelines
variables:
environment:
description: Deployment environment
default: dev
catalog:
description: Unity Catalog name
default: harbinger_dev
targets:
dev:
mode: development
default: true
workspace:
host: https://adb-7405613637854743.3.azuredatabricks.net
variables:
environment: dev
catalog: harbinger_dev
staging:
mode: development
workspace:
host: https://adb-7405613637854743.3.azuredatabricks.net
variables:
environment: staging
catalog: harbinger_staging
prod:
mode: production
workspace:
host: https://adb-7405613637854743.3.azuredatabricks.net
variables:
environment: prod
catalog: harbinger_prod
run_as:
service_principal_name: harbinger-prod-sp
Job Definition (resources/jobs.yml)
resources:
jobs:
events_ingestion_job:
name: "harbinger-events-ingestion-${var.environment}"
schedule:
quartz_cron_expression: "0 0 * * * ?"
timezone_id: "UTC"
tasks:
- task_key: ingest_bronze
python_wheel_task:
package_name: harbinger_pipelines
entry_point: ingest_events
job_cluster_key: default
- task_key: clean_silver
depends_on:
- task_key: ingest_bronze
python_wheel_task:
package_name: harbinger_pipelines
entry_point: clean_events
job_cluster_key: default
job_clusters:
- job_cluster_key: default
new_cluster:
spark_version: "15.4.x-scala2.12"
node_type_id: "Standard_DS3_v2"
autoscale:
min_workers: 2
max_workers: 8
Setting Up GitHub Actions
CI Workflow (.github/workflows/ci.yml)
Runs on every pull request. Validates bundle configuration, runs unit tests, and runs a dry-deploy.
name: CI
on:
pull_request:
branches: [main, staging]
jobs:
lint-and-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: |
pip install -r requirements-dev.txt
pip install -e .
- name: Run linting
run: |
ruff check src/ tests/
mypy src/
- name: Run unit tests
run: pytest tests/unit/ -v --tb=short
- name: Install Databricks CLI
run: |
curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
- name: Validate bundle
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST_DEV }}
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN_DEV }}
run: databricks bundle validate --target dev
- name: Deploy to dev
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST_DEV }}
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN_DEV }}
run: databricks bundle deploy --target dev
Deploy Workflow (.github/workflows/deploy.yml)
Runs on merge to main. Deploys to staging, runs integration tests, then promotes to prod.
name: Deploy
on:
push:
branches: [main]
jobs:
deploy-staging:
runs-on: ubuntu-latest
environment: staging
steps:
- uses: actions/checkout@v4
- name: Install Databricks CLI
run: curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
- name: Deploy to staging
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN_STAGING }}
run: databricks bundle deploy --target staging
- name: Run integration tests
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN_STAGING }}
run: pytest tests/integration/ -v --timeout=300
deploy-prod:
runs-on: ubuntu-latest
needs: deploy-staging
environment: production
steps:
- uses: actions/checkout@v4
- name: Install Databricks CLI
run: curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
- name: Deploy to production
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN_PROD }}
run: databricks bundle deploy --target prod
Writing Testable Code
The key to testable Databricks code is separating business logic from Spark I/O.
# src/pipelines/silver/clean_events.py
from pyspark.sql import DataFrame
from pyspark.sql.functions import col, trim, upper, when
def clean_event_type(df: DataFrame) -> DataFrame:
return df.withColumn(
"event_type",
when(upper(trim(col("event_type"))).isin(["CONFLICT", "WAR", "BATTLE"]), "CONFLICT")
.when(upper(trim(col("event_type"))).isin(["NATURAL_DISASTER", "EARTHQUAKE", "FLOOD"]), "DISASTER")
.otherwise("UNKNOWN")
)
def filter_low_quality_events(df: DataFrame, min_severity: float = 0.5) -> DataFrame:
return df.filter(col("severity") >= min_severity)
# tests/unit/test_clean_events.py
import pytest
from pyspark.sql import SparkSession
from src.pipelines.silver.clean_events import clean_event_type, filter_low_quality_events
@pytest.fixture(scope="session")
def spark():
return SparkSession.builder.master("local[1]").appName("test").getOrCreate()
def test_clean_event_type_normalizes_aliases(spark):
data = [("war", 5.0), ("FLOOD", 7.0), ("political", 3.0)]
df = spark.createDataFrame(data, ["event_type", "severity"])
result = clean_event_type(df)
types = {row.event_type for row in result.collect()}
assert "CONFLICT" in types
assert "DISASTER" in types
assert "UNKNOWN" in types
def test_filter_removes_low_severity(spark):
data = [("CONFLICT", 0.3), ("DISASTER", 0.8), ("CONFLICT", 0.5)]
df = spark.createDataFrame(data, ["event_type", "severity"])
result = filter_low_quality_events(df, min_severity=0.5)
assert result.count() == 2
Branching Strategy
We recommend a trunk-based development model with short-lived feature branches:
main (production)
|
+-- feature/add-gdelt-source PR -> main
+-- feature/optimize-silver PR -> main
+-- hotfix/fix-null-handling PR -> main (fast-track)
- main always reflects production state
- feature branches are short-lived (1-3 days max)
- No long-lived dev/staging branches — use environment targets in DAB instead
Secrets Management in CI/CD
Store Databricks tokens as GitHub Actions secrets (never in code):
gh secret set DATABRICKS_TOKEN_PROD --body "dapi..."
gh secret set DATABRICKS_TOKEN_STAGING --body "dapi..."
gh secret set DATABRICKS_HOST --body "https://adb-xxx.azuredatabricks.net"
In Databricks, use secret scopes rather than hardcoding credentials in job configs.
Monitoring Deployments
After deployment, verify the bundle state:
# Check what is deployed in each target
databricks bundle summary --target prod
# Run a job manually to verify
databricks bundle run events_ingestion_job --target prod
Common Issues and Fixes
| Issue | Cause | Fix |
|---|---|---|
bundle validate fails in CI | Missing DATABRICKS_HOST env var | Set secrets in GitHub repo settings |
| Unit tests fail with Spark errors | JVM not installed in CI runner | Add java-version: '11' in setup action |
| Deploy succeeds but job fails | Wheel not uploaded before job run | Ensure bundle deploy uploads artifacts |
| Integration tests timeout | Dev cluster cold start | Use keep_alive: true for test clusters |
| Prod deploy skipped | Staging tests failed | Fix staging tests; never skip the gate |
Wrapping Up
Proper CI/CD for Databricks is not optional — it's what separates teams that ship confidently from teams that dread Friday deploys. Databricks Asset Bundles, combined with GitHub Actions, give you a clean, version-controlled, environment-aware deployment pipeline.
Start small: even just adding bundle validate to your PR checks will catch configuration errors before they reach production.
Try Harbinger Explorer free for 7 days — built on the same CI/CD principles described here, deploying multiple times per day with confidence. Start your free trial at harbingerexplorer.com.
Continue Reading
Databricks Autoloader: The Complete Guide
Databricks Cluster Policies for Cost Control: A Practical Guide
Learn how to use Databricks cluster policies to enforce cost guardrails, standardize cluster configurations, and prevent cloud bill surprises without blocking your team's productivity.
Databricks Asset Bundles (DABs): The Complete Deployment Guide
A comprehensive guide to Databricks Asset Bundles (DABs) — define, test, and deploy Databricks resources as code with CI/CD pipelines, multi-environment support, and GitOps best practices.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial