Harbinger Explorer

Back to Knowledge Hub
databricks
Published:

CI/CD Pipelines for Databricks Projects: A Production-Ready Guide

11 min read·Tags: databricks, cicd, github-actions, devops, asset-bundles, testing

CI/CD Pipelines for Databricks Projects: A Production-Ready Guide

One of the most common pain points for data engineering teams adopting Databricks is the lack of proper CI/CD. Notebooks get edited directly in the UI, "deployment" means copying files between folders, and production incidents are traced back to someone running ad-hoc code in the wrong workspace.

This guide walks through a production-grade CI/CD setup for Databricks using Databricks Asset Bundles (DAB) and GitHub Actions. By the end, you'll have automated testing, environment promotion, and deployment that your software engineering colleagues will actually respect.


The Problem with Notebook-First Development

Most Databricks teams start with notebooks. They're fast to iterate, easy to share, and require zero setup. But they scale poorly:

  • No diff visibility — Git shows base64-encoded JSON, not readable Python
  • Manual promotion — "Deploy to prod" means dragging a notebook between workspace folders
  • No automated testing — how do you test a notebook before it runs on production data?
  • Environment drift — prod and dev notebooks diverge silently

The solution is treating your Databricks project like a proper software project: code in .py files, infrastructure as code, automated tests, and deployment pipelines.


Project Structure

harbinger-pipelines/
├── .github/
│   └── workflows/
│       ├── ci.yml
│       └── deploy.yml
├── databricks.yml
├── resources/
│   ├── jobs.yml
│   └── pipelines.yml
├── src/
│   ├── pipelines/
│   │   ├── bronze/
│   │   │   └── ingest_events.py
│   │   ├── silver/
│   │   │   └── clean_events.py
│   │   └── gold/
│   │       └── aggregate_signals.py
│   └── utils/
│       ├── schema_utils.py
│       └── quality_checks.py
├── tests/
│   ├── unit/
│   │   └── test_clean_events.py
│   └── integration/
│       └── test_pipeline_e2e.py
├── notebooks/
├── pyproject.toml
└── requirements.txt

Databricks Asset Bundles (DAB)

Asset Bundles are Databricks' native IaC framework. They replace the older dbx tool and are the current recommended approach.

Root databricks.yml

bundle:
  name: harbinger-pipelines

variables:
  environment:
    description: Deployment environment
    default: dev
  catalog:
    description: Unity Catalog name
    default: harbinger_dev

targets:
  dev:
    mode: development
    default: true
    workspace:
      host: https://adb-7405613637854743.3.azuredatabricks.net
    variables:
      environment: dev
      catalog: harbinger_dev

  staging:
    mode: development
    workspace:
      host: https://adb-7405613637854743.3.azuredatabricks.net
    variables:
      environment: staging
      catalog: harbinger_staging

  prod:
    mode: production
    workspace:
      host: https://adb-7405613637854743.3.azuredatabricks.net
    variables:
      environment: prod
      catalog: harbinger_prod
    run_as:
      service_principal_name: harbinger-prod-sp

Job Definition (resources/jobs.yml)

resources:
  jobs:
    events_ingestion_job:
      name: "harbinger-events-ingestion-${var.environment}"
      schedule:
        quartz_cron_expression: "0 0 * * * ?"
        timezone_id: "UTC"
      tasks:
        - task_key: ingest_bronze
          python_wheel_task:
            package_name: harbinger_pipelines
            entry_point: ingest_events
          job_cluster_key: default

        - task_key: clean_silver
          depends_on:
            - task_key: ingest_bronze
          python_wheel_task:
            package_name: harbinger_pipelines
            entry_point: clean_events
          job_cluster_key: default

      job_clusters:
        - job_cluster_key: default
          new_cluster:
            spark_version: "15.4.x-scala2.12"
            node_type_id: "Standard_DS3_v2"
            autoscale:
              min_workers: 2
              max_workers: 8

Setting Up GitHub Actions

CI Workflow (.github/workflows/ci.yml)

Runs on every pull request. Validates bundle configuration, runs unit tests, and runs a dry-deploy.

name: CI

on:
  pull_request:
    branches: [main, staging]

jobs:
  lint-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: |
          pip install -r requirements-dev.txt
          pip install -e .

      - name: Run linting
        run: |
          ruff check src/ tests/
          mypy src/

      - name: Run unit tests
        run: pytest tests/unit/ -v --tb=short

      - name: Install Databricks CLI
        run: |
          curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

      - name: Validate bundle
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST_DEV }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN_DEV }}
        run: databricks bundle validate --target dev

      - name: Deploy to dev
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST_DEV }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN_DEV }}
        run: databricks bundle deploy --target dev

Deploy Workflow (.github/workflows/deploy.yml)

Runs on merge to main. Deploys to staging, runs integration tests, then promotes to prod.

name: Deploy

on:
  push:
    branches: [main]

jobs:
  deploy-staging:
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v4

      - name: Install Databricks CLI
        run: curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

      - name: Deploy to staging
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN_STAGING }}
        run: databricks bundle deploy --target staging

      - name: Run integration tests
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN_STAGING }}
        run: pytest tests/integration/ -v --timeout=300

  deploy-prod:
    runs-on: ubuntu-latest
    needs: deploy-staging
    environment: production
    steps:
      - uses: actions/checkout@v4

      - name: Install Databricks CLI
        run: curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

      - name: Deploy to production
        env:
          DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
          DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN_PROD }}
        run: databricks bundle deploy --target prod

Writing Testable Code

The key to testable Databricks code is separating business logic from Spark I/O.

# src/pipelines/silver/clean_events.py

from pyspark.sql import DataFrame
from pyspark.sql.functions import col, trim, upper, when

def clean_event_type(df: DataFrame) -> DataFrame:
    return df.withColumn(
        "event_type",
        when(upper(trim(col("event_type"))).isin(["CONFLICT", "WAR", "BATTLE"]), "CONFLICT")
        .when(upper(trim(col("event_type"))).isin(["NATURAL_DISASTER", "EARTHQUAKE", "FLOOD"]), "DISASTER")
        .otherwise("UNKNOWN")
    )

def filter_low_quality_events(df: DataFrame, min_severity: float = 0.5) -> DataFrame:
    return df.filter(col("severity") >= min_severity)
# tests/unit/test_clean_events.py

import pytest
from pyspark.sql import SparkSession
from src.pipelines.silver.clean_events import clean_event_type, filter_low_quality_events

@pytest.fixture(scope="session")
def spark():
    return SparkSession.builder.master("local[1]").appName("test").getOrCreate()

def test_clean_event_type_normalizes_aliases(spark):
    data = [("war", 5.0), ("FLOOD", 7.0), ("political", 3.0)]
    df = spark.createDataFrame(data, ["event_type", "severity"])
    result = clean_event_type(df)
    types = {row.event_type for row in result.collect()}
    assert "CONFLICT" in types
    assert "DISASTER" in types
    assert "UNKNOWN" in types

def test_filter_removes_low_severity(spark):
    data = [("CONFLICT", 0.3), ("DISASTER", 0.8), ("CONFLICT", 0.5)]
    df = spark.createDataFrame(data, ["event_type", "severity"])
    result = filter_low_quality_events(df, min_severity=0.5)
    assert result.count() == 2

Branching Strategy

We recommend a trunk-based development model with short-lived feature branches:

main  (production)
  |
  +-- feature/add-gdelt-source     PR -> main
  +-- feature/optimize-silver      PR -> main
  +-- hotfix/fix-null-handling     PR -> main (fast-track)
  • main always reflects production state
  • feature branches are short-lived (1-3 days max)
  • No long-lived dev/staging branches — use environment targets in DAB instead

Secrets Management in CI/CD

Store Databricks tokens as GitHub Actions secrets (never in code):

gh secret set DATABRICKS_TOKEN_PROD --body "dapi..."
gh secret set DATABRICKS_TOKEN_STAGING --body "dapi..."
gh secret set DATABRICKS_HOST --body "https://adb-xxx.azuredatabricks.net"

In Databricks, use secret scopes rather than hardcoding credentials in job configs.


Monitoring Deployments

After deployment, verify the bundle state:

# Check what is deployed in each target
databricks bundle summary --target prod

# Run a job manually to verify
databricks bundle run events_ingestion_job --target prod

Common Issues and Fixes

IssueCauseFix
bundle validate fails in CIMissing DATABRICKS_HOST env varSet secrets in GitHub repo settings
Unit tests fail with Spark errorsJVM not installed in CI runnerAdd java-version: '11' in setup action
Deploy succeeds but job failsWheel not uploaded before job runEnsure bundle deploy uploads artifacts
Integration tests timeoutDev cluster cold startUse keep_alive: true for test clusters
Prod deploy skippedStaging tests failedFix staging tests; never skip the gate

Wrapping Up

Proper CI/CD for Databricks is not optional — it's what separates teams that ship confidently from teams that dread Friday deploys. Databricks Asset Bundles, combined with GitHub Actions, give you a clean, version-controlled, environment-aware deployment pipeline.

Start small: even just adding bundle validate to your PR checks will catch configuration errors before they reach production.


Try Harbinger Explorer free for 7 days — built on the same CI/CD principles described here, deploying multiple times per day with confidence. Start your free trial at harbingerexplorer.com.


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...