Databricks Asset Bundles (DABs): The Complete Deployment Guide
Databricks Asset Bundles (DABs): The Complete Deployment Guide
Databricks Asset Bundles (DABs) are Databricks' answer to infrastructure-as-code for data pipelines. With DABs, you define your entire Databricks project — jobs, Delta Live Tables pipelines, notebooks, clusters, permissions — in YAML and Python, deploy it consistently across environments, and version everything in Git.
This guide covers everything from project initialization to production CI/CD pipelines.
What Are Databricks Asset Bundles?
A bundle is a collection of Databricks resources defined in YAML that can be deployed as a single unit. Think of it as Terraform for Databricks — but with first-class support for Databricks-specific concepts like jobs, pipelines, and notebooks.
What you can define in a bundle:
| Resource Type | Description |
|---|---|
jobs | Databricks Workflows (job clusters, task dependencies) |
pipelines | Delta Live Tables (DLT) pipelines |
experiments | MLflow experiments |
models | MLflow model registrations |
clusters | All-Purpose cluster configurations |
dashboards | Databricks SQL dashboards |
permissions | Access control for any resource |
Installation and Setup
# Install Databricks CLI (includes DABs support, v0.18+)
curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
# Verify installation
databricks --version
# Authenticate
databricks configure
# Prompts for host and token, or use environment variables:
export DATABRICKS_HOST="https://adb-WORKSPACE_ID.azuredatabricks.net"
export DATABRICKS_TOKEN="dapi..."
Project Initialization
# Initialize a new bundle project
databricks bundle init
# Choose from templates:
# 1. Default Python (recommended starting point)
# 2. DLT Python
# 3. MLOps Stacks
cd my-databricks-project
ls -la
# databricks.yml -- main bundle config
# src/ -- Python source files
# resources/ -- YAML resource definitions
# tests/ -- unit tests
# .github/workflows/ -- CI/CD (if selected)
Bundle Configuration Deep Dive
databricks.yml — The Root Config
bundle:
name: harbinger-data-platform
# Define environments (targets)
targets:
dev:
mode: development # adds [dev username] prefix to resources
default: true
workspace:
host: https://adb-WORKSPACE_ID.azuredatabricks.net
staging:
mode: production
workspace:
host: https://adb-WORKSPACE_ID.azuredatabricks.net
variables:
env: staging
cluster_size: "2g"
prod:
mode: production
workspace:
host: https://adb-WORKSPACE_ID.azuredatabricks.net
variables:
env: prod
cluster_size: "8g"
permissions:
- group_name: data-engineers
level: CAN_MANAGE_RUN
- group_name: data-analysts
level: CAN_VIEW
# Variable definitions with defaults
variables:
env:
description: "Target environment"
default: dev
cluster_size:
description: "Worker node count"
default: "2g"
# Include resource files
include:
- resources/jobs/*.yml
- resources/pipelines/*.yml
- resources/clusters/*.yml
Defining Jobs
# resources/jobs/medallion_pipeline.yml
resources:
jobs:
medallion_pipeline:
name: "Medallion Pipeline [${var.env}]"
email_notifications:
on_failure:
- data-alerts@yourcompany.com
on_success:
- data-reports@yourcompany.com
health:
rules:
- metric: RUN_DURATION_SECONDS
op: GREATER_THAN
value: 7200
trigger:
pause_status: UNPAUSED
periodic:
interval: 1
unit: DAYS
tasks:
- task_key: bronze_ingestion
description: "Ingest raw events from landing zone"
new_cluster:
spark_version: "14.3.x-scala2.12"
node_type_id: "Standard_DS3_v2"
num_workers: 4
spark_conf:
spark.databricks.delta.optimizeWrite.enabled: "true"
notebook_task:
notebook_path: ./src/notebooks/bronze_ingestion
base_parameters:
env: ${var.env}
libraries:
- pypi:
package: delta-spark==3.0.0
- task_key: silver_transform
depends_on:
- task_key: bronze_ingestion
new_cluster:
spark_version: "14.3.x-scala2.12"
node_type_id: "Standard_DS4_v2"
num_workers: 8
azure_attributes:
availability: SPOT_WITH_FALLBACK_AZURE
python_wheel_task:
package_name: harbinger_transforms
entry_point: run_silver
parameters:
- "--env"
- ${var.env}
libraries:
- whl: ./dist/harbinger_transforms-*.whl
- task_key: gold_aggregation
depends_on:
- task_key: silver_transform
sql_task:
warehouse_id: ${var.sql_warehouse_id}
query:
query_id: ""
- task_key: dq_validation
depends_on:
- task_key: gold_aggregation
notebook_task:
notebook_path: ./src/notebooks/data_quality_check
run_if: ALL_SUCCESS
Defining DLT Pipelines
# resources/pipelines/streaming_events.yml
resources:
pipelines:
streaming_events:
name: "Streaming Events Pipeline [${var.env}]"
target: "prod_silver"
catalog: "prod"
development: ${bundle.target == 'dev'}
continuous: false
clusters:
- label: default
num_workers: 4
node_type_id: "Standard_DS3_v2"
libraries:
- notebook:
path: ./src/dlt/streaming_pipeline
- notebook:
path: ./src/dlt/quality_expectations
configuration:
pipelines.applyChangesPreviewEnabled: "true"
spark.databricks.delta.optimizeWrite.enabled: "true"
Python Source Structure
src/
├── notebooks/
│ ├── bronze_ingestion.py
│ ├── silver_transform.py
│ └── data_quality_check.py
├── dlt/
│ ├── streaming_pipeline.py
│ └── quality_expectations.py
├── transforms/
│ ├── __init__.py
│ ├── bronze.py
│ ├── silver.py
│ └── gold.py
└── utils/
├── __init__.py
├── spark_helpers.py
└── schema_registry.py
Writing Bundle-Compatible Notebooks
# src/notebooks/bronze_ingestion.py
# Databricks notebook source
# COMMAND ----------
# Parameters (injected by Databricks Workflows)
dbutils.widgets.text("env", "dev")
dbutils.widgets.text("batch_date", "2024-01-01")
env = dbutils.widgets.get("env")
batch_date = dbutils.widgets.get("batch_date")
print(f"Running bronze ingestion for env={env}, date={batch_date}")
# COMMAND ----------
from pyspark.sql.functions import current_timestamp, input_file_name, lit
source_path = f"abfss://landing@{env}storage.dfs.core.windows.net/events/{batch_date}/"
target_table = f"{env}.bronze.events_raw"
(
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", f"/mnt/checkpoints/{env}/bronze/events/schema")
.load(source_path)
.withColumn("_ingested_at", current_timestamp())
.withColumn("_source_file", input_file_name())
.withColumn("_env", lit(env))
.writeStream
.format("delta")
.option("checkpointLocation", f"/mnt/checkpoints/{env}/bronze/events/stream")
.option("mergeSchema", "true")
.outputMode("append")
.trigger(availableNow=True)
.table(target_table)
.awaitTermination()
)
print(f"Bronze ingestion complete: {target_table}")
CLI Workflow
# Validate bundle configuration (syntax check)
databricks bundle validate
# Deploy to dev (default target)
databricks bundle deploy
# Deploy to specific target
databricks bundle deploy --target staging
databricks bundle deploy --target prod
# Run a specific job after deployment
databricks bundle run medallion_pipeline
# Run with parameter overrides
databricks bundle run medallion_pipeline \
--python-params '["--env", "staging", "--batch-date", "2024-01-15"]'
# Watch job status
databricks bundle run medallion_pipeline --watch
# Destroy all deployed resources
databricks bundle destroy --target dev --auto-approve
CI/CD with GitHub Actions
# .github/workflows/databricks-deploy.yml
name: Deploy Databricks Bundle
on:
push:
branches: [main, staging]
pull_request:
branches: [main]
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
jobs:
validate:
name: Validate Bundle
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Databricks CLI
uses: databricks/setup-cli@main
- name: Validate Bundle
run: databricks bundle validate
- name: Run Unit Tests
run: |
pip install pytest pyspark delta-spark
pytest tests/ -v --tb=short
deploy-staging:
name: Deploy to Staging
needs: validate
if: github.ref == 'refs/heads/staging'
runs-on: ubuntu-latest
environment: staging
steps:
- uses: actions/checkout@v4
- name: Setup Databricks CLI
uses: databricks/setup-cli@main
- name: Build Python Wheel
run: |
pip install build
python -m build
- name: Deploy to Staging
run: databricks bundle deploy --target staging
env:
DATABRICKS_TOKEN: ${{ secrets.STAGING_DATABRICKS_TOKEN }}
- name: Run Integration Test
run: |
databricks bundle run smoke_test_job --target staging --watch
deploy-prod:
name: Deploy to Production
needs: validate
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v4
- name: Setup Databricks CLI
uses: databricks/setup-cli@main
- name: Build Python Wheel
run: |
pip install build
python -m build
- name: Deploy to Production
run: databricks bundle deploy --target prod
env:
DATABRICKS_TOKEN: ${{ secrets.PROD_DATABRICKS_TOKEN }}
Variable Management and Secrets
# databricks.yml — reference variables
variables:
sql_warehouse_id:
description: "SQL Warehouse ID for Gold queries"
# Set via environment variable: BUNDLE_VAR_sql_warehouse_id
db_password:
description: "Database password"
# Use Databricks secret scope — never put secrets in YAML
# Inject variable at deploy time
BUNDLE_VAR_sql_warehouse_id=d3f8f59331f78ac5 databricks bundle deploy --target prod
For secrets, always use Databricks Secret Scopes (never YAML):
# In notebook/Python code
db_password = dbutils.secrets.get(scope="harbinger", key="db-password")
Bundle Testing Strategies
Unit Tests (No Cluster Required)
# tests/test_silver_transform.py
import pytest
from pyspark.sql import SparkSession
from src.transforms.silver import deduplicate_events
@pytest.fixture(scope="session")
def spark():
return SparkSession.builder.master("local[2]").appName("test").getOrCreate()
def test_deduplication(spark):
data = [
("evt_001", "2024-01-15T10:00:00", "PURCHASE", 100.0),
("evt_001", "2024-01-15T10:00:00", "PURCHASE", 100.0), # duplicate
("evt_002", "2024-01-15T11:00:00", "CLICK", 0.0),
]
df = spark.createDataFrame(data, ["event_id", "event_ts", "event_type", "amount"])
result = deduplicate_events(df, key_col="event_id")
assert result.count() == 2 # duplicate removed
Integration Tests (Require Cluster)
# Run a lightweight integration test job defined in the bundle
databricks bundle run integration_test --target staging --watch
# Check exit code
echo "Exit code: $?"
Monitoring Deployed Bundles
Track your deployed resources via System Tables:
SELECT
job_id,
job_name,
run_id,
state.result_state,
start_time,
end_time,
DATEDIFF(second, start_time, end_time) AS duration_sec
FROM system.workflow.job_run_timeline
WHERE job_name LIKE '%Medallion Pipeline%'
AND DATE(start_time) >= CURRENT_DATE - 7
ORDER BY start_time DESC;
Connect your bundle deployments to Harbinger Explorer for centralized monitoring — track job run history, detect failures, and correlate deployment events with pipeline anomalies across environments.
Common Pitfalls
| Problem | Solution |
|---|---|
| Bundle validate passes but deploy fails | Check workspace permissions; ensure service principal has CAN_MANAGE on workspace |
| Notebook paths not found | Use ./ relative paths from bundle root, not absolute workspace paths |
| Variable not injected | Check BUNDLE_VAR_<name> env var convention (uppercase) |
| Dev mode name conflicts | mode: development adds [dev username] prefix — don't hardcode resource names |
| Wheel not found | Build wheel before deploy; ensure dist/*.whl is included |
Conclusion
Databricks Asset Bundles are the right way to manage Databricks infrastructure at scale. They bring software engineering discipline — version control, CI/CD, environment parity, code review — to data platform deployments.
The learning curve is worth it: once your team adopts DABs, deployments become reproducible, rollbacks become trivial, and cross-environment consistency is guaranteed. Start with a single job, then expand to your full platform incrementally.
Try Harbinger Explorer free for 7 days — get full visibility into your DABs-deployed resources, track deployments across environments, and monitor job health without custom tooling. Start your free trial at harbingerexplorer.com
Continue Reading
Databricks Autoloader: The Complete Guide
CI/CD Pipelines for Databricks Projects: A Production-Ready Guide
Build a robust CI/CD pipeline for your Databricks projects using GitHub Actions, Databricks Asset Bundles, and automated testing. Covers branching strategy, testing, and deployment.
Databricks Cluster Policies for Cost Control: A Practical Guide
Learn how to use Databricks cluster policies to enforce cost guardrails, standardize cluster configurations, and prevent cloud bill surprises without blocking your team's productivity.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial