Harbinger Explorer

Back to Knowledge Hub
databricks
Published:

Databricks Cost Optimization: 12 Strategies to Cut Your Cloud Bill

11 min read·Tags: databricks, cost-optimization, dbu, spot-instances, cluster-management, data-engineering

Databricks Cost Optimization: 12 Strategies to Cut Your Cloud Bill

Databricks is powerful — and like all powerful infrastructure, it can get expensive fast. Teams routinely discover surprise bills from idle clusters, over-provisioned nodes, and inefficient queries. In this guide, we cover 12 actionable strategies that data engineering teams use to reduce Databricks spending without sacrificing performance.


Understanding Databricks Pricing

Before optimizing, understand what drives costs:

Cost DriverDescription
DBUs (Databricks Units)Compute time × node tier (Jobs, All-Purpose, SQL)
Cloud VMsUnderlying AWS/Azure/GCP instance costs
StorageDelta Lake files in cloud object storage
Network egressCross-region data transfer

Jobs compute is typically 2–3× cheaper than All-Purpose compute for the same workload — this single insight drives several optimizations below.


Strategy 1: Kill Idle All-Purpose Clusters

The #1 cost killer: clusters left running after work hours or during lunch. A forgotten 4-node cluster at $0.40/DBU × 4 nodes × 10 idle hours = $16/day = $5,840/year.

Set aggressive auto-termination:

import requests

DATABRICKS_HOST = "https://adb-WORKSPACE_ID.azuredatabricks.net"
TOKEN = "dapi..."

response = requests.post(
    f"{DATABRICKS_HOST}/api/2.0/clusters/edit",
    headers={"Authorization": f"Bearer {TOKEN}"},
    json={
        "cluster_id": "YOUR_CLUSTER_ID",
        "autotermination_minutes": 20,
    }
)

Default cluster policies (enforce via Cluster Policies):

{
  "autotermination_minutes": {
    "type": "fixed",
    "value": 30
  }
}

Strategy 2: Move Batch Jobs to Jobs Compute

All-Purpose clusters are for interactive development. Production jobs should always use Jobs compute (job clusters) — they're cheaper and isolated:

# In your Databricks Asset Bundle (databricks.yml)
resources:
  jobs:
    nightly_etl:
      tasks:
        - task_key: transform
          new_cluster:
            spark_version: "14.3.x-scala2.12"
            node_type_id: "Standard_DS3_v2"
            num_workers: 4

Jobs clusters spin up for the job, run, and terminate — you pay only for runtime.


Strategy 3: Use Spot/Preemptible Instances

Spot instances (AWS) or Spot VMs (Azure) are typically 60–80% cheaper than on-demand. For fault-tolerant batch workloads, use a hybrid: on-demand driver + spot workers.

cluster_config = {
    "cluster_name": "spot-workers-cluster",
    "spark_version": "14.3.x-scala2.12",
    "driver_node_type_id": "Standard_DS3_v2",
    "node_type_id": "Standard_DS3_v2",
    "num_workers": 8,
    "azure_attributes": {
        "availability": "SPOT_WITH_FALLBACK_AZURE",
        "spot_bid_max_price": -1
    }
}

When to use spot: Bronze ingestion, Silver batch transforms, ML training. When to avoid spot: Interactive notebooks, streaming with exactly-once semantics.


Strategy 4: Right-Size Your Clusters

Over-provisioning is rampant. Most teams run clusters 2–3× larger than needed.

Profile before sizing:

SELECT
  application_id,
  SUM(disk_bytes_spilled) / 1e9 AS gb_spilled_to_disk,
  SUM(memory_bytes_spilled) / 1e9 AS gb_spilled_to_memory
FROM system.runtime.query_history
WHERE date_day >= current_date() - 7
GROUP BY 1
ORDER BY 2 DESC

Sizing heuristics:

WorkloadRecommendation
SQL analytics2–4 nodes, memory-optimized
Spark ETL (small)2–8 nodes, general-purpose
Spark ETL (large)8–32 nodes, compute-optimized
ML trainingGPU nodes only when needed

Strategy 5: Enable Photon Engine

Photon is Databricks' vectorized query engine — it dramatically speeds up SQL and DataFrame operations, meaning jobs finish faster and cost less:

cluster_config = {
    "runtime_engine": "PHOTON",
    "node_type_id": "Standard_DS3_v2",
    "spark_version": "14.3.x-photon-scala2.12"
}

Photon typically delivers 2–8× speedups on SQL-heavy workloads. Since you pay per DBU-hour, faster = cheaper.

Best Photon workloads: Delta Lake reads/writes, SQL aggregations, joins, sorts.


Strategy 6: Use SQL Warehouses for BI Queries

Never use All-Purpose clusters for BI tool connections (Tableau, Power BI, Looker). Use Databricks SQL Warehouses instead — they're designed for concurrent SQL and are significantly cheaper for that pattern:

SELECT
  warehouse_id,
  SUM(execution_duration_ms) / 3600000.0 AS total_hours,
  COUNT(*) AS query_count,
  AVG(execution_duration_ms / 1000.0) AS avg_duration_sec
FROM system.query.history
WHERE DATE(start_time) >= CURRENT_DATE - 30
GROUP BY 1
ORDER BY 2 DESC;

Configure auto-stop and auto-scale:

{
  "auto_stop_mins": 10,
  "min_num_clusters": 1,
  "max_num_clusters": 3,
  "cluster_size": "Small"
}

Strategy 7: Partition and Z-Order Strategically

Poorly designed tables force full scans, wasting compute. Use partitioning and Z-ordering to minimize data scanned:

CREATE TABLE prod.silver.events
USING DELTA
PARTITIONED BY (event_date)
AS SELECT *, CAST(event_timestamp AS DATE) AS event_date
FROM prod.bronze.events_raw;

OPTIMIZE prod.silver.events
ZORDER BY (user_id, event_type);

Impact: Partition pruning can reduce scanned data by 90%+, directly cutting query cost.


Strategy 8: Leverage Delta Cache

The Delta cache keeps decoded columnar data in SSD/memory for reuse. Enable it on frequently read tables:

spark.conf.set("spark.databricks.io.cache.enabled", "true")
spark.conf.set("spark.databricks.io.cache.maxDiskUsage", "50g")
spark.conf.set("spark.databricks.io.cache.maxMetaDataCache", "1g")

Pre-warm the cache before scheduled jobs:

spark.sql("CACHE SELECT * FROM prod.gold.customers")
spark.sql("CACHE SELECT * FROM prod.silver.products")

Strategy 9: Vacuum Delta Tables Regularly

Delta keeps old file versions for time travel. Without VACUUM, storage costs compound:

VACUUM prod.silver.events RETAIN 168 HOURS;

ALTER TABLE prod.silver.events
  SET TBLPROPERTIES ('delta.deletedFileRetentionDuration' = 'interval 3 days');
def vacuum_all_tables(schema: str, retention_hours: int = 168):
    # Vacuum all Delta tables in a schema
    tables = spark.sql(f"SHOW TABLES IN {schema}").collect()
    for row in tables:
        table = f"{schema}.{row['tableName']}"
        print(f"Vacuuming {table}...")
        spark.sql(f"VACUUM {table} RETAIN {retention_hours} HOURS")

vacuum_all_tables("prod.silver", retention_hours=72)

Strategy 10: Implement Cluster Policies

Prevent engineers from spinning up oversized clusters by enforcing policies:

policy = {
    "name": "Data Engineer Standard",
    "definition": {
        "dbus_per_hour": {
            "type": "range",
            "maxValue": 100
        },
        "autotermination_minutes": {
            "type": "fixed",
            "value": 30
        },
        "node_type_id": {
            "type": "allowlist",
            "values": ["Standard_DS3_v2", "Standard_DS4_v2"]
        },
        "num_workers": {
            "type": "range",
            "maxValue": 8
        }
    }
}

response = requests.post(
    f"{DATABRICKS_HOST}/api/2.0/policies/clusters/create",
    headers={"Authorization": f"Bearer {TOKEN}"},
    json=policy
)

Strategy 11: Use Structured Streaming Efficiently

Streaming clusters often run 24/7 — optimize them carefully:

# Micro-batch every 5 minutes instead of continuous
df.writeStream \
  .trigger(processingTime="5 minutes") \
  .format("delta") \
  .option("checkpointLocation", "/checkpoints/events") \
  .table("prod.silver.events") \
  .start()

# For batch workloads that can run on a schedule
df.writeStream \
  .trigger(availableNow=True) \
  .format("delta") \
  .table("prod.silver.events") \
  .start() \
  .awaitTermination()

Strategy 12: Monitor Costs Continuously

You can't optimize what you don't measure. Use Databricks System Tables:

SELECT
  job_id,
  job_name,
  SUM(dbu_consumed) AS total_dbus,
  SUM(dbu_consumed) * 0.40 AS estimated_cost_usd,
  COUNT(*) AS run_count
FROM system.billing.usage
WHERE usage_date >= CURRENT_DATE - 30
  AND usage_type = 'JOBS_COMPUTE'
GROUP BY 1, 2
ORDER BY 3 DESC
LIMIT 20;

Connect your billing data to a monitoring tool like Harbinger Explorer for automated anomaly detection — get alerted when a job's DBU consumption spikes unexpectedly.


Cost Optimization Checklist

ActionEstimated Savings
Auto-terminate idle clusters (30 min)20–40%
Migrate batch to Jobs compute15–30%
Enable spot workers40–60%
Right-size clusters20–35%
Enable Photon10–25% (via speed)
Optimize partitioning/Z-Order10–30%
Regular VACUUM5–15% (storage)
Cluster policies10–25% (governance)

Implementing all these strategies together can cut a typical Databricks bill by 50–70%.


Try Harbinger Explorer free for 7 days — get automated cost anomaly detection, cluster utilization reports, and DBU budget alerts across your entire Databricks workspace. Start your free trial at harbingerexplorer.com


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...