databricks

Published: Apr 3, 2026

Databricks SQL Warehouse Sizing and Cost Optimization Guide

10 min read·Tags: databricks sql, sql warehouse, cost optimization, serverless, databricks, data engineering

Databricks SQL Warehouse Sizing and Cost Optimization Guide

Databricks SQL Warehouses are the compute layer behind Databricks SQL — powering BI dashboards, ad-hoc queries, and dbt transformations. But with DBU costs that add up fast, sizing them wrong is one of the most common ways teams overbuild (or underbuild) their data platform.

This guide covers everything you need to make informed SQL Warehouse decisions: types, sizing, auto-stop tuning, and concrete cost optimization tactics.

SQL Warehouse Types

Databricks offers three SQL Warehouse types:

Type	Best For	Cold Start	Cost
Serverless	Unpredictable/bursty workloads, BI tools	~2–5s	Pay per query second
Pro	Consistent workloads, advanced features	~2–4 min	Pay per running hour
Classic	Legacy workloads, cost predictability	~2–4 min	Pay per running hour

Serverless SQL Warehouses are the default recommendation for new deployments in 2024. They start in seconds (no cluster bootstrap), scale to zero immediately when idle, and are billed only for query execution time — not idle time.

# Create a serverless SQL warehouse via CLI
databricks warehouses create   --name "analytics-serverless"   --warehouse-type STARTER   --cluster-size "Small"   --auto-stop-mins 5   --enable-serverless-compute true

T-Shirt Sizing Reference

SQL Warehouses use t-shirt sizes that map to DBU consumption:

Size	DBUs/Hour	vCPUs	RAM	Best For
2X-Small	1	4	16 GB	Dev/test, single analyst
X-Small	2	8	32 GB	Light BI, < 5 concurrent users
Small	4	16	64 GB	Moderate BI, 5–15 users
Medium	8	32	128 GB	Heavy BI, 15–30 users
Large	16	64	256 GB	Enterprise BI, dbt production
X-Large	32	128	512 GB	Very large datasets, ML feature stores
2X-Large	64	256	1 TB	Extreme scale
3X-Large	128	512	2 TB	Rarely needed
4X-Large	256	1024	4 TB	Benchmark workloads only

Note: For Serverless warehouses, the DBU rate is typically 1.4–1.6x Classic, but you eliminate idle costs, often resulting in lower total spend.

Clusters (Scaling within a Warehouse)

Each SQL Warehouse can run multiple clusters to handle concurrent queries. One cluster handles about 10 concurrent queries efficiently.

# Configure via Python SDK
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.sql import CreateWarehouseRequest, WarehouseTypePair

w = WorkspaceClient()

warehouse = w.warehouses.create(
    name="production-bi",
    cluster_size="Medium",
    min_num_clusters=1,
    max_num_clusters=5,       # Auto-scales up to 5 clusters
    auto_stop_mins=30,
    enable_serverless_compute=True,
    warehouse_type=WarehouseTypePair(warehouse_type="PRO"),
)
print(f"Created warehouse: {warehouse.id}")

Scaling rules:

min_num_clusters=1: Keeps one cluster always warm (eliminates cold start for first query)
max_num_clusters=N: Sets the concurrency ceiling
Auto-scale adds clusters when queue depth exceeds threshold (~2 queries waiting)

Auto-Stop Configuration: The Biggest Cost Lever

Auto-stop is the single most impactful cost control for SQL Warehouses. An idle warehouse running all night costs the same as a busy one.

# Set auto-stop to 5 minutes for development warehouses
databricks warehouses update   --id <warehouse-id>   --auto-stop-mins 5

# 30 minutes for production BI (balance cost vs cold start for morning rush)
databricks warehouses update   --id <warehouse-id>   --auto-stop-mins 30

# Query current warehouse settings
databricks warehouses get --id <warehouse-id>

Auto-stop strategy by use case:

Warehouse Purpose	Recommended Auto-Stop
Development/ad-hoc	5 minutes
Dashboard refresh (daytime only)	10 minutes
Production BI (business hours)	30 minutes
Always-on API endpoint	60 minutes or `min_num_clusters=1` + low auto-stop
dbt production jobs	10 minutes (restarts fast)

Query Routing and Warehouse Assignment

Map different workload types to appropriately sized warehouses:

-- Query profiles to understand your workload mix
SELECT
  warehouse_id,
  COUNT(*) AS query_count,
  AVG(duration) / 1000 AS avg_duration_sec,
  PERCENTILE(duration, 0.95) / 1000 AS p95_duration_sec,
  SUM(metrics.total_time_ms) / 1000 / 3600 AS total_compute_hours
FROM system.query.history
WHERE start_time >= NOW() - INTERVAL 30 DAYS
GROUP BY warehouse_id
ORDER BY total_compute_hours DESC;

Recommended warehouse architecture:

┌─────────────────────────────────────────────────┐
│  Production BI Dashboard Warehouse               │
│  Size: Medium, Min: 1, Max: 4, Stop: 30 min     │
│  Users: Tableau, Power BI, Metabase             │
└─────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────┐
│  Analytics / Ad-Hoc Warehouse                   │
│  Size: Small (Serverless), Max: 3, Stop: 10 min │
│  Users: Data analysts, SQL notebooks            │
└─────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────┐
│  dbt Production Warehouse                       │
│  Size: Large, Min: 0, Max: 1, Stop: 10 min      │
│  Users: dbt Cloud / dbt Core CI/CD             │
└─────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────┐
│  Developer / Sandbox Warehouse                  │
│  Size: 2X-Small, Min: 0, Max: 1, Stop: 5 min   │
│  Users: Engineers testing queries               │
└─────────────────────────────────────────────────┘

Cost Monitoring via system.billing

Databricks exposes billing data in the system.billing.usage table:

-- SQL Warehouse cost breakdown by warehouse (last 30 days)
SELECT
  u.usage_metadata.warehouse_id AS warehouse_id,
  w.name AS warehouse_name,
  SUM(u.usage_quantity) AS total_dbus,
  SUM(u.usage_quantity) * 0.22 AS estimated_cost_usd  -- adjust for your DBU rate
FROM system.billing.usage u
LEFT JOIN system.compute.warehouses w
  ON u.usage_metadata.warehouse_id = w.warehouse_id
WHERE u.usage_date >= CURRENT_DATE - INTERVAL 30 DAYS
  AND u.sku_name LIKE '%SQL%'
GROUP BY 1, 2
ORDER BY total_dbus DESC;

-- Identify wasteful idle compute (warehouses running with zero queries)
SELECT
  date_trunc('hour', period_start) AS hour,
  warehouse_id,
  SUM(dbus) AS idle_dbus
FROM system.billing.warehouse_events
WHERE event_type = 'IDLE'
  AND period_start >= CURRENT_DATE - INTERVAL 7 DAYS
GROUP BY 1, 2
HAVING idle_dbus > 1
ORDER BY idle_dbus DESC;

Query-Level Cost Attribution

For chargeback models, trace cost to users or teams:

-- Cost per user for the last 30 days
SELECT
  user_name,
  COUNT(*) AS query_count,
  SUM(metrics.total_time_ms) / 1000 / 3600 AS compute_hours,
  ROUND(SUM(metrics.total_time_ms) / 1000 / 3600 * 0.22, 2) AS est_cost_usd
FROM system.query.history
WHERE start_time >= NOW() - INTERVAL 30 DAYS
  AND status = 'FINISHED'
GROUP BY user_name
ORDER BY est_cost_usd DESC
LIMIT 20;

Performance Optimization for SQL Warehouses

1. Result Caching

SQL Warehouses cache identical query results for up to 24 hours. Exploit this:

-- This query hits cache on second execution (same SQL, same data)
SELECT country, SUM(revenue) FROM prod.gold.sales GROUP BY country;

-- Force cache bypass (for testing)
SELECT /*+ NO_CACHE */ country, SUM(revenue) FROM prod.gold.sales GROUP BY country;

2. Query Optimization

-- Use EXPLAIN COST to understand query plans
EXPLAIN COST
SELECT
  c.customer_name,
  SUM(o.amount) AS total_spend
FROM prod.gold.orders o
JOIN prod.gold.customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2024-01-01'
GROUP BY c.customer_name
ORDER BY total_spend DESC
LIMIT 100;

3. Materialized Views and Dashboards

For dashboards that run the same expensive aggregations repeatedly:

-- Create a materialized summary (refresh daily)
CREATE OR REPLACE TABLE prod.gold.daily_revenue_summary AS
SELECT
  date_trunc('day', order_date) AS day,
  product_category,
  region,
  SUM(amount) AS revenue,
  COUNT(*) AS order_count
FROM prod.gold.orders
GROUP BY 1, 2, 3;

-- Dashboard queries hit the summary instead of raw table
-- 1M row aggregation → 365 row lookup

Serverless vs Pro: Total Cost Comparison

For a realistic scenario: 4 analysts running 50 queries/day, each taking 30 seconds average, 8 business hours/day:

Warehouse	Size	Running Hours/Day	DBUs/Hour	Daily DBU Cost
Pro/Classic Small	Small	8h (always on)	4	32 DBU
Serverless Small	Small	~25 min actual query time	4 × 1.5 = 6	~2.5 DBU

Serverless saves ~93% in this scenario. The break-even point where Classic/Pro wins: warehouses running > 60% of the day under load.

Databricks SQL + Harbinger Explorer

Teams using Harbinger Explorer can connect directly to their SQL Warehouses to run exploration queries against external API datasets — without spinning up a separate ETL pipeline. Harbinger's API crawler can push structured results directly into Delta tables queryable by your SQL Warehouse, turning external data (geopolitical events, economic indicators, supply chain signals) into SQL-accessible assets in minutes.

-- After Harbinger ingests external data to your Delta table:
SELECT
  h.event_country,
  h.event_type,
  COUNT(*) AS event_count,
  SUM(s.revenue_impact) AS revenue_impact
FROM harbinger.geopolitical.events h
JOIN prod.gold.sales s ON h.event_country = s.ship_to_country
  AND h.event_date = s.sale_date
WHERE h.event_date >= '2024-01-01'
GROUP BY 1, 2
ORDER BY revenue_impact DESC;

Checklist: Cost-Optimized SQL Warehouse Setup

Use Serverless for bursty/unpredictable workloads
Set auto-stop ≤ 10 minutes for dev warehouses
Set auto-stop ≤ 30 minutes for production BI
Separate dev/analytics/production into different warehouses
Enable result caching (on by default, don't disable)
Monitor system.billing.usage weekly
Right-size to the smallest warehouse that meets P95 query latency SLA
Use materialized tables for expensive recurring aggregations
Tag warehouses for cost attribution (team, environment, owner)

Conclusion

SQL Warehouse cost optimization is primarily about matching supply to demand: right-sized compute, aggressive auto-stop for development environments, and Serverless for everything with unpredictable load. Start with Serverless unless you have a specific reason to use Pro. Monitor system.billing.usage and system.query.history weekly to catch runaway costs early.

The teams that get this right spend 40–70% less on SQL compute than those who size up "just to be safe" and forget to tune auto-stop.

Try Harbinger Explorer free for 7 days — connect to your Databricks SQL Warehouse, explore external datasets, and build data products that combine internal lakehouse data with real-world signals from across the web. harbingerexplorer.com

View all articles

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Harbinger Explorer

Databricks SQL Warehouse Sizing and Cost Optimization Guide

Databricks SQL Warehouse Sizing and Cost Optimization Guide

SQL Warehouse Types

T-Shirt Sizing Reference

Clusters (Scaling within a Warehouse)

Auto-Stop Configuration: The Biggest Cost Lever

Query Routing and Warehouse Assignment

Cost Monitoring via system.billing

Query-Level Cost Attribution

Performance Optimization for SQL Warehouses

1. Result Caching

2. Query Optimization

3. Materialized Views and Dashboards

Serverless vs Pro: Total Cost Comparison

Databricks SQL + Harbinger Explorer

Checklist: Cost-Optimized SQL Warehouse Setup

Conclusion

Continue Reading

Databricks Autoloader: The Complete Guide

CI/CD Pipelines for Databricks Projects: A Production-Ready Guide

Databricks Cluster Policies for Cost Control: A Practical Guide

Try Harbinger Explorer for free

Databricks SQL Warehouse Sizing and Cost Optimization Guide

SQL Warehouse Types

T-Shirt Sizing Reference

Clusters (Scaling within a Warehouse)

Auto-Stop Configuration: The Biggest Cost Lever

Query Routing and Warehouse Assignment

Cost Monitoring via system.billing

Query-Level Cost Attribution

Performance Optimization for SQL Warehouses

1. Result Caching

2. Query Optimization

3. Materialized Views and Dashboards

Serverless vs Pro: Total Cost Comparison

Databricks SQL + Harbinger Explorer

Checklist: Cost-Optimized SQL Warehouse Setup

Conclusion

Continue Reading

Databricks Autoloader: The Complete Guide

CI/CD Pipelines for Databricks Projects: A Production-Ready Guide

Databricks Cluster Policies for Cost Control: A Practical Guide

Try Harbinger Explorer for free

Command Palette