Databricks SQL Warehouse Sizing and Cost Optimization Guide
Databricks SQL Warehouse Sizing and Cost Optimization Guide
Databricks SQL Warehouses are the compute layer behind Databricks SQL — powering BI dashboards, ad-hoc queries, and dbt transformations. But with DBU costs that add up fast, sizing them wrong is one of the most common ways teams overbuild (or underbuild) their data platform.
This guide covers everything you need to make informed SQL Warehouse decisions: types, sizing, auto-stop tuning, and concrete cost optimization tactics.
SQL Warehouse Types
Databricks offers three SQL Warehouse types:
| Type | Best For | Cold Start | Cost |
|---|---|---|---|
| Serverless | Unpredictable/bursty workloads, BI tools | ~2–5s | Pay per query second |
| Pro | Consistent workloads, advanced features | ~2–4 min | Pay per running hour |
| Classic | Legacy workloads, cost predictability | ~2–4 min | Pay per running hour |
Serverless SQL Warehouses are the default recommendation for new deployments in 2024. They start in seconds (no cluster bootstrap), scale to zero immediately when idle, and are billed only for query execution time — not idle time.
# Create a serverless SQL warehouse via CLI
databricks warehouses create --name "analytics-serverless" --warehouse-type STARTER --cluster-size "Small" --auto-stop-mins 5 --enable-serverless-compute true
T-Shirt Sizing Reference
SQL Warehouses use t-shirt sizes that map to DBU consumption:
| Size | DBUs/Hour | vCPUs | RAM | Best For |
|---|---|---|---|---|
| 2X-Small | 1 | 4 | 16 GB | Dev/test, single analyst |
| X-Small | 2 | 8 | 32 GB | Light BI, < 5 concurrent users |
| Small | 4 | 16 | 64 GB | Moderate BI, 5–15 users |
| Medium | 8 | 32 | 128 GB | Heavy BI, 15–30 users |
| Large | 16 | 64 | 256 GB | Enterprise BI, dbt production |
| X-Large | 32 | 128 | 512 GB | Very large datasets, ML feature stores |
| 2X-Large | 64 | 256 | 1 TB | Extreme scale |
| 3X-Large | 128 | 512 | 2 TB | Rarely needed |
| 4X-Large | 256 | 1024 | 4 TB | Benchmark workloads only |
Note: For Serverless warehouses, the DBU rate is typically 1.4–1.6x Classic, but you eliminate idle costs, often resulting in lower total spend.
Clusters (Scaling within a Warehouse)
Each SQL Warehouse can run multiple clusters to handle concurrent queries. One cluster handles about 10 concurrent queries efficiently.
# Configure via Python SDK
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.sql import CreateWarehouseRequest, WarehouseTypePair
w = WorkspaceClient()
warehouse = w.warehouses.create(
name="production-bi",
cluster_size="Medium",
min_num_clusters=1,
max_num_clusters=5, # Auto-scales up to 5 clusters
auto_stop_mins=30,
enable_serverless_compute=True,
warehouse_type=WarehouseTypePair(warehouse_type="PRO"),
)
print(f"Created warehouse: {warehouse.id}")
Scaling rules:
min_num_clusters=1: Keeps one cluster always warm (eliminates cold start for first query)max_num_clusters=N: Sets the concurrency ceiling- Auto-scale adds clusters when queue depth exceeds threshold (~2 queries waiting)
Auto-Stop Configuration: The Biggest Cost Lever
Auto-stop is the single most impactful cost control for SQL Warehouses. An idle warehouse running all night costs the same as a busy one.
# Set auto-stop to 5 minutes for development warehouses
databricks warehouses update --id <warehouse-id> --auto-stop-mins 5
# 30 minutes for production BI (balance cost vs cold start for morning rush)
databricks warehouses update --id <warehouse-id> --auto-stop-mins 30
# Query current warehouse settings
databricks warehouses get --id <warehouse-id>
Auto-stop strategy by use case:
| Warehouse Purpose | Recommended Auto-Stop |
|---|---|
| Development/ad-hoc | 5 minutes |
| Dashboard refresh (daytime only) | 10 minutes |
| Production BI (business hours) | 30 minutes |
| Always-on API endpoint | 60 minutes or min_num_clusters=1 + low auto-stop |
| dbt production jobs | 10 minutes (restarts fast) |
Query Routing and Warehouse Assignment
Map different workload types to appropriately sized warehouses:
-- Query profiles to understand your workload mix
SELECT
warehouse_id,
COUNT(*) AS query_count,
AVG(duration) / 1000 AS avg_duration_sec,
PERCENTILE(duration, 0.95) / 1000 AS p95_duration_sec,
SUM(metrics.total_time_ms) / 1000 / 3600 AS total_compute_hours
FROM system.query.history
WHERE start_time >= NOW() - INTERVAL 30 DAYS
GROUP BY warehouse_id
ORDER BY total_compute_hours DESC;
Recommended warehouse architecture:
┌─────────────────────────────────────────────────┐
│ Production BI Dashboard Warehouse │
│ Size: Medium, Min: 1, Max: 4, Stop: 30 min │
│ Users: Tableau, Power BI, Metabase │
└─────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────┐
│ Analytics / Ad-Hoc Warehouse │
│ Size: Small (Serverless), Max: 3, Stop: 10 min │
│ Users: Data analysts, SQL notebooks │
└─────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────┐
│ dbt Production Warehouse │
│ Size: Large, Min: 0, Max: 1, Stop: 10 min │
│ Users: dbt Cloud / dbt Core CI/CD │
└─────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────┐
│ Developer / Sandbox Warehouse │
│ Size: 2X-Small, Min: 0, Max: 1, Stop: 5 min │
│ Users: Engineers testing queries │
└─────────────────────────────────────────────────┘
Cost Monitoring via system.billing
Databricks exposes billing data in the system.billing.usage table:
-- SQL Warehouse cost breakdown by warehouse (last 30 days)
SELECT
u.usage_metadata.warehouse_id AS warehouse_id,
w.name AS warehouse_name,
SUM(u.usage_quantity) AS total_dbus,
SUM(u.usage_quantity) * 0.22 AS estimated_cost_usd -- adjust for your DBU rate
FROM system.billing.usage u
LEFT JOIN system.compute.warehouses w
ON u.usage_metadata.warehouse_id = w.warehouse_id
WHERE u.usage_date >= CURRENT_DATE - INTERVAL 30 DAYS
AND u.sku_name LIKE '%SQL%'
GROUP BY 1, 2
ORDER BY total_dbus DESC;
-- Identify wasteful idle compute (warehouses running with zero queries)
SELECT
date_trunc('hour', period_start) AS hour,
warehouse_id,
SUM(dbus) AS idle_dbus
FROM system.billing.warehouse_events
WHERE event_type = 'IDLE'
AND period_start >= CURRENT_DATE - INTERVAL 7 DAYS
GROUP BY 1, 2
HAVING idle_dbus > 1
ORDER BY idle_dbus DESC;
Query-Level Cost Attribution
For chargeback models, trace cost to users or teams:
-- Cost per user for the last 30 days
SELECT
user_name,
COUNT(*) AS query_count,
SUM(metrics.total_time_ms) / 1000 / 3600 AS compute_hours,
ROUND(SUM(metrics.total_time_ms) / 1000 / 3600 * 0.22, 2) AS est_cost_usd
FROM system.query.history
WHERE start_time >= NOW() - INTERVAL 30 DAYS
AND status = 'FINISHED'
GROUP BY user_name
ORDER BY est_cost_usd DESC
LIMIT 20;
Performance Optimization for SQL Warehouses
1. Result Caching
SQL Warehouses cache identical query results for up to 24 hours. Exploit this:
-- This query hits cache on second execution (same SQL, same data)
SELECT country, SUM(revenue) FROM prod.gold.sales GROUP BY country;
-- Force cache bypass (for testing)
SELECT /*+ NO_CACHE */ country, SUM(revenue) FROM prod.gold.sales GROUP BY country;
2. Query Optimization
-- Use EXPLAIN COST to understand query plans
EXPLAIN COST
SELECT
c.customer_name,
SUM(o.amount) AS total_spend
FROM prod.gold.orders o
JOIN prod.gold.customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2024-01-01'
GROUP BY c.customer_name
ORDER BY total_spend DESC
LIMIT 100;
3. Materialized Views and Dashboards
For dashboards that run the same expensive aggregations repeatedly:
-- Create a materialized summary (refresh daily)
CREATE OR REPLACE TABLE prod.gold.daily_revenue_summary AS
SELECT
date_trunc('day', order_date) AS day,
product_category,
region,
SUM(amount) AS revenue,
COUNT(*) AS order_count
FROM prod.gold.orders
GROUP BY 1, 2, 3;
-- Dashboard queries hit the summary instead of raw table
-- 1M row aggregation → 365 row lookup
Serverless vs Pro: Total Cost Comparison
For a realistic scenario: 4 analysts running 50 queries/day, each taking 30 seconds average, 8 business hours/day:
| Warehouse | Size | Running Hours/Day | DBUs/Hour | Daily DBU Cost |
|---|---|---|---|---|
| Pro/Classic Small | Small | 8h (always on) | 4 | 32 DBU |
| Serverless Small | Small | ~25 min actual query time | 4 × 1.5 = 6 | ~2.5 DBU |
Serverless saves ~93% in this scenario. The break-even point where Classic/Pro wins: warehouses running > 60% of the day under load.
Databricks SQL + Harbinger Explorer
Teams using Harbinger Explorer can connect directly to their SQL Warehouses to run exploration queries against external API datasets — without spinning up a separate ETL pipeline. Harbinger's API crawler can push structured results directly into Delta tables queryable by your SQL Warehouse, turning external data (geopolitical events, economic indicators, supply chain signals) into SQL-accessible assets in minutes.
-- After Harbinger ingests external data to your Delta table:
SELECT
h.event_country,
h.event_type,
COUNT(*) AS event_count,
SUM(s.revenue_impact) AS revenue_impact
FROM harbinger.geopolitical.events h
JOIN prod.gold.sales s ON h.event_country = s.ship_to_country
AND h.event_date = s.sale_date
WHERE h.event_date >= '2024-01-01'
GROUP BY 1, 2
ORDER BY revenue_impact DESC;
Checklist: Cost-Optimized SQL Warehouse Setup
- Use Serverless for bursty/unpredictable workloads
- Set auto-stop ≤ 10 minutes for dev warehouses
- Set auto-stop ≤ 30 minutes for production BI
- Separate dev/analytics/production into different warehouses
- Enable result caching (on by default, don't disable)
- Monitor
system.billing.usageweekly - Right-size to the smallest warehouse that meets P95 query latency SLA
- Use materialized tables for expensive recurring aggregations
- Tag warehouses for cost attribution (
team,environment,owner)
Conclusion
SQL Warehouse cost optimization is primarily about matching supply to demand: right-sized compute, aggressive auto-stop for development environments, and Serverless for everything with unpredictable load. Start with Serverless unless you have a specific reason to use Pro. Monitor system.billing.usage and system.query.history weekly to catch runaway costs early.
The teams that get this right spend 40–70% less on SQL compute than those who size up "just to be safe" and forget to tune auto-stop.
Try Harbinger Explorer free for 7 days — connect to your Databricks SQL Warehouse, explore external datasets, and build data products that combine internal lakehouse data with real-world signals from across the web. harbingerexplorer.com
Continue Reading
Databricks Autoloader: The Complete Guide
CI/CD Pipelines for Databricks Projects: A Production-Ready Guide
Build a robust CI/CD pipeline for your Databricks projects using GitHub Actions, Databricks Asset Bundles, and automated testing. Covers branching strategy, testing, and deployment.
Databricks Cluster Policies for Cost Control: A Practical Guide
Learn how to use Databricks cluster policies to enforce cost guardrails, standardize cluster configurations, and prevent cloud bill surprises without blocking your team's productivity.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial