Harbinger Explorer

Back to Knowledge Hub
databricks
Published:

Databricks SQL Warehouse Sizing and Cost Optimization Guide

10 min read·Tags: databricks sql, sql warehouse, cost optimization, serverless, databricks, data engineering

Databricks SQL Warehouse Sizing and Cost Optimization Guide

Databricks SQL Warehouses are the compute layer behind Databricks SQL — powering BI dashboards, ad-hoc queries, and dbt transformations. But with DBU costs that add up fast, sizing them wrong is one of the most common ways teams overbuild (or underbuild) their data platform.

This guide covers everything you need to make informed SQL Warehouse decisions: types, sizing, auto-stop tuning, and concrete cost optimization tactics.


SQL Warehouse Types

Databricks offers three SQL Warehouse types:

TypeBest ForCold StartCost
ServerlessUnpredictable/bursty workloads, BI tools~2–5sPay per query second
ProConsistent workloads, advanced features~2–4 minPay per running hour
ClassicLegacy workloads, cost predictability~2–4 minPay per running hour

Serverless SQL Warehouses are the default recommendation for new deployments in 2024. They start in seconds (no cluster bootstrap), scale to zero immediately when idle, and are billed only for query execution time — not idle time.

# Create a serverless SQL warehouse via CLI
databricks warehouses create   --name "analytics-serverless"   --warehouse-type STARTER   --cluster-size "Small"   --auto-stop-mins 5   --enable-serverless-compute true

T-Shirt Sizing Reference

SQL Warehouses use t-shirt sizes that map to DBU consumption:

SizeDBUs/HourvCPUsRAMBest For
2X-Small1416 GBDev/test, single analyst
X-Small2832 GBLight BI, < 5 concurrent users
Small41664 GBModerate BI, 5–15 users
Medium832128 GBHeavy BI, 15–30 users
Large1664256 GBEnterprise BI, dbt production
X-Large32128512 GBVery large datasets, ML feature stores
2X-Large642561 TBExtreme scale
3X-Large1285122 TBRarely needed
4X-Large25610244 TBBenchmark workloads only

Note: For Serverless warehouses, the DBU rate is typically 1.4–1.6x Classic, but you eliminate idle costs, often resulting in lower total spend.


Clusters (Scaling within a Warehouse)

Each SQL Warehouse can run multiple clusters to handle concurrent queries. One cluster handles about 10 concurrent queries efficiently.

# Configure via Python SDK
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.sql import CreateWarehouseRequest, WarehouseTypePair

w = WorkspaceClient()

warehouse = w.warehouses.create(
    name="production-bi",
    cluster_size="Medium",
    min_num_clusters=1,
    max_num_clusters=5,       # Auto-scales up to 5 clusters
    auto_stop_mins=30,
    enable_serverless_compute=True,
    warehouse_type=WarehouseTypePair(warehouse_type="PRO"),
)
print(f"Created warehouse: {warehouse.id}")

Scaling rules:

  • min_num_clusters=1: Keeps one cluster always warm (eliminates cold start for first query)
  • max_num_clusters=N: Sets the concurrency ceiling
  • Auto-scale adds clusters when queue depth exceeds threshold (~2 queries waiting)

Auto-Stop Configuration: The Biggest Cost Lever

Auto-stop is the single most impactful cost control for SQL Warehouses. An idle warehouse running all night costs the same as a busy one.

# Set auto-stop to 5 minutes for development warehouses
databricks warehouses update   --id <warehouse-id>   --auto-stop-mins 5

# 30 minutes for production BI (balance cost vs cold start for morning rush)
databricks warehouses update   --id <warehouse-id>   --auto-stop-mins 30

# Query current warehouse settings
databricks warehouses get --id <warehouse-id>

Auto-stop strategy by use case:

Warehouse PurposeRecommended Auto-Stop
Development/ad-hoc5 minutes
Dashboard refresh (daytime only)10 minutes
Production BI (business hours)30 minutes
Always-on API endpoint60 minutes or min_num_clusters=1 + low auto-stop
dbt production jobs10 minutes (restarts fast)

Query Routing and Warehouse Assignment

Map different workload types to appropriately sized warehouses:

-- Query profiles to understand your workload mix
SELECT
  warehouse_id,
  COUNT(*) AS query_count,
  AVG(duration) / 1000 AS avg_duration_sec,
  PERCENTILE(duration, 0.95) / 1000 AS p95_duration_sec,
  SUM(metrics.total_time_ms) / 1000 / 3600 AS total_compute_hours
FROM system.query.history
WHERE start_time >= NOW() - INTERVAL 30 DAYS
GROUP BY warehouse_id
ORDER BY total_compute_hours DESC;

Recommended warehouse architecture:

┌─────────────────────────────────────────────────┐
│  Production BI Dashboard Warehouse               │
│  Size: Medium, Min: 1, Max: 4, Stop: 30 min     │
│  Users: Tableau, Power BI, Metabase             │
└─────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────┐
│  Analytics / Ad-Hoc Warehouse                   │
│  Size: Small (Serverless), Max: 3, Stop: 10 min │
│  Users: Data analysts, SQL notebooks            │
└─────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────┐
│  dbt Production Warehouse                       │
│  Size: Large, Min: 0, Max: 1, Stop: 10 min      │
│  Users: dbt Cloud / dbt Core CI/CD             │
└─────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────┐
│  Developer / Sandbox Warehouse                  │
│  Size: 2X-Small, Min: 0, Max: 1, Stop: 5 min   │
│  Users: Engineers testing queries               │
└─────────────────────────────────────────────────┘

Cost Monitoring via system.billing

Databricks exposes billing data in the system.billing.usage table:

-- SQL Warehouse cost breakdown by warehouse (last 30 days)
SELECT
  u.usage_metadata.warehouse_id AS warehouse_id,
  w.name AS warehouse_name,
  SUM(u.usage_quantity) AS total_dbus,
  SUM(u.usage_quantity) * 0.22 AS estimated_cost_usd  -- adjust for your DBU rate
FROM system.billing.usage u
LEFT JOIN system.compute.warehouses w
  ON u.usage_metadata.warehouse_id = w.warehouse_id
WHERE u.usage_date >= CURRENT_DATE - INTERVAL 30 DAYS
  AND u.sku_name LIKE '%SQL%'
GROUP BY 1, 2
ORDER BY total_dbus DESC;
-- Identify wasteful idle compute (warehouses running with zero queries)
SELECT
  date_trunc('hour', period_start) AS hour,
  warehouse_id,
  SUM(dbus) AS idle_dbus
FROM system.billing.warehouse_events
WHERE event_type = 'IDLE'
  AND period_start >= CURRENT_DATE - INTERVAL 7 DAYS
GROUP BY 1, 2
HAVING idle_dbus > 1
ORDER BY idle_dbus DESC;

Query-Level Cost Attribution

For chargeback models, trace cost to users or teams:

-- Cost per user for the last 30 days
SELECT
  user_name,
  COUNT(*) AS query_count,
  SUM(metrics.total_time_ms) / 1000 / 3600 AS compute_hours,
  ROUND(SUM(metrics.total_time_ms) / 1000 / 3600 * 0.22, 2) AS est_cost_usd
FROM system.query.history
WHERE start_time >= NOW() - INTERVAL 30 DAYS
  AND status = 'FINISHED'
GROUP BY user_name
ORDER BY est_cost_usd DESC
LIMIT 20;

Performance Optimization for SQL Warehouses

1. Result Caching

SQL Warehouses cache identical query results for up to 24 hours. Exploit this:

-- This query hits cache on second execution (same SQL, same data)
SELECT country, SUM(revenue) FROM prod.gold.sales GROUP BY country;

-- Force cache bypass (for testing)
SELECT /*+ NO_CACHE */ country, SUM(revenue) FROM prod.gold.sales GROUP BY country;

2. Query Optimization

-- Use EXPLAIN COST to understand query plans
EXPLAIN COST
SELECT
  c.customer_name,
  SUM(o.amount) AS total_spend
FROM prod.gold.orders o
JOIN prod.gold.customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2024-01-01'
GROUP BY c.customer_name
ORDER BY total_spend DESC
LIMIT 100;

3. Materialized Views and Dashboards

For dashboards that run the same expensive aggregations repeatedly:

-- Create a materialized summary (refresh daily)
CREATE OR REPLACE TABLE prod.gold.daily_revenue_summary AS
SELECT
  date_trunc('day', order_date) AS day,
  product_category,
  region,
  SUM(amount) AS revenue,
  COUNT(*) AS order_count
FROM prod.gold.orders
GROUP BY 1, 2, 3;

-- Dashboard queries hit the summary instead of raw table
-- 1M row aggregation → 365 row lookup

Serverless vs Pro: Total Cost Comparison

For a realistic scenario: 4 analysts running 50 queries/day, each taking 30 seconds average, 8 business hours/day:

WarehouseSizeRunning Hours/DayDBUs/HourDaily DBU Cost
Pro/Classic SmallSmall8h (always on)432 DBU
Serverless SmallSmall~25 min actual query time4 × 1.5 = 6~2.5 DBU

Serverless saves ~93% in this scenario. The break-even point where Classic/Pro wins: warehouses running > 60% of the day under load.


Databricks SQL + Harbinger Explorer

Teams using Harbinger Explorer can connect directly to their SQL Warehouses to run exploration queries against external API datasets — without spinning up a separate ETL pipeline. Harbinger's API crawler can push structured results directly into Delta tables queryable by your SQL Warehouse, turning external data (geopolitical events, economic indicators, supply chain signals) into SQL-accessible assets in minutes.

-- After Harbinger ingests external data to your Delta table:
SELECT
  h.event_country,
  h.event_type,
  COUNT(*) AS event_count,
  SUM(s.revenue_impact) AS revenue_impact
FROM harbinger.geopolitical.events h
JOIN prod.gold.sales s ON h.event_country = s.ship_to_country
  AND h.event_date = s.sale_date
WHERE h.event_date >= '2024-01-01'
GROUP BY 1, 2
ORDER BY revenue_impact DESC;

Checklist: Cost-Optimized SQL Warehouse Setup

  • Use Serverless for bursty/unpredictable workloads
  • Set auto-stop ≤ 10 minutes for dev warehouses
  • Set auto-stop ≤ 30 minutes for production BI
  • Separate dev/analytics/production into different warehouses
  • Enable result caching (on by default, don't disable)
  • Monitor system.billing.usage weekly
  • Right-size to the smallest warehouse that meets P95 query latency SLA
  • Use materialized tables for expensive recurring aggregations
  • Tag warehouses for cost attribution (team, environment, owner)

Conclusion

SQL Warehouse cost optimization is primarily about matching supply to demand: right-sized compute, aggressive auto-stop for development environments, and Serverless for everything with unpredictable load. Start with Serverless unless you have a specific reason to use Pro. Monitor system.billing.usage and system.query.history weekly to catch runaway costs early.

The teams that get this right spend 40–70% less on SQL compute than those who size up "just to be safe" and forget to tune auto-stop.


Try Harbinger Explorer free for 7 days — connect to your Databricks SQL Warehouse, explore external datasets, and build data products that combine internal lakehouse data with real-world signals from across the web. harbingerexplorer.com


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...