Back to Knowledge Hub
Engineering

dbt vs Spark SQL: How to Choose

7 min read·Tags: dbt, spark sql, databricks, transformation tools, analytics engineering, sql, data engineering, lakehouse

If you're building a transformation layer on a lakehouse, at some point you'll face the dbt vs spark sql question. They don't optimize for the same problems, and choosing wrong means months of maintenance pain or missing scale requirements. This comparison breaks down where each tool fits, when to combine them, and what the honest trade-offs actually look like.

TL;DR (If You're Busy)

Choose dbt if your team thinks in SQL, you want built-in testing and documentation, and your warehouse or lakehouse supports a dbt adapter.

Choose Spark SQL if you need to process at scales beyond what your warehouse SQL engine handles efficiently, or you're working with transformations that benefit from the Spark DataFrame API.

Use both if you're on Databricks — many teams use Spark SQL for heavy transformation work and dbt for the modeling, testing, and documentation layer on top. These roles don't overlap.


What Each Tool Actually Does

dbt (data build tool) is a transformation framework, not a query engine. It wraps SQL SELECT statements into a DAG (directed acyclic graph), runs them against your existing warehouse or lakehouse, and adds testing, documentation, and lineage tracking on top. dbt doesn't execute SQL — your warehouse does. The value is the engineering discipline around the SQL, not the SQL itself.

Spark SQL is a SQL dialect that runs on Apache Spark. It can process data at scale across a distributed cluster, supports complex aggregations and window functions, and integrates tightly with the PySpark and Scala APIs. You can use it in notebooks, as part of a Spark job, or through a Databricks SQL Warehouse.

These solve fundamentally different problems: dbt is a development workflow and modeling layer; Spark SQL is a distributed query execution engine. Framing this as a direct competition misunderstands what both tools are for.


Feature Comparison

FeaturedbtSpark SQL
Primary useTransformation modeling and engineering workflowDistributed query execution
SQL dialectAdapter-dependent (Spark SQL, BigQuery SQL, etc.)Spark SQL (ANSI SQL + extensions)
Built-in testingYes (not null, unique, accepted values, custom)No — manual or external
Auto-documentationYes — from YAML configNo native documentation
Lineage trackingYes — built-in DAG visualizationVia Spark UI (coarse)
Scale ceilingWarehouse-dependentHandles PB-scale natively
Incremental modelsFirst-class support with multiple strategiesDIY (merge, overwrite partition logic)
Schema evolutionManual via migrationsSchema merge options in Delta
Warehouse adapters30+ (Snowflake, BigQuery, Databricks, Redshift, etc.)Spark ecosystem only
LanguageSQL + Jinja templatingSQL + PySpark / Scala API
Learning curveLow for SQL practitionersMedium (requires Spark concepts)
Orchestrationdbt Cloud or Airflow / DagsterDatabricks Workflows, Airflow
Streaming supportNoYes (Structured Streaming)

Pricing

dbt Core is open source and free. Self-hosting requires your own orchestration layer (Airflow, Dagster, or similar).

dbt Cloud (Last verified: March 2026):

  • Developer: Free (1 seat)
  • Team: $100/month per seat
  • Enterprise: Custom pricing

[PRICING-CHECK] — dbt Cloud pricing; verify at getdbt.com/pricing before publication

Spark SQL — cost is your Spark infrastructure:

  • Databricks (Last verified: March 2026): DBU pricing varies by cloud provider and instance type. Serverless SQL Warehouse pricing runs approximately $0.25–$0.70/DBU depending on region. Full current pricing at databricks.com/product/pricing.
  • Apache Spark on self-managed infrastructure: You pay only compute and storage costs.

[PRICING-CHECK] — Databricks DBU pricing; verify at databricks.com/product/pricing before publication


Code Examples

dbt Model — Incremental Strategy

-- models/marts/fct_orders_daily.sql
-- SQL dialect: dbt with Databricks adapter (Spark SQL under the hood)
-- Incrementally processes new orders, merging on order_id

{{
  config(
    materialized='incremental',
    unique_key='order_id',
    incremental_strategy='merge',
    partition_by={'field': 'created_date', 'data_type': 'date'}
  )
}}

with source_orders as (
    select
        order_id,
        customer_id,
        order_status,
        total_amount,
        date(created_at) as created_date,
        created_at
    from {{ ref('stg_orders') }}

    {% if is_incremental() %}
        -- Only process records newer than the current max in the target
        where created_at > (select max(created_at) from {{ this }})
    {% endif %}
),

enriched as (
    select
        o.order_id,
        o.customer_id,
        c.customer_segment,
        o.order_status,
        o.total_amount,
        o.created_date
    from source_orders o
    left join {{ ref('dim_customers') }} c
        on o.customer_id = c.customer_id
)

select * from enriched

Equivalent in Spark SQL (PySpark)

# spark_jobs/transform_orders_daily.py
# SQL dialect: Spark SQL
# Equivalent logic to the dbt model above — merge-based incremental load
# No DAG management, no testing framework — those you wire separately

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("orders-daily-transform").getOrCreate()

# Equivalent to stg_orders — read from staging table
# Incremental filter: only records newer than current max in target
source_orders = spark.sql("""
    SELECT
        order_id,
        customer_id,
        order_status,
        total_amount,
        CAST(created_at AS DATE) AS created_date,
        created_at
    FROM catalog.staging.stg_orders
    WHERE created_at > (
        SELECT COALESCE(MAX(created_at), '1970-01-01')
        FROM catalog.marts.fct_orders_daily
    )
""")

source_orders.createOrReplaceTempView("source_orders")

# Enrich with customer dimension — equivalent to dim_customers ref
enriched = spark.sql("""
    SELECT
        o.order_id,
        o.customer_id,
        c.customer_segment,
        o.order_status,
        o.total_amount,
        o.created_date
    FROM source_orders o
    LEFT JOIN catalog.dimensions.dim_customers c
        ON o.customer_id = c.customer_id
""")

enriched.createOrReplaceTempView("enriched_orders")

# Merge into target Delta table
spark.sql("""
    MERGE INTO catalog.marts.fct_orders_daily AS target
    USING enriched_orders AS source
    ON target.order_id = source.order_id
    WHEN MATCHED THEN UPDATE SET *
    WHEN NOT MATCHED THEN INSERT *
""")

row_count = enriched.count()
print(f"Processed {row_count} incremental records")

Notice what dbt handles automatically that Spark SQL doesn't: the upstream DAG (stg_orders, dim_customers resolve to the right tables without hardcoded paths), the built-in testing layer, generated documentation, and the incremental logic abstraction. The Spark SQL version is more explicit and gives you more control — but every structural decision is yours to make and maintain.


When to Choose dbt

  • Your team's primary language is SQL and you want transformations to look like SQL
  • You need built-in data testing, documentation, and lineage without bolting on separate tools
  • You're on Snowflake, BigQuery, or using Databricks SQL Warehouse
  • You want to enforce a modeling convention (staging → intermediate → mart) across a distributed team
  • You have junior analysts who need to contribute transformations safely within guardrails
  • You want dbt Cloud for scheduling without running your own Airflow instance

dbt's opinionated conventions are the point — they make distributed collaboration tractable. If your team writes a lot of SQL and you want software engineering practices applied to that SQL, dbt is the right tool.


When to Choose Spark SQL

  • You're transforming data at volumes that strain your warehouse SQL engine — multi-TB daily loads, complex shuffle operations
  • Your transformation logic requires iterative processing, complex UDFs, or ML feature engineering that's impractical in pure SQL
  • You're already running Spark workloads and want stack consistency
  • You need fine-grained control over partitioning, file formats, and cluster configuration
  • You're building streaming pipelines — Structured Streaming uses Spark SQL under the hood

Spark SQL is the right choice when you specifically need the Spark execution engine, not as a default for all SQL transformation work.


Honest Trade-offs

dbt's real weaknesses:

  • Model DAGs grow unwieldy in large projects without strict naming conventions and folder organization. Technical debt accumulates quietly.
  • Batch-only — dbt doesn't handle streaming or stateful processing.
  • Incremental model logic can become fragile with edge cases (late-arriving data, schema changes mid-increment).
  • dbt Cloud pricing escalates quickly for teams beyond a handful of developers.

Spark SQL's real weaknesses:

  • No native testing framework. You're writing tests yourself or integrating Great Expectations separately.
  • Documentation is absent unless you build a separate catalog layer.
  • Cluster management adds operational overhead that SQL-first teams consistently underestimate.
  • Spark error messages from distributed jobs are notoriously hard to debug — stack traces span hundreds of lines and often obscure the actual cause.

Using Both: The Databricks Pattern

On Databricks, many teams run a hybrid: Spark SQL (or PySpark) handles the heavy transformation layer — ingestion, deduplication, large aggregations, partition management — while dbt handles the modeling layer that produces the tables business users query. The Databricks dbt adapter supports incremental strategies natively against Delta tables.

This isn't redundant — it plays to each tool's strengths. Spark handles the volume and complexity; dbt handles the structure, testing, and documentation. The interface between them is a well-defined set of Delta tables that dbt reads from.

If you're doing ad-hoc analysis on the output of either layer, Harbinger Explorer can run natural language queries against your data sources using DuckDB WASM in the browser — useful for exploring final mart tables without spinning up a notebook or writing SQL from scratch.


Choosing Well

dbt and Spark SQL are not competing for the same job. dbt is an engineering discipline for SQL-based transformation; Spark SQL is a distributed execution engine for workloads that need it. The question isn't which is better — it's which problem you're solving.

If your team writes SQL and needs structure, testing, and documentation: start with dbt. If you're processing at Spark-scale or embedded in the Spark ecosystem: use Spark SQL. If you're on Databricks with serious volume requirements: use both, with clear ownership of what each layer handles.

Continue Reading


[VERIFY] — dbt Cloud pricing tiers; verify at getdbt.com/pricing [VERIFY] — Databricks DBU pricing range ~$0.25–$0.70; confirm approximate values are accurate for March 2026 [PRICING-CHECK] — dbt Cloud: getdbt.com/pricing (Last verified: March 2026) [PRICING-CHECK] — Databricks: databricks.com/product/pricing (Last verified: March 2026)


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial