EngineeringMar 24, 2026

What Is dbt? The Data Engineer's Complete Guide

7 min read·Tags: dbt, data transformation, analytics engineering, dbt core, dbt cloud, ELT, data warehouse, SQL

You've joined a data team, and someone keeps referencing "dbt models" in pull requests. Or maybe your warehouse has 200 views and nobody knows which ones are still in use. dbt (data build tool) is the framework that turned SQL-based data transformation into a proper engineering discipline — with version control, testing, and documentation baked in.

This article explains what dbt actually does, how it works under the hood, when it's the right choice, and when it isn't.

What Is dbt and Why Does It Exist?

dbt is an open-source command-line tool that lets you transform data already loaded into your warehouse using SQL SELECT statements. You write models (SQL files), dbt compiles them into DDL/DML, and executes them against your warehouse in the correct dependency order.

Before dbt, data transformation meant one of three things:

Stored procedures nobody dared touch
Python scripts that mixed extraction, loading, and transformation into unmaintainable blobs
GUI-based ETL tools where logic lived in drag-and-drop canvases, invisible to version control

dbt changed the game by applying software engineering practices — git, CI/CD, modularity, testing — to the T in ELT. It doesn't extract or load data. It only transforms what's already in your warehouse.

How dbt Works: The Core Concepts

Models

A dbt model is a SQL SELECT statement saved as a .sql file. dbt wraps it in CREATE TABLE AS or CREATE VIEW AS depending on your materialization config. Here's a practical example — a staging model that cleans raw order data:

-- models/staging/stg_orders.sql
-- Dialect: Snowflake SQL (works similarly in BigQuery, Redshift, Postgres)

WITH source AS (
    SELECT * FROM {{ source('raw', 'orders') }}
),

cleaned AS (
    SELECT
        id                          AS order_id,
        user_id,
        TRIM(LOWER(status))         AS order_status,
        amount_cents / 100.0        AS order_amount_usd,
        created_at::timestamp       AS ordered_at
    FROM source
    WHERE id IS NOT NULL
)

SELECT * FROM cleaned

The {{ source() }} and {{ ref() }} functions are Jinja macros that dbt uses to build a dependency graph (DAG) of your models. When you run dbt run, it executes models in topological order — upstream tables first.

Materializations

dbt supports four materialization strategies out of the box:

Materialization	What It Creates	When to Use
`view`	SQL view	Lightweight staging layers, low query frequency
`table`	Physical table (full rebuild)	Marts and aggregations, moderate data volume
`incremental`	Appends/merges new rows only	Large fact tables, event streams
`ephemeral`	CTE (no database object)	Reusable logic that doesn't need its own table

Tests and Documentation

dbt lets you define tests in YAML:

# models/staging/_stg_models.yml
models:
  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: order_status
        tests:
          - accepted_values:
              values: ['pending', 'completed', 'cancelled', 'refunded']

Run dbt test and you get pass/fail results for every assertion. This is table stakes for data quality — and something most SQL-only workflows completely lack.

dbt Core vs dbt Cloud

This is the first decision any team evaluating dbt analytics engineering needs to make. Here's an honest comparison:

Dimension	dbt Core (OSS)	dbt Cloud
Cost	Free (open source)	Team: $100/seat/mo, Enterprise: custom [Last verified: March 2026]
Execution	CLI, runs wherever you deploy it	Managed SaaS, browser-based IDE
Scheduling	BYO (Airflow, cron, Dagster, etc.)	Built-in job scheduler
CI/CD	You configure it (GitHub Actions, etc.)	Slim CI built-in (runs modified models on PR)
IDE	Your editor (VS Code, vim, whatever)	Cloud IDE with DAG visualization
Semantic Layer	MetricFlow (self-hosted)	Managed MetricFlow + API
Access Control	None (it's a CLI tool)	Environment-level permissions, SSO
Setup Effort	Medium-high (infra, orchestration, CI)	Low (sign up, connect warehouse, go)
Flexibility	Full control, any orchestrator	Locked into dbt Cloud patterns

When to choose dbt Core: Your team already has orchestration (Airflow, Dagster), wants full control, and has engineers comfortable with DevOps. Cost-sensitive teams also benefit — it's free forever.

When to choose dbt Cloud: You're a smaller team without dedicated platform engineers, want fast onboarding, or need the managed semantic layer. The built-in CI alone saves significant setup time.

My take: most teams with existing Airflow or Dagster deployments get more value from dbt Core. If you're starting from scratch and don't want to manage infrastructure, dbt Cloud earns its price.

A Practical dbt Tutorial: From Zero to First Model

Here's the fastest path to running dbt data transformation against a warehouse:

1. Install dbt Core (using your warehouse adapter):

pip install dbt-snowflake  # or dbt-bigquery, dbt-postgres, dbt-redshift

2. Initialize a project:

dbt init my_analytics
cd my_analytics

3. Configure your warehouse connection in ~/.dbt/profiles.yml:

my_analytics:
  target: dev
  outputs:
    dev:
      type: snowflake
      account: xy12345.us-east-1
      user: "{{ env_var('DBT_USER') }}"
      password: "{{ env_var('DBT_PASSWORD') }}"
      database: ANALYTICS
      warehouse: TRANSFORM_WH
      schema: DEV_MARC
      threads: 4

4. Write a model and run it:

dbt run --select stg_orders
dbt test --select stg_orders

That's it. You have a tested, version-controlled transformation. The dbt documentation covers adapters for every major warehouse.

Common Pitfalls

Having worked with dbt across multiple teams, these are the traps I see most often:

1. The "one giant staging layer" anti-pattern. Teams dump every source table into a staging model without thinking about what downstream actually needs. You end up with 300 staging models, 40 of which are referenced. Audit your DAG regularly with dbt ls --resource-type model --select +marts.

2. Overusing incremental models too early. Incremental materializations add complexity — merge keys, late-arriving data handling, full-refresh cadences. Start with table materialization. Switch to incremental only when rebuild time actually becomes a problem.

3. No naming conventions. Without prefixes (stg_, int_, fct_, dim_), your project becomes a maze within months. Establish conventions on day one and enforce them with dbt-project-evaluator.

4. Ignoring the ref() graph. Hard-coding table names instead of using {{ ref('model_name') }} breaks dbt's dependency resolution. Every cross-model reference should use ref() — no exceptions.

5. Skipping tests because "the data looks fine." It looks fine until it doesn't. At minimum, test primary keys (unique, not_null) on every model. It takes 30 seconds per model and saves hours of debugging downstream dashboards.

6. Not using sources. Define your raw tables as sources with freshness checks. Without this, you have no visibility into whether upstream data is actually arriving.

When dbt Is NOT the Right Choice

dbt is excellent at what it does, but it's not universal. Skip it when:

Your transformations are primarily Python/Spark. If you're doing ML feature engineering, complex geospatial processing, or working with unstructured data, dbt's SQL-first approach creates friction. Yes, dbt now supports Python models, but they're a second-class citizen compared to the SQL experience.
You don't have a warehouse. dbt transforms data inside a warehouse. If your data lives in APIs, flat files, or application databases and you need to move it first, you need an ingestion tool — not dbt.
Your team doesn't know SQL. This sounds obvious, but dbt's value proposition assumes SQL competency. A team of Python-only engineers will fight the paradigm instead of benefiting from it.
You're doing real-time streaming. dbt operates in batch mode. If you need sub-second latency transformations, look at Kafka Streams, Flink, or Materialize.
Your entire pipeline is 3 queries. dbt's project structure, YAML config, and compilation step add overhead. For trivially simple transformations, a scheduled SQL script is honestly fine.

dbt in the Modern Data Stack

dbt sits at the transformation layer of the medallion architecture — specifically between raw/bronze data and the cleaned silver/gold layers your analysts query. It pairs naturally with:

Ingestion: Fivetran, Airbyte, custom API pipelines (building REST API pipelines)
Orchestration: Airflow, Dagster, Prefect (for dbt Core)
Warehouse: Snowflake, BigQuery, Redshift, Databricks, Postgres
BI / Analytics: Looker, Metabase, Power BI — or lighter-weight tools for ad-hoc exploration

That last category is where the workflow gets interesting. Once dbt has built clean, tested tables in your warehouse, someone still needs to query them. Traditional BI tools work, but sometimes you just want to point a tool at your data and ask questions. Harbinger Explorer does exactly this — you can connect to data sources through its source catalog, then use natural language queries or write SQL directly in the browser via its DuckDB engine, without setting up a full BI stack.

For teams already using dbt vs Spark SQL for their transformation layer, the choice of downstream query tooling matters just as much as the transformation framework itself.

What's Next for dbt?

The dbt ecosystem continues to evolve. The semantic layer (MetricFlow) aims to create a single source of truth for metric definitions. The introduction of Python models expands dbt beyond pure SQL. And the mesh architecture pattern — where teams publish and discover each other's models across projects — addresses the scaling challenges large organizations face.

Whether dbt stays dominant depends on how well it handles the tension between simplicity (its original strength) and the feature sprawl that comes with enterprise adoption. For now, it remains the most practical way to bring engineering rigor to SQL-based data transformation.

Your next step: If you're evaluating dbt, start with dbt Core against a development schema. Build 5-10 models following the staging → intermediate → marts pattern. You'll know within a week whether it fits your team's workflow.

Continue Reading

dbt vs Spark SQL: How to Choose — direct comparison for teams deciding between the two
Medallion Architecture Explained — the layered pattern dbt implements
Building a REST API Data Pipeline in Python — the ingestion layer that feeds dbt

View all articles

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

What Is dbt? The Data Engineer's Complete Guide

What Is dbt and Why Does It Exist?

How dbt Works: The Core Concepts

Models

Materializations

Tests and Documentation

dbt Core vs dbt Cloud

A Practical dbt Tutorial: From Zero to First Model

Common Pitfalls

When dbt Is NOT the Right Choice

dbt in the Modern Data Stack

What's Next for dbt?

Continue Reading

Continue Reading

Data Lakehouse Architecture Explained

dbt vs Spark SQL: How to Choose

Delta Live Tables vs Classic ETL: Which Fits Your Pipeline?

Try Harbinger Explorer for free