What Is dbt? The Data Engineer's Complete Guide
You've joined a data team, and someone keeps referencing "dbt models" in pull requests. Or maybe your warehouse has 200 views and nobody knows which ones are still in use. dbt (data build tool) is the framework that turned SQL-based data transformation into a proper engineering discipline — with version control, testing, and documentation baked in.
This article explains what dbt actually does, how it works under the hood, when it's the right choice, and when it isn't.
What Is dbt and Why Does It Exist?
dbt is an open-source command-line tool that lets you transform data already loaded into your warehouse using SQL SELECT statements. You write models (SQL files), dbt compiles them into DDL/DML, and executes them against your warehouse in the correct dependency order.
Before dbt, data transformation meant one of three things:
- Stored procedures nobody dared touch
- Python scripts that mixed extraction, loading, and transformation into unmaintainable blobs
- GUI-based ETL tools where logic lived in drag-and-drop canvases, invisible to version control
dbt changed the game by applying software engineering practices — git, CI/CD, modularity, testing — to the T in ELT. It doesn't extract or load data. It only transforms what's already in your warehouse.
How dbt Works: The Core Concepts
Models
A dbt model is a SQL SELECT statement saved as a .sql file. dbt wraps it in CREATE TABLE AS or CREATE VIEW AS depending on your materialization config. Here's a practical example — a staging model that cleans raw order data:
-- models/staging/stg_orders.sql
-- Dialect: Snowflake SQL (works similarly in BigQuery, Redshift, Postgres)
WITH source AS (
SELECT * FROM {{ source('raw', 'orders') }}
),
cleaned AS (
SELECT
id AS order_id,
user_id,
TRIM(LOWER(status)) AS order_status,
amount_cents / 100.0 AS order_amount_usd,
created_at::timestamp AS ordered_at
FROM source
WHERE id IS NOT NULL
)
SELECT * FROM cleaned
The {{ source() }} and {{ ref() }} functions are Jinja macros that dbt uses to build a dependency graph (DAG) of your models. When you run dbt run, it executes models in topological order — upstream tables first.
Materializations
dbt supports four materialization strategies out of the box:
| Materialization | What It Creates | When to Use |
|---|---|---|
view | SQL view | Lightweight staging layers, low query frequency |
table | Physical table (full rebuild) | Marts and aggregations, moderate data volume |
incremental | Appends/merges new rows only | Large fact tables, event streams |
ephemeral | CTE (no database object) | Reusable logic that doesn't need its own table |
Tests and Documentation
dbt lets you define tests in YAML:
# models/staging/_stg_models.yml
models:
- name: stg_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: order_status
tests:
- accepted_values:
values: ['pending', 'completed', 'cancelled', 'refunded']
Run dbt test and you get pass/fail results for every assertion. This is table stakes for data quality — and something most SQL-only workflows completely lack.
dbt Core vs dbt Cloud
This is the first decision any team evaluating dbt analytics engineering needs to make. Here's an honest comparison:
| Dimension | dbt Core (OSS) | dbt Cloud |
|---|---|---|
| Cost | Free (open source) | Team: $100/seat/mo, Enterprise: custom [Last verified: March 2026] |
| Execution | CLI, runs wherever you deploy it | Managed SaaS, browser-based IDE |
| Scheduling | BYO (Airflow, cron, Dagster, etc.) | Built-in job scheduler |
| CI/CD | You configure it (GitHub Actions, etc.) | Slim CI built-in (runs modified models on PR) |
| IDE | Your editor (VS Code, vim, whatever) | Cloud IDE with DAG visualization |
| Semantic Layer | MetricFlow (self-hosted) | Managed MetricFlow + API |
| Access Control | None (it's a CLI tool) | Environment-level permissions, SSO |
| Setup Effort | Medium-high (infra, orchestration, CI) | Low (sign up, connect warehouse, go) |
| Flexibility | Full control, any orchestrator | Locked into dbt Cloud patterns |
When to choose dbt Core: Your team already has orchestration (Airflow, Dagster), wants full control, and has engineers comfortable with DevOps. Cost-sensitive teams also benefit — it's free forever.
When to choose dbt Cloud: You're a smaller team without dedicated platform engineers, want fast onboarding, or need the managed semantic layer. The built-in CI alone saves significant setup time.
My take: most teams with existing Airflow or Dagster deployments get more value from dbt Core. If you're starting from scratch and don't want to manage infrastructure, dbt Cloud earns its price.
A Practical dbt Tutorial: From Zero to First Model
Here's the fastest path to running dbt data transformation against a warehouse:
1. Install dbt Core (using your warehouse adapter):
pip install dbt-snowflake # or dbt-bigquery, dbt-postgres, dbt-redshift
2. Initialize a project:
dbt init my_analytics
cd my_analytics
3. Configure your warehouse connection in ~/.dbt/profiles.yml:
my_analytics:
target: dev
outputs:
dev:
type: snowflake
account: xy12345.us-east-1
user: "{{ env_var('DBT_USER') }}"
password: "{{ env_var('DBT_PASSWORD') }}"
database: ANALYTICS
warehouse: TRANSFORM_WH
schema: DEV_MARC
threads: 4
4. Write a model and run it:
dbt run --select stg_orders
dbt test --select stg_orders
That's it. You have a tested, version-controlled transformation. The dbt documentation covers adapters for every major warehouse.
Common Pitfalls
Having worked with dbt across multiple teams, these are the traps I see most often:
1. The "one giant staging layer" anti-pattern. Teams dump every source table into a staging model without thinking about what downstream actually needs. You end up with 300 staging models, 40 of which are referenced. Audit your DAG regularly with dbt ls --resource-type model --select +marts.
2. Overusing incremental models too early. Incremental materializations add complexity — merge keys, late-arriving data handling, full-refresh cadences. Start with table materialization. Switch to incremental only when rebuild time actually becomes a problem.
3. No naming conventions. Without prefixes (stg_, int_, fct_, dim_), your project becomes a maze within months. Establish conventions on day one and enforce them with dbt-project-evaluator.
4. Ignoring the ref() graph. Hard-coding table names instead of using {{ ref('model_name') }} breaks dbt's dependency resolution. Every cross-model reference should use ref() — no exceptions.
5. Skipping tests because "the data looks fine." It looks fine until it doesn't. At minimum, test primary keys (unique, not_null) on every model. It takes 30 seconds per model and saves hours of debugging downstream dashboards.
6. Not using sources. Define your raw tables as sources with freshness checks. Without this, you have no visibility into whether upstream data is actually arriving.
When dbt Is NOT the Right Choice
dbt is excellent at what it does, but it's not universal. Skip it when:
-
Your transformations are primarily Python/Spark. If you're doing ML feature engineering, complex geospatial processing, or working with unstructured data, dbt's SQL-first approach creates friction. Yes, dbt now supports Python models, but they're a second-class citizen compared to the SQL experience.
-
You don't have a warehouse. dbt transforms data inside a warehouse. If your data lives in APIs, flat files, or application databases and you need to move it first, you need an ingestion tool — not dbt.
-
Your team doesn't know SQL. This sounds obvious, but dbt's value proposition assumes SQL competency. A team of Python-only engineers will fight the paradigm instead of benefiting from it.
-
You're doing real-time streaming. dbt operates in batch mode. If you need sub-second latency transformations, look at Kafka Streams, Flink, or Materialize.
-
Your entire pipeline is 3 queries. dbt's project structure, YAML config, and compilation step add overhead. For trivially simple transformations, a scheduled SQL script is honestly fine.
dbt in the Modern Data Stack
dbt sits at the transformation layer of the medallion architecture — specifically between raw/bronze data and the cleaned silver/gold layers your analysts query. It pairs naturally with:
- Ingestion: Fivetran, Airbyte, custom API pipelines (building REST API pipelines)
- Orchestration: Airflow, Dagster, Prefect (for dbt Core)
- Warehouse: Snowflake, BigQuery, Redshift, Databricks, Postgres
- BI / Analytics: Looker, Metabase, Power BI — or lighter-weight tools for ad-hoc exploration
That last category is where the workflow gets interesting. Once dbt has built clean, tested tables in your warehouse, someone still needs to query them. Traditional BI tools work, but sometimes you just want to point a tool at your data and ask questions. Harbinger Explorer does exactly this — you can connect to data sources through its source catalog, then use natural language queries or write SQL directly in the browser via its DuckDB engine, without setting up a full BI stack.
For teams already using dbt vs Spark SQL for their transformation layer, the choice of downstream query tooling matters just as much as the transformation framework itself.
What's Next for dbt?
The dbt ecosystem continues to evolve. The semantic layer (MetricFlow) aims to create a single source of truth for metric definitions. The introduction of Python models expands dbt beyond pure SQL. And the mesh architecture pattern — where teams publish and discover each other's models across projects — addresses the scaling challenges large organizations face.
Whether dbt stays dominant depends on how well it handles the tension between simplicity (its original strength) and the feature sprawl that comes with enterprise adoption. For now, it remains the most practical way to bring engineering rigor to SQL-based data transformation.
Your next step: If you're evaluating dbt, start with dbt Core against a development schema. Build 5-10 models following the staging → intermediate → marts pattern. You'll know within a week whether it fits your team's workflow.
Continue Reading
- dbt vs Spark SQL: How to Choose — direct comparison for teams deciding between the two
- Medallion Architecture Explained — the layered pattern dbt implements
- Building a REST API Data Pipeline in Python — the ingestion layer that feeds dbt
Continue Reading
Data Lakehouse Architecture Explained
How data lakehouse architecture works, when to use it over a warehouse or lake, and the common pitfalls that trip up data engineering teams.
dbt vs Spark SQL: How to Choose
dbt or Spark SQL for your transformation layer? A side-by-side comparison of features, pricing, and use cases — with code examples for both and honest trade-offs for analytics engineers.
Delta Live Tables vs Classic ETL: Which Fits Your Pipeline?
DLT vs classic ETL compared honestly: declarative expectations, streaming, debugging, testing, and pricing. Includes DLT code example with expectations syntax.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial