Data Lake vs Warehouse vs Lakehouse: Which to Pick?
Most data architecture debates collapse into the same three options: data lake, data warehouse, or lakehouse. Each camp has evangelists who'll tell you their choice is obviously correct. The reality is that each solves a different problem — and picking wrong costs you 12 months of painful migration. Here's the honest comparison.
TL;DR
| Data Lake | Data Warehouse | Lakehouse | |
|---|---|---|---|
| Storage | Object storage (S3/GCS/ADLS) | Proprietary columnar | Object storage + open format |
| Schema | Schema-on-read | Schema-on-write | Schema-on-write (open) |
| SQL | Limited / via compute engine | Native, optimized | Full SQL via engine |
| ML/AI workloads | Native | Awkward | Native |
| Cost at scale | Low storage, variable compute | High, predictable | Low-medium |
| Governance | Hard | Easy | Medium |
| Maturity | High (messy) | High (proven) | Growing fast |
Data Lake — The Raw Archive
A data lake stores data in its native format — JSON, Parquet, CSV, Avro, images, audio — in cheap object storage. There's no schema enforcement at write time. You dump data in; you apply schema when you read it.
The promise: Store everything, figure out what you need later. No data loss from upfront schema decisions.
The reality: Most data lakes become "data swamps" within 18 months. Without discipline, you end up with:
- Files no one remembers creating
- Multiple versions of the same dataset with slightly different schemas
- No queryable catalog → analysts can't find anything
- Raw JSON that no one wants to parse
# Reading a messy data lake file — schema on read
import duckdb
# You discover schema only at read time
result = duckdb.sql('''
SELECT
json_extract_string(raw, '$.user_id') AS user_id,
json_extract_string(raw, '$.event_type') AS event_type,
CAST(json_extract_string(raw, '$.ts') AS TIMESTAMP) AS event_ts
FROM read_json_auto('s3://my-lake/events/2024/01/**.json')
WHERE json_extract_string(raw, '$.event_type') = 'purchase'
''')
Best for: ML feature stores, raw event archival, teams with strong data engineering discipline, workloads mixing structured and unstructured data.
Avoid when: Your primary consumers are SQL-first analysts. They will suffer.
Data Warehouse — The SQL-First Workhorse
A data warehouse stores structured, modeled data in a proprietary columnar format. Schema is enforced at write time. Everything is typed, indexed, and optimized for SQL queries.
Major players: Snowflake, BigQuery, Redshift, Databricks SQL Warehouse, Azure Synapse.
The promise: Fast SQL, great governance, mature tooling, analysts are productive from day one.
The reality: You pay a premium for that convenience. Proprietary storage formats lock you in. Storing semi-structured data (JSON, nested arrays) is possible but often painful. Running Python-based ML models directly on the data requires exporting it out.
-- Snowflake SQL — schema enforced, types guaranteed
CREATE OR REPLACE TABLE orders (
order_id STRING NOT NULL,
customer_id STRING NOT NULL,
order_total DECIMAL(10,2),
placed_at TIMESTAMP_NTZ,
status VARCHAR(50)
);
-- Queries are fast because data is typed and compressed
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(order_total) AS total_spend
FROM orders
WHERE placed_at >= DATEADD(day, -30, CURRENT_TIMESTAMP())
GROUP BY 1
ORDER BY 3 DESC;
Best for: BI-first teams, mature analytics organizations, teams that need predictable performance SLAs, regulated industries where governance matters.
Avoid when: You need to run ML workloads on the same data without exporting. Or your data includes raw files (images, logs, audio) that don't fit tabular storage.
Lakehouse — The Attempt to Have Both
A lakehouse sits on open-format table standards (Delta Lake, Apache Iceberg, Apache Hudi) on top of object storage. You get the cost and flexibility of a data lake with SQL semantics, ACID transactions, and schema enforcement of a warehouse.
Object Storage (S3/GCS/ADLS)
│
┌─────┴──────────────────────┐
│ Open Table Format │
│ (Delta Lake / Iceberg / │
│ Hudi) │
│ – ACID transactions │
│ – Schema enforcement │
│ – Time travel │
│ – Partition pruning │
└────────────┬───────────────┘
│
┌────────┴────────┐
│ │
SQL Engine ML/Python
(Spark/Trino/ (PySpark,
Athena/DuckDB) MLflow)
The promise: One storage layer, multiple compute engines, no vendor lock-in on storage, SQL for analysts and Python for data scientists, ML and BI from the same tables.
The reality: Complexity is non-trivial. You need to manage compaction, vacuuming, and file management that a warehouse handles for you automatically. The tooling is maturing fast but isn't as polished as Snowflake or BigQuery for pure SQL work.
See our Data Lakehouse Architecture Explained article for a deeper dive into open table formats.
# PySpark — writing a Delta Lake table with schema enforcement
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, DecimalType, TimestampType
from delta.tables import DeltaTable
spark = SparkSession.builder .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") .getOrCreate()
schema = StructType([
StructField("order_id", StringType(), False),
StructField("customer_id", StringType(), False),
StructField("order_total", DecimalType(10, 2), True),
StructField("placed_at", TimestampType(), True),
])
df = spark.read.schema(schema).parquet("s3://raw/orders/")
df.write .format("delta") .mode("append") .option("mergeSchema", "false") .save("s3://lakehouse/gold/orders/")
Best for: Teams that want open standards to avoid vendor lock-in, organizations running ML and BI on the same data, companies already in the Databricks or Spark ecosystem.
Avoid when: Your team is SQL-only and has no appetite for managing a distributed compute layer. A managed warehouse is almost certainly more productive.
Decision Tree
Use this to cut through the noise:
Do you have structured data only (tables, no raw files)?
├─ YES → Do you need ML/Python workloads on the same data?
│ ├─ YES → Lakehouse
│ └─ NO → Data Warehouse
└─ NO → Do you have strong data engineering discipline?
├─ YES → Lakehouse (open table formats over object storage)
└─ NO → Start with Data Warehouse, add a lake later
Additional signals that push you toward each option:
| Signal | Lean Toward |
|---|---|
| Team is SQL-first, BI tools are primary consumers | Warehouse |
| Lots of raw events, logs, semi-structured data | Lake or Lakehouse |
| ML/AI is a first-class citizen | Lakehouse |
| Budget is constrained, scale is large | Lake or Lakehouse |
| Compliance / fine-grained access control is critical | Warehouse |
| Vendor lock-in is a concern | Lakehouse (open formats) |
| Small team, fast time-to-value needed | Warehouse |
| Already on Databricks or Spark ecosystem | Lakehouse |
Honest Trade-Offs Side by Side
Data Lake Trade-Offs
✅ Cheapest storage
✅ Handles any data type
✅ No schema lock-in upfront
❌ Governance is hard
❌ SQL performance poor without a compute engine
❌ Swamp risk is real
Data Warehouse Trade-Offs
✅ Best SQL performance
✅ Easiest governance and access control
✅ Analysts productive immediately
❌ Proprietary format = vendor lock-in
❌ ML workloads require data export
❌ Highest cost at scale
Lakehouse Trade-Offs
✅ Open formats, no storage lock-in
✅ SQL + Python from one layer
✅ ACID + time travel without a proprietary warehouse
❌ More operational complexity to manage
❌ Compute engine selection adds decisions
❌ Less mature than warehouse for governance
The Hybrid Reality
Most companies above 50 engineers end up with a hybrid. A typical mature stack looks like:
- Raw zone (data lake): S3 with raw event data, logs, external API dumps
- Curated zone (lakehouse): Delta Lake or Iceberg tables with transformation, ACID guarantees
- Serving zone (warehouse or lakehouse): dbt-modeled marts, queried by BI tools via Snowflake or Databricks SQL
This is effectively the Medallion Architecture — bronze (raw lake), silver (cleansed lakehouse), gold (served warehouse).
Querying Across Layers Without Complexity
If you want to explore data across these layers before committing to an architecture, Harbinger Explorer lets you query CSVs, Parquet files, and API endpoints directly in the browser via DuckDB WASM. It's useful for validating data shapes and running ad-hoc SQL against files before you've stood up a full warehouse or lakehouse — no infrastructure required.
The Bottom Line
The data lake vs. warehouse debate is largely over for greenfield projects — the lakehouse pattern wins on paper. But if your team is SQL-first and has no ML workloads, a managed warehouse is faster to productive. Start with the access pattern your team actually has today, not the architecture you might need in three years.
Next step: Map your team's primary consumers (SQL analysts vs. Python ML engineers) and your dominant data types (structured vs. raw). That answer should make the decision obvious.
Continue Reading
Continue Reading
Data Deduplication Strategies: Hash, Fuzzy, and Record Linkage
Airflow vs Dagster vs Prefect: An Honest Comparison
An unbiased comparison of Airflow, Dagster, and Prefect — covering architecture, DX, observability, and real trade-offs to help you pick the right orchestrator.
Change Data Capture Explained
A practical guide to CDC patterns — log-based, trigger-based, and polling — with Debezium configuration examples and Kafka Connect integration.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial