DuckDB Tutorial: Analytical SQL Directly in Your Browser
DuckDB Tutorial: Analytical SQL Directly in Your Browser
If you've run aggregations on a 5-million-row CSV in SQLite and stared at a spinning cursor, you already feel the problem. DuckDB exists to fix that — and this duckdb tutorial gets you running real analytical SQL in under 15 minutes, no server required.
What Makes DuckDB Different
DuckDB is an in-process analytical database. No server process. No config file. No daemon to manage. You embed it directly in Python, use it from the command line, or run it in a browser via WebAssembly — and it executes OLAP-style queries using a vectorized columnar engine that consistently surprises people with how fast it is on local data.
The duckdb getting started experience is intentionally minimal: one pip install duckdb and you're writing analytical SQL. But the real capability is in the SQL dialect itself — read_parquet, read_csv_auto, PIVOT, window functions, and a growing set of analytical extensions. This is not SQLite with a marketing rebrand. The execution model is fundamentally different.
DuckDB Architecture: Embedded, No Server
Understanding DuckDB's architecture explains both its strengths and its limits. Everything runs inside your process — there is no separate server component.
Loading diagram...
① In-Process Execution — DuckDB runs inside your application. No network round-trips, no connection pooling, no port to open. It starts in milliseconds.
② Columnar Engine — Queries scan only the columns they need, using SIMD instructions and vectorized batching. Aggregations on wide tables are dramatically faster than row-oriented databases.
③ Direct File Access — DuckDB reads Parquet, CSV, and JSON directly from disk, remote HTTPS, or S3-compatible storage without importing them into a database first.
④ DuckDB WASM — The WebAssembly build compiles the full DuckDB engine for browser execution. No backend, no data upload — queries run client-side inside the browser tab.
🔵 Application Layer | 🟢 DuckDB Core | 🟡 Local Storage | 🔴 File Sources | 🟣 Browser Runtime
DuckDB vs SQLite vs PostgreSQL
The most common confusion is where DuckDB fits relative to existing databases. Here's the comparison that matters for analytical SQL:
| Feature | DuckDB | SQLite | PostgreSQL |
|---|---|---|---|
| Primary use case | Analytical (OLAP) | Transactional (OLTP) | Transactional + mixed |
| Execution model | Columnar, vectorized | Row-based | Row-based |
| Server required | ❌ No (embedded) | ❌ No (embedded) | ✅ Yes |
| Native Parquet support | ✅ read_parquet() | ❌ No | ❌ Extension needed |
read_csv_auto() | ✅ Yes | ❌ No | ❌ No |
| PIVOT / UNPIVOT | ✅ Native | ❌ No | ❌ Manual CASE WHEN |
| Window functions | ✅ Full support | ✅ Basic | ✅ Full support |
| Concurrent writes | ⚠️ Single writer | ⚠️ Limited | ✅ Full MVCC |
| WASM browser build | ✅ Yes | ✅ Yes | ❌ No |
| Best for | Local analytics, ETL, notebooks | Apps with light local storage | Production web applications |
Last verified: March 2026
The design philosophy difference: DuckDB optimizes for read throughput on wide tables. SQLite optimizes for small, frequent writes in embedded apps. PostgreSQL optimizes for correctness and concurrent access in server applications. They're not competitors — they solve different problems.
Tutorial: DuckDB in Practice
Step 1: Install DuckDB
pip install duckdb
No system dependencies. No Docker. No environment variables. This is the full install.
For the CLI:
# macOS
brew install duckdb
# Or download the binary from duckdb.org/docs/installation
Step 2: Query a CSV File with read_csv_auto
DuckDB's read_csv_auto reads a CSV file and infers column types automatically. You write SQL against the file directly — no import step, no CREATE TABLE first.
-- DuckDB SQL: query a CSV file without importing it
SELECT
region,
COUNT(*) AS order_count,
SUM(revenue) AS total_revenue,
ROUND(AVG(revenue), 2) AS avg_order_value
FROM read_csv_auto('sales_data.csv')
WHERE order_date >= '2024-01-01'
GROUP BY region
ORDER BY total_revenue DESC;
From Python, returning a pandas DataFrame:
import duckdb
conn = duckdb.connect() # in-memory; pass a file path to persist
result = conn.execute("""
SELECT
region,
COUNT(*) AS order_count,
SUM(revenue) AS total_revenue,
ROUND(AVG(revenue), 2) AS avg_order_value
FROM read_csv_auto('sales_data.csv')
WHERE order_date >= '2024-01-01'
GROUP BY region
ORDER BY total_revenue DESC
""").fetchdf()
print(result)
One practical note: read_csv_auto is excellent for exploration. In production pipelines, specify explicit types with read_csv('file.csv', columns={'id': 'INTEGER', 'revenue': 'DOUBLE'}) to avoid schema drift issues.
Step 3: Query Parquet Files with read_parquet
Parquet is the standard format of the modern data stack — columnar, compressed, efficient for analytics. DuckDB's read_parquet can query local files, HTTPS endpoints, or S3-compatible storage directly.
-- DuckDB SQL: aggregate a remote Parquet file with column pushdown
SELECT
year,
category,
SUM(amount) AS total_amount,
COUNT(DISTINCT user_id) AS unique_users
FROM read_parquet('s3://my-data-bucket/transactions/2024/*.parquet')
WHERE year = 2024
AND category IN ('electronics', 'apparel')
GROUP BY year, category
ORDER BY total_amount DESC;
DuckDB uses Parquet's metadata to push down column selection and row-group filters. For a 10GB Parquet file where you're aggregating 3 columns, it may read only 15% of the actual bytes. This is why analytical sql duckdb performs surprisingly well on large files that would be painfully slow in SQLite.
You can also glob multiple files: read_parquet('data/2024-*.parquet') reads all matching files as a single virtual table.
Step 4: Reshape Results with PIVOT
PIVOT turns row values into columns — essential for reporting and time-series comparisons. DuckDB supports it natively; in most other databases you'd write manual CASE WHEN expressions.
-- DuckDB SQL: PIVOT monthly revenue by region
PIVOT (
SELECT
region,
strftime(order_date, '%Y-%m') AS month,
SUM(revenue) AS revenue
FROM read_csv_auto('sales_data.csv')
GROUP BY region, month
)
ON month
USING SUM(revenue)
ORDER BY region;
Output: one row per region, one column per month (2024-01, 2024-02, …). This is the structure you'd hand to a BI tool or spreadsheet — and you got there without leaving SQL.
Common Mistakes and Pitfalls
Using DuckDB for concurrent writes. DuckDB is single-writer. One process can open a .duckdb file for writing at a time. If your application has multiple writers or you're building a transactional service, use PostgreSQL.
Relying on read_csv_auto in production. Auto-detection works 90% of the time in exploration. That 10% introduces hard-to-debug type coercion issues in pipelines. Specify explicit schemas for production use.
Treating DuckDB as a data warehouse substitute. For datasets up to ~50–100GB on a single machine, DuckDB is excellent. For petabyte-scale multi-user analytics, you still want a distributed warehouse. They complement each other — DuckDB WASM for interactive exploration, Snowflake or Databricks for the heavy shared workloads.
Ignoring memory limits. DuckDB spills to disk automatically when memory is exceeded, but query performance degrades significantly. Know your dataset size. For very large datasets, filter aggressively before aggregating.
When DuckDB Is NOT the Right Choice
DuckDB has clear boundaries. Use something else when:
- Multi-user transactional applications — concurrent writes, foreign key enforcement, row-level security → PostgreSQL
- Petabyte-scale shared analytics — distributed query execution, caching, workload management → Snowflake, BigQuery, Databricks
- Mobile or IoT embedded databases — SQLite has better multi-process support and a smaller footprint for write-heavy embedded use cases
- Long-running shared server processes — DuckDB's single-writer model doesn't work well in multi-process server architectures
- Complex OLTP schemas — triggers, extensive FK constraints, stored procedures → PostgreSQL or MySQL
The honest version: DuckDB is a specialist tool that excels at a specific job. That job is analytical SQL on local or file-based data. If your job matches, it's probably the best tool available. If it doesn't, it's the wrong tool.
DuckDB WASM: The Browser Angle
The most surprising development in the DuckDB ecosystem is the WebAssembly build — duckdb wasm. It compiles the full DuckDB engine to run inside a browser tab, with no backend server involved. You get the same SQL dialect, the same file reading capabilities, and the same performance characteristics — client-side.
Harbinger Explorer uses DuckDB WASM as its core query engine. When you connect a CSV, upload a file, or query an API source, you're running the exact SQL from this tutorial directly in your browser. There's no server processing your data — it never leaves your tab. You can explore it yourself with a 7-day free trial.
Next Steps
You've covered the core DuckDB workflow: reading files with read_csv_auto and read_parquet, running analytical aggregations, and reshaping results with PIVOT. A few directions to go from here:
- For querying data without writing SQL at all, see Natural Language SQL: Ask Your Data in Plain English
- For organizing your data sources as your catalog grows, see What Is a Data Catalog?
The DuckDB documentation at duckdb.org covers the full SQL reference — the httpfs extension for S3/HTTPS access and the json extension are worth exploring next.
Continue Reading
- Natural Language SQL: Ask Your Data in Plain English
- What Is a Data Catalog? Tools, Trade-offs and When You Need One
- Self-Service Analytics: A Practical Guide for Data Teams
[VERIFY] DuckDB WASM performance characteristics vs server-side DuckDB for large file reads
[VERIFY] DuckDB single-writer limitation — confirm current state with DuckDB 1.x
Continue Reading
Natural Language SQL: Ask Your Data Questions in Plain English
How NL2SQL works, real examples of natural language questions converted to SQL, an honest comparison of tools, and where it fails.
Building a REST API Data Pipeline in Python
A step-by-step guide to building a production-grade REST API data pipeline in Python. Covers authentication, pagination, rate limits, schema validation, and common pitfalls with real runnable code.
Excel to SQL: A Migration Guide for Business Analysts
Complete guide to Excel to SQL migration for business analysts. 25-row concept mapping table, SQL code examples, common pitfalls, and tips for making the switch stick.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial