Back to Knowledge Hub
Tutorials

DuckDB Tutorial: Analytical SQL Directly in Your Browser

9 min read·Tags: duckdb, sql, analytics, parquet, duckdb-wasm, tutorial, columnar-database, analytical-sql

DuckDB Tutorial: Analytical SQL Directly in Your Browser

If you've run aggregations on a 5-million-row CSV in SQLite and stared at a spinning cursor, you already feel the problem. DuckDB exists to fix that — and this duckdb tutorial gets you running real analytical SQL in under 15 minutes, no server required.

What Makes DuckDB Different

DuckDB is an in-process analytical database. No server process. No config file. No daemon to manage. You embed it directly in Python, use it from the command line, or run it in a browser via WebAssembly — and it executes OLAP-style queries using a vectorized columnar engine that consistently surprises people with how fast it is on local data.

The duckdb getting started experience is intentionally minimal: one pip install duckdb and you're writing analytical SQL. But the real capability is in the SQL dialect itself — read_parquet, read_csv_auto, PIVOT, window functions, and a growing set of analytical extensions. This is not SQLite with a marketing rebrand. The execution model is fundamentally different.

DuckDB Architecture: Embedded, No Server

Understanding DuckDB's architecture explains both its strengths and its limits. Everything runs inside your process — there is no separate server component.

Loading diagram...

In-Process Execution — DuckDB runs inside your application. No network round-trips, no connection pooling, no port to open. It starts in milliseconds.

Columnar Engine — Queries scan only the columns they need, using SIMD instructions and vectorized batching. Aggregations on wide tables are dramatically faster than row-oriented databases.

Direct File Access — DuckDB reads Parquet, CSV, and JSON directly from disk, remote HTTPS, or S3-compatible storage without importing them into a database first.

DuckDB WASM — The WebAssembly build compiles the full DuckDB engine for browser execution. No backend, no data upload — queries run client-side inside the browser tab.

🔵 Application Layer  |  🟢 DuckDB Core  |  🟡 Local Storage  |  🔴 File Sources  |  🟣 Browser Runtime


DuckDB vs SQLite vs PostgreSQL

The most common confusion is where DuckDB fits relative to existing databases. Here's the comparison that matters for analytical SQL:

FeatureDuckDBSQLitePostgreSQL
Primary use caseAnalytical (OLAP)Transactional (OLTP)Transactional + mixed
Execution modelColumnar, vectorizedRow-basedRow-based
Server required❌ No (embedded)❌ No (embedded)✅ Yes
Native Parquet supportread_parquet()❌ No❌ Extension needed
read_csv_auto()✅ Yes❌ No❌ No
PIVOT / UNPIVOT✅ Native❌ No❌ Manual CASE WHEN
Window functions✅ Full support✅ Basic✅ Full support
Concurrent writes⚠️ Single writer⚠️ Limited✅ Full MVCC
WASM browser build✅ Yes✅ Yes❌ No
Best forLocal analytics, ETL, notebooksApps with light local storageProduction web applications

Last verified: March 2026

The design philosophy difference: DuckDB optimizes for read throughput on wide tables. SQLite optimizes for small, frequent writes in embedded apps. PostgreSQL optimizes for correctness and concurrent access in server applications. They're not competitors — they solve different problems.


Tutorial: DuckDB in Practice

Step 1: Install DuckDB

pip install duckdb

No system dependencies. No Docker. No environment variables. This is the full install.

For the CLI:

# macOS
brew install duckdb

# Or download the binary from duckdb.org/docs/installation

Step 2: Query a CSV File with read_csv_auto

DuckDB's read_csv_auto reads a CSV file and infers column types automatically. You write SQL against the file directly — no import step, no CREATE TABLE first.

-- DuckDB SQL: query a CSV file without importing it
SELECT
    region,
    COUNT(*)                    AS order_count,
    SUM(revenue)                AS total_revenue,
    ROUND(AVG(revenue), 2)      AS avg_order_value
FROM read_csv_auto('sales_data.csv')
WHERE order_date >= '2024-01-01'
GROUP BY region
ORDER BY total_revenue DESC;

From Python, returning a pandas DataFrame:

import duckdb

conn = duckdb.connect()  # in-memory; pass a file path to persist

result = conn.execute("""
    SELECT
        region,
        COUNT(*)               AS order_count,
        SUM(revenue)           AS total_revenue,
        ROUND(AVG(revenue), 2) AS avg_order_value
    FROM read_csv_auto('sales_data.csv')
    WHERE order_date >= '2024-01-01'
    GROUP BY region
    ORDER BY total_revenue DESC
""").fetchdf()

print(result)

One practical note: read_csv_auto is excellent for exploration. In production pipelines, specify explicit types with read_csv('file.csv', columns={'id': 'INTEGER', 'revenue': 'DOUBLE'}) to avoid schema drift issues.

Step 3: Query Parquet Files with read_parquet

Parquet is the standard format of the modern data stack — columnar, compressed, efficient for analytics. DuckDB's read_parquet can query local files, HTTPS endpoints, or S3-compatible storage directly.

-- DuckDB SQL: aggregate a remote Parquet file with column pushdown
SELECT
    year,
    category,
    SUM(amount)             AS total_amount,
    COUNT(DISTINCT user_id) AS unique_users
FROM read_parquet('s3://my-data-bucket/transactions/2024/*.parquet')
WHERE year = 2024
  AND category IN ('electronics', 'apparel')
GROUP BY year, category
ORDER BY total_amount DESC;

DuckDB uses Parquet's metadata to push down column selection and row-group filters. For a 10GB Parquet file where you're aggregating 3 columns, it may read only 15% of the actual bytes. This is why analytical sql duckdb performs surprisingly well on large files that would be painfully slow in SQLite.

You can also glob multiple files: read_parquet('data/2024-*.parquet') reads all matching files as a single virtual table.

Step 4: Reshape Results with PIVOT

PIVOT turns row values into columns — essential for reporting and time-series comparisons. DuckDB supports it natively; in most other databases you'd write manual CASE WHEN expressions.

-- DuckDB SQL: PIVOT monthly revenue by region
PIVOT (
    SELECT
        region,
        strftime(order_date, '%Y-%m') AS month,
        SUM(revenue)                   AS revenue
    FROM read_csv_auto('sales_data.csv')
    GROUP BY region, month
)
ON month
USING SUM(revenue)
ORDER BY region;

Output: one row per region, one column per month (2024-01, 2024-02, …). This is the structure you'd hand to a BI tool or spreadsheet — and you got there without leaving SQL.


Common Mistakes and Pitfalls

Using DuckDB for concurrent writes. DuckDB is single-writer. One process can open a .duckdb file for writing at a time. If your application has multiple writers or you're building a transactional service, use PostgreSQL.

Relying on read_csv_auto in production. Auto-detection works 90% of the time in exploration. That 10% introduces hard-to-debug type coercion issues in pipelines. Specify explicit schemas for production use.

Treating DuckDB as a data warehouse substitute. For datasets up to ~50–100GB on a single machine, DuckDB is excellent. For petabyte-scale multi-user analytics, you still want a distributed warehouse. They complement each other — DuckDB WASM for interactive exploration, Snowflake or Databricks for the heavy shared workloads.

Ignoring memory limits. DuckDB spills to disk automatically when memory is exceeded, but query performance degrades significantly. Know your dataset size. For very large datasets, filter aggressively before aggregating.


When DuckDB Is NOT the Right Choice

DuckDB has clear boundaries. Use something else when:

  • Multi-user transactional applications — concurrent writes, foreign key enforcement, row-level security → PostgreSQL
  • Petabyte-scale shared analytics — distributed query execution, caching, workload management → Snowflake, BigQuery, Databricks
  • Mobile or IoT embedded databases — SQLite has better multi-process support and a smaller footprint for write-heavy embedded use cases
  • Long-running shared server processes — DuckDB's single-writer model doesn't work well in multi-process server architectures
  • Complex OLTP schemas — triggers, extensive FK constraints, stored procedures → PostgreSQL or MySQL

The honest version: DuckDB is a specialist tool that excels at a specific job. That job is analytical SQL on local or file-based data. If your job matches, it's probably the best tool available. If it doesn't, it's the wrong tool.


DuckDB WASM: The Browser Angle

The most surprising development in the DuckDB ecosystem is the WebAssembly build — duckdb wasm. It compiles the full DuckDB engine to run inside a browser tab, with no backend server involved. You get the same SQL dialect, the same file reading capabilities, and the same performance characteristics — client-side.

Harbinger Explorer uses DuckDB WASM as its core query engine. When you connect a CSV, upload a file, or query an API source, you're running the exact SQL from this tutorial directly in your browser. There's no server processing your data — it never leaves your tab. You can explore it yourself with a 7-day free trial.


Next Steps

You've covered the core DuckDB workflow: reading files with read_csv_auto and read_parquet, running analytical aggregations, and reshaping results with PIVOT. A few directions to go from here:

The DuckDB documentation at duckdb.org covers the full SQL reference — the httpfs extension for S3/HTTPS access and the json extension are worth exploring next.


Continue Reading


[VERIFY] DuckDB WASM performance characteristics vs server-side DuckDB for large file reads
[VERIFY] DuckDB single-writer limitation — confirm current state with DuckDB 1.x


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial