Harbinger Explorer

Back to Knowledge Hub
solutions
Published:

Moving from CSV Files to Proper SQL Queries: A Practical Guide

9 min read·Tags: csv, sql, data-analysis, spreadsheets, duckdb, analytics, migration

Moving from CSV Files to Proper SQL Queries: A Practical Guide

Let's start with a confession: almost everyone starts with CSVs.

You get a data export from a client. You receive a dataset in an email attachment. You download public data from a government portal. It comes as a CSV. You open it in Excel or Google Sheets, do some filtering, write some formulas, maybe make a pivot table.

And then the file grows. Or you need to combine two files. Or you need to filter 500,000 rows. Or a stakeholder asks a question that requires comparing data across three different spreadsheets.

Suddenly, the spreadsheet starts fighting back.

This isn't a beginner problem. It happens to experienced analysts, seasoned researchers, and professional data teams. CSV + spreadsheet is the path of least resistance until it isn't — and then it becomes the path of maximum pain.

SQL is the upgrade. And with Harbinger Explorer, the upgrade doesn't require a database server, an IT department, or a weekend of setup.


The Real Problems with CSV + Spreadsheet Analysis

Problem 1: Performance walls

Excel starts struggling around 100,000 rows. Google Sheets hits walls even sooner. Formulas that reference large datasets become sluggish. Filters take seconds. VLOOKUP on large datasets is notoriously slow.

Modern data rarely fits in "comfortable Excel size." A single day of web logs might be 500k rows. A year of transaction records from a mid-sized business: millions of rows. Government datasets: routinely tens of millions.

SQL on DuckDB handles 10 million rows faster than Excel handles 100,000. That's not hyperbole — it's the architectural difference between a row-by-row formula engine and a vectorized columnar database.

Problem 2: Reproducibility and auditing

A spreadsheet with 20 formula columns and 5 pivot tables is a logic black box. What transformations were applied? In what order? Why does this cell reference that range?

SQL is executable documentation. The query is the methodology. You can read it, audit it, share it, version-control it, and re-run it on updated data. When a stakeholder asks "how did you calculate this?" you have an answer that's 10 lines of SQL, not a guided tour of a spreadsheet.

Problem 3: Multi-file analysis is painful

Combining data from multiple CSV files in spreadsheets requires either VLOOKUPs (slow, error-prone) or manually copying data between sheets (tedious, introduces errors).

In SQL: JOIN table_a ON table_a.id = table_b.id. One line. Handles any scale.

Problem 4: Version and collaboration chaos

CSV files proliferate. data.csv, data_cleaned.csv, data_final.csv, data_FINAL_v2.csv, data_FINAL_v2_marcedits.csv. Which one is current? Who changed what?

SQL queries are independent of the underlying data. Load the current dataset, run the same query. No version chaos.

Problem 5: Limited analytical capability

Spreadsheets handle basic aggregations. But try implementing a rolling 30-day average, a percentile calculation across groups, a lead/lag comparison over time, or a ratio that depends on a subquery result. You can do it in Excel — but you'll hate yourself.

SQL window functions make these trivial. A rolling average is AVG(value) OVER (ORDER BY date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW).


Why SQL Adoption Stalls (And Why It Doesn't Have To)

For many analysts and researchers, SQL adoption stalls at the same friction point: getting data into a queryable database.

The mental model looks like this:

  1. Get a CSV
  2. ??? (set up a database, import the data, connect to it)
  3. Write SQL

Step 2 is where the dream dies. Setting up PostgreSQL or SQLite locally, importing CSVs, configuring connections — it's an hour of work before you write a single query. Every time you want to analyze a new CSV, you repeat the ritual.

So people go back to spreadsheets. Not because spreadsheets are better, but because the barrier to SQL is high enough that the path of least resistance wins.

Harbinger Explorer eliminates step 2 entirely.

DuckDB WASM runs in your browser. Drop a CSV, and it's immediately queryable. No installation, no import process, no connection configuration. The database is the browser tab.


The Migration in Practice: A Step-by-Step Guide

Step 1: Upload your CSV

Drag your CSV file into Harbinger Explorer. That's it. DuckDB infers the schema — column names, data types — automatically. Within seconds, your CSV is a queryable table.

For most CSVs, type inference is accurate. Harbinger Explorer shows you the inferred schema before you query, so you can spot any misclassifications (dates read as strings, for example).

Step 2: Your first query — the SELECT

The simplest SQL is just SELECT everything:

SELECT * FROM my_data LIMIT 100;

This is your sanity check. Does the data look right? Are columns named correctly? Are dates parsing correctly?

From here, layer in filtering:

SELECT * FROM my_data WHERE date >= '2024-01-01' LIMIT 100;

And basic aggregation:

SELECT category, COUNT(*) as count, AVG(value) as avg_value
FROM my_data
GROUP BY category
ORDER BY count DESC;

If you've been doing this with pivot tables, notice how much clearer the intent is in the SQL.

Step 3: Replace your VLOOKUP with a JOIN

Have a second CSV? Upload it the same way. Now JOIN them:

SELECT a.id, a.name, b.region, b.revenue
FROM customers a
JOIN sales_data b ON a.id = b.customer_id
WHERE b.revenue > 10000;

What would have taken 3 nested VLOOKUPs and prayer now takes 6 lines of SQL that run in milliseconds on any file size.

Step 4: Replace pivot tables with GROUP BY + window functions

Your average pivot table question is a GROUP BY query:

SELECT 
    region, 
    product_category,
    SUM(revenue) as total_revenue,
    COUNT(DISTINCT customer_id) as unique_customers
FROM sales_data
GROUP BY region, product_category
ORDER BY total_revenue DESC;

For time-series analysis with rolling windows:

SELECT 
    date,
    daily_revenue,
    AVG(daily_revenue) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as rolling_7day_avg
FROM daily_sales;

This is where SQL becomes genuinely superior to spreadsheets — not just equivalent.

Step 5: Combine CSVs with live API data

This is the step that has no spreadsheet equivalent. In Harbinger Explorer, you can JOIN your CSV data against a live API source in a single query.

Example: You have a CSV of customer orders. You want to enrich it with current exchange rates from a live API. In a spreadsheet, this requires writing a custom function or exporting/reimporting. In Harbinger Explorer:

SELECT o.order_id, o.amount_usd, r.eur_rate, o.amount_usd * r.eur_rate as amount_eur
FROM orders o
JOIN live_exchange_rates r ON o.currency = r.base_currency;

Same query interface. Same SQL syntax. One table is a local CSV; one is a live API. DuckDB handles both.


SQL Patterns That Replace Common Spreadsheet Operations

Replacing IF/nested IF formulas

Spreadsheet: =IF(A2>1000, "High", IF(A2>500, "Medium", "Low"))

SQL:

SELECT 
    *,
    CASE 
        WHEN value > 1000 THEN 'High'
        WHEN value > 500 THEN 'Medium'
        ELSE 'Low'
    END as tier
FROM data;

Cleaner. Applies to every row. Readable.

Replacing COUNTIF/SUMIF

Spreadsheet: Separate COUNTIF and SUMIF formulas for each category

SQL:

SELECT category, COUNT(*) as count, SUM(value) as total
FROM data GROUP BY category;

One query. Every category. Automatically handles new categories in updated data.

Replacing date calculations

Spreadsheet: =DATEDIF(A2, TODAY(), "D") and variants

SQL:

SELECT *, DATEDIFF('day', created_date, CURRENT_DATE) as days_since_creation FROM data;

Replacing manual deduplication

Spreadsheet: Sort, remove duplicates button, hope for the best

SQL:

SELECT DISTINCT customer_id, email FROM customers;
-- or for "keep the latest record":
SELECT * FROM (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) as rn
    FROM customers
) WHERE rn = 1;

Natural Language Mode: SQL Without Writing SQL

For users who know what they want to find but don't yet write SQL fluently, Harbinger Explorer's natural language interface bridges the gap.

Ask questions like:

  • "Show me the top 10 customers by total purchase value in the last 90 days"
  • "What's the average order value by product category, grouped by month?"
  • "Find all rows where revenue is more than 2 standard deviations above the mean"

The AI agent generates the SQL, runs it, and shows you both the query and the result. You can inspect and modify the SQL — it's a teaching tool as much as an execution tool.

For bootcamp grads or analysts making the transition from spreadsheets, this dramatically accelerates SQL fluency. You see the query, you understand what it does, you modify it. Pattern recognition builds fast.


When to Stick with Spreadsheets (And When Not To)

To be fair: spreadsheets aren't always wrong. They're still the right tool when:

  • You need manual data entry (SQL databases aren't spreadsheets)
  • You need visual formatting for a deliverable (export your SQL results to a spreadsheet for final formatting)
  • You're doing light analysis on a small dataset (<10k rows, simple questions)
  • You need charts and visualizations embedded in a document

For everything else — especially at scale, with multiple files, requiring joins or window functions, or when reproducibility matters — SQL wins.

The practical recommendation: use SQL for the analytical work, export the results to a spreadsheet for the formatted deliverable. Best of both worlds.


Performance Comparison: Real Numbers

On a 1 million row CSV:

OperationExcelGoogle SheetsHarbinger Explorer (DuckDB)
Open/load~2 min~3 min~5 seconds
Simple filter~10 secTimeout<1 second
GROUP BY + COUNTPivot wizard, ~30 secOften times out<1 second
JOIN with another fileRequires VLOOKUP, ~5 minOften impossible<2 seconds

These aren't edge cases — they're typical analytical operations on a moderately-sized dataset. The performance difference is categorical, not marginal.


Getting Started: Your First SQL Query on a CSV

  1. Visit harbingerexplorer.com — 7-day free trial, no credit card
  2. Grab any CSV file you have (client data, a public dataset, anything)
  3. Drop it into Harbinger Explorer
  4. Type SELECT * FROM your_file LIMIT 100 and run it
  5. Start exploring

The first time you run a GROUP BY query that would have taken 5 minutes in Excel and see results in under a second, you'll understand why analysts who make the switch don't go back.


The Bottom Line

CSVs aren't the problem. Spreadsheets as the only analysis tool are the problem.

SQL is faster, more reproducible, more scalable, and more powerful than spreadsheet formulas for analytical work. The traditional barrier — getting data into a database — no longer exists with browser-native DuckDB.

Drop your CSV, write SQL, get answers. No setup required.

Try Harbinger Explorer free for 7 days →


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...