Harbinger Explorer

Back to Knowledge Hub
solutions
Published:

Join Data From Multiple Sources in Your Browser — No Pipeline Required

9 min read·Tags: data joins, DuckDB, browser, multi-source, no-code, analytics, SQL

Join Data From Multiple Sources in Your Browser — No Pipeline Required

You have three data sources. A CSV export from your CRM. A live REST API for market data. A flat file from a vendor that updates weekly. You need to join all three to answer a single business question.

In the traditional workflow, this means: fire up Python, install pandas, write the merge logic, debug the type mismatches, handle the timezone differences, realize the CSV has duplicate keys, fix it, re-run, finally get a result. Or spin up a local Postgres instance, ingest all three sources, write the JOIN query, debug the import errors.

Experienced data engineers can do this in an hour. Analysts with less Python fluency spend a full afternoon. Researchers and bootcamp grads sometimes spend an entire day.

What if you could do it in two minutes — in a browser tab?

That's exactly what Harbinger Explorer is built for. This article explains how browser-based multi-source joins work, why they're a step change in productivity, and how to do them in Harbinger — with or without writing a line of SQL.


The Multi-Source Join Problem

Before we get into solutions, let's be precise about the problem.

A "multi-source join" means combining rows from two or more datasets based on a shared key or condition. It sounds simple. In practice, it involves a cascade of small problems that compound into large time sinks:

1. Heterogeneous formats

Your sources are almost never in the same format. One is JSON (nested), one is CSV (flat), one is a paginated API (requires crawling), one is a Parquet file from a data lake. Before you can join them, you need to get them all into a common format — which usually means code.

2. Key mismatches

The column you want to join on is called user_id in one source, userId in another, and customer_identifier in a third. One source uses integers, another uses strings, a third pads with zeros (001, 002). These aren't insurmountable, but they require careful transformation before any join will work.

3. Temporal misalignment

You're joining a daily snapshot with a live feed with a weekly batch file. What does a "current" join even mean? If you join on today's date, you'll drop rows from sources that don't have today's data yet. If you join on the most recent available, you're creating a mixed-vintage result. Neither is wrong — but the right choice requires explicit handling.

4. Volume asymmetry

One source has 50 rows; another has 50 million. A naive join can generate cartesian explosions that crash your laptop. Or silently return incomplete results because of default row limits.

5. Infrastructure friction

Setting up a local environment to do any of this — Python, pandas, a local database, a Docker container — takes time before you've touched a single byte of actual data. For a one-off analysis question, this overhead is often worse than the analysis itself.


How Teams Currently Handle Multi-Source Joins

Python/pandas: The most common approach for analysts with coding skills. Flexible, powerful, and time-consuming for setup. Type handling and key normalization need manual code. Not accessible to non-coders on the team.

Google Sheets/Excel VLOOKUP: Works for small datasets, breaks above ~100k rows, can't handle API sources or Parquet files, and produces brittle workbooks that break whenever a column shifts.

Snowflake / BigQuery external tables: Powerful for large-scale production use. Requires a cloud account, table definitions, and either an ETL pipeline or manual upload for each source. Not practical for ad-hoc analysis on sources that haven't been loaded to a warehouse.

dbt + data warehouse: The right answer for a production data team. Also a significant engineering project. Not appropriate for a researcher doing a one-time enrichment join.

Makeshift local Postgres: Spin up Docker, create tables, import CSVs, write SQL. Works. Takes 45 minutes before the first query. Nobody does this for ad-hoc work after the first few times.


Enter DuckDB WASM: A Database in Your Browser

Harbinger Explorer is built on DuckDB WASM — DuckDB compiled to WebAssembly, running entirely inside your browser tab.

DuckDB is an analytical database engine that excels at fast, in-process queries over flat files, JSON, Parquet, and API data. The WASM version brings that power into the browser without any server-side infrastructure. Your data never leaves your machine — it's loaded into an in-memory database engine running locally, processed in seconds.

This is a fundamental architectural shift from every other approach. You're not uploading data to a cloud service. You're not connecting to a remote server. You're running a real database engine in your browser tab.

The implications:

  • Zero setup. No Python environment, no Docker, no cloud account. Open a tab.
  • Real SQL. Full DuckDB SQL — window functions, CTEs, JOINs, aggregations, string operations — not a limited spreadsheet formula language.
  • Multi-format native support. DuckDB natively reads CSV, JSON, Parquet, and Arrow. No conversion step.
  • Performance. DuckDB's vectorized execution handles millions of rows in seconds, even in a browser.
  • Privacy. Your data stays local. Nothing is sent to Harbinger's servers.

How Multi-Source Joins Work in Harbinger Explorer

Step 1: Add Sources to the Catalog

The Source Catalog is where you register your data sources. This can be:

  • A CSV or Parquet file (drag and drop)
  • An API endpoint (paste the URL, configure auth if needed)
  • A file from a URL (public or authenticated)
  • A previously saved query result (which becomes a virtual source)

Each source gets a name you choose — say, crm_export, market_api, vendor_weekly. These become table names you can reference in queries.

Step 2: Write Your Join (SQL or Natural Language)

Once your sources are registered, you can join them immediately. Two options:

Option A: SQL

SELECT
  c.user_id,
  c.company,
  m.market_cap,
  v.risk_score
FROM crm_export c
LEFT JOIN market_api m ON c.ticker = m.symbol
LEFT JOIN vendor_weekly v ON c.company_id = v.id
WHERE c.segment = 'enterprise'

You write standard SQL. DuckDB executes it in your browser. Results appear in seconds.

Option B: Natural Language

If you'd rather not write SQL, Harbinger's AI agent translates plain English:

"Join my CRM export with the market data API on the ticker symbol, and add the vendor risk scores. Show me only enterprise-segment companies."

The agent generates the SQL, runs it, and shows you the results — plus the SQL it used, so you can inspect and modify if needed. This is genuinely useful for analysts who know what they want but aren't SQL-fluent.

Step 3: Handle Key Mismatches Automatically

Before running your join, Harbinger's schema analyzer compares the join keys you've specified across both sources and flags potential mismatches:

  • Type mismatch: user_id is INTEGER in source A, VARCHAR in source B
  • Format mismatch: IDs in source A are zero-padded (007), source B uses raw integers (7)
  • Case inconsistency: COMPANY_NAME vs. company_name
  • Null distribution: 12% of rows in source B have null join keys — those will drop from an INNER JOIN

These are surfaced before execution, with suggested fixes. You address them explicitly rather than discovering the issue in a downstream anomaly report.

Step 4: Inspect and Export

Results are displayed in a table with column type annotations. You can:

  • Sort and filter inline
  • Run follow-up queries against the joined result (it's saved as a temporary table)
  • Export to CSV, Parquet, or JSON
  • Save the query to your workspace for reuse
  • Share the query (not the data) with a colleague who has their own Harbinger workspace

Real-World Use Cases

Freelance data consultants: Client enrichment analysis

A client has a CRM export (CSV) and wants it enriched with data from a company research API. In the old workflow: write a Python script, handle pagination, merge DataFrames, debug key mismatches, deliver a final CSV. Time: 3–4 hours.

In Harbinger: Add the CRM CSV to the catalog, add the API endpoint, ask "join these on company name and add firmographic data." Time: 15 minutes, including source setup.

Research analysts: Cross-source event study

You're analyzing the relationship between geopolitical events (from a news API) and market movements (from a financial data API), with country metadata from a reference file. Three sources, different schemas, different update frequencies.

In Harbinger: Register all three sources in the catalog, specify the join logic in natural language, inspect the results in seconds. The browser-based execution means you can iterate on the join conditions and filters instantly — no waiting for a Jupyter kernel, no re-importing data.

Bootcamp graduates: Portfolio project with real data

You're building a portfolio project that combines public economic data (from an API) with demographic data (from a CSV) and historical event data (from another API). You want to demonstrate SQL skills and multi-source analysis — but setting up a local database just for a portfolio project feels heavyweight.

Harbinger lets you do the entire analysis in the browser, export the results, and document the query logic. It's faster to set up and looks more sophisticated in a portfolio presentation.

Internal analytics teams: Ad-hoc enrichment without ETL

A stakeholder wants a one-off enrichment: take the existing customer list and add a column from a recently-received vendor file, then cross-reference with a live API for current account status. The right long-term answer is to add this to the ETL pipeline — but the stakeholder needs it today.

Harbinger handles the one-off join immediately, the analyst delivers the enriched dataset in 10 minutes, and the ETL ticket goes in the backlog for later.


Harbinger vs. The Alternatives (Time Comparison)

ApproachSetup TimeFirst Query TimeSkill Required
Python/pandas10–20 min20–40 minIntermediate Python
Local Postgres30–60 min10–20 minSQL + Docker
Google Sheets VLOOKUP5 min5 minBasic (breaks at scale)
Cloud Warehouse (BigQuery)60–120 min5–10 minSQL + GCP
Harbinger Explorer2 min2 minNone (NL) or Basic SQL

These numbers assume a typical analyst doing a one-off join of 2–3 sources. For recurring production joins, a pipeline is still the right answer. But for the 80% of join work that's ad-hoc, investigative, or iterative, Harbinger removes the infrastructure friction entirely.


The Time Math: What This Is Worth

For an analyst who does multi-source join work twice a week:

  • Old approach average time: 2.5 hours per join (including setup, debugging, export)
  • Harbinger average time: 20 minutes per join
  • Weekly time saved: ~4.3 hours
  • Annual time saved: ~225 hours

At €50/hr, that's €11,250 in recovered productivity per year — per analyst. For a team of four analysts doing this kind of work regularly, you're looking at recovered time equivalent to a part-time hire.

The Harbinger Pro plan is €24/month. The math is not close.


Privacy and Security: A Word on Where Your Data Goes

When you run a join in Harbinger Explorer, your data is loaded into DuckDB WASM running in your browser. It does not leave your machine. The API crawling functionality makes requests from your browser (not through Harbinger servers). Query results are stored in your browser's local storage or exported locally.

This is meaningfully different from cloud analytics tools that ingest your data into their infrastructure. For analysts working with confidential client data, sensitive research, or proprietary datasets, the browser-local execution model is a significant advantage.


Getting Started: Your First Multi-Source Join

  1. Go to harbingerexplorer.com and start your free 7-day trial
  2. Add your first source — drag in a CSV or paste an API URL
  3. Add a second source — another file or API
  4. Ask Harbinger to join them — in natural language or SQL
  5. Inspect the results, export, and iterate

The whole thing takes under five minutes. No installation, no cloud setup, no code.


Conclusion

Multi-source data joins are one of the highest-friction, most time-consuming parts of day-to-day data work. The infrastructure overhead alone — Python environments, local databases, cloud accounts — consumes time that should be spent on analysis.

Harbinger Explorer eliminates that overhead by running a full analytical database engine (DuckDB) directly in your browser. Add your sources, write a query (or ask in plain English), get your joined result in seconds.

For freelancers, researchers, analysts, and team leads who do this work regularly, the time savings are substantial. The math is simple. The setup is two minutes.

Stop fighting infrastructure. Start joining data.

→ Try Harbinger Explorer free for 7 days

Starter plan: €8/month. Pro plan: €24/month. All plans include multi-source joins, DuckDB WASM engine, AI natural language queries, and the full Source Catalog.


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...