Harbinger Explorer

Back to Knowledge Hub
databricks

Spark SQL vs Pandas: When to Use Which

10 min read·Tags: spark sql, pandas, pyspark, databricks, performance, data engineering, python

The question comes up in almost every data team: should this transformation be Pandas or Spark SQL? Both are valid answers — and both are regularly the wrong one. The choice has real consequences for performance, maintainability, and cost.

The Core Difference

Pandas runs entirely in memory on a single machine. The entire DataFrame must fit in RAM. Operations are eager — they execute immediately and return results. It's Python-native, intuitive, and has the richest ecosystem of libraries.

Spark SQL distributes computation across a cluster. DataFrames are logical plans — execution is lazy until an action triggers it. It scales to terabytes without code changes. But it comes with cluster overhead, serialization costs, and a different performance mental model.

The decision isn't about which is "better" — it's about matching the tool to the data size, team skill, and operational context.

Performance: Where Each Wins

# Pandas — fast for small data, runs locally
import pandas as pd

df = pd.read_parquet("sample_1M_rows.parquet")
result = (
    df
    .groupby("customer_segment")
    .agg({"revenue": "sum", "order_count": "count"})
    .reset_index()
)
print(result)  # Instant for 1M rows on any modern laptop
# PySpark — fast for large data, distributes automatically
from pyspark.sql import functions as F

df = spark.read.parquet("s3://data-lake/transactions/year=2025/")
result = (
    df
    .groupBy("customer_segment")
    .agg(
        F.sum("revenue").alias("total_revenue"),
        F.count("order_id").alias("order_count")
    )
)
result.show()  # Processes 10B rows across cluster nodes

The crossover point where Spark becomes faster than Pandas on a single machine is roughly 500MB–2GB depending on operation type. Below that, Spark's cluster startup and serialization overhead often makes it slower than Pandas, even on a powerful cluster.

API Comparison: The Same Problem, Two Ways

Aggregation

# Pandas
df.groupby(["region", "product"])["revenue"].sum().reset_index()

# Spark SQL (SQL dialect)
spark.sql("""
    SELECT region, product, SUM(revenue) AS total_revenue
    FROM transactions
    GROUP BY region, product
""")

# Spark SQL (DataFrame API — PySpark)
df.groupBy("region", "product").agg(F.sum("revenue").alias("total_revenue"))

Window Functions

# Pandas
df["rank"] = df.groupby("region")["revenue"].rank(method="dense", ascending=False)

# Spark SQL
from pyspark.sql.window import Window

window = Window.partitionBy("region").orderBy(F.desc("revenue"))
df.withColumn("rank", F.dense_rank().over(window))

String Operations

# Pandas
df["email_domain"] = df["email"].str.extract(r"@(.+)$")
df["name_upper"] = df["name"].str.upper()

# Spark SQL
df.withColumn("email_domain", F.regexp_extract("email", r"@(.+)$", 1))
  .withColumn("name_upper", F.upper("name"))

The Pandas API is generally more concise for complex string operations. Spark's DataFrame API requires more explicit function imports but is more portable to SQL.

Memory and Scale

ScenarioPandasSpark SQL
10MB CSV✅ Ideal❌ Overkill
500MB JSON✅ Fine⚠️ Marginal benefit
10GB Parquet❌ OOM risk✅ Handles well
1TB Delta table❌ Impossible✅ Designed for this
10B row aggregation❌ Impossible✅ Bread and butter
Complex regex on 100K rows✅ Fast⚠️ Cluster startup overhead
ML feature engineering (wide DataFrames)✅ sklearn compatible⚠️ Requires conversion

Pandas on Spark (pandas API on Spark)

Databricks Runtime 10.0+ includes the pyspark.pandas module (formerly Koalas), which implements the pandas API on top of Spark:

# pyspark.pandas — familiar pandas API, Spark execution
import pyspark.pandas as ps

df = ps.read_parquet("s3://data-lake/transactions/")
result = df.groupby("region")["revenue"].sum()
print(result)  # Executes on Spark cluster, returns pandas-like result

This is a genuine middle ground — teams with strong pandas skills can work with large datasets without rewriting to the native DataFrame API. The catch: not all pandas operations are supported, and some return different results due to distributed semantics (e.g., random sampling, index-dependent operations).

For production pipelines, prefer native Spark APIs. For analysts exploring large datasets, pyspark.pandas is a practical bridge.

SQL vs DataFrame API in Spark

Within Spark itself, there's a secondary choice: use SQL strings or the DataFrame API?

# Spark SQL string — familiar, readable, portable
spark.sql("""
    SELECT
        customer_id,
        SUM(order_amount) AS lifetime_value,
        COUNT(*) AS order_count,
        MAX(order_date) AS last_order
    FROM silver.orders
    WHERE order_status = 'completed'
    GROUP BY customer_id
    HAVING COUNT(*) >= 3
""")

# Spark DataFrame API — composable, type-safe, refactorable
(
    spark.table("silver.orders")
    .filter(F.col("order_status") == "completed")
    .groupBy("customer_id")
    .agg(
        F.sum("order_amount").alias("lifetime_value"),
        F.count("*").alias("order_count"),
        F.max("order_date").alias("last_order")
    )
    .filter(F.col("order_count") >= 3)
)

Both produce identical execution plans — the optimizer doesn't care. Use SQL when:

  • The logic is read by SQL-first analysts
  • You're writing dbt models or DLT SQL tables
  • Readability for non-Python engineers is important

Use the DataFrame API when:

  • Logic is parameterized (column names, thresholds are variables)
  • You're building reusable functions
  • You want static analysis and autocomplete in an IDE (see notebooks vs IDE)

When Pandas Is the Right Answer

  • Unit testing: Pandas DataFrames are easy to construct and assert on. Spark requires a SparkSession fixture.
  • Library compatibility: scikit-learn, matplotlib, seaborn, statsmodels all expect pandas. Converting large Spark DataFrames to pandas with .toPandas() is expensive.
  • Local development: Pandas runs without a cluster. For development and debugging, pandas is faster iteration.
  • Data < 500MB: Below this threshold, Pandas is almost always faster end-to-end.
  • One-off analysis: Quick explorations don't justify cluster startup time.

When Spark SQL Is the Right Answer

  • Data > 1GB: Pandas will struggle; Spark is designed for this.
  • Production pipelines: Spark's fault tolerance, checkpointing, and scheduling integration make it more reliable for production.
  • Streaming data: Spark Structured Streaming is the natural extension of batch Spark. No pandas equivalent.
  • Delta Lake operations: MERGE, Z-ORDER, OPTIMIZE — Spark SQL is the native interface.
  • Multi-hop transformations: Chained jobs in Databricks Workflows are easier to manage with Spark.

Common Mistakes

1. Using .toPandas() on large DataFrames This pulls the entire distributed DataFrame onto the driver node. On a 10GB table, this will OOM your driver. Always filter or aggregate before .toPandas().

# Wrong — pulls 10GB to driver
df.toPandas()

# Right — aggregate first, then collect
df.groupBy("region").agg(F.sum("revenue")).toPandas()

2. Pandas in a loop over Spark rows Never iterate over Spark DataFrame rows with .collect() or .iterrows() equivalent. This is an anti-pattern that negates all Spark benefits.

3. Using Spark for small reference data Joining a 10M-row fact table to a 100-row lookup table? Read the lookup into a Python dict or pandas DataFrame and broadcast it. A Spark join on a 100-row table is slower than a pandas merge.

4. Ignoring the pandas API on Spark Teams often write complex Spark code when pyspark.pandas would be faster to write and fast enough to execute. Know it exists.

Quick Decision Guide

Data size > 1GB?
    YES → Spark SQL
    NO  → Pandas (unless in a Spark pipeline context)

Need streaming?
    YES → Spark Structured Streaming
    NO  → Continue

Production pipeline in Databricks?
    YES → Spark SQL (native, better scheduling integration)
    NO  → Pandas for dev/test

Team is SQL-first?
    YES → Spark SQL string API
    NO  → Spark DataFrame API or pandas

If you want to run quick ad-hoc SQL on CSV exports or sampled datasets from your Databricks environment without spinning up a cluster, Harbinger Explorer lets you query files directly in the browser via DuckDB WASM — useful for the "small data, quick look" phase before scaling up with Spark.

Key Takeaways

Pandas and Spark SQL are complements, not competitors. The right answer is almost always: Pandas for development, exploration, and sub-GB data; Spark SQL for production pipelines and large-scale transformations. The most expensive mistake is using Spark where Pandas would suffice, and the most catastrophic mistake is using Pandas where Spark is required.

Know where the crossover is for your data, and don't let familiarity with one tool make the decision for you.


Continue Reading


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...