TutorialsMar 25, 2026

Apache Spark Tutorial: From Zero to Your First Data Pipeline

11 min read·Tags: apache spark, pyspark, data engineering, spark tutorial, big data, etl, data pipelines

Apache Spark Tutorial: From Zero to Your First Data Pipeline

You've got a dataset that doesn't fit in pandas anymore. Maybe it's 50 million rows, maybe 500 million. Your laptop fan screams, your kernel crashes, and you start wondering if there's a better way. There is — and it's called Apache Spark.

This Apache Spark tutorial walks you through everything you need to go from "I've heard of Spark" to writing production-grade PySpark jobs. No fluff, just working code.

What Is Apache Spark?

Apache Spark is a distributed computing engine designed to process large datasets across a cluster of machines. Instead of running a transformation on one CPU core, Spark splits the work across dozens or hundreds of cores in parallel.

The key concepts:

Concept	What It Means	Why It Matters
Driver	The process running your main program	Coordinates the work
Executors	Worker processes on cluster nodes	Actually process data
Partitions	Chunks of your data	Enable parallel processing
Lazy evaluation	Transformations aren't executed immediately	Spark optimizes the full plan before running
DAG	Directed Acyclic Graph of operations	Spark's execution plan

Spark supports Python (PySpark), Scala, Java, and R. This tutorial uses PySpark because that's what most data engineers reach for first.

Setting Up Your Apache Spark Environment

Local Development (Single Machine)

The fastest way to start: install PySpark via pip.

# Python 3.8+ required
pip install pyspark==3.5.1

# Verify installation
pyspark --version

Your First SparkSession

Every Spark program starts with a SparkSession — it's your entry point to all Spark functionality.

# PySpark
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("my-first-spark-job")
    .master("local[*]")  # Use all available cores locally
    .config("spark.sql.shuffle.partitions", 8)  # Reduce for local dev
    .getOrCreate()
)

print(f"Spark version: {spark.version}")
print(f"Spark UI: http://localhost:4040")

The local[*] master means "run everything on this machine using all CPU cores." In production, you'd point this to a cluster manager like YARN or Kubernetes.

Common mistake: Leaving spark.sql.shuffle.partitions at the default (200) during local development. With small datasets, 200 partitions means 200 tiny tasks with more overhead than actual work. Set it to 2× your core count locally.

Apache Spark DataFrames: The Core Abstraction

If you know pandas or SQL, Spark DataFrames will feel familiar. The difference: they're distributed across a cluster.

Creating DataFrames

# PySpark — Creating DataFrames from different sources

# From a list of tuples
data = [
    ("Berlin", "DE", 3_645_000, 891.7),
    ("Munich", "DE", 1_472_000, 310.7),
    ("Hamburg", "DE", 1_841_000, 755.2),
    ("Paris", "FR", 2_161_000, 105.4),
    ("Lyon", "FR", 516_000, 47.9),
]
columns = ["city", "country", "population", "area_km2"]

df = spark.createDataFrame(data, columns)
df.show()
# +------+-------+----------+--------+
# |  city|country|population|area_km2|
# +------+-------+----------+--------+
# |Berlin|     DE|   3645000|   891.7|
# |Munich|     DE|   1472000|   310.7|
# |Hamburg|     DE|   1841000|   755.2|
# | Paris|     FR|   2161000|   105.4|
# |  Lyon|     FR|    516000|    47.9|
# +------+-------+----------+--------+

# From CSV (most common in practice)
df_csv = (
    spark.read
    .option("header", True)
    .option("inferSchema", True)
    .csv("/data/cities.csv")
)

# From Parquet (preferred for Spark — columnar, compressed)
df_parquet = spark.read.parquet("/data/cities.parquet")

# From JSON
df_json = spark.read.json("/data/cities.json")

Schema Definition — Don't Rely on Inference

Schema inference is convenient for exploration but dangerous in production. Always define your schema explicitly.

# PySpark — Explicit schema definition
from pyspark.sql.types import (
    StructType, StructField, StringType, LongType, DoubleType
)

schema = StructType([
    StructField("city", StringType(), nullable=False),
    StructField("country", StringType(), nullable=False),
    StructField("population", LongType(), nullable=True),
    StructField("area_km2", DoubleType(), nullable=True),
])

df = spark.read.schema(schema).csv("/data/cities.csv", header=True)
df.printSchema()
# root
#  |-- city: string (nullable = false)
#  |-- country: string (nullable = false)
#  |-- population: long (nullable = true)
#  |-- area_km2: double (nullable = true)

Common mistake: Using inferSchema=True in production pipelines. It reads the entire file once just to guess types, doubles your I/O cost, and sometimes guesses wrong (zip codes as integers, IDs as doubles).

Transformations: Where the Real Work Happens

Spark transformations are lazy — they build up a plan but don't execute until you call an action (like .show(), .count(), or .write).

Essential DataFrame Operations

# PySpark — Core transformations
from pyspark.sql import functions as F

# Filter rows
german_cities = df.filter(F.col("country") == "DE")

# Add computed columns
df_with_density = df.withColumn(
    "density_per_km2",
    F.round(F.col("population") / F.col("area_km2"), 1)
)

# Select and rename
df_clean = (
    df.select(
        F.col("city"),
        F.col("country"),
        F.col("population"),
        F.col("area_km2").alias("area_sqkm"),
    )
)

# Aggregation
country_stats = (
    df.groupBy("country")
    .agg(
        F.sum("population").alias("total_population"),
        F.count("city").alias("num_cities"),
        F.round(F.avg("area_km2"), 1).alias("avg_area_km2"),
    )
)
country_stats.show()
# +-------+----------------+----------+------------+
# |country|total_population|num_cities|avg_area_km2|
# +-------+----------------+----------+------------+
# |     DE|         6958000|         3|       652.5|
# |     FR|         2677000|         2|        76.7|
# +-------+----------------+----------+------------+

# Sort
df_sorted = df.orderBy(F.col("population").desc())

Spark SQL — Use Whichever Feels Natural

You can mix DataFrame API and SQL freely. Register a DataFrame as a temp view and query it with SQL.

# PySpark + Spark SQL
df.createOrReplaceTempView("cities")

result = spark.sql("""
    SELECT
        country,
        COUNT(*) AS num_cities,
        SUM(population) AS total_pop,
        ROUND(AVG(population / area_km2), 1) AS avg_density
    FROM cities
    GROUP BY country
    HAVING SUM(population) > 1000000
    ORDER BY total_pop DESC
""")
result.show()

My take: use SQL for ad-hoc exploration and complex joins. Use the DataFrame API for pipelines where you need composability and testing. They compile to the same execution plan anyway.

Joins in Apache Spark

Joins are where Spark's distributed nature gets interesting — and where performance problems usually start.

# PySpark — Joins
# Country reference data (small dataset)
countries = spark.createDataFrame([
    ("DE", "Germany", "Europe"),
    ("FR", "France", "Europe"),
    ("US", "United States", "North America"),
], ["code", "name", "continent"])

# Inner join (default)
enriched = df.join(countries, df.country == countries.code, "inner")

# Left join — keep all cities even without country match
enriched_left = df.join(countries, df.country == countries.code, "left")

# Broadcast join — force Spark to send small table to all executors
from pyspark.sql.functions import broadcast

enriched_fast = df.join(
    broadcast(countries),
    df.country == countries.code,
    "inner"
)

Broadcast Joins: The Single Biggest Apache Spark Optimization

When one side of a join is small (under ~100 MB), use broadcast(). This sends the small table to every executor, avoiding an expensive shuffle of the large table.

Join Type	When to Use	Watch Out For
Shuffle join (default)	Both tables are large	Expensive — shuffles both sides
Broadcast join	One table < 100 MB	Don't broadcast large tables — OOM risk
Sort-merge join	Both tables large, pre-sorted	Good for repeated joins on same key

Common mistake: Joining two large tables without checking if one side is actually small enough to broadcast. Check with df.count() and estimate size before assuming you need a shuffle join.

Building a Real Data Pipeline

Let's put it all together. Here's a pipeline that reads raw event data, cleans it, aggregates it, and writes the result — a pattern you'll use daily.

# PySpark — Complete pipeline example
from pyspark.sql import SparkSession, functions as F
from pyspark.sql.types import (
    StructType, StructField, StringType, TimestampType, DoubleType, LongType
)

spark = (
    SparkSession.builder
    .appName("event-aggregation-pipeline")
    .config("spark.sql.shuffle.partitions", 16)
    .getOrCreate()
)

# 1. Define schema explicitly
event_schema = StructType([
    StructField("event_id", StringType(), False),
    StructField("user_id", StringType(), False),
    StructField("event_type", StringType(), False),
    StructField("event_ts", TimestampType(), False),
    StructField("amount", DoubleType(), True),
    StructField("country", StringType(), True),
])

# 2. Read raw data
raw_events = (
    spark.read
    .schema(event_schema)
    .parquet("s3a://data-lake/raw/events/")
)

# 3. Clean: remove nulls, filter invalid, deduplicate
cleaned = (
    raw_events
    .filter(F.col("event_type").isNotNull())
    .filter(F.col("amount") >= 0)
    .dropDuplicates(["event_id"])
    .withColumn("event_date", F.to_date("event_ts"))
)

# 4. Aggregate: daily revenue by country
daily_revenue = (
    cleaned
    .filter(F.col("event_type") == "purchase")
    .groupBy("event_date", "country")
    .agg(
        F.sum("amount").alias("total_revenue"),
        F.countDistinct("user_id").alias("unique_buyers"),
        F.count("event_id").alias("num_transactions"),
    )
    .withColumn(
        "avg_order_value",
        F.round(F.col("total_revenue") / F.col("num_transactions"), 2)
    )
)

# 5. Write partitioned output
(
    daily_revenue
    .repartition("event_date")  # One partition per date
    .write
    .mode("overwrite")
    .partitionBy("event_date")
    .parquet("s3a://data-lake/curated/daily_revenue/")
)

print(f"Wrote {daily_revenue.count()} aggregated rows")
spark.stop()

This pipeline follows a pattern you'll see everywhere:

Read with explicit schema
Clean (filter, deduplicate, cast)
Transform (aggregate, join, compute)
Write with partitioning

Performance Pitfalls Every Beginner Hits

1. Collecting Large DataFrames to the Driver

# ❌ This pulls ALL data to the driver — will crash on large datasets
all_rows = df.collect()

# ✅ Take a sample instead
sample = df.limit(100).collect()

# ✅ Or use .show() for inspection
df.show(20, truncate=False)

2. Using Python UDFs When Built-in Functions Exist

# ❌ Python UDF — serializes data to Python, kills performance
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

@udf(StringType())
def upper_udf(s):
    return s.upper() if s else None

df.withColumn("city_upper", upper_udf("city"))

# ✅ Built-in function — runs in JVM, 10-100x faster
df.withColumn("city_upper", F.upper("city"))

3. Not Repartitioning Before Writes

# ❌ 200 tiny files (one per default shuffle partition)
df.write.parquet("/output/")

# ✅ Control output file count
df.repartition(4).write.parquet("/output/")

# ✅ Or coalesce (avoids shuffle, but uneven partition sizes)
df.coalesce(4).write.parquet("/output/")

4. Caching Everything

# ❌ Caching a dataset you only use once — wastes memory
df.cache()
result = df.groupBy("country").count()

# ✅ Cache only when you reuse a DataFrame multiple times
cleaned = raw.filter(...).cache()
report_a = cleaned.groupBy("country").count()
report_b = cleaned.groupBy("event_type").count()
cleaned.unpersist()  # Release when done

When NOT to Use Apache Spark

This is important — Spark isn't always the right tool:

Dataset under 10 GB? Use DuckDB or pandas. Spark's startup overhead isn't worth it for small data.
Real-time, event-by-event processing? Consider Apache Flink or Kafka Streams. Spark Structured Streaming works for micro-batch, but it's not true event-at-a-time processing.
Simple SQL queries on files? DuckDB reads Parquet directly without a cluster. For exploring datasets interactively, tools like Harbinger Explorer let you query data right in your browser using DuckDB WASM — no cluster setup, no JVM overhead, just drag in your files and start writing SQL.
One-off data exploration? A Jupyter notebook with pandas or Polars is faster to set up.

Spark shines when your data is too large for a single machine, when you need a battle-tested distributed engine, or when you're building pipelines that run daily in production.

Next Steps

You've got the foundations: SparkSession, DataFrames, transformations, joins, and a complete pipeline pattern. From here:

Learn partitioning strategy — how you partition data determines your pipeline's performance
Explore Spark Structured Streaming — the same DataFrame API, but for streaming data
Practice with the Spark UI — understanding the DAG visualization and stage metrics is what separates beginners from experienced Spark developers

The best way to learn Spark is to take a dataset that's too big for pandas and rebuild a pipeline you've already written. You'll immediately see where Spark's model differs and where it excels.

Continue Reading

What Is dbt and Why Data Engineers Use It — how dbt complements Spark for transformation logic
Data Lakehouse Architecture Explained — the storage architecture Spark pipelines commonly target
REST API Data Pipeline Guide — ingesting data from APIs before processing with Spark

[VERIFY] Spark default shuffle partitions is 200 — confirmed as of Spark 3.5.x. [VERIFY] Broadcast join threshold default is 10 MB (spark.sql.autoBroadcastJoinThreshold), article recommends manual broadcast for tables up to ~100 MB — this is a practical guideline, not the default. [PRICING-CHECK] None — no third-party pricing mentioned.

View all articles

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Apache Spark Tutorial: From Zero to Your First Data Pipeline

Apache Spark Tutorial: From Zero to Your First Data Pipeline

What Is Apache Spark?

Setting Up Your Apache Spark Environment

Local Development (Single Machine)

Your First SparkSession

Apache Spark DataFrames: The Core Abstraction

Creating DataFrames

Schema Definition — Don't Rely on Inference

Transformations: Where the Real Work Happens

Essential DataFrame Operations

Spark SQL — Use Whichever Feels Natural

Joins in Apache Spark

Broadcast Joins: The Single Biggest Apache Spark Optimization

Building a Real Data Pipeline

Performance Pitfalls Every Beginner Hits

1. Collecting Large DataFrames to the Driver

2. Using Python UDFs When Built-in Functions Exist

3. Not Repartitioning Before Writes

4. Caching Everything

When NOT to Use Apache Spark

Next Steps

Continue Reading

Continue Reading

Natural Language SQL: Ask Your Data Questions in Plain English

DuckDB Tutorial: Analytical SQL Directly in Your Browser

Excel to SQL: A Migration Guide for Business Analysts

Try Harbinger Explorer for free