Apache Spark Tutorial: From Zero to Your First Data Pipeline
Apache Spark Tutorial: From Zero to Your First Data Pipeline
You've got a dataset that doesn't fit in pandas anymore. Maybe it's 50 million rows, maybe 500 million. Your laptop fan screams, your kernel crashes, and you start wondering if there's a better way. There is — and it's called Apache Spark.
This Apache Spark tutorial walks you through everything you need to go from "I've heard of Spark" to writing production-grade PySpark jobs. No fluff, just working code.
What Is Apache Spark?
Apache Spark is a distributed computing engine designed to process large datasets across a cluster of machines. Instead of running a transformation on one CPU core, Spark splits the work across dozens or hundreds of cores in parallel.
The key concepts:
| Concept | What It Means | Why It Matters |
|---|---|---|
| Driver | The process running your main program | Coordinates the work |
| Executors | Worker processes on cluster nodes | Actually process data |
| Partitions | Chunks of your data | Enable parallel processing |
| Lazy evaluation | Transformations aren't executed immediately | Spark optimizes the full plan before running |
| DAG | Directed Acyclic Graph of operations | Spark's execution plan |
Spark supports Python (PySpark), Scala, Java, and R. This tutorial uses PySpark because that's what most data engineers reach for first.
Setting Up Your Apache Spark Environment
Local Development (Single Machine)
The fastest way to start: install PySpark via pip.
# Python 3.8+ required
pip install pyspark==3.5.1
# Verify installation
pyspark --version
Your First SparkSession
Every Spark program starts with a SparkSession — it's your entry point to all Spark functionality.
# PySpark
from pyspark.sql import SparkSession
spark = (
SparkSession.builder
.appName("my-first-spark-job")
.master("local[*]") # Use all available cores locally
.config("spark.sql.shuffle.partitions", 8) # Reduce for local dev
.getOrCreate()
)
print(f"Spark version: {spark.version}")
print(f"Spark UI: http://localhost:4040")
The local[*] master means "run everything on this machine using all CPU cores." In production, you'd point this to a cluster manager like YARN or Kubernetes.
Common mistake: Leaving
spark.sql.shuffle.partitionsat the default (200) during local development. With small datasets, 200 partitions means 200 tiny tasks with more overhead than actual work. Set it to 2× your core count locally.
Apache Spark DataFrames: The Core Abstraction
If you know pandas or SQL, Spark DataFrames will feel familiar. The difference: they're distributed across a cluster.
Creating DataFrames
# PySpark — Creating DataFrames from different sources
# From a list of tuples
data = [
("Berlin", "DE", 3_645_000, 891.7),
("Munich", "DE", 1_472_000, 310.7),
("Hamburg", "DE", 1_841_000, 755.2),
("Paris", "FR", 2_161_000, 105.4),
("Lyon", "FR", 516_000, 47.9),
]
columns = ["city", "country", "population", "area_km2"]
df = spark.createDataFrame(data, columns)
df.show()
# +------+-------+----------+--------+
# | city|country|population|area_km2|
# +------+-------+----------+--------+
# |Berlin| DE| 3645000| 891.7|
# |Munich| DE| 1472000| 310.7|
# |Hamburg| DE| 1841000| 755.2|
# | Paris| FR| 2161000| 105.4|
# | Lyon| FR| 516000| 47.9|
# +------+-------+----------+--------+
# From CSV (most common in practice)
df_csv = (
spark.read
.option("header", True)
.option("inferSchema", True)
.csv("/data/cities.csv")
)
# From Parquet (preferred for Spark — columnar, compressed)
df_parquet = spark.read.parquet("/data/cities.parquet")
# From JSON
df_json = spark.read.json("/data/cities.json")
Schema Definition — Don't Rely on Inference
Schema inference is convenient for exploration but dangerous in production. Always define your schema explicitly.
# PySpark — Explicit schema definition
from pyspark.sql.types import (
StructType, StructField, StringType, LongType, DoubleType
)
schema = StructType([
StructField("city", StringType(), nullable=False),
StructField("country", StringType(), nullable=False),
StructField("population", LongType(), nullable=True),
StructField("area_km2", DoubleType(), nullable=True),
])
df = spark.read.schema(schema).csv("/data/cities.csv", header=True)
df.printSchema()
# root
# |-- city: string (nullable = false)
# |-- country: string (nullable = false)
# |-- population: long (nullable = true)
# |-- area_km2: double (nullable = true)
Common mistake: Using
inferSchema=Truein production pipelines. It reads the entire file once just to guess types, doubles your I/O cost, and sometimes guesses wrong (zip codes as integers, IDs as doubles).
Transformations: Where the Real Work Happens
Spark transformations are lazy — they build up a plan but don't execute until you call an action (like .show(), .count(), or .write).
Essential DataFrame Operations
# PySpark — Core transformations
from pyspark.sql import functions as F
# Filter rows
german_cities = df.filter(F.col("country") == "DE")
# Add computed columns
df_with_density = df.withColumn(
"density_per_km2",
F.round(F.col("population") / F.col("area_km2"), 1)
)
# Select and rename
df_clean = (
df.select(
F.col("city"),
F.col("country"),
F.col("population"),
F.col("area_km2").alias("area_sqkm"),
)
)
# Aggregation
country_stats = (
df.groupBy("country")
.agg(
F.sum("population").alias("total_population"),
F.count("city").alias("num_cities"),
F.round(F.avg("area_km2"), 1).alias("avg_area_km2"),
)
)
country_stats.show()
# +-------+----------------+----------+------------+
# |country|total_population|num_cities|avg_area_km2|
# +-------+----------------+----------+------------+
# | DE| 6958000| 3| 652.5|
# | FR| 2677000| 2| 76.7|
# +-------+----------------+----------+------------+
# Sort
df_sorted = df.orderBy(F.col("population").desc())
Spark SQL — Use Whichever Feels Natural
You can mix DataFrame API and SQL freely. Register a DataFrame as a temp view and query it with SQL.
# PySpark + Spark SQL
df.createOrReplaceTempView("cities")
result = spark.sql("""
SELECT
country,
COUNT(*) AS num_cities,
SUM(population) AS total_pop,
ROUND(AVG(population / area_km2), 1) AS avg_density
FROM cities
GROUP BY country
HAVING SUM(population) > 1000000
ORDER BY total_pop DESC
""")
result.show()
My take: use SQL for ad-hoc exploration and complex joins. Use the DataFrame API for pipelines where you need composability and testing. They compile to the same execution plan anyway.
Joins in Apache Spark
Joins are where Spark's distributed nature gets interesting — and where performance problems usually start.
# PySpark — Joins
# Country reference data (small dataset)
countries = spark.createDataFrame([
("DE", "Germany", "Europe"),
("FR", "France", "Europe"),
("US", "United States", "North America"),
], ["code", "name", "continent"])
# Inner join (default)
enriched = df.join(countries, df.country == countries.code, "inner")
# Left join — keep all cities even without country match
enriched_left = df.join(countries, df.country == countries.code, "left")
# Broadcast join — force Spark to send small table to all executors
from pyspark.sql.functions import broadcast
enriched_fast = df.join(
broadcast(countries),
df.country == countries.code,
"inner"
)
Broadcast Joins: The Single Biggest Apache Spark Optimization
When one side of a join is small (under ~100 MB), use broadcast(). This sends the small table to every executor, avoiding an expensive shuffle of the large table.
| Join Type | When to Use | Watch Out For |
|---|---|---|
| Shuffle join (default) | Both tables are large | Expensive — shuffles both sides |
| Broadcast join | One table < 100 MB | Don't broadcast large tables — OOM risk |
| Sort-merge join | Both tables large, pre-sorted | Good for repeated joins on same key |
Common mistake: Joining two large tables without checking if one side is actually small enough to broadcast. Check with
df.count()and estimate size before assuming you need a shuffle join.
Building a Real Data Pipeline
Let's put it all together. Here's a pipeline that reads raw event data, cleans it, aggregates it, and writes the result — a pattern you'll use daily.
# PySpark — Complete pipeline example
from pyspark.sql import SparkSession, functions as F
from pyspark.sql.types import (
StructType, StructField, StringType, TimestampType, DoubleType, LongType
)
spark = (
SparkSession.builder
.appName("event-aggregation-pipeline")
.config("spark.sql.shuffle.partitions", 16)
.getOrCreate()
)
# 1. Define schema explicitly
event_schema = StructType([
StructField("event_id", StringType(), False),
StructField("user_id", StringType(), False),
StructField("event_type", StringType(), False),
StructField("event_ts", TimestampType(), False),
StructField("amount", DoubleType(), True),
StructField("country", StringType(), True),
])
# 2. Read raw data
raw_events = (
spark.read
.schema(event_schema)
.parquet("s3a://data-lake/raw/events/")
)
# 3. Clean: remove nulls, filter invalid, deduplicate
cleaned = (
raw_events
.filter(F.col("event_type").isNotNull())
.filter(F.col("amount") >= 0)
.dropDuplicates(["event_id"])
.withColumn("event_date", F.to_date("event_ts"))
)
# 4. Aggregate: daily revenue by country
daily_revenue = (
cleaned
.filter(F.col("event_type") == "purchase")
.groupBy("event_date", "country")
.agg(
F.sum("amount").alias("total_revenue"),
F.countDistinct("user_id").alias("unique_buyers"),
F.count("event_id").alias("num_transactions"),
)
.withColumn(
"avg_order_value",
F.round(F.col("total_revenue") / F.col("num_transactions"), 2)
)
)
# 5. Write partitioned output
(
daily_revenue
.repartition("event_date") # One partition per date
.write
.mode("overwrite")
.partitionBy("event_date")
.parquet("s3a://data-lake/curated/daily_revenue/")
)
print(f"Wrote {daily_revenue.count()} aggregated rows")
spark.stop()
This pipeline follows a pattern you'll see everywhere:
- Read with explicit schema
- Clean (filter, deduplicate, cast)
- Transform (aggregate, join, compute)
- Write with partitioning
Performance Pitfalls Every Beginner Hits
1. Collecting Large DataFrames to the Driver
# ❌ This pulls ALL data to the driver — will crash on large datasets
all_rows = df.collect()
# ✅ Take a sample instead
sample = df.limit(100).collect()
# ✅ Or use .show() for inspection
df.show(20, truncate=False)
2. Using Python UDFs When Built-in Functions Exist
# ❌ Python UDF — serializes data to Python, kills performance
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
@udf(StringType())
def upper_udf(s):
return s.upper() if s else None
df.withColumn("city_upper", upper_udf("city"))
# ✅ Built-in function — runs in JVM, 10-100x faster
df.withColumn("city_upper", F.upper("city"))
3. Not Repartitioning Before Writes
# ❌ 200 tiny files (one per default shuffle partition)
df.write.parquet("/output/")
# ✅ Control output file count
df.repartition(4).write.parquet("/output/")
# ✅ Or coalesce (avoids shuffle, but uneven partition sizes)
df.coalesce(4).write.parquet("/output/")
4. Caching Everything
# ❌ Caching a dataset you only use once — wastes memory
df.cache()
result = df.groupBy("country").count()
# ✅ Cache only when you reuse a DataFrame multiple times
cleaned = raw.filter(...).cache()
report_a = cleaned.groupBy("country").count()
report_b = cleaned.groupBy("event_type").count()
cleaned.unpersist() # Release when done
When NOT to Use Apache Spark
This is important — Spark isn't always the right tool:
- Dataset under 10 GB? Use DuckDB or pandas. Spark's startup overhead isn't worth it for small data.
- Real-time, event-by-event processing? Consider Apache Flink or Kafka Streams. Spark Structured Streaming works for micro-batch, but it's not true event-at-a-time processing.
- Simple SQL queries on files? DuckDB reads Parquet directly without a cluster. For exploring datasets interactively, tools like Harbinger Explorer let you query data right in your browser using DuckDB WASM — no cluster setup, no JVM overhead, just drag in your files and start writing SQL.
- One-off data exploration? A Jupyter notebook with pandas or Polars is faster to set up.
Spark shines when your data is too large for a single machine, when you need a battle-tested distributed engine, or when you're building pipelines that run daily in production.
Next Steps
You've got the foundations: SparkSession, DataFrames, transformations, joins, and a complete pipeline pattern. From here:
- Learn partitioning strategy — how you partition data determines your pipeline's performance
- Explore Spark Structured Streaming — the same DataFrame API, but for streaming data
- Practice with the Spark UI — understanding the DAG visualization and stage metrics is what separates beginners from experienced Spark developers
The best way to learn Spark is to take a dataset that's too big for pandas and rebuild a pipeline you've already written. You'll immediately see where Spark's model differs and where it excels.
Continue Reading
- What Is dbt and Why Data Engineers Use It — how dbt complements Spark for transformation logic
- Data Lakehouse Architecture Explained — the storage architecture Spark pipelines commonly target
- REST API Data Pipeline Guide — ingesting data from APIs before processing with Spark
[VERIFY] Spark default shuffle partitions is 200 — confirmed as of Spark 3.5.x. [VERIFY] Broadcast join threshold default is 10 MB (spark.sql.autoBroadcastJoinThreshold), article recommends manual broadcast for tables up to ~100 MB — this is a practical guideline, not the default. [PRICING-CHECK] None — no third-party pricing mentioned.
Continue Reading
Natural Language SQL: Ask Your Data Questions in Plain English
How NL2SQL works, real examples of natural language questions converted to SQL, an honest comparison of tools, and where it fails.
DuckDB Tutorial: Analytical SQL Directly in Your Browser
Get started with DuckDB in 15 minutes. Learn read_parquet, read_csv_auto, PIVOT, and when DuckDB beats SQLite and PostgreSQL for analytical SQL.
Excel to SQL: A Migration Guide for Business Analysts
Complete guide to Excel to SQL migration for business analysts. 25-row concept mapping table, SQL code examples, common pitfalls, and tips for making the switch stick.
Try Harbinger Explorer for free
Connect any API, upload files, and explore with AI — all in your browser. No credit card required.
Start Free Trial