Harbinger Explorer

Back to Knowledge Hub
databricks
Published:

Databricks Photon Engine: When to Use It — and When Not To

10 min read·Tags: databricks, photon, performance, sql, spark, query-engine

Databricks Photon Engine: When to Use It — and When Not To

Databricks Photon is one of the most significant performance advances in the Databricks runtime — a native, vectorized C++ query execution engine that can deliver dramatic speedups for the right workloads. But "enable Photon everywhere" is not a sound engineering strategy. Photon has a specific cost profile, a focused set of strengths, and real limitations.

This guide gives you the technical foundations, benchmarks, and decision framework to deploy Photon intelligently.


What Is Photon?

Photon is a reimplementation of the Spark SQL execution engine in native C++, designed from the ground up for modern CPUs. It replaces the JVM-based Spark execution layer for supported operations, processing data in a columnar, vectorized fashion that takes full advantage of CPU SIMD instructions (AVX-512, AVX2).

Standard Spark executes queries row-by-row through a JVM call stack. Photon processes data in batches of columns, which means:

  • More data processed per CPU cycle
  • Dramatically reduced JVM overhead and garbage collection
  • Better cache locality (columns fit in CPU L1/L2 cache)
  • Native memory management (no JVM heap pressure)

The result: for the right query patterns, Photon is typically 2-10x faster than standard Spark at the same compute cost.


How to Enable Photon

Photon requires a Photon-enabled instance type. On AWS, these are i3, m5d, r5d-class instances. On Azure, Standard_E and Standard_L series. On GCP, n2 and n2d families.

# Enable Photon via CLI when creating a cluster
databricks clusters create --json '{
  "cluster_name": "photon-cluster",
  "spark_version": "13.3.x-photon-scala2.12",
  "node_type_id": "i3.xlarge",
  "num_workers": 4
}'
# Verify Photon is active in your session
print(spark.conf.get("spark.databricks.photon.enabled"))

In the Databricks UI, simply select a Photon Accelerated runtime when creating a cluster. Photon-accelerated runtimes are clearly labeled.


Where Photon Excels

1. SQL Aggregations and Analytics

Photon's biggest wins are on heavy SQL workloads — aggregations, GROUP BY, HAVING, window functions, and joins on large tables.

-- This query type benefits massively from Photon
SELECT
  country,
  event_type,
  DATE_TRUNC('month', event_ts) AS month,
  COUNT(*)                       AS events,
  COUNT(DISTINCT user_id)        AS unique_users,
  SUM(revenue_usd)               AS total_revenue,
  AVG(session_duration_s)        AS avg_session_s
FROM catalog.schema.events
WHERE event_ts >= '2024-01-01'
GROUP BY 1, 2, 3
ORDER BY total_revenue DESC;

Benchmark (100M rows, 8-core cluster):

EngineRuntimeCost
Standard Spark142s$0.48
Photon18s$0.06

8x speedup at the same DBU rate — because Photon DBUs cost the same as standard DBUs on most configurations.

2. ETL Pipelines with Column Projections and Filters

Bulk transformations that project columns, apply filters, and write to Delta benefit from Photon's vectorized scan and write path:

# Photon accelerates this entire pipeline
df = (
    spark.table("raw.events")
    .filter("event_date = '2024-01-15'")
    .select("user_id", "event_type", "session_id", "revenue_usd", "country")
    .withColumn("revenue_eur", col("revenue_usd") * 0.92)
    .groupBy("country", "event_type")
    .agg(
        count("*").alias("events"),
        sum("revenue_eur").alias("revenue_eur")
    )
)

df.write.format("delta").mode("append").saveAsTable("gold.country_events_daily")

3. Delta Lake Reads and Writes

Photon has native Delta Lake integration. It accelerates:

  • Parquet vectorized reads
  • Delta write path (including OPTIMIZE)
  • Data skipping via column statistics
-- Photon accelerates the scan and MERGE computation
MERGE INTO target t
USING source s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

4. String Operations and Regular Expressions

Photon's native C++ string library significantly outperforms JVM string operations:

-- Photon handles this 3-5x faster than standard Spark
SELECT
  user_id,
  REGEXP_EXTRACT(url, 'utm_source=([^&]+)', 1) AS utm_source,
  UPPER(country)                                AS country,
  LENGTH(description)                           AS desc_len
FROM catalog.schema.pageviews
WHERE url LIKE '%utm_%';

Where Photon Does NOT Help (or Hurts)

Understanding Photon's limitations is as important as knowing its strengths.

1. Python UDFs

This is the most critical limitation. Photon cannot execute Python UDFs. When Photon encounters a Python UDF, it falls back to standard Spark for that stage:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

# This kills Photon acceleration for the entire stage
@udf(returnType=StringType())
def parse_custom_format(value):
    # Photon falls back to JVM/Python for this
    return value.split("|")[0].strip()

df = df.withColumn("parsed", parse_custom_format(col("raw_value")))

Fix: Replace Python UDFs with built-in Spark SQL functions or Pandas UDFs (which are partially accelerated):

from pyspark.sql import functions as F

# Photon can accelerate this
df = df.withColumn("parsed", F.split(F.trim(col("raw_value")), "\\|")[0])

2. ML and Data Science Workloads

Photon does not accelerate:

  • MLlib model training
  • Pandas operations on Spark DataFrames
  • Custom Scala/Java transformers outside the supported operator set
  • Complex nested data structures (maps, arrays of structs)
# Photon provides no benefit here
from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(numTrees=100, maxDepth=5)
model = rf.fit(training_df)

For ML, standard Spark runtimes or GPU-enabled clusters are more appropriate.

3. Small Data and Short-Running Queries

Photon has a non-trivial startup overhead per query. For queries that run in under 1-2 seconds, Photon's initialization cost can negate its throughput advantage:

Query DurationPhoton Benefit
< 1 secondNone or negative
1-10 secondsMarginal
> 30 secondsSignificant
> 5 minutesMaximum benefit

Implication: Don't enable Photon on clusters used primarily for interactive, exploratory work on small datasets.

4. Streaming with Frequent Micro-Batches

Structured Streaming with very short trigger intervals (< 5 seconds) can see overhead from Photon query planning. Test carefully before enabling Photon on high-frequency streaming clusters:

# For streaming with very short intervals, benchmark before committing to Photon
(
    spark.readStream
    .format("delta")
    .table("raw.events")
    .writeStream
    .trigger(processingTime="2 seconds")  # Short interval — test Photon carefully
    .format("delta")
    .table("silver.events")
    .start()
)

The Decision Framework

Use this framework to decide whether to enable Photon for a given workload:

Is the workload primarily SQL aggregations or large Delta reads/writes?
  YES → Enable Photon

Does the workload use Python UDFs extensively?
  YES → Refactor UDFs first, then evaluate Photon

Is this ML training or model inference?
  YES → Use standard runtime or GPU cluster instead

Are queries typically < 5 seconds on small data?
  YES → Standard runtime is cheaper and equally fast

Is this high-frequency micro-batch streaming (< 5s trigger)?
  YES → Benchmark both runtimes before deciding

Benchmarking Photon vs Standard Spark

Always benchmark your specific workload — don't rely on generic claims. Here's a reproducible benchmark pattern:

# photon_benchmark.py
import time

def benchmark_query(query: str, label: str, runs: int = 3) -> float:
    times = []
    for i in range(runs):
        # Clear caches between runs
        spark.catalog.clearCache()
        spark.sparkContext._jvm.System.gc()

        start = time.time()
        spark.sql(query).write.format("noop").mode("overwrite").save()
        elapsed = time.time() - start
        times.append(elapsed)
        print(f"  [{label}] Run {i+1}: {elapsed:.2f}s")

    avg = sum(times) / len(times)
    print(f"  [{label}] Average: {avg:.2f}s")
    return avg

BENCHMARK_QUERY = """
    SELECT country, event_type, COUNT(*), SUM(revenue_usd)
    FROM catalog.schema.events
    WHERE event_date >= '2024-01-01'
    GROUP BY 1, 2
    ORDER BY 3 DESC
"""

# Run on standard cluster, record results, then switch to Photon cluster
standard_avg = benchmark_query(BENCHMARK_QUERY, "Standard Spark")

# After switching to Photon-enabled cluster:
photon_avg = benchmark_query(BENCHMARK_QUERY, "Photon")

speedup = standard_avg / photon_avg
print(f"\nPhoton speedup: {speedup:.1f}x")

Cost Implications

Photon runtimes cost the same DBUs as standard runtimes on Databricks. If Photon runs a job 5x faster, you use 5x fewer DBUs — that's a direct cost reduction.

However, Photon-compatible instance types (with local NVMe SSDs) are slightly more expensive than general-purpose instances at the cloud VM level. For most analytical workloads, the DBU savings far outweigh the instance premium.

ScenarioStandard DBUsPhoton DBUsCloud VM CostNet Result
4-hour batch job → 45 min4.00.75+15% per hour~75% cost reduction
30-min interactive query0.50.5+15% per hourNeutral
1-second ad-hoc query0.020.02+15% per hourSlight increase

Photon and Liquid Clustering

Databricks Liquid Clustering — the successor to Hive-style partitioning and Z-ORDER — is deeply integrated with Photon. Liquid clustering uses column statistics and an automated file layout to enable efficient data skipping without requiring manual OPTIMIZE ZORDER calls:

-- Create a table with Liquid Clustering
CREATE TABLE catalog.schema.events
CLUSTER BY (country, event_type, user_id)
AS SELECT * FROM raw.events;

-- Clustering runs automatically during OPTIMIZE
OPTIMIZE catalog.schema.events;

Photon accelerates both the OPTIMIZE operation and the subsequent scans. If you're on Databricks Runtime 13.3+ and using Photon, Liquid Clustering is the preferred layout strategy over manual Z-ORDER.


Final Thoughts

Photon is a genuine engineering achievement — it makes Databricks SQL and Delta Lake significantly faster for analytical and ETL workloads. But it's not magic dust. Enable it where the workload profile matches: large-scale SQL, heavy Delta reads/writes, and ETL pipelines free of Python UDFs.

When you're managing multiple Databricks clusters across different workload types, tracking which clusters have Photon enabled, which jobs would benefit from migration, and how cluster configurations compare — that's exactly the operational visibility that Harbinger Explorer provides.

Try Harbinger Explorer free for 7 days and get clear visibility into how your Databricks clusters are configured and how to optimize them.


Continue Reading

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Command Palette

Search for a command to run...