Photon Engine: What Databricks' Native Vectorized Runtime Actually Changes
Databricks announced Photon at Data + AI Summit last month, and it's now available in Databricks Runtime 9.1. The pitch: a native vectorized query engine written in C++ that replaces the JVM-based Spark SQL execution for analytical queries. Faster scans, faster aggregations, faster joins — without changing any code.
I've been running it on a few production workloads. Here's what I actually saw.
What Photon Is
Standard Spark SQL execution runs on the JVM. Every row goes through Java object creation, garbage collection, and JVM overhead. For small amounts of data this is irrelevant. For large analytical queries scanning millions or billions of rows, JVM overhead accumulates.
Photon runs natively on the CPU, processes data in columnar batches (not row by row), and takes advantage of CPU vectorization instructions (SIMD). The data stays in columnar format through the execution pipeline — no row-by-row translation until the result needs to be returned.
Enabling Photon
You don't change your code. You select a Photon-enabled runtime when configuring your cluster:
{
"spark_version": "9.1.x-photon-scala2.12",
"node_type_id": "Standard_DS4_v2",
"num_workers": 4
}That's it. Spark SQL queries, DataFrame operations, Delta Lake reads — all go through the Photon engine automatically when it can handle the operation. Photon has its own internal list of supported operations; anything it can't execute falls back to standard Spark. The fallback is transparent — you don't see it unless you look at query plans.
Where I Saw Performance Improvements
The best results were on queries that scan large amounts of data and do aggregations or range-based filters:
-- This query type benefits significantly from Photon
-- Large Delta table scan with aggregation
SELECT
region_code,
product_category,
SUM(order_amount) AS total_revenue,
COUNT(DISTINCT customer_id) AS unique_customers,
AVG(order_amount) AS avg_order_value
FROM gold.daily_orders
WHERE order_date BETWEEN '2021-01-01' AND '2021-11-30'
GROUP BY region_code, product_category
ORDER BY total_revenue DESC;
On the specific query above (about 800M rows), standard runtime: 4m 12s. Photon: 1m 48s. Roughly 2.3x faster. The improvement comes almost entirely from the scan and aggregation phases — Photon handles both natively.
Where the Improvement Was Smaller
Pandas UDFs, Python UDFs, and any operation that requires Python execution don't run through Photon. Photon handles the SQL/DataFrame operations and falls back to standard Spark for Python. If your critical path involves a complex Python UDF, Photon won't help that part of the query.
Small queries — anything under a few million rows — don't show meaningful improvement. The vectorized execution overhead isn't worth it for short queries. Photon is an analytical workload optimization, not a general-purpose speed boost.
The Cost
Photon-enabled runtimes cost more in Databricks DBU pricing. The exact multiplier depends on the cloud provider and instance type. The break-even math: if Photon makes your workload 2x faster and the runtime costs 1.5x more, you come out ahead. If your workload is short and the Photon runtime costs more than you save in compute time, it's not worth it. Benchmark on your actual workloads before committing. As always, I'm here to help.