Most joins in Spark require a shuffle: both sides get redistributed so that matching rows from each side end up on the same executor. For two large tables, that's unavoidable. But if one side of the join is small enough to fit in memory, there's a better option: copy the small table to every executor and avoid the shuffle entirely. This is a broadcast join, and it can turn a 10-minute join into a 30-second join.
How It Works
In a standard sort-merge join (Spark's default for two large tables):
- Both sides shuffle — rows with the same key go to the same partition
- Each partition sorts by join key
- The executor merges the sorted lists to produce join output
In a broadcast join:
- The small table is collected to the driver
- The driver broadcasts a copy to every executor's memory
- Each executor performs a local hash lookup — no shuffling of the large table required
The large table never moves. Only the small table does, and it goes to every executor once. For a 10GB large table and a 1MB lookup table, you've eliminated a 10GB shuffle in exchange for broadcasting 1MB × number_of_executors.
When Spark Auto-Broadcasts
# Spark automatically broadcasts tables smaller than this threshold
spark.conf.get("spark.sql.autoBroadcastJoinThreshold")
# Default: "10485760" (10MB)
# Increase it if your dimension tables are larger but still fit comfortably in executor memory
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", str(50 * 1024 * 1024)) # 50MB
When you run result.explain() and see BroadcastHashJoin in the physical plan, Spark chose a broadcast join. If you see SortMergeJoin, it shuffled both sides. If you expected a broadcast but got a sort-merge, the table exceeded the threshold or Spark couldn't determine its size.
Forcing a Broadcast with a Hint
from pyspark.sql import functions as F
# Explicitly hint that this table should be broadcast
result = large_orders.join(
F.broadcast(small_products),
on="product_id",
how="inner"
)
# Verify: check the physical plan
result.explain()
# Should show: BroadcastHashJoin
The hint overrides the auto-broadcast threshold. Use it when you know the table is small but Spark can't determine its size (happens with JDBC sources, complex subqueries, or tables whose statistics aren't cached).
In SQL Syntax
-- Broadcast hint in Spark SQL
spark.sql("""
SELECT /*+ BROADCAST(p) */ o.order_id, p.product_name
FROM orders o
JOIN products p ON o.product_id = p.product_id
""")
What "Small" Means
The broadcast table needs to fit in the memory of every executor, not just the driver. If you have 10 executors each with 8GB of heap and you broadcast a 6GB table, you've consumed 6GB × 10 = 60GB of total cluster memory for the broadcast. On a cluster where each executor has 8GB available for data processing, a 6GB broadcast leaves 2GB for everything else — likely causing spills to disk and hurting overall performance.
Practical guidance for 2020-era clusters: auto-broadcast threshold of 10MB is conservative. Moving it to 50-100MB is generally safe. Anything above a few hundred MB needs careful memory accounting against your executor heap allocation.
The SQL Server Analogy
SQL Server's equivalent is the nested loop join with the inner side fully loaded into memory as a hash table — the "hash join" with build/probe sides. SQL Server's query optimizer chooses the build side (the smaller table) for the hash table, exactly as Spark broadcasts the smaller side. In SQL Server you can hint this with OPTION (HASH JOIN). In Spark you use F.broadcast() or the SQL hint.
The difference: in SQL Server, both sides still go through the same process on one machine. In Spark, broadcast avoids moving the large table across the network entirely — which is the expensive part in a distributed system.
When Not to Broadcast
- The "small" table changes frequently and you'd need to re-broadcast on every query
- The table isn't small — you're just hoping the hint makes the join faster (it won't, if the table doesn't fit in executor memory)
- You're doing a right or full outer join — broadcast hash join only supports specific join types (inner, left outer when broadcasting the right side)
Broadcast joins are the first tool I reach for when a join on a fact-to-dimension table pattern is slow. If the dimension table is under 100MB, the broadcast hint almost always helps. If it's larger, the shuffle is unavoidable and the focus shifts to partition tuning and filter pushdown.