Both repartition() and coalesce() change the number of partitions in a Spark DataFrame. They are not interchangeable. Picking the wrong one doesn't cause an error — it just makes your job slower or produces unbalanced output files. Here's the distinction.
The Difference
repartition(n) triggers a full shuffle. Every row in the DataFrame is sent through the shuffle mechanism (hash partitioned by default) and redistributed across exactly n output partitions. The output partitions are roughly equal in size regardless of how skewed the input was. Use this when you need a specific number of partitions, when you want to balance an uneven dataset, or when you want to hash-partition by a key column for a downstream join or groupBy.
coalesce(n) avoids a full shuffle by merging existing partitions locally. Spark combines adjacent partitions on the same executor, reducing partition count without network I/O. The constraint: coalesce() can only reduce partition count, not increase it. And the output partitions may be unbalanced if the input partitions were unbalanced — you're just merging whatever was there.
When to Use Each
Use repartition() when:
- You need to increase partition count (e.g., after reading a small CSV that produced too few partitions)
- You want balanced partition sizes (e.g., before writing output with a target file size)
- You want to partition by a specific key column (pre-partitioning for a downstream join)
- Your data is severely skewed and you need to redistribute evenly
# Hash partition on user_id — downstream groupBy(user_id) won't need to shuffle again
df_repartitioned = df.repartition(32, F.col("user_id"))
# Increase partition count for a small input with too few partitions
df_small = spark.read.parquet("path/to/small/parquet/") # produces 4 partitions
df_expanded = df_small.repartition(32) # now 32 partitions, uses full cluster
Use coalesce() when:
- You're about to write output and want fewer, larger files
- You've done heavy filtering that left many nearly-empty partitions
- You want to reduce partition count without the cost of a full shuffle
# After heavy filtering, many partitions are nearly empty
# coalesce before writing avoids many tiny output files
heavy_filter_result = df.filter(F.col("status") == "critical")
heavy_filter_result.coalesce(4).write.format("delta").mode("overwrite").save("path/output/")
# Writing a small reference table -- no need for 200 partitions
lookup_df.coalesce(1).write.format("parquet").save("path/lookup/")
The Output File Count Implication
When you write a DataFrame, each partition becomes one output file. 200 partitions = 200 files. For a 5GB aggregated result table, 200 files averaging 25MB each is poor for downstream readers. coalesce(10) before writing gives you 10 files averaging 500MB — much better for read performance.
# Common pattern: transform, coalesce for output efficiency, write
result = df.groupBy("region", "month") .agg(F.sum("revenue").alias("revenue"))
result.coalesce(4).write.format("delta").mode("overwrite").saveAsTable("analytics.regional_monthly")
For large tables, OPTIMIZE handles file compaction after the fact. For small-to-medium result sets that you control, coalesce before writing is simpler and avoids the extra maintenance step.
One Trap: coalesce Before a Shuffle
# Don't do this
df.coalesce(4).groupBy("category").agg(F.sum("amount"))
Coalescing down to 4 partitions before a groupBy means only 4 partitions go into the shuffle, and the default 200 shuffle output partitions come out. You've done a local coalesce that saved nothing and limited the amount of parallelism available to the groupBy stage. Coalesce after the last shuffle in your job, not before.
repartitionByRange
# Range partition instead of hash partition -- useful for ordered output
df.repartitionByRange(10, F.col("order_date"))
# Each partition contains a contiguous range of order_date values
For writing partitioned Delta tables where you want each output file to contain a contiguous date range (for time-travel-friendly compaction), repartitionByRange is cleaner than hashing. The output files' min/max statistics will accurately reflect the date ranges in each file, which helps predicate pushdown on date range queries.