Repartition vs. Coalesce in Spark: When to Use Each and Why the Difference Matters

Both repartition() and coalesce() change the number of partitions in a Spark DataFrame. They are not interchangeable. Picking the wrong one doesn't cause an error — it just makes your job slower or produces unbalanced output files. Here's the distinction.

The Difference

repartition(n) triggers a full shuffle. Every row in the DataFrame is sent through the shuffle mechanism (hash partitioned by default) and redistributed across exactly n output partitions. The output partitions are roughly equal in size regardless of how skewed the input was. Use this when you need a specific number of partitions, when you want to balance an uneven dataset, or when you want to hash-partition by a key column for a downstream join or groupBy.

coalesce(n) avoids a full shuffle by merging existing partitions locally. Spark combines adjacent partitions on the same executor, reducing partition count without network I/O. The constraint: coalesce() can only reduce partition count, not increase it. And the output partitions may be unbalanced if the input partitions were unbalanced — you're just merging whatever was there.

When to Use Each

Use repartition() when:

  • You need to increase partition count (e.g., after reading a small CSV that produced too few partitions)
  • You want balanced partition sizes (e.g., before writing output with a target file size)
  • You want to partition by a specific key column (pre-partitioning for a downstream join)
  • Your data is severely skewed and you need to redistribute evenly
# Hash partition on user_id — downstream groupBy(user_id) won't need to shuffle again
df_repartitioned = df.repartition(32, F.col("user_id"))

# Increase partition count for a small input with too few partitions
df_small = spark.read.parquet("path/to/small/parquet/")  # produces 4 partitions
df_expanded = df_small.repartition(32)  # now 32 partitions, uses full cluster

Use coalesce() when:

  • You're about to write output and want fewer, larger files
  • You've done heavy filtering that left many nearly-empty partitions
  • You want to reduce partition count without the cost of a full shuffle
# After heavy filtering, many partitions are nearly empty
# coalesce before writing avoids many tiny output files
heavy_filter_result = df.filter(F.col("status") == "critical")
heavy_filter_result.coalesce(4).write.format("delta").mode("overwrite").save("path/output/")

# Writing a small reference table -- no need for 200 partitions
lookup_df.coalesce(1).write.format("parquet").save("path/lookup/")

The Output File Count Implication

When you write a DataFrame, each partition becomes one output file. 200 partitions = 200 files. For a 5GB aggregated result table, 200 files averaging 25MB each is poor for downstream readers. coalesce(10) before writing gives you 10 files averaging 500MB — much better for read performance.

# Common pattern: transform, coalesce for output efficiency, write
result = df.groupBy("region", "month")            .agg(F.sum("revenue").alias("revenue"))
result.coalesce(4).write.format("delta").mode("overwrite").saveAsTable("analytics.regional_monthly")

For large tables, OPTIMIZE handles file compaction after the fact. For small-to-medium result sets that you control, coalesce before writing is simpler and avoids the extra maintenance step.

One Trap: coalesce Before a Shuffle

# Don't do this
df.coalesce(4).groupBy("category").agg(F.sum("amount"))

Coalescing down to 4 partitions before a groupBy means only 4 partitions go into the shuffle, and the default 200 shuffle output partitions come out. You've done a local coalesce that saved nothing and limited the amount of parallelism available to the groupBy stage. Coalesce after the last shuffle in your job, not before.

repartitionByRange

# Range partition instead of hash partition -- useful for ordered output
df.repartitionByRange(10, F.col("order_date"))
# Each partition contains a contiguous range of order_date values

For writing partitioned Delta tables where you want each output file to contain a contiguous date range (for time-travel-friendly compaction), repartitionByRange is cleaner than hashing. The output files' min/max statistics will accurately reflect the date ranges in each file, which helps predicate pushdown on date range queries.

Read more