Delta Lake tables accumulate small files fast. Every streaming write that commits a micro-batch produces files. Every append job that runs hourly produces files. Every MERGE that touches 1,000 rows produces files. After a few weeks of steady writes, you have a table with thousands of tiny Parquet files, and queries are paying full file-open overhead for each one. OPTIMIZE fixes this.
Why Small Files Hurt Performance
When Spark reads a Delta table, it opens each file as a separate Spark task. A table with 5,000 files means 5,000 tasks on read, each incurring file metadata API calls, file open overhead, and task scheduling cost. If those files average 5MB each, you have 25GB of data split across 5,000 tasks. A table with 250 files averaging 100MB each reads the same 25GB in 250 tasks — less overhead, better performance.
The Hive Metastore tracks partition metadata but not file counts within partitions. The file count problem can grow for months before anyone notices it in query performance — especially on tables with many small incremental writes.
Running OPTIMIZE
-- Compact all files in the table to ~1GB target file size
OPTIMIZE analytics.orders;
-- Or target a specific partition
OPTIMIZE analytics.orders WHERE order_date >= '2020-11-01';
# PySpark equivalent (Databricks Runtime 7+)
spark.sql("OPTIMIZE analytics.orders")
# Delta Python API (also available)
from delta.tables import DeltaTable
delta_table = DeltaTable.forName(spark, "analytics.orders")
delta_table.optimize().executeCompaction()
OPTIMIZE reads all the small files in the target partition, merges them into larger files (targeting ~1GB per file), writes the new larger files, and updates the transaction log. The old small files are retained for time travel until VACUUM removes them.
Z-Ordering: Clustering by Query Patterns
Z-ordering goes further than just compaction — it co-locates related data within each file, making predicate pushdown more effective:
-- Compact AND Z-order by frequently filtered columns
OPTIMIZE analytics.orders ZORDER BY (customer_id, order_date);
Z-ordering sorts data along a space-filling curve that preserves locality across multiple dimensions. When rows with similar customer_id and order_date values land in the same files, Spark can skip entire files based on the min/max statistics for those columns — far more files get skipped than with random ordering.
The T-SQL analogy: if you've ever used clustered indexes on a datetime column specifically to enable range scan performance, Z-ordering achieves the same goal — co-locating data with similar key values — but in a file-format without fixed indexes. Delta statistics act like the clustered index's b-tree statistics, letting the engine skip irrelevant data blocks.
When to Run OPTIMIZE
OPTIMIZE is a maintenance operation, not a real-time one. Common scheduling patterns:
- After large batch loads: run OPTIMIZE on the written partitions at the end of the pipeline
- Nightly maintenance job: run OPTIMIZE on the previous day's partition before business-hours query load peaks
- After streaming completes a backfill: streaming writes many small files; OPTIMIZE after backfill is essential
# Practical pipeline pattern: write, then optimize the written partition
df.write.format("delta") .mode("append") .partitionBy("order_date") .saveAsTable("analytics.orders")
# Optimize only today's partition, not the whole table
today = spark.sql("SELECT CURRENT_DATE()").collect()[0][0]
spark.sql(f"OPTIMIZE analytics.orders WHERE order_date = '{today}'")
VACUUM: Removing Old Files
OPTIMIZE writes new larger files but doesn't delete the old small ones — those are preserved for time travel. VACUUM removes files that are no longer referenced in the current version and older than the retention threshold:
-- Default 7-day retention
VACUUM analytics.orders;
-- Explicit retention
VACUUM analytics.orders RETAIN 168 HOURS; -- 7 days
-- DRY RUN: see what would be deleted without deleting it
VACUUM analytics.orders DRY RUN;
Setting retention below 7 days requires setting spark.databricks.delta.retentionDurationCheck.enabled to false — Databricks warns you because doing so prevents time travel and concurrent read isolation during open transactions.
For production Delta tables: run OPTIMIZE nightly (or after each major write batch), run VACUUM weekly. If disk costs are a concern and you don't need time travel beyond 48 hours, a shorter retention is fine — but consider the operational tradeoff of losing the ability to roll back to yesterday's version when something goes wrong.