Delta Lake MERGE: The UPSERT Pattern You Already Know, Now Running Distributed

If you've written ETL in SQL Server, you've written MERGE. It's the standard pattern for incremental loads: insert new rows, update existing rows, optionally delete rows that disappeared from the source. Delta Lake supports the same MERGE semantics, running distributed across your cluster. Here's the translation.

Basic MERGE Syntax

-- SQL Server MERGE (for reference)
MERGE INTO dbo.customers AS target
USING staging.customer_updates AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
    UPDATE SET target.name = source.name, target.updated_at = source.updated_at
WHEN NOT MATCHED BY TARGET THEN
    INSERT (customer_id, name, region, created_at, updated_at)
    VALUES (source.customer_id, source.name, source.region, GETDATE(), source.updated_at);

-- Delta Lake MERGE (Spark SQL)
MERGE INTO analytics.customers AS target
USING staging.customer_updates AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
    UPDATE SET
        target.name = source.name,
        target.updated_at = source.updated_at
WHEN NOT MATCHED THEN
    INSERT (customer_id, name, region, created_at, updated_at)
    VALUES (source.customer_id, source.name, source.region, CURRENT_TIMESTAMP(), source.updated_at);

Nearly identical syntax. The main differences: GETDATE() becomes CURRENT_TIMESTAMP(), WHEN NOT MATCHED BY TARGET shortens to WHEN NOT MATCHED, and the target must be a Delta table (not plain Parquet).

MERGE with DELETE

-- Handle deletes using a flag in the source
MERGE INTO analytics.customers AS target
USING staging.customer_changes AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND source.is_deleted = true THEN
    DELETE
WHEN MATCHED AND source.is_deleted = false THEN
    UPDATE SET
        target.name = source.name,
        target.email = source.email,
        target.updated_at = source.updated_at
WHEN NOT MATCHED AND source.is_deleted = false THEN
    INSERT (customer_id, name, email, region, created_at, updated_at)
    VALUES (source.customer_id, source.name, source.email, source.region,
            CURRENT_TIMESTAMP(), source.updated_at);

Python API for MERGE

from delta.tables import DeltaTable

target_table = DeltaTable.forName(spark, "analytics.customers")
source_df = spark.table("staging.customer_updates")

target_table.alias("target").merge(
    source_df.alias("source"),
    "target.customer_id = source.customer_id"
).whenMatchedUpdate(set={
    "name": "source.name",
    "email": "source.email",
    "updated_at": "source.updated_at"
}).whenNotMatchedInsert(values={
    "customer_id": "source.customer_id",
    "name": "source.name",
    "email": "source.email",
    "region": "source.region",
    "created_at": "current_timestamp()",
    "updated_at": "source.updated_at"
}).execute()

The Python API is chainable and more testable than inline SQL strings. For complex MERGE logic with conditional updates, the Python API is easier to read and maintain.

How Delta MERGE Works Under the Hood

Delta MERGE finds the files that contain matching rows (using file-level statistics to skip irrelevant files), rewrites those files with the updated rows incorporated, and adds new files for the inserted rows. The transaction log records all added and removed files atomically.

The performance implication: if your target table has 1,000 files but MERGE only touches rows in 50 of them, Delta only rewrites those 50 files. Non-matching files are not touched. This is significantly better than a "delete-and-reinsert" pattern that would rewrite the entire table.

MERGE Performance Tips

Partition your target by a column that's also in the MERGE condition. If your customers table is partitioned by region and your MERGE condition includes region, Delta can limit its file scan to matching partitions only.

Compact before a large MERGE. Many small files mean many file rewrites. Running OPTIMIZE before a large MERGE reduces the number of files that need to be rewritten.

Filter the source before MERGE. Only include rows in the source DataFrame that actually need to change. A source with 1,000 changed rows on a 100M-row target is much faster than a source with 10M rows where only 1,000 are different.

-- Only merge rows from the last 7 days of changes
MERGE INTO analytics.orders AS target
USING (
    SELECT * FROM staging.order_updates
    WHERE updated_at >= DATE_SUB(CURRENT_DATE(), 7)
) AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...;

Delta Lake MERGE covers 90% of the incremental load patterns I used to write with SQL Server MERGE. The syntax translates directly, the performance scales with the cluster instead of the single server, and the transaction log gives you the audit trail of every change.

Delta Lake MERGE: The UPSERT Pattern You Already Know, Now Running Distributed

Shannon Lowder

Basic MERGE Syntax

MERGE with DELETE

Python API for MERGE

How Delta MERGE Works Under the Hood

MERGE Performance Tips

Read more

The Context Problem Neither Agent Mesh Nor OpenSharing Solves

Unity AI Gateway and What a Governed Model Access Layer Actually Buys You

You Don't Need Fable. You Need a Router.

DAIS 2026: Genie One and the Context Problem Databricks Is Solving