PySpark DataFrames vs. SQL in Databricks: Which to Use When

Databricks gives you a genuine choice between PySpark DataFrames and SQL for almost every operation. Both approaches produce identical Spark execution plans — there's no performance difference. The choice is about readability, team skill set, testability, and context. Here's how I decide.

They Produce the Same Plan

First, let's put the performance question to rest:

# These produce identical physical execution plans
df_result_1 = df.filter(F.col("region") == "West")                 .groupBy("customer_id")                 .agg(F.sum("revenue").alias("total_revenue"))                 .orderBy(F.col("total_revenue").desc())

df_result_2 = spark.sql("""
    SELECT customer_id, SUM(revenue) AS total_revenue
    FROM orders
    WHERE region = 'West'
    GROUP BY customer_id
    ORDER BY total_revenue DESC
""")

df_result_1.explain() == df_result_2.explain()  # effectively true

The Catalyst optimizer sees both representations and produces the same physical plan. Choose based on what makes the code better — not performance.

When SQL Wins

Ad-hoc exploration and analysis. When you're exploring data interactively, SQL is faster to write. %sql SELECT * FROM analytics.orders LIMIT 100 is two keystrokes and universal muscle memory. The DataFrame equivalent requires import statements, method chaining, and .show().

Complex joins and subqueries that read like SQL. Some queries are easier to follow in SQL:

%sql
WITH recent_orders AS (
    SELECT customer_id, MAX(order_date) AS last_order
    FROM analytics.orders
    GROUP BY customer_id
),
high_value AS (
    SELECT customer_id, SUM(revenue) AS ltv
    FROM analytics.orders
    GROUP BY customer_id
    HAVING SUM(revenue) > 50000
)
SELECT c.name, r.last_order, h.ltv
FROM analytics.customers c
JOIN recent_orders r ON c.customer_id = r.customer_id
JOIN high_value h ON c.customer_id = h.customer_id
ORDER BY h.ltv DESC;

Teams with SQL expertise who don't know Python. If your data analysts know SQL but not Python, SQL cells let them query Delta tables directly without learning PySpark. Shared notebooks can have Python cells for complex transformations and SQL cells for the analysis layer.

DDL and DML operations. CREATE TABLE, MERGE, INSERT INTO, UPDATE, DELETE — these are more natural in SQL. You can call them via spark.sql() from Python code, but the SQL string is usually cleaner than the equivalent Python API calls.

When DataFrames Win

Complex multi-step transformations that benefit from intermediate variables. DataFrame API makes it easy to build and inspect intermediate results without registering temp views:

base = spark.read.format("delta").load("path/to/raw/")
cleaned = base.dropDuplicates(["event_id"]).filter(F.col("amount") > 0)
enriched = cleaned.join(F.broadcast(product_lookup), on="product_id", how="left")
aggregated = enriched.groupBy("customer_id", "product_category")                      .agg(F.sum("amount").alias("spend"))

Each variable is inspectable with .show(), .count(), or .printSchema(). In SQL, you'd need CTEs or temp views for each intermediate step.

Python interoperability. When the transformation involves Python libraries (numpy, scikit-learn, custom business logic), DataFrames are the integration point. You can't call a pandas function from a SQL cell.

Testability. Python functions that accept and return DataFrames are unit-testable with pytest and a small SparkSession fixture. SQL strings are harder to test in isolation.

Dynamic construction. Building queries dynamically based on configuration is cleaner in Python than string-interpolated SQL:

def filter_by_regions(df, regions):
    return df.filter(F.col("region").isin(regions))

def apply_filters(df, config):
    if config.get("min_revenue"):
        df = df.filter(F.col("revenue") >= config["min_revenue"])
    if config.get("regions"):
        df = filter_by_regions(df, config["regions"])
    return df

The Hybrid Pattern

In practice, most production notebooks mix both. Python handles the orchestration, data loading, and complex transformations. SQL handles the catalog operations, schema queries, and analytical queries that read clearly in SQL syntax. The transition is seamless:

# Python: load and transform
cleaned = (
    spark.read.format("delta").load("path/to/raw/")
         .dropDuplicates(["event_id"])
         .filter(F.col("event_date") >= start_date)
)
cleaned.createOrReplaceTempView("cleaned_events")

# SQL: analytical query on the result
result = spark.sql("""
    SELECT
        product_category,
        COUNT(DISTINCT customer_id) AS unique_buyers,
        SUM(amount) AS total_revenue,
        AVG(amount) AS avg_order_value
    FROM cleaned_events
    GROUP BY product_category
    ORDER BY total_revenue DESC
""")

# Python: write result
result.write.format("delta").mode("overwrite").saveAsTable("analytics.category_summary")

The cleanup work (deduplication, date filtering) is Python because it's procedural. The aggregation is SQL because GROUP BY with multiple aggregates is cleaner in SQL. The write is Python because it chains naturally from the Python DataFrame.

There's no wrong answer here — there's just the tool that makes the code clearest for the person who has to maintain it six months from now.

Read more