PySpark is the right tool for complex transformations, multi-step pipelines, and anything that benefits from the full DataFrame API. But if your goal is "run a SQL query against a Delta table and see the results," you don't need Python at all. Databricks SQL gives you a first-class SQL interface directly on your data lake.
Two Ways to Write SQL in Databricks
In notebooks: You can write SQL directly in a SQL cell (change the cell language with %sql at the top of the cell). The results render as an interactive table, just like running a SELECT in SSMS.
%sql
SELECT
region,
COUNT(*) AS order_count,
SUM(total_amount) AS revenue
FROM analytics.orders
WHERE order_date >= '2020-01-01'
GROUP BY region
ORDER BY revenue DESC;
Everything you've learned about T-SQL GROUP BY, window functions, subqueries, CTEs, and joins works here. Spark SQL is a superset of ANSI SQL, and most T-SQL syntax translates directly.
Via Databricks SQL (SQL Analytics, launched 2020): A dedicated SQL interface within the Databricks workspace with its own SQL warehouses (previously called SQL endpoints). Designed for BI analysts and data consumers who need SQL access without managing notebooks or clusters. Connected directly to dashboards and BI tools via JDBC/ODBC.
CTEs Work
%sql
WITH monthly_revenue AS (
SELECT
DATE_TRUNC('month', order_date) AS month,
region,
SUM(total_amount) AS revenue
FROM analytics.orders
WHERE order_date >= '2020-01-01'
GROUP BY 1, 2
),
ranked AS (
SELECT *,
RANK() OVER (PARTITION BY month ORDER BY revenue DESC) AS revenue_rank
FROM monthly_revenue
)
SELECT *
FROM ranked
WHERE revenue_rank <= 3;
Creating Tables from SELECT
%sql
-- Create a new Delta table from query results
CREATE TABLE analytics.monthly_summary
USING DELTA
AS
SELECT
DATE_TRUNC('month', order_date) AS month,
region,
COUNT(*) AS order_count,
SUM(total_amount) AS revenue
FROM analytics.orders
GROUP BY 1, 2;
-- Or create a view
CREATE OR REPLACE VIEW analytics.v_active_customers AS
SELECT customer_id, name, region
FROM analytics.customers
WHERE last_order_date >= DATE_SUB(CURRENT_DATE(), 365);
DML Operations on Delta Tables
%sql
-- UPDATE (only works on Delta tables, not plain Parquet)
UPDATE analytics.customers
SET region = 'Pacific Northwest'
WHERE region = 'Northwest' AND country = 'US';
-- DELETE
DELETE FROM analytics.customers
WHERE status = 'inactive' AND last_order_date < '2019-01-01';
-- INSERT INTO
INSERT INTO analytics.orders
SELECT * FROM staging.new_orders WHERE load_date = CURRENT_DATE();
UPDATE and DELETE on Parquet files would require rewriting entire partitions manually. On Delta tables, these are first-class operations handled by the transaction log — Delta writes new versions of affected files and records the change in the log.
MERGE: The T-SQL Developer's Favorite
%sql
MERGE INTO analytics.customers AS target
USING staging.customer_updates AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND source.is_deleted = true THEN
DELETE
WHEN MATCHED THEN
UPDATE SET
target.name = source.name,
target.email = source.email,
target.updated_date = source.updated_date
WHEN NOT MATCHED THEN
INSERT (customer_id, name, email, region, created_date, updated_date)
VALUES (source.customer_id, source.name, source.email, source.region, CURRENT_DATE(), source.updated_date);
This is the UPSERT pattern — insert if new, update if exists, delete if flagged. The same pattern you'd write in T-SQL MERGE, running distributed across your Delta Lake. On a 100M-row customers table with a 10k-row daily delta, this runs in seconds rather than the minutes a row-by-row SQL Server MERGE might take on undersized hardware.
Describing Tables and Querying the Catalog
%sql
-- Table schema
DESCRIBE TABLE analytics.orders;
-- Full metadata including storage location, format, partition info
DESCRIBE TABLE EXTENDED analytics.orders;
-- Delta-specific: history of all operations on this table
DESCRIBE HISTORY analytics.orders;
-- List tables in a database
SHOW TABLES IN analytics;
-- Table statistics (if computed)
DESCRIBE TABLE EXTENDED analytics.orders;
-- Look for: Statistics section
Interoperability Between SQL Cells and Python Cells
# In a Python cell, register a Spark DataFrame as a temp view
orders_2020 = df.filter(F.col("order_date").between("2020-01-01", "2020-12-31"))
orders_2020.createOrReplaceTempView("orders_2020")
# In the next cell (SQL), query the temp view
%sql
SELECT region, SUM(total_amount) AS revenue
FROM orders_2020
GROUP BY region;
Temp views live for the session. If you restart the cluster or start a new session, they're gone. For persistent access, register as a table in the metastore via saveAsTable() or CREATE TABLE AS SELECT.
For SQL-heavy workloads, this SQL-first approach in Databricks notebooks is genuinely productive. You get the full power of Spark's distributed execution, Delta Lake's ACID guarantees, and a SQL interface that feels like home if you came from a relational background.