2020 Data Engineering Retrospective: The Year I Made the Full Leap to Spark

A year ago I was a SQL Server and Azure SQL Data Warehouse person who had just watched Microsoft slowly back away from Azure Data Lake Analytics. Now I'm writing PySpark full-time, managing Delta Lake tables, and genuinely excited about data engineering in a way I haven't been since I first learned about SQL Server PDW.

This isn't a tools-list post. This is what actually changed about how I work.

The Mental Model Shift That Took the Longest

Distributed computing requires you to think about where computation happens in a way that single-node SQL never did. In SQL Server, the query optimizer figures out the execution plan and you trust it. You can influence it with hints and statistics updates, but fundamentally the machine is one machine and the optimizer is one optimizer.

In Spark, the physical cluster topology matters. How many executors? How many cores per executor? How many partitions is this DataFrame split into? Are my joins shuffling or broadcasting? These aren't footnotes — they're the first-order questions that determine whether a job takes 5 minutes or 50 minutes. Learning to read execution plans and Spark UI metrics was more important than learning the Python API.

The other shift: lazy evaluation forces you to think about your data pipeline as a DAG (Directed Acyclic Graph) of operations, not a sequence of statements. In T-SQL, each statement executes immediately. In Spark, transformations accumulate until an action fires, and Spark optimizes the whole DAG before running any of it. This is actually more powerful once you internalize it, but the first few weeks of debugging errors that surface at action-time rather than transformation-time were disorienting.

What Actually Carried Over From SQL

More than I expected. Spark SQL is a real SQL dialect — CTEs, window functions, subqueries, MERGE, DML operations on Delta tables. If you know T-SQL well, Spark SQL is 80% familiar. The 20% that isn't: no stored procedures (you write functions in Python or Scala), no IDENTITY columns (generate UUIDs or monotonically increasing IDs explicitly), slightly different string and date function names.

The relational thinking itself carries over completely. JOIN logic, GROUP BY design, query decomposition via CTEs, incremental load patterns — all of it. The implementation is distributed, but the logic is the same logic.

Delta Lake Was the Unlock

The thing that made Spark feel production-ready for me was Delta Lake. Plain Parquet files are powerful but have no ACID guarantees, no schema enforcement, no time travel. Failed jobs leave partial data. Schema drift corrupts downstream consumers silently. Recovering from a bad pipeline run means digging through file manifests.

Delta Lake added the operational guarantees I expected from a production data store. Now a failed write doesn't corrupt the table. A schema change that doesn't match the table definition raises an error instead of silently creating mistyped columns. Time travel means "roll back to yesterday's data" is a two-line operation, not a restore-from-backup event.

The Performance Ceiling Is Gone

The thing I most wanted from MSSQL PDW — scale-out analytics on a tier I could actually afford — is now reality. A query against 10 billion rows that would require a $400k appliance and months of capacity planning on-premises takes a few minutes on a Databricks cluster billed per compute-minute. Add more nodes for a large job, scale back down when it's done. Pay for what you used.

That's not a minor improvement. That's a fundamental change in what analytical queries are feasible to run as part of a normal workflow versus reserved for "we'll run that one quarterly when we have the budget."

What's Next

In 2021, I'm going deeper on the parts I've been treating as black boxes: Structured Streaming for near-real-time pipelines, MLflow for tracking model experiments alongside data pipeline runs, and Delta Lake internals (Z-ordering, compaction strategies, the transaction log under load). The foundation is solid now. Time to build on it.

If you're a SQL Server person who's been skeptical about making the leap to distributed computing, this was the year it got accessible enough to be worth it. The learning curve is real, but the ceiling you hit against on single-node SQL is gone. That trade is worth it.

2020 Data Engineering Retrospective: The Year I Made the Full Leap to Spark

Shannon Lowder

The Mental Model Shift That Took the Longest

What Actually Carried Over From SQL

Delta Lake Was the Unlock

The Performance Ceiling Is Gone

What's Next

Read more

The Context Problem Neither Agent Mesh Nor OpenSharing Solves

Unity AI Gateway and What a Governed Model Access Layer Actually Buys You

You Don't Need Fable. You Need a Router.

DAIS 2026: Genie One and the Context Problem Databricks Is Solving