I ran my first production Mapping Data Flow about two years ago. It was an experiment -- I was watching performance, checking that it handled my edge cases, making sure the Spark execution model didn't do anything surprising at scale. I had a Databricks fallback ready if Data Flows didn't hold up.
I never used the fallback. In 2021, for most transformation work on Azure data, Mapping Data Flows are my first choice.
What Changed Between 2019 and Now
The GA release in 2019 had the core transformation set: derived column, filter, join, aggregate, sort, select, sink. It was functional but not comprehensive.
The 2020-2021 additions filled the meaningful gaps: the Flowlet pattern (reusable sub-transformation components you define once and reference across data flows), Delta Lake as a first-class source and sink with time travel support, the Parse and Stringify transformations for working with JSON and XML within a row, the Assert transformation for in-flight data quality validation, and the Rank transformation for row_number and dense_rank patterns without writing a window function in Spark SQL.
The Optimize tab matured in parallel -- partition count, partition key selection, broadcast thresholds, cache sinks. The tuning surface went from "trust the defaults" to "here are the levers, here's what each does."
The SSIS Comparison, Finally Resolved
For the first three years of my ADF career, I compared Data Flows to SSIS Data Flow task and found them lacking in specific dimensions. I need to update that assessment.
Things ADF Data Flows now do better than SSIS Data Flow:
- Scale. A Data Flow processing 100M rows is a Spark job. SSIS at that scale requires careful buffer tuning, engine thread management, and probably a multi-server deployment. ADF handles it with a cluster size selection.
- Delta Lake. Native Delta source and sink with time travel support, merge/upsert semantics, schema evolution handling. SSIS has no native Delta capability.
- Flowlets. Reusable transformation components defined once and referenced across multiple data flows. SSIS has nothing equivalent -- you copy component configurations manually.
- Managed infrastructure. No Spark cluster to manage, no memory tuning for the Spark executor, no patching. ADF handles it.
Things SSIS still does better than ADF Data Flows:
- Row-level debugging. In Visual Studio, you can pause SSIS execution and inspect individual rows flowing through a component. ADF's data preview is useful but it's a static sample, not an interactive debugger.
- Synchronous script execution. SSIS Script Component runs .NET code synchronously in the data flow pipeline. ADF's equivalent for custom code is an Azure Function Activity or a Databricks notebook -- external calls that add latency and complexity for simple row-level operations.
For greenfield transformation work on Azure data in 2021, ADF Data Flows is my first recommendation. The SSIS advantages that remain are real, but they're specific enough that they're the exception rather than the rule.
The Delta Lake Integration
Delta Lake as a native source and sink is more significant than it might look. Delta gives you ACID transactions on the data lake, time travel (query the table as of a specific version or timestamp), schema evolution handling, and upsert and merge semantics -- all the things that make data lake storage behave more like a database.
In a Data Flow, using Delta as a sink with a merge operation handles SCD Type 1 (overwrite matching rows) and SCD Type 2 (insert new versions) without complex custom SQL logic. The merge condition is configured in the sink settings -- match key, insert new rows, update changed rows, optionally delete removed rows.
When Databricks Is Still the Right Call
ADF Data Flows are the right choice for most structured transformation work. There are scenarios where Databricks notebooks give you better outcomes:
Complex debugging requirements. If you're building transformation logic that you need to iteratively debug with real data, Databricks notebooks give you a real Python and Scala REPL with a running Spark session. The ADF Data Flow canvas preview is useful but it's not a REPL.
ML feature engineering. Transformations that involve scikit-learn preprocessing, custom statistical functions, or model inference inline don't fit naturally in the Data Flow canvas. Use a Databricks Notebook Activity from ADF to orchestrate a notebook that handles the ML-adjacent transformation.
The typical architecture I recommend: ADF orchestrates everything, Data Flows handle the structured ETL, Databricks handles the cases where you need a full Python and Spark environment. These are not competing choices -- they're layers in the same stack.
If you're evaluating whether a specific transformation scenario fits ADF Data Flows or requires Databricks, I'm here to help work through it.