ADF Data Flows in 2022: What Three Years of Maturation Delivers

Three Years of Production Data Flows

ADF Data Flows went GA in 2019. I've been running production Data Flows for three years. Not toy examples — production pipelines loading dimension tables with SCD logic, conformed zone aggregations, multi-source joins feeding analytical models. The kind of work where Spark startup latency costs you money and transformation bugs cost you credibility.

Three years is long enough to have an informed opinion. Here it is.

What's Actually Improved Since 2019

The 2019 Data Flows product and the 2022 Data Flows product are meaningfully different. Here's what maturation delivered:

Flowlets

Reusable sub-flows. The SSIS equivalent is the package template concept — define a transformation pattern once, reuse it across multiple data flows. In ADF, this is a flowlet: a self-contained set of transformations that can be referenced from multiple parent data flows with different source and sink bindings.

In practice: I've built flowlets for common dimension loading patterns (SCD Type 1, SCD Type 2 with surrogate key generation), null handling normalization, and data type standardization. These flowlets get reused across 20+ data flows without copy-pasting transformation logic. When the business rule changes, I update the flowlet and every data flow that references it picks up the change on next run.

This is the pattern that makes Data Flows scale beyond "one-off transformations."

Delta Lake as a First-Class Source and Sink

You can now read and write Delta tables directly from Data Flows, with full support for Delta features: time travel (read a specific table version or timestamp), upserts using Delta's merge semantics, and schema evolution. This is not "write Parquet to a Delta Lake path and hope" — it's native Delta semantics inside the Data Flow transformation engine.

Combined with the Databricks Delta Lake connector in Copy Activity, this means ADF can participate as a full citizen in a Lakehouse architecture: Copy Activity brings raw data into the Bronze layer, Data Flows handle Bronze-to-Silver transformations using Delta upsert semantics, and the same Delta tables feed Power BI or Databricks notebooks downstream.

Better Partition Optimization

The partition optimization UI now provides visual feedback on partition distribution — you can see whether your data is skewed before it causes the "95% complete, waiting forever" problem that anyone who's run Spark on skewed data knows intimately. The UI isn't as detailed as Spark UI, but it's enough to catch obvious partitioning problems before they hit production.

Inline Dataset Mode

You can now define source and sink schema inline within the Data Flow rather than requiring a separate Dataset object in the factory. For transformations where the schema is fixed and unlikely to be reused elsewhere, this reduces factory object proliferation — one fewer entity to maintain, one fewer place where a misconfiguration can cause a failure.

What I Use Data Flows For

Three years of production has given me a clear mental model of where Data Flows fit:

SCD Type 1 and Type 2 dimension loading. The Alter Row transformation handles the "insert new, update existing, don't touch unchanged" logic that SCD loading requires. The window function support handles surrogate key generation and effective date tracking. This is where Data Flows genuinely earn their complexity cost.

Aggregation pipelines from raw to conformed zones. Fact table loading often requires grouping, summing, filtering, and joining. Data Flows handle this without leaving the ADF abstraction layer — no need to spin up a Databricks cluster for aggregations that aren't computationally extreme.

Multi-source joins into unified models. Joining three sources, applying lookup enrichment from a fourth, filtering based on a fifth reference table — this is what the Join, Lookup, and Filter transformations are for, and they handle it cleanly.

What I Do NOT Use Data Flows For

Equally important: where I reach for something else.

Low-latency requirements. Spark startup time on a Data Flow cluster is 2-5 minutes. If you have a pipeline that needs to complete in under a minute, Data Flows are wrong. Use Copy Activity (fast, no Spark startup) or a Stored Procedure (runs in-database).

Highly complex row-level logic. If your transformation requires complex procedural logic — iterating over rows, maintaining state across rows, calling external APIs row-by-row — a Databricks notebook is more debuggable. Data Flow transformations are declarative. When the logic is genuinely complex, declarative transformations become hard to reason about and harder to debug when they're wrong.

Very large datasets with complex shuffle operations. Data Flows run on auto-provisioned Spark clusters. For extreme-scale transformations with complex shuffles, Databricks gives you more control over cluster configuration and partition strategy. Data Flows work at scale for most workloads; for the extreme edge, Databricks is a better tool.

The Comparison That Resolved

For three years, clients have asked me: "Data Flows or Databricks notebooks?" That's the wrong question. The right question is: "Which gives better observability for this specific transformation?"

Data Flows give you visual, row-level data preview at each transformation step during debug mode. You can see exactly what the data looks like after each filter, join, and aggregate. For non-engineers who need to validate transformation logic, this is invaluable. For complex transformations being actively debugged, it's faster than anything in Databricks.

Databricks notebooks give you full Python/Scala execution context, arbitrary code, Spark UI for execution plan inspection, and the full ecosystem of Python libraries. For sophisticated data engineers building complex transformations, it's the better debugging environment.

Use both. They're not competing; they're complementary. Data Flows for GUI-verifiable transformation logic. Databricks for complex programmatic transformations. ADF orchestrates both.

If you're still treating this as an either/or choice, I'd be glad to help you think through the architecture. As always, reach out.

Read more