Microsoft's ADF documentation covers the building blocks thoroughly: how to create a pipeline, how to parameterize an activity, how to use ForEach. What it doesn't give you is a complete, production-ready metadata-driven framework. The documentation shows you the Lego blocks. It does not show you how to build a factory.
The community did that work. And what the community built goes substantially further than Microsoft's documentation implies is possible.
What a Metadata-Driven Framework Actually Does
The core pattern: instead of one pipeline per source table, you build one parameterized pipeline that takes source and sink configuration as inputs, then drive it with a metadata table. Onboarding a new source table means adding a row to a configuration table, not building a new pipeline.
A minimal metadata table looks something like this:
CREATE TABLE etl.pipeline_config (
config_id INT PRIMARY KEY IDENTITY,
source_name NVARCHAR(128),
source_type NVARCHAR(50),
sink_container NVARCHAR(128),
sink_path NVARCHAR(500),
watermark_column NVARCHAR(128),
watermark_value DATETIME2,
load_type NVARCHAR(20),
is_active BIT DEFAULT 1,
pipeline_name NVARCHAR(200)
);
The orchestration pipeline: Lookup (read active rows from etl.pipeline_config) then ForEach (iterate, invoke the appropriate parameterized pipeline per row). Each child pipeline runs independently. One failure doesn't abort the others.
What the Community Took Further
The community implementations I've studied go well past the basic pattern. Notable additions that I've either adopted directly or built variants of:
Pipeline Dependency Chaining
Run pipeline B only after pipeline A completes successfully, for the same execution window. Implemented by adding a depends_on_config_id column to the config table and a dependency-resolution pass in the orchestration layer before the ForEach executes. The orchestration pipeline builds a dependency graph, determines the safe execution order, and runs independent pipelines concurrently while respecting dependencies.
The Watermark Store with Locking
The naive watermark pattern stores the high watermark in the config table after a successful run. The problem: parallel pipelines can race to update the same watermark. The community solution adds a lock column and a "claim watermark" step before the pipeline runs.
-- Claim watermark before pipeline execution
UPDATE etl.pipeline_config
SET is_locked = 1,
lock_acquired_at = GETUTCDATE()
WHERE config_id = @configId
AND is_locked = 0;
-- Only proceed if the UPDATE affected 1 row (we won the lock)
The "Retry Failed Slice" Procedure
My own addition: a stored procedure that identifies config rows where the last run failed and resets them for re-execution without touching rows that succeeded. The pattern lets you fix the source issue, then call the retry procedure and rerun only the failed slices rather than the entire batch.
CREATE PROCEDURE etl.retry_failed_runs
@run_date DATE
AS
BEGIN
UPDATE pc
SET pc.watermark_value = rl.watermark_before,
pc.is_locked = 0
FROM etl.pipeline_config pc
JOIN etl.run_log rl
ON rl.config_id = pc.config_id
AND rl.run_date = @run_date
AND rl.status = 'failed';
END;
Data Quality Checkpoint
After the sink write, count rows in the source query and rows in the sink. If the variance exceeds a threshold (I use 1% as the default), mark the run as failed and alert. This catches silent failures -- truncated extracts, network-interrupted writes, Spark jobs that completed without an error but wrote fewer rows than expected.
The REST API Polling Pattern
For sources where data isn't always ready at the scheduled run time (vendor FTP drops, API exports that take variable time to generate), the community pattern uses an ADF Until activity to poll for a readiness signal before triggering the actual extract. The Until loop calls an HTTP endpoint or checks for a file in storage and waits up to a configured timeout.
The Uncomfortable Observation
None of these patterns require new ADF features. The building blocks -- parameterized pipelines, ForEach, Lookup, Until, Set Variable, stored procedure calls -- have existed since ADF v2. Microsoft could have documented these as first-party reference implementations years ago. They chose not to.
The community shipped the 20% that Microsoft's documentation left blank. The GitHub repos, blog posts, and Stack Overflow answers that constitute the community's ADF knowledge base are more useful for production implementations than the official ADF documentation for anything beyond basic patterns.
I'm not saying this to criticize Microsoft -- they have a finite documentation budget and they prioritized the connectors and the core activities over the framework patterns. The community filling in the gap is how open ecosystems work. I'm saying it because if you go looking for Microsoft's official guidance on building a metadata-driven ADF framework, you won't find a complete answer, and knowing that saves you from looking for something that isn't there.
The pattern is well-understood. I've built it multiple times. If you want to walk through the design for your specific data estate, I'm here to help.