What the Community Built That Microsoft Didn't: ADF Metadata Frameworks

Microsoft's ADF documentation covers the building blocks thoroughly: how to create a pipeline, how to parameterize an activity, how to use ForEach. What it doesn't give you is a complete, production-ready metadata-driven framework. The documentation shows you the Lego blocks. It does not show you how to build a factory.

The community did that work. And what the community built goes substantially further than Microsoft's documentation implies is possible.

What a Metadata-Driven Framework Actually Does

The core pattern: instead of one pipeline per source table, you build one parameterized pipeline that takes source and sink configuration as inputs, then drive it with a metadata table. Onboarding a new source table means adding a row to a configuration table, not building a new pipeline.

A minimal metadata table looks something like this:

CREATE TABLE etl.pipeline_config (
    config_id        INT PRIMARY KEY IDENTITY,
    source_name      NVARCHAR(128),
    source_type      NVARCHAR(50),
    sink_container   NVARCHAR(128),
    sink_path        NVARCHAR(500),
    watermark_column NVARCHAR(128),
    watermark_value  DATETIME2,
    load_type        NVARCHAR(20),
    is_active        BIT DEFAULT 1,
    pipeline_name    NVARCHAR(200)
);

The orchestration pipeline: Lookup (read active rows from etl.pipeline_config) then ForEach (iterate, invoke the appropriate parameterized pipeline per row). Each child pipeline runs independently. One failure doesn't abort the others.

What the Community Took Further

The community implementations I've studied go well past the basic pattern. Notable additions that I've either adopted directly or built variants of:

Pipeline Dependency Chaining

Run pipeline B only after pipeline A completes successfully, for the same execution window. Implemented by adding a depends_on_config_id column to the config table and a dependency-resolution pass in the orchestration layer before the ForEach executes. The orchestration pipeline builds a dependency graph, determines the safe execution order, and runs independent pipelines concurrently while respecting dependencies.

The Watermark Store with Locking

The naive watermark pattern stores the high watermark in the config table after a successful run. The problem: parallel pipelines can race to update the same watermark. The community solution adds a lock column and a "claim watermark" step before the pipeline runs.

-- Claim watermark before pipeline execution
UPDATE etl.pipeline_config
SET    is_locked = 1,
       lock_acquired_at = GETUTCDATE()
WHERE  config_id = @configId
  AND  is_locked = 0;

-- Only proceed if the UPDATE affected 1 row (we won the lock)

The "Retry Failed Slice" Procedure

My own addition: a stored procedure that identifies config rows where the last run failed and resets them for re-execution without touching rows that succeeded. The pattern lets you fix the source issue, then call the retry procedure and rerun only the failed slices rather than the entire batch.

CREATE PROCEDURE etl.retry_failed_runs
    @run_date DATE
AS
BEGIN
    UPDATE pc
    SET    pc.watermark_value = rl.watermark_before,
           pc.is_locked = 0
    FROM   etl.pipeline_config pc
    JOIN   etl.run_log rl
           ON rl.config_id = pc.config_id
           AND rl.run_date = @run_date
           AND rl.status = 'failed';
END;

Data Quality Checkpoint

After the sink write, count rows in the source query and rows in the sink. If the variance exceeds a threshold (I use 1% as the default), mark the run as failed and alert. This catches silent failures -- truncated extracts, network-interrupted writes, Spark jobs that completed without an error but wrote fewer rows than expected.

The REST API Polling Pattern

For sources where data isn't always ready at the scheduled run time (vendor FTP drops, API exports that take variable time to generate), the community pattern uses an ADF Until activity to poll for a readiness signal before triggering the actual extract. The Until loop calls an HTTP endpoint or checks for a file in storage and waits up to a configured timeout.

The Uncomfortable Observation

None of these patterns require new ADF features. The building blocks -- parameterized pipelines, ForEach, Lookup, Until, Set Variable, stored procedure calls -- have existed since ADF v2. Microsoft could have documented these as first-party reference implementations years ago. They chose not to.

The community shipped the 20% that Microsoft's documentation left blank. The GitHub repos, blog posts, and Stack Overflow answers that constitute the community's ADF knowledge base are more useful for production implementations than the official ADF documentation for anything beyond basic patterns.

I'm not saying this to criticize Microsoft -- they have a finite documentation budget and they prioritized the connectors and the core activities over the framework patterns. The community filling in the gap is how open ecosystems work. I'm saying it because if you go looking for Microsoft's official guidance on building a metadata-driven ADF framework, you won't find a complete answer, and knowing that saves you from looking for something that isn't there.

The pattern is well-understood. I've built it multiple times. If you want to walk through the design for your specific data estate, I'm here to help.

Read more