Building a Metadata Repository: The Config Table at the Center of a Data Platform

After enough metadata-driven pipeline work, you develop strong opinions about what belongs in the config table. Most teams start with the obvious things — source connection info, table names, extract type — and add columns reactively as each new requirement arrives. By the time the table has twenty-five columns and three cross-reference tables, the "config table" has become a schema that nobody fully understands and nobody wants to change.

There's a better starting point. Here's the structure that I've settled on after building this pattern across several client engagements.

The Core Tables

-- The source registry: what exists
CREATE TABLE [meta].[DataSource] (
    SourceID          INT IDENTITY(1,1) PRIMARY KEY,
    SourceName        NVARCHAR(100) NOT NULL UNIQUE,
    SourceType        NVARCHAR(50)  NOT NULL,  -- SQLSERVER | ORACLE | SFTP | API | KAFKA
    ConnectionString  NVARCHAR(500) NULL,       -- or reference to Key Vault secret
    SecretName        NVARCHAR(200) NULL,
    IsActive          BIT           NOT NULL DEFAULT 1,
    OwnerTeam         NVARCHAR(100) NULL,
    CreatedDate       DATE          NOT NULL DEFAULT GETDATE()
);

-- The extraction config: how to get it
CREATE TABLE [meta].[ExtractionConfig] (
    ExtractionID      INT IDENTITY(1,1) PRIMARY KEY,
    SourceID          INT           NOT NULL REFERENCES [meta].[DataSource](SourceID),
    SourceSchema      NVARCHAR(100) NOT NULL,
    SourceTable       NVARCHAR(200) NOT NULL,
    ExtractType       NVARCHAR(20)  NOT NULL,  -- FULL | INCREMENTAL | CDC
    WatermarkColumn   NVARCHAR(100) NULL,
    WatermarkValue    NVARCHAR(200) NULL,
    PartitionColumn   NVARCHAR(100) NULL,       -- for parallel JDBC reads
    PartitionCount    INT           NULL,
    TargetZone        NVARCHAR(20)  NOT NULL,   -- BRONZE | SILVER | GOLD
    TargetPath        NVARCHAR(500) NOT NULL,
    LoadFrequency     NVARCHAR(50)  NOT NULL,   -- HOURLY | DAILY | WEEKLY | ON_DEMAND
    LastSuccessfulRun DATETIME2     NULL,
    IsActive          BIT           NOT NULL DEFAULT 1
);

-- The transformation config: what to do with it
CREATE TABLE [meta].[TransformationConfig] (
    TransformID       INT IDENTITY(1,1) PRIMARY KEY,
    ExtractionID      INT           NOT NULL REFERENCES [meta].[ExtractionConfig](ExtractionID),
    NotebookPath      NVARCHAR(500) NOT NULL,
    TransformVersion  NVARCHAR(20)  NOT NULL DEFAULT '1.0',
    BusinessRulesJSON NVARCHAR(MAX) NULL,       -- source-specific overrides as JSON
    IsActive          BIT           NOT NULL DEFAULT 1
);

Why Three Tables Instead of One

The separation reflects different rates of change and different ownership. The source registry changes when new systems are onboarded — that's an IT or architecture decision. Extraction config changes when load requirements change — frequency, partitioning, watermark strategy. Transformation config changes when business rules evolve — that's data engineering work. Keeping these separate means a DBA can update a watermark value without touching anything related to how the notebook transforms data.

The Lookup Pattern

Pipelines that read from this config should always join all three tables and pass the complete config object as a parameter. Avoid having notebook code re-query the config mid-execution — the config is input state, and it should be fixed at the point the pipeline starts:

-- The lookup query used by ADF's Lookup activity
SELECT
s.SourceName,
s.SourceType,
s.SecretName,
e.SourceSchema,
e.SourceTable,
e.ExtractType,
e.WatermarkColumn,
e.WatermarkValue,
e.PartitionColumn,
e.PartitionCount,
e.TargetPath,
t.NotebookPath,
t.BusinessRulesJSON
FROM [meta].[ExtractionConfig] e
JOIN [meta].[DataSource] s ON e.SourceID = s.SourceID
LEFT JOIN [meta].[TransformationConfig] t ON t.ExtractionID = e.ExtractionID
WHERE e.IsActive = 1
AND s.IsActive = 1
AND e.LoadFrequency = 'DAILY'
ORDER BY s.SourceName;

Bootstrapping the Config

Don't populate the config by hand for existing sources. Pull it from the source system's information schema:

import pyodbc
import pandas as pd

def bootstrap_extraction_config(conn_str: str, source_id: int) -> pd.DataFrame:
with pyodbc.connect(conn_str) as conn:
df = pd.read_sql("""
SELECT
TABLE_SCHEMA as SourceSchema,
TABLE_NAME as SourceTable,
'FULL' as ExtractType,
NULL as WatermarkColumn,
'BRONZE' as TargetZone
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_TYPE = 'BASE TABLE'
ORDER BY TABLE_SCHEMA, TABLE_NAME
""", conn)
df['SourceID'] = source_id
df['TargetPath'] = df.apply(
lambda r: f"/mnt/bronze/{r['SourceSchema']}/{r['SourceTable']}", axis=1
)
df['LoadFrequency'] = 'DAILY'
return df

One query against the source system's information schema gives you a complete bootstrap DataFrame. Load it into ExtractionConfig and you have a starting point that reflects what actually exists — not what someone remembered to add. As always, I'm here to help.

Read more