Building a Metadata Repository: The Config Table at the Center of a Data Platform
After enough metadata-driven pipeline work, you develop strong opinions about what belongs in the config table. Most teams start with the obvious things — source connection info, table names, extract type — and add columns reactively as each new requirement arrives. By the time the table has twenty-five columns and three cross-reference tables, the "config table" has become a schema that nobody fully understands and nobody wants to change.
There's a better starting point. Here's the structure that I've settled on after building this pattern across several client engagements.
The Core Tables
-- The source registry: what exists
CREATE TABLE [meta].[DataSource] (
SourceID INT IDENTITY(1,1) PRIMARY KEY,
SourceName NVARCHAR(100) NOT NULL UNIQUE,
SourceType NVARCHAR(50) NOT NULL, -- SQLSERVER | ORACLE | SFTP | API | KAFKA
ConnectionString NVARCHAR(500) NULL, -- or reference to Key Vault secret
SecretName NVARCHAR(200) NULL,
IsActive BIT NOT NULL DEFAULT 1,
OwnerTeam NVARCHAR(100) NULL,
CreatedDate DATE NOT NULL DEFAULT GETDATE()
);
-- The extraction config: how to get it
CREATE TABLE [meta].[ExtractionConfig] (
ExtractionID INT IDENTITY(1,1) PRIMARY KEY,
SourceID INT NOT NULL REFERENCES [meta].[DataSource](SourceID),
SourceSchema NVARCHAR(100) NOT NULL,
SourceTable NVARCHAR(200) NOT NULL,
ExtractType NVARCHAR(20) NOT NULL, -- FULL | INCREMENTAL | CDC
WatermarkColumn NVARCHAR(100) NULL,
WatermarkValue NVARCHAR(200) NULL,
PartitionColumn NVARCHAR(100) NULL, -- for parallel JDBC reads
PartitionCount INT NULL,
TargetZone NVARCHAR(20) NOT NULL, -- BRONZE | SILVER | GOLD
TargetPath NVARCHAR(500) NOT NULL,
LoadFrequency NVARCHAR(50) NOT NULL, -- HOURLY | DAILY | WEEKLY | ON_DEMAND
LastSuccessfulRun DATETIME2 NULL,
IsActive BIT NOT NULL DEFAULT 1
);
-- The transformation config: what to do with it
CREATE TABLE [meta].[TransformationConfig] (
TransformID INT IDENTITY(1,1) PRIMARY KEY,
ExtractionID INT NOT NULL REFERENCES [meta].[ExtractionConfig](ExtractionID),
NotebookPath NVARCHAR(500) NOT NULL,
TransformVersion NVARCHAR(20) NOT NULL DEFAULT '1.0',
BusinessRulesJSON NVARCHAR(MAX) NULL, -- source-specific overrides as JSON
IsActive BIT NOT NULL DEFAULT 1
);Why Three Tables Instead of One
The separation reflects different rates of change and different ownership. The source registry changes when new systems are onboarded — that's an IT or architecture decision. Extraction config changes when load requirements change — frequency, partitioning, watermark strategy. Transformation config changes when business rules evolve — that's data engineering work. Keeping these separate means a DBA can update a watermark value without touching anything related to how the notebook transforms data.
The Lookup Pattern
Pipelines that read from this config should always join all three tables and pass the complete config object as a parameter. Avoid having notebook code re-query the config mid-execution — the config is input state, and it should be fixed at the point the pipeline starts:
-- The lookup query used by ADF's Lookup activity
SELECT
s.SourceName,
s.SourceType,
s.SecretName,
e.SourceSchema,
e.SourceTable,
e.ExtractType,
e.WatermarkColumn,
e.WatermarkValue,
e.PartitionColumn,
e.PartitionCount,
e.TargetPath,
t.NotebookPath,
t.BusinessRulesJSON
FROM [meta].[ExtractionConfig] e
JOIN [meta].[DataSource] s ON e.SourceID = s.SourceID
LEFT JOIN [meta].[TransformationConfig] t ON t.ExtractionID = e.ExtractionID
WHERE e.IsActive = 1
AND s.IsActive = 1
AND e.LoadFrequency = 'DAILY'
ORDER BY s.SourceName;
Bootstrapping the Config
Don't populate the config by hand for existing sources. Pull it from the source system's information schema:
import pyodbc
import pandas as pd
def bootstrap_extraction_config(conn_str: str, source_id: int) -> pd.DataFrame:
with pyodbc.connect(conn_str) as conn:
df = pd.read_sql("""
SELECT
TABLE_SCHEMA as SourceSchema,
TABLE_NAME as SourceTable,
'FULL' as ExtractType,
NULL as WatermarkColumn,
'BRONZE' as TargetZone
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_TYPE = 'BASE TABLE'
ORDER BY TABLE_SCHEMA, TABLE_NAME
""", conn)
df['SourceID'] = source_id
df['TargetPath'] = df.apply(
lambda r: f"/mnt/bronze/{r['SourceSchema']}/{r['SourceTable']}", axis=1
)
df['LoadFrequency'] = 'DAILY'
return df
One query against the source system's information schema gives you a complete bootstrap DataFrame. Load it into ExtractionConfig and you have a starting point that reflects what actually exists — not what someone remembered to add. As always, I'm here to help.