Auto-Generating Terraform from a Metadata Config Table
The metadata-driven pipeline generator was producing ADF pipelines and Databricks notebooks from a config table. The missing piece was infrastructure: every new client data source also needed a storage container, a Databricks cluster configuration, a linked service, and sometimes a Key Vault secret reference. Those were still being created by hand. Terraform templates, mostly duplicated across projects, occasionally diverging in ways that caused subtle environment differences.
The fix was to close the loop: if metadata drives ADF and Databricks, it should also drive Terraform.
The Config Table Extension
-- Extend the existing IngestionConfig table or use a separate InfraConfig table
CREATE TABLE [meta].[InfraConfig] (
InfraID INT IDENTITY(1,1) PRIMARY KEY,
SourceID INT NOT NULL REFERENCES [meta].[IngestionConfig](SourceID),
StorageAccountName NVARCHAR(100) NOT NULL,
ContainerName NVARCHAR(100) NOT NULL,
ClusterSizeLabel NVARCHAR(50) NOT NULL, -- SMALL | MEDIUM | LARGE
RequiresKeyVault BIT NOT NULL DEFAULT 0,
SecretName NVARCHAR(200) NULL, -- name within KV
Tags NVARCHAR(MAX) NULL -- JSON object: {env, owner, cost_center}
);The Terraform Generator
import json
from pathlib import Path
CLUSTER_SIZE_MAP = {
'SMALL': {'node_type': 'Standard_DS3_v2', 'workers': 2},
'MEDIUM': {'node_type': 'Standard_DS4_v2', 'workers': 4},
'LARGE': {'node_type': 'Standard_DS5_v2', 'workers': 8},
}
def render_storage_container(config: dict) -> str:
tags = json.loads(config.get('Tags') or '{}')
tags_tf = "\n ".join(f'{k} = "{v}"' for k, v in tags.items())
return f"""
resource "azurerm_storage_container" "{config['ContainerName']}" {{
name = "{config['ContainerName']}"
storage_account_name = "{config['StorageAccountName']}"
container_access_type = "private"
}}
"""
def render_databricks_cluster(config: dict, source_name: str) -> str:
size = CLUSTER_SIZE_MAP[config['ClusterSizeLabel']]
return f"""
resource "databricks_cluster" "{source_name.lower()}_cluster" {{
cluster_name = "{source_name}-ingest"
spark_version = "10.4.x-scala2.12"
node_type_id = "{size['node_type']}"
autotermination_minutes = 30
autoscale {{
min_workers = 1
max_workers = {size['workers']}
}}
}}
"""
def render_keyvault_reference(config: dict) -> str:
if not config['RequiresKeyVault']:
return ""
return f"""
resource "azurerm_key_vault_secret" "{config['SecretName']}_ref" {{
name = "{config['SecretName']}"
value = var.{config['SecretName'].replace('-', '_')}_value
key_vault_id = azurerm_key_vault.pipeline_kv.id
}}
"""
def generate_terraform_module(configs: list[dict], output_dir: str) -> None:
Path(output_dir).mkdir(parents=True, exist_ok=True)
blocks = []
for config in configs:
source_name = config['SourceName']
blocks.append(render_storage_container(config))
blocks.append(render_databricks_cluster(config, source_name))
blocks.append(render_keyvault_reference(config))
tf_content = "\n".join(blocks)
with open(f"{output_dir}/generated_sources.tf", "w") as f:
f.write(tf_content)
Integration with the CI/CD Pipeline
The generator runs as part of the pipeline that deploys new data sources. The sequence: new row in meta.IngestionConfig + meta.InfraConfig → CI job runs the generator → generated .tf files committed to the infra repo → Terraform plan reviewed → Terraform apply provisions the infrastructure → ADF and Databricks notebook generators run → deployment complete.
No more handcrafted Terraform per source. No more infrastructure drift because someone copied last project's config and changed three of the five things that needed changing.
The Gotcha: Generated Files and Code Review
Generated Terraform files need to live in source control, but they complicate code review — a config table change that adds five sources generates five hundred lines of Terraform. The pattern that works: keep generated files in a generated/ subdirectory, note in the PR that these were auto-generated from a config change, and review the config change rather than the generated output. Trust the generator; review the config. As always, I'm here to help.