Auto-Generating Databricks Notebooks from Metadata: The Pattern That Scales to 200 Sources
We'd already solved the ADF side with a metadata config table and a generic pipeline. Forty-three sources, one pipeline, parameterized at runtime. The problem was what happened next. Once the raw data landed in ADLS, each source needed a Databricks notebook to clean, conform, and write it to the silver layer. And each of those notebooks was hand-written. We had thirty notebooks that were 90% identical and diverged only in column names, data types, and a handful of business rules.
That's the wrong kind of uniqueness. Let's fix it.
The Pattern: Notebooks as Templates
A metadata-generated notebook isn't magic — it's a parameterized template with the boilerplate written once and the variable parts injected from a config table. The output is a valid Databricks notebook (JSON format) that you import into your workspace or check into your repo.
Databricks notebooks are just JSON under the hood. The .dbc format is a zip file containing JSON files. You can export an existing notebook, inspect the structure, and write code that generates new notebooks following the same schema.
import json
import base64
def build_notebook_cell(source: str) -> str:
return source
def build_silver_notebook(config: dict) -> dict:
source_path = config['source_path']
target_path = config['target_path']
primary_key = config['primary_key']
columns = config['columns'] # list of {name, type, nullable}
col_cast_lines = "\n ".join(
f".withColumn('{c[\"name\"]}', col('{c[\"name\"]}').cast('{c[\"type\"]}'))"
for c in columns
)
cells = [
{
"source": "# Auto-generated silver notebook — do not edit directly\n"
"# Source config: " + config['source_name'],
"cell_type": "markdown"
},
{
"source": f"""from pyspark.sql.functions import col, current_timestamp
from delta.tables import DeltaTable
raw_df = spark.read.format("delta").load("{source_path}")
silver_df = (raw_df
{col_cast_lines}
.withColumn("_ingested_at", current_timestamp())
)""",
"cell_type": "code"
},
{
"source": f"""if DeltaTable.isDeltaTable(spark, "{target_path}"):
dt = DeltaTable.forPath(spark, "{target_path}")
(dt.alias("target")
.merge(
silver_df.alias("source"),
"target.{primary_key} = source.{primary_key}"
)
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute())
else:
silver_df.write.format("delta").mode("overwrite").save("{target_path}")""",
"cell_type": "code"
}
]
return {"cells": cells, "metadata": {"name": config['notebook_name']}}
What Lives in the Config Table
CREATE TABLE [meta].[NotebookConfig] (
NotebookID INT IDENTITY(1,1) PRIMARY KEY,
SourceName NVARCHAR(100) NOT NULL,
NotebookName NVARCHAR(200) NOT NULL,
SourcePath NVARCHAR(500) NOT NULL,
TargetPath NVARCHAR(500) NOT NULL,
PrimaryKey NVARCHAR(100) NOT NULL,
ColumnConfigJSON NVARCHAR(MAX) NOT NULL -- JSON array of {name, type, nullable}
);The ColumnConfigJSON column stores the schema as JSON. You populate it either manually or — better — by querying the source system's information schema and serializing the output.
Deploying the Generated Notebooks
Once you've generated the notebook JSON, import it via the Databricks REST API:
import requests
def import_notebook(workspace_url: str, token: str, path: str, content: dict) -> None:
encoded = base64.b64encode(
json.dumps(content).encode('utf-8')
).decode('utf-8')
resp = requests.post(
f"{workspace_url}/api/2.0/workspace/import",
headers={"Authorization": f"Bearer {token}"},
json={
"path": path,
"format": "JUPYTER",
"language": "PYTHON",
"content": encoded,
"overwrite": True
}
)
resp.raise_for_status()
The Limit of Generation
Generated notebooks handle the 90% that's structural boilerplate. The remaining 10% — business rules, source-specific quirks, non-standard date formats, columns that need lookups against reference tables — those still require hand-written logic. The generated notebook is a starting point and a scaffold, not a final artifact.
The pattern I landed on: generate the notebook, import it, then add a clearly marked "custom logic" cell where source-specific rules go. The generator overwrites everything except that cell on regeneration runs. Regeneration never touches custom logic cells. That boundary keeps the maintenance model clean. As always, I'm here to help.