Auto-Generating Databricks Notebooks from Metadata: The Pattern That Scales to 200 Sources

We'd already solved the ADF side with a metadata config table and a generic pipeline. Forty-three sources, one pipeline, parameterized at runtime. The problem was what happened next. Once the raw data landed in ADLS, each source needed a Databricks notebook to clean, conform, and write it to the silver layer. And each of those notebooks was hand-written. We had thirty notebooks that were 90% identical and diverged only in column names, data types, and a handful of business rules.

That's the wrong kind of uniqueness. Let's fix it.

The Pattern: Notebooks as Templates

A metadata-generated notebook isn't magic — it's a parameterized template with the boilerplate written once and the variable parts injected from a config table. The output is a valid Databricks notebook (JSON format) that you import into your workspace or check into your repo.

Databricks notebooks are just JSON under the hood. The .dbc format is a zip file containing JSON files. You can export an existing notebook, inspect the structure, and write code that generates new notebooks following the same schema.

import json
import base64

def build_notebook_cell(source: str) -> str:
    return source

def build_silver_notebook(config: dict) -> dict:
    source_path = config['source_path']
    target_path = config['target_path']
    primary_key = config['primary_key']
    columns = config['columns']  # list of {name, type, nullable}

    col_cast_lines = "\n    ".join(
        f".withColumn('{c[\"name\"]}', col('{c[\"name\"]}').cast('{c[\"type\"]}'))"
        for c in columns
    )

    cells = [
        {
            "source": "# Auto-generated silver notebook — do not edit directly\n"
                      "# Source config: " + config['source_name'],
            "cell_type": "markdown"
        },
        {
            "source": f"""from pyspark.sql.functions import col, current_timestamp
from delta.tables import DeltaTable

raw_df = spark.read.format("delta").load("{source_path}")

silver_df = (raw_df
    {col_cast_lines}
    .withColumn("_ingested_at", current_timestamp())
)""",
            "cell_type": "code"
        },
        {
            "source": f"""if DeltaTable.isDeltaTable(spark, "{target_path}"):
    dt = DeltaTable.forPath(spark, "{target_path}")
    (dt.alias("target")
       .merge(
           silver_df.alias("source"),
           "target.{primary_key} = source.{primary_key}"
       )
       .whenMatchedUpdateAll()
       .whenNotMatchedInsertAll()
       .execute())
else:
    silver_df.write.format("delta").mode("overwrite").save("{target_path}")""",
            "cell_type": "code"
        }
    ]
    return {"cells": cells, "metadata": {"name": config['notebook_name']}}

What Lives in the Config Table

CREATE TABLE [meta].[NotebookConfig] (
    NotebookID       INT IDENTITY(1,1) PRIMARY KEY,
    SourceName       NVARCHAR(100) NOT NULL,
    NotebookName     NVARCHAR(200) NOT NULL,
    SourcePath       NVARCHAR(500) NOT NULL,
    TargetPath       NVARCHAR(500) NOT NULL,
    PrimaryKey       NVARCHAR(100) NOT NULL,
    ColumnConfigJSON NVARCHAR(MAX) NOT NULL  -- JSON array of {name, type, nullable}
);

The ColumnConfigJSON column stores the schema as JSON. You populate it either manually or — better — by querying the source system's information schema and serializing the output.

Deploying the Generated Notebooks

Once you've generated the notebook JSON, import it via the Databricks REST API:

import requests

def import_notebook(workspace_url: str, token: str, path: str, content: dict) -> None:
    encoded = base64.b64encode(
        json.dumps(content).encode('utf-8')
    ).decode('utf-8')

    resp = requests.post(
        f"{workspace_url}/api/2.0/workspace/import",
        headers={"Authorization": f"Bearer {token}"},
        json={
            "path": path,
            "format": "JUPYTER",
            "language": "PYTHON",
            "content": encoded,
            "overwrite": True
        }
    )
    resp.raise_for_status()

The Limit of Generation

Generated notebooks handle the 90% that's structural boilerplate. The remaining 10% — business rules, source-specific quirks, non-standard date formats, columns that need lookups against reference tables — those still require hand-written logic. The generated notebook is a starting point and a scaffold, not a final artifact.

The pattern I landed on: generate the notebook, import it, then add a clearly marked "custom logic" cell where source-specific rules go. The generator overwrites everything except that cell on regeneration runs. Regeneration never touches custom logic cells. That boundary keeps the maintenance model clean. As always, I'm here to help.

Read more