Databricks Repos: Git-Native Development in Your Workspace

Databricks Repos has been available since 2021 and is fully integrated with Git. If you're still managing notebooks by exporting them as .dbc files, checking them into Git manually, or using a sync script — stop. Repos is the right answer for notebook version control, and it's been stable and production-ready for long enough that there's no reason not to be using it.

What Repos Does

Repos syncs a directory in your Databricks workspace with a Git repository. You check out a branch, make changes in the workspace, commit directly from the Databricks UI or Repos API, and push to your remote. Notebooks behave like files in Git — diffs, branches, pull requests, all the normal Git workflow applies.

This is meaningfully different from the old "export notebook as .py, add to Git" workflow because changes are bidirectional and the sync is managed. You can pull updates from remote, switch branches, and work with multiple notebooks in a directory structure that maps directly to your repo layout.

Setting Up a Repo

import requests

DATABRICKS_HOST = "https://your-workspace.azuredatabricks.net"
TOKEN = "dapi..."

def create_repo(git_url: str, provider: str, path: str) -> dict:
resp = requests.post(
f"{DATABRICKS_HOST}/api/2.0/repos",
headers={"Authorization": f"Bearer {TOKEN}"},
json={
"url": git_url,
"provider": provider, # gitHub | azureDevOpsServices | bitbucketCloud
"path": path # e.g., "/Repos/data-engineering/order-pipeline"
}
)
resp.raise_for_status()
return resp.json()

# Create a repo linked to your Git provider
repo = create_repo(
git_url="https://github.com/your-org/databricks-pipelines.git",
provider="gitHub",
path="/Repos/data-engineering/databricks-pipelines"
)
print(f"Repo ID: {repo['id']}, Branch: {repo['branch']}")

Updating a Repo in CI/CD

def update_repo(repo_id: int, branch: str) -> None:
resp = requests.patch(
f"{DATABRICKS_HOST}/api/2.0/repos/{repo_id}",
headers={"Authorization": f"Bearer {TOKEN}"},
json={"branch": branch}
)
resp.raise_for_status()

# Pull latest changes on main (used in a CI/CD deploy step)
update_repo(repo_id=12345, branch="main")

In a CI/CD pipeline: your deployment job calls this endpoint to pull the latest main branch into the production workspace Repo. The notebooks in the workspace automatically reflect the latest committed code. No file uploads, no .dbc exports.

The Directory Structure Matters

databricks-pipelines/
├── notebooks/
│ ├── bronze/
│ │ ├── 01_ingest_orders.py
│ │ └── 01_ingest_products.py
│ ├── silver/
│ │ └── 02_transform_orders.py
│ └── gold/
│ └── 03_build_daily_summary.py
├── lib/
│ └── pipeline_utils.py
└── tests/
└── test_transform_orders.py

Keep notebooks in a directory structure that mirrors your pipeline architecture. The Repos workspace path maps directly to this structure — a notebook at notebooks/silver/02_transform_orders.py in your repo becomes accessible at /Repos/data-engineering/databricks-pipelines/notebooks/silver/02_transform_orders in the workspace.

Testing Before Merge

Create a separate Repo that tracks your feature branches. CI runs on the feature branch by updating that dev Repo to the PR branch and running a job against it. Main is what's in production. The repos give you a clean environment-per-branch model without duplicating the workspace. As always, I'm here to help.

Read more