A notebook that you run manually is a prototype. A notebook that runs on a schedule, retries on failure, sends you an email when it breaks, and logs each run's success or failure is a pipeline. In Databricks, Jobs are the mechanism that turns the first thing into the second.
Creating a Job
In the Databricks workspace, go to Jobs in the left sidebar and click Create Job. A job in 2020 is primarily notebook-based: you select a notebook, configure the cluster it runs on, set a schedule, and define notification settings.
The minimum viable job configuration:
- Task type: Notebook
- Notebook path: the workspace path to your notebook (e.g.,
/Users/shannon/pipelines/orders_etl) - Cluster: a new job cluster (not an existing interactive cluster — job clusters are terminated after the run)
- Schedule: a cron expression or a preconfigured interval
Job Cluster vs. Interactive Cluster
Configure your job to use a new job cluster, not an existing all-purpose cluster. Job clusters are:
- Created fresh for each run (no state from previous runs leaking in)
- Terminated as soon as the job finishes (no idle cost)
- Billed at the lower Jobs compute DBU rate
Running a scheduled job on a persistent interactive cluster is the right choice if your job runs frequently enough that cluster startup time is meaningful and the cluster was going to be running anyway. For most scheduled batch pipelines (daily, hourly), job clusters are more cost-effective and less error-prone.
Scheduling with Cron
Databricks job scheduling uses quartz cron format: seconds minutes hours day-of-month month day-of-week year
-- Run daily at 2:00 AM UTC
0 0 2 * * ?
-- Run every 6 hours
0 0 0/6 * * ?
-- Run Monday through Friday at 8:00 AM
0 0 8 ? * MON-FRI
Passing Parameters to Notebooks
Notebooks aren't parameterized by default, but Databricks Widgets let you pass values from a job's run configuration:
# In your notebook, define a widget
dbutils.widgets.text("run_date", "", "Run Date (YYYY-MM-DD)")
# Read the widget value
run_date = dbutils.widgets.get("run_date")
# Use it in your pipeline
df = spark.read.parquet(f"abfss://[email protected]/raw/orders/date={run_date}/")
In the job configuration, set the notebook parameter run_date to a value. For scheduled runs, you can use a fixed value or let the notebook compute the date dynamically from CURRENT_DATE() instead of relying on the passed parameter.
Retry Configuration
Transient failures — network timeouts, spot instance preemption, momentary storage API errors — happen in cloud environments. Configure retries so your pipeline doesn't require manual intervention for transient issues:
- Max retries: 1–3 for most batch jobs
- Retry interval: 5–10 minutes (gives transient issues time to clear)
A job that fails after all retries are exhausted triggers the failure notification. If a job fails after 1 retry within seconds, that's usually a code error, not a transient issue — investigate before increasing the retry count.
Notifications
Configure email notifications for job start, success, and failure. At minimum, configure failure notifications — you want to know within minutes of a production pipeline failing, not hours later when a downstream dashboard looks wrong.
Databricks also exposes the Jobs API, so you can integrate job run status with PagerDuty, Slack, or any other alerting system:
curl -X GET -H "Authorization: Bearer $DATABRICKS_TOKEN" "https://adb-XXXXXXXXX.azuredatabricks.net/api/2.0/jobs/runs/list?job_id=123"
Monitoring Runs
The Jobs UI shows the history of all runs for each job: duration, status, cluster configuration used, and a link to the notebook run's output. Click into any run to see the notebook output as it appeared when the job executed — all print() statements and display calls are captured.
For post-run debugging, the Spark UI for a job run stays accessible for a configurable retention period. The stage-level metrics (shuffle size, task durations, data read/written) are there even after the cluster is terminated.
The operational principle that applies here is the same one that applied to SQL Server Agent jobs: if it runs on a schedule and something downstream depends on it, it needs monitoring, alerting, and a defined failure procedure. A notebook without those is a script that runs sometimes. A job with those is a pipeline.