Databricks Multi-Task Jobs: Orchestrating a Pipeline Without Leaving the Platform

Shannon Lowder

20 Nov 2020 — 1 min read

For two years I've been running Databricks pipelines as separate notebooks orchestrated by dbutils.notebook.run() or scheduled individually via the Jobs UI. It works, but it shows in the monitoring — five separate jobs in the UI, no unified view of the pipeline, no easy way to see that step 3 failed because step 2 returned bad data.

Multi-task Jobs changed that. Instead of five separate jobs, you define one job with five tasks and the dependencies between them. The monitoring view shows the full pipeline as a DAG, with per-task runtime and failure details. This is the version of Databricks Jobs that actually serves production pipeline needs.

Defining a Multi-Task Job

{
  "name": "daily_order_pipeline",
  "tasks": [
    {
      "task_key": "extract_raw",
      "description": "Extract from SQL Server to bronze",
      "notebook_task": {
        "notebook_path": "/pipelines/01_extract_orders",
        "base_parameters": {"processing_date": "{{ds}}"}
      },
      "new_cluster": {
        "spark_version": "7.3.x-scala2.12",
        "node_type_id": "Standard_DS3_v2",
        "num_workers": 2
      }
    },
    {
      "task_key": "transform_silver",
      "description": "Clean and conform to silver layer",
      "depends_on": [{"task_key": "extract_raw"}],
      "notebook_task": {
        "notebook_path": "/pipelines/02_transform_orders"
      },
      "new_cluster": {
        "spark_version": "7.3.x-scala2.12",
        "node_type_id": "Standard_DS3_v2",
        "num_workers": 4
      }
    },
    {
      "task_key": "aggregate_gold",
      "description": "Build daily summary metrics",
      "depends_on": [{"task_key": "transform_silver"}],
      "notebook_task": {
        "notebook_path": "/pipelines/03_aggregate_orders"
      },
      "new_cluster": {
        "spark_version": "7.3.x-scala2.12",
        "node_type_id": "Standard_DS3_v2",
        "num_workers": 2
      }
    }
  ]
}

Parallel Tasks

Tasks with the same depends_on ancestor run in parallel:

{
  "tasks": [
    {"task_key": "extract_raw", ...},
    {
      "task_key": "process_west",
      "depends_on": [{"task_key": "extract_raw"}],
      ...
    },
    {
      "task_key": "process_east",
      "depends_on": [{"task_key": "extract_raw"}],
      ...
    },
    {
      "task_key": "combine_regions",
      "depends_on": [
        {"task_key": "process_west"},
        {"task_key": "process_east"}
      ],
      ...
    }
  ]
}

process_west and process_east run simultaneously after extract_raw completes. combine_regions waits for both. This replaces the ThreadPoolExecutor pattern I was using with dbutils.notebook.run() — same outcome, but now it's visible in the Jobs UI and each task's runtime and logs are isolated.

Task-Level Retry and Timeout

{
  "task_key": "transform_silver",
  "max_retries": 2,
  "min_retry_interval_millis": 60000,  # 1 minute between retries
  "retry_on_timeout": false,
  "timeout_seconds": 3600,
  ...
}

Per-task retry configuration means a transient failure in task 2 doesn't require rerunning the entire pipeline. It retries task 2 (and only task 2) up to the configured limit. The upstream extract is preserved; the downstream aggregation waits. That's the model you want for a production pipeline. As always, I'm here to help.

The Context Problem Neither Agent Mesh Nor OpenSharing Solves

I wrote recently about Azure Agent Mesh and OpenSharing — two infrastructure layers that between them cover how enterprises register, discover, share, and execute agents. Between them, they address a lot of the plumbing that has been missing from the enterprise agent stack. But there's a gap neither of

Unity AI Gateway and What a Governed Model Access Layer Actually Buys You

Unity AI Gateway, announced at DAIS this week, is the feature I've been waiting for since Agent Bricks shipped last year. It's a centralized governance layer for model access in Databricks — you configure which models are approved for use in your environment, who can call them,

You Don't Need Fable. You Need a Router.

The performance gap between open-weight models and closed frontier models has spent the last year collapsing faster than anyone predicted. Epoch AI's tracking puts open weights at roughly a three-to-four-month lag behind state-of-the-art closed models on average. For coding tasks, the gap has effectively closed — DeepSeek V3.2

DAIS 2026: Genie One and the Context Problem Databricks Is Solving

The central message from DAIS this week, delivered by Ali Ghodsi in the opening keynote, was direct: AI doesn't have an intelligence problem, it has a context problem. If your CFO can't get an AI system to explain why margins changed, that's not a