Can an LLM Write My Airflow DAG?

The logical extension of "LLM helps me write Spark code" is "LLM writes the pipeline definition from a description." If it can generate a transformation function, can it generate the Airflow DAG that orchestrates the whole thing? I've spent the last few weeks testing this with Codex and text-davinci, and I have a concrete answer.

Short version: yes, with a specific and consistent failure mode that tells you exactly where the human still has to drive.

The Test

I described a pipeline in plain English — the kind of description you'd write in a ticket or a design doc — and asked the model to generate the Airflow DAG:

Write an Airflow DAG called daily_session_pipeline that runs at 2am UTC. It should: (1) check that yesterday's raw events partition exists in S3 at s3://data-lake/raw/events/event_date=EXECUTION_DATE, (2) run a Spark job at s3://spark-jobs/session_agg.py with the execution date as an argument, (3) run a data quality check that asserts the output table has at least 1000 rows, (4) send a Slack notification on failure. Use the execution date templating for all date references.

The output:

from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
from airflow.providers.slack.operators.slack_webhook import SlackWebhookOperator

default_args = {
    'owner': 'data-team',
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
    'on_failure_callback': lambda ctx: notify_slack(ctx),
}

with DAG(
    'daily_session_pipeline',
    default_args=default_args,
    start_date=datetime(2022, 1, 1),
    schedule_interval='0 2 * * *',
    catchup=False,
) as dag:

    check_upstream = BashOperator(
        task_id='check_upstream_partition',
        bash_command='aws s3 ls s3://data-lake/raw/events/event_date={{ ds }}/ | head -1',
    )

    run_spark = BashOperator(
        task_id='run_session_aggregation',
        bash_command='spark-submit s3://spark-jobs/session_agg.py --date {{ ds }}',
    )

    quality_check = PythonOperator(
        task_id='data_quality_check',
        python_callable=check_row_count,
        op_kwargs={'date': '{{ ds }}', 'min_rows': 1000},
    )

    check_upstream >> run_spark >> quality_check

The structure is correct. The templating is right. The dependency chain is right. The Slack failure notification is wired into default_args rather than as an explicit task, which is one of two valid approaches and a defensible choice.

The Consistent Failure Mode

What the model got wrong: the check_upstream task. I asked it to verify the partition exists. It generated an aws s3 ls command that returns exit code 0 whether or not there's any output — ls on an empty prefix doesn't fail. The partition check silently passes even when there's no data.

The correct implementation uses aws s3 ls ... | grep -c . or an S3KeySensor, not a raw ls. The model generated something that looks like a check and isn't.

This is the failure mode I described in the GPT-3 field notes last year: the model generates structurally valid, functionally broken logic, and you need to understand the underlying system to catch it. An engineer who doesn't know how aws s3 ls behaves on empty prefixes would accept this DAG and ship a broken partition check.

What This Tells You About the Workflow

LLM-generated DAG scaffolding is useful as a starting point, not an ending point. The structure, the templating, the dependency chain — these are exactly the tedious, pattern-consistent parts that LLMs handle well. The correctness of the individual task implementations requires domain knowledge the model doesn't consistently have.

The right workflow: generate the scaffold, review each task implementation against your knowledge of what it's actually supposed to do, fix the ones that look right but aren't. That's faster than writing from scratch and requires the same review discipline you'd apply to any code you didn't write yourself.

If you've been generating DAG scaffolding from descriptions and have hit different failure modes, I'd like to compare notes. As always, I'm here to help.

Read more