Metadata-Driven Pipelines Meet LLMs: When Your Config Becomes a Prompt

I've been building config-driven pipelines since 2015. The pattern: a structured JSON or YAML file describes the data — source paths, target paths, column mappings, business rules — and a generic Spark job reads the config at runtime and executes it. The logic lives in the job; the knowledge about specific data lives in the config. Change what you process by changing the config, not the code.

Something clicked recently when I was writing a new pipeline config by hand: I was narrating. I was describing the source schema in English in the comments, then translating that description into JSON keys. The English description was the real specification; the JSON was the machine-readable translation. And translating structured English into structured JSON is exactly what a language model is good at.

The Experiment

I took a client's source schema documentation — the kind of thing that comes from a vendor as a PDF, describing field names, types, and semantics — and fed it to the model with a prompt describing my config schema. Asked it to generate the pipeline config.

The source description looked like this:

The orders table contains the following fields:
- order_id (string): Unique identifier for the order. Never null.
- customer_id (string): ID of the placing customer. Null for guest checkouts.
- order_ts (integer): Unix timestamp in milliseconds of order placement.
- order_status (string): One of: pending, confirmed, shipped, cancelled, returned.
- item_count (integer): Number of line items. Always >= 1.
- subtotal_cents (integer): Order subtotal in cents before tax and shipping.
- is_test_order (boolean): True for orders placed by internal QA accounts. Filter these out.

The prompt added my config schema structure and asked for the output. What came back:

{
  "name": "orders",
  "source_path": "s3://data-lake/raw/orders/",
  "target_path": "s3://data-lake/processed/orders/",
  "partition_column": "order_date",
  "column_mappings": {
    "order_id": "order_id",
    "customer_id": "customer_id",
    "order_ts": "order_ts",
    "order_status": "status",
    "item_count": "item_count",
    "subtotal_cents": "subtotal_cents"
  },
  "derived_columns": {
    "order_date": "CAST(FROM_UNIXTIME(order_ts / 1000) AS DATE)"
  },
  "filters": [
    "is_test_order = false",
    "item_count >= 1"
  ],
  "nullable_keys": ["customer_id"],
  "not_null_constraints": ["order_id", "order_ts"]
}

What It Got Right

The column mappings are correct. The derived column for order_date correctly handles the millisecond-to-date conversion from the description. The test order filter is there. The nullability constraints reflect what the documentation said. This is 80% of a working config, and it took thirty seconds to generate.

What It Got Wrong

The order_status mapping renamed the column to status — I didn't ask for that. The model made an inference about naming conventions that wasn't in the spec and that I would have caught in review, but that illustrates the category of error: silent inference about things the spec didn't specify. The model fills gaps with reasonable defaults rather than flagging that a decision was made.

More importantly, the config doesn't capture the business rule I'd want around order_status — that cancelled and returned orders should be routed to a separate partition from active orders. That rule wasn't in the source documentation. It lives in domain knowledge that exists nowhere the model can see.

The Workflow That Works

Generate the config from the documentation. Review it against what you know about the data that isn't in the documentation. Add the missing business rules. Correct the silent inferences. What you're left with is a config that reflects both the documented schema and the institutional knowledge — and you got there faster than writing it from scratch.

The model handles the translation layer. You handle the judgment layer. That division is sustainable. As always, I'm here to help.

Read more