Automatic Prompt Refinement: The Goal Worth Chasing

The goal I had been working toward since month four was simple to state and hard to execute: make the AI tools I use produce better output without requiring me to write better prompts. The knowledge retrieval system was the storage and retrieval half of that goal. Automatic prompt refinement was the other half — the layer that took retrieved knowledge and turned it into prompt context that actually improved model behavior.

After several months of running the system in production, here is what I learned about what works and what doesn't.

Why Prompts Need Refinement

The prompts I write in the course of a work session are not optimized for model comprehension. They're optimized for my speed of expression. "Why is the fiscal Q1 revenue number different from what the client expects?" is a natural way to ask that question. It's not the ideal prompt for a model that doesn't know what fiscal Q1 means in this client's context, doesn't know the client's revenue definition, and doesn't know what "different" means here (wrong sign? wrong magnitude? wrong category?).

Refinement adds that missing context automatically. The refined version of the same question might look like:

[Context from knowledge base: Fiscal Q1 runs October 1 – December 31. Revenue for this client excludes intercompany transactions. The reporting discrepancy reported in November 2024 was traced to a currency conversion bug in the v2 pipeline that has since been patched.]

Why is the fiscal Q1 revenue number different from what the client expects?

The model now has the domain vocabulary, the relevant historical context, and a potential lead on the answer — all surfaced from the knowledge base without me explicitly including them.

The Refinement Pipeline

The pipeline has four steps:

Signal extraction. Parse the prompt and identify the entities, concepts, and question type. What is the prompt about? Which project does it concern? What type of knowledge is most relevant — structured facts, domain explanations, historical decisions?

Retrieval. Run the hybrid search against the knowledge base using the extracted signals as query terms. Weight recent entries slightly higher than older ones (recent decisions are usually more relevant than older ones, all else being equal).

Ranking and selection. Take the top retrieval results and determine which are actually relevant enough to include. Not all retrieved entries improve the prompt — some will be tangentially related and would add noise. A simple relevance threshold based on the hybrid retrieval score handles most cases.

Injection. Format the selected entries as a context block and prepend to the prompt. The format matters — entries dumped as raw text perform worse than entries with clear labels identifying what each entry is and why it's included.

The Format That Works

After testing several injection formats, the one that produces the best model behavior is a structured XML-like block with explicit labels:

<project_context project="crm-ingest">
  <fact type="domain" confidence="1.0">
    Fiscal year runs October 1 – September 30.
    Fiscal Q1 = October 1 – December 31.
  </fact>
  <fact type="decision" confidence="1.0">
    Revenue excludes intercompany transactions.
    Source: finance team directive, 2024-08-15.
  </fact>
  <fact type="gotcha" confidence="0.9">
    Currency conversion bug in v2 pipeline caused Q1 revenue
    understatement of ~2.3%. Patched 2024-11-22.
  </fact>
</project_context>

The tags help the model distinguish injected context from the user's actual question. The type and confidence attributes tell the model how to weight each piece of context. The Source field in decision entries gives the model something to cite when it references the fact in its response.

What It Actually Improves

The most consistent improvement is in responses that would otherwise hallucinate domain details. When a model doesn't know your fiscal calendar, it might make up a reasonable-sounding one. With the context block present, it uses your actual definition. That's the clearest win: reducing confident-but-wrong responses by providing the correct facts before the model has to guess.

The subtler improvement is in follow-up questions. A model that has received project context in the first message can use that context in subsequent turns without me re-establishing it. The session builds coherent understanding across messages rather than starting fresh for each turn.

What It Doesn't Fix

Refinement doesn't help when the knowledge base doesn't have the relevant facts. Obvious, but worth stating: the system is only as good as what's been captured. A thin knowledge base produces thin refinement. The investment in ingestion quality pays forward into retrieval quality into response quality — the chain is only as strong as its weakest link.

Refinement also doesn't substitute for prompt craft on the question itself. A vague question refined with good context still produces a vague answer. The refinement layer handles the "what does the model need to know to answer this well?" problem. The "what is the right question to ask?" problem is still yours.

That might sound limiting. I think of it as appropriate scope. Automatic refinement should do one thing well: surface and inject relevant knowledge. Replacing human judgment about what to ask is a different problem — one that belongs further up the stack, in the orchestration layer. Which is exactly where work was headed next. As always, I'm here to help if you want to dig into the implementation details.

Read more