Vector Search and Retrieval: The Quiet Workhorse of Enterprise AI

The thing nobody says loudly at AI conferences: most GenAI application failures are retrieval failures. The model itself — the LLM doing the generation — is usually not the problem. The problem is that the model received the wrong context, incomplete context, or no relevant context at all, and generated confidently wrong output as a result. Vector search is what fixes that. It's not glamorous, it doesn't get the keynote stage, but it's the backbone of every enterprise RAG application that actually works.

Why Retrieval Matters More Than Model Size

There's a mental model that equates "better AI" with "bigger model." This is wrong for enterprise applications, and it's worth being direct about it. A 70 billion parameter model with no access to your company's specific data will perform worse on your use case than a 7 billion parameter model that retrieves accurate, relevant documents before generating a response.

The reason is that LLMs don't have memory in the persistence sense — they have context. What's in the context window when the model generates its response determines the quality of that response. If the context contains accurate, relevant information about the specific question, the model generates a good answer. If it doesn't, the model either hallucinates or says it doesn't know. Retrieval is the mechanism that populates the context correctly. Model size is a secondary factor.

How Databricks' UC-Native Vector Search Helps

Databricks Vector Search GA connects vector indexes to Unity Catalog. This sounds like a catalog feature, but it has real operational implications. The most important: a vector index that's a UC object inherits the permission model of the underlying table. If a user doesn't have SELECT on the documents table, they don't get results back from the vector index over that table. This prevents what I've started calling "shadow RAG" — retrieval systems that accidentally let users access content they're not authorized to see, because the embedding service bypasses the data access layer.

from databricks.vector_search.client import VectorSearchClient

vs_client = VectorSearchClient()

# Create an index backed by a Delta table
# Inherits UC permissions from the source table
vs_client.create_delta_sync_index(
endpoint_name="prod-vector-search",
index_name="prod_analytics.knowledge.policy_docs_index",
source_table_name="prod_analytics.knowledge.policy_documents",
pipeline_type="TRIGGERED", # or CONTINUOUS for real-time sync
primary_key="doc_id",
embedding_source_column="content",
embedding_model_endpoint_name="databricks-gte-large-en"
)

# Query the index — respects UC row-level security on the source table
results = vs_client.get_index(
endpoint_name="prod-vector-search",
index_name="prod_analytics.knowledge.policy_docs_index"
).similarity_search(
columns=["doc_id", "title", "content", "category"],
query_text="what is the late payment policy for enterprise accounts",
num_results=5,
filters={"category": "billing"}
)

RAG Failures I've Seen in the Field

The most common failure is embedding model mismatch: you embed documents with one model and queries with a different model, or you use the same model but the query style is so different from the document style that cosine similarity doesn't surface relevant results. Test your retrieval independently from your generation. Evaluate the top-k results for a representative query set before you ever touch the LLM layer.

The second most common failure is chunk size pathology. You split documents into 500-token chunks, embed them, and then retrieve chunks that contain part of the answer but miss the context that makes the answer correct. Chunk size tuning is not glamorous work — it's table-scans-and-row-counts-adjacent work — but it has more impact on RAG quality than almost anything else.

My Predictions for Retrieval in 2025

Multi-modal retrieval — embedding images, audio, and structured data alongside text — is going to move from research demo to production reality over the next year. The embedding model ecosystem is maturing fast. Hybrid search (combining dense vector similarity with sparse keyword matching) will become the default rather than the exception; pure dense retrieval misses too many cases that keyword search handles well. And the "retrieval engineer" will become a recognized specialization — the person who owns the embedding pipeline, the index configuration, the relevance evaluation, and the feedback loop. That role doesn't have a title yet. It will. As always, I'm here to help.

Read more