Cortex Forge: Building a Semantic Memory Layer on PostgreSQL

The prototype worked. That meant it was time to stop prototyping and build the real thing.

What had started as a PostgreSQL database with a JSONB blob and a pgvector column needed to become something I could rely on across multiple concurrent client projects, maintain without weekly intervention, and extend as requirements evolved. That's a different engineering problem than "prove the concept." I called the resulting system Cortex Forge — a semantic memory layer backed by PostgreSQL.

Why Not an Off-the-Shelf Solution

The vector database market had exploded by this point. Pinecone, Weaviate, Qdrant, Chroma — there were options. I evaluated them. All of them failed at least one of my constraints: local-first deployment, no external API dependencies for embeddings, client data isolation, and cost predictability at small-to-medium scale.

The bigger issue was that most vector database products assumed a specific workflow: ingest documents, embed them with a cloud embedding API, query with another cloud API call. Every step of that workflow sent data to an external service. For client project data, that was a non-starter. The business rules, domain models, and architectural decisions I was capturing contained information that my clients would reasonably expect to remain under my control.

PostgreSQL with pgvector solved this cleanly. I already ran Postgres locally. The pgvector extension ran in-process. Embedding generation could be handled locally with a small embedding model. No data left the machine unless I explicitly sent it somewhere. That's the right default for consulting work.

The Architecture That Emerged

Cortex Forge ended up as three components, each with a clear boundary:

The store: PostgreSQL with pgvector, running locally. Handles persistence, indexing, and retrieval. The schema was redesigned from the prototype — proper entity separation, versioning support, project isolation via row-level security, and a clean separation between structured fact storage and semantic knowledge storage.

The MCP server: A Model Context Protocol server that exposed the store's retrieval capabilities to AI tools. MCP was the right interface here — it provided a standard way for models to query external knowledge bases without bespoke integration for each tool. The server exposed a small set of tools: search_knowledge, add_knowledge, list_projects, get_related.

The CLI: A command-line interface for human-in-the-loop operations — adding entries, reviewing recent additions, checking coverage for a project, running one-off queries. The CLI was the interface for me; the MCP server was the interface for the models.

The Embedding Choice

Running embedding generation locally meant choosing a model that would fit on available hardware and produce good-enough vectors for the retrieval task. "Good enough" here means: semantically similar knowledge entries should have similar vectors, and the similarity should hold for domain-specific terminology that a general-purpose model might not handle well.

I tested several options. The all-MiniLM-L6-v2 model from Sentence Transformers was fast, small (under 100MB), and produced surprisingly good results on technical content. The dimensionality (384 dimensions) was lower than OpenAI's text-embedding-ada-002 (1536 dimensions), which reduced storage and improved query speed. The tradeoff was slightly lower recall on edge-case queries — acceptable for my use case.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

def embed(text: str) -> list[float]:
    return model.encode(text, normalize_embeddings=True).tolist()

Local inference, no API call, no data leaves the machine. The latency per embedding was under 50ms on CPU. Fast enough for interactive use.

The MCP Integration

The Model Context Protocol integration was the piece that made the whole system feel different from what I had before. Instead of a retrieval proxy that sat between my AI client and the model provider — intercepting and enriching requests — the MCP server let the model actively query the knowledge base when it determined that external knowledge was relevant.

This is a meaningful architectural difference. The proxy approach enriches every request, whether or not retrieval is useful. The MCP approach lets the model decide when to call the knowledge tool. In practice, models make reasonable decisions about this — they call search_knowledge when the query involves project-specific facts and skip it when the question is general enough that retrieved context wouldn't help.

The result was less noise in the context window and better-targeted retrieval. When the model retrieved knowledge, it was because the model had determined that knowledge was relevant — a better signal than my heuristics about when to inject context automatically.

What the First Month of Production Use Showed

The system performed better than the prototype on the metrics I cared about. Retrieval quality on domain-specific queries was noticeably higher than keyword search. The MCP integration worked reliably across the AI tools I used daily. The ingestion CLI was fast enough that I was actually using it consistently rather than forgetting to update the knowledge base.

The remaining gaps were around coverage: the knowledge base was only as useful as what had been captured in it. Projects where I had been diligent about ingestion had excellent retrieval. Projects where I had slacked on ingestion had thin coverage and correspondingly weaker AI assistance. The tool rewarded investment in proportion to the investment made. That's the right tradeoff — it just means the investment has to be made.

If you're building something similar and want to compare notes on the MCP integration or the embedding model choice, I'm happy to dig into the details. As always, I'm here to help.

Read more