Designing a Hybrid Memory Architecture

Designing a system is more interesting than implementing one, which is probably why I spent two weeks in October drawing architecture diagrams before I wrote a single line of code for the new memory layer. The diagrams were useful. They forced decisions that are much harder to make once you have implementation details to argue about.

Here is what the hybrid memory architecture looked like when the design phase was done.

Three Tiers, Three Jobs

The key design decision was separating knowledge into three tiers based on how it would be used, how often it would change, and what retrieval pattern made sense for each type.

Tier 1: Structured facts. Things that have a definite value and can be expressed precisely. The name of a source system. The schema of a table. A configuration value. A date. These go into regular PostgreSQL tables with proper column types. Retrieval is exact: find me the schema for this table, look up the config for this pipeline. No fuzzy matching needed.

Tier 2: Semantic knowledge. Domain explanations, decision reasoning, context notes, gotchas that are hard to express as structured facts. "The fiscal year starts in October because the company was founded in Q4 and the first complete fiscal year began the following January" is not a structured fact — it's an explanation that informs decisions. This goes into the vector store. Retrieval is semantic: find me things related to fiscal period calculations.

Tier 3: Working memory. The current session's accumulated context. What has been discussed. What decisions have been made in this session. What code has been reviewed. This lives only for the duration of a session — in memory, or in a fast key-value store that gets cleared when the session ends. Retrieval is recency-ordered: what did we just establish?

Most memory systems I had seen collapsed tier 1 and tier 2 into a single store, which forced a tradeoff between retrieval precision and retrieval flexibility. Keeping them separate let each tier use the optimal retrieval mechanism for its content type.

The Schema Evolution

The existing knowledge entries table was adequate for tier 2 but needed extensions:

ALTER TABLE knowledge_entries
    ADD COLUMN entry_type   TEXT NOT NULL DEFAULT 'fact',
    ADD COLUMN confidence   NUMERIC(3,2) DEFAULT 1.0,
    ADD COLUMN valid_from   TIMESTAMPTZ,
    ADD COLUMN valid_until  TIMESTAMPTZ,
    ADD COLUMN superseded_by UUID REFERENCES knowledge_entries(id),
    ADD COLUMN source_ref   TEXT;

CREATE INDEX ON knowledge_entries (entry_type, project_id);
CREATE INDEX ON knowledge_entries (valid_until) WHERE valid_until IS NOT NULL;

confidence handled the case where I knew something was probably true but wasn't certain. valid_from and valid_until handled the case where a fact was true for a period and then superseded. superseded_by created a chain — when a new entry replaced an old one, the old one pointed to the new one rather than being deleted. The history was preserved.

source_ref was the field I wished I had added from the start: where did this fact come from? A Slack message, a code comment, a client document, my own discovery? Without source tracking, stale entries were impossible to verify — I had no way to go back and check whether the original source still agreed with what I had documented.

The Ingestion Pipeline

Manual ingestion had been the bottleneck. I wanted to automate more of it, but the judgment about what was worth capturing couldn't be automated reliably. The compromise: make manual ingestion fast enough that it didn't feel like a task.

The interface I built was a small CLI with three entry points:

# Capture a fact from a note
cortex add --project crm-ingest --type fact --tags "source,api" \
  "The CRM API rate limit is 200 req/min despite the docs saying 1000"

# Capture a decision with reasoning
cortex add --project crm-ingest --type decision --tags "pipeline,sequencing" \
  "Raw data always lands before transformation, even for trivial renames.
   Reason: reproducibility — we can always re-derive transformed data from raw."

# Capture a gotcha
cortex add --project crm-ingest --type gotcha --tags "orders,nulls" \
  "orders.total_amount: pre-tax in v1 records (order_date < 2022-06-01),
   post-tax in v2. Check version via the orders.schema_version field."

Each command generated the embedding, stored the entry, and logged it. The whole operation took under thirty seconds. Fast enough that I would actually do it when I discovered something worth capturing, rather than making a mental note to add it later (which meant never adding it).

The Retrieval Proxy

The injection layer took shape as a local FastAPI server — a proxy that accepted OpenAI-compatible API requests, enriched the messages with retrieved context, and forwarded to the actual model provider. Configuring an AI tool to use the local proxy instead of the provider endpoint directly was a one-line config change.

# config.py
UPSTREAM_API_URL = "https://api.openai.com/v1"
LOCAL_PROXY_URL = "http://localhost:8765/v1"

async def enrich_request(messages: list[dict], project_id: str) -> list[dict]:
    query = extract_query_signal(messages)
    relevant_context = await retrieve_relevant(query, project_id, limit=5)
    if relevant_context:
        system_injection = format_context_block(relevant_context)
        return inject_into_system_message(messages, system_injection)
    return messages

extract_query_signal took the last few messages and constructed a retrieval query from them. retrieve_relevant ran the hybrid search against the knowledge base. inject_into_system_message prepended the retrieved facts to the system prompt. The whole enrichment step added under 200ms to request latency — acceptable.

The First Integration Test

Running a session through the proxy with a populated knowledge base and asking about fiscal period calculations produced something I hadn't seen before: the model used my domain-specific terminology without me providing it in the question. The retrieved context had included the fiscal year definition and the agreed vocabulary, and the model's response reflected both.

Small thing. First time it had worked automatically. Worth noting.

The system wasn't finished — there were edge cases in the retrieval, the ingestion CLI needed polish, and the proxy had no authentication. But the core loop was working: capture knowledge, retrieve relevant subset, inject automatically. Month eleven closed with proof of concept and a list of what needed hardening before this could run in production. As always, I'm here to help if you want to compare architecture notes on any of this.

Read more