Building a Personal Knowledge Layer

At some point the context file problem stopped being a tooling complaint and started being an engineering problem. That's when things got interesting.

I'd spent five months learning where a flat markdown file fails as a memory system for AI-assisted development. The failure modes were clear: maintenance overhead, retrieval by keyword rather than relevance, no automatic surfacing of relevant facts. The next step was obvious in principle and nontrivial in practice: build something better.

The Requirements That Came From Experience

Before writing any code, I wrote down what I actually needed — derived from five months of observing where the markdown approach fell short:

Storage that can hold facts at multiple levels of granularity. A domain term definition is different from a data model description, which is different from a specific gotcha about one field in one table. They all need to live somewhere, but they're different kinds of knowledge with different retrieval patterns.

Retrieval by semantic similarity, not keyword. When I'm debugging a pipeline that processes fiscal quarter data, I want the system to surface everything about fiscal periods — including documents that don't use the exact phrase "fiscal quarter" but are clearly relevant. Keyword search misses synonyms, related concepts, and domain-adjacent knowledge that would matter to a human expert.

Automatic injection. The knowledge layer should surface relevant context without me deciding what to include. The decision should be driven by what I'm currently working on, not by my memory of what's documented.

Local and auditable. I'm working on client projects. The knowledge layer will contain proprietary data model details, business rules, and domain-specific information that cannot be sent to a third-party API. Whatever I build has to run locally.

The Technology Choices

PostgreSQL was the obvious foundation. I already run it locally for development work, I know it well, and the Postgres ecosystem covers most of what I needed. What it didn't cover natively was semantic search — the ability to retrieve facts by meaning rather than keyword match.

That's where pgvector came in. The pgvector extension adds vector storage and similarity search to Postgres. The pattern: embed each piece of project knowledge as a vector using an embedding model, store the vector alongside the text, and retrieve by cosine similarity at query time. "Find the five facts most similar to this query" becomes a single SQL query:

SELECT content, metadata
FROM project_knowledge
WHERE project_id = $1
ORDER BY embedding <=> $2
LIMIT 5;

Where $2 is the vector embedding of your query text. The <=> operator is cosine distance in pgvector. Closer to zero means more similar.

Combined with PostgreSQL's native full-text search, this gave me two retrieval modes: keyword-exact for when I know the specific term, semantic for when I'm looking for related concepts. The hybrid approach covers most retrieval scenarios.

The Schema

The initial schema was deliberately minimal:

CREATE TABLE knowledge_entries (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    project_id  UUID NOT NULL REFERENCES projects(id),
    content     TEXT NOT NULL,
    metadata    JSONB DEFAULT '{}',
    tags        TEXT[] DEFAULT '{}',
    embedding   vector(1536),
    created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX ON knowledge_entries
    USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);

CREATE INDEX ON knowledge_entries
    USING GIN (to_tsvector('english', content));

The metadata JSONB column holds structured facts about each entry — source, confidence, last-verified date, the type of knowledge it represents. The tags array provides coarse-grained filtering before the vector search. The dual index enables both retrieval modes without a separate search infrastructure.

The Ingestion Problem

Storage and retrieval are the easy parts. The hard part is getting knowledge in. A database with a great schema is useless if maintaining it requires more effort than maintaining the markdown file it replaced.

The ingestion approach I landed on for the initial version: anything I would have written into the markdown file, I write here instead. The interface is simpler than the schema suggests — a small CLI that takes a text snippet and a set of tags, generates the embedding, and inserts the record. The operational overhead is similar to maintaining a markdown file, but the retrieval quality is significantly better.

What I wanted eventually was automatic ingestion — capture facts from conversations, code comments, and documentation without manual intervention. That was a future problem. For the initial version, manual ingestion with semantic retrieval was already a meaningful improvement over what I had.

The First Real Test

I migrated my largest context file into the new system — about eighty discrete facts across three projects. The retrieval quality was immediately better than keyword search. Querying "fiscal period calculation" returned entries tagged with calendar, accounting period, quarter, and fiscal year that a keyword search on "fiscal period" would have missed.

The remaining gap was the injection step: getting the retrieved context into the AI session automatically. That required building a layer between the knowledge system and whatever model I was talking to. That layer was the next piece of work — and it was where things started to get genuinely complex.

If you've built a local knowledge retrieval system for AI-assisted development and run into ingestion challenges I haven't covered here, I'd like to hear about it. As always, I'm here to help.

Read more