Building a Metadata Collection Interface: Capturing What the Catalog Misses

A data catalog tells you what exists. It doesn't tell you what it means, who owns it, how sensitive it is, whether it's safe to join with another dataset, or what business rules were applied in its transformation. That context — the metadata that makes a table useful rather than just present — has to come from somewhere.

The problem is that the people who have it — data owners, subject matter experts, domain analysts — don't live in your catalog tool. They live in email, in meetings, in the business systems they operate every day. Building an interface that makes metadata contribution accessible to non-technical stakeholders is engineering work, and it's work that every serious data catalog implementation eventually needs.

What to Capture

Before building the interface, define the minimum viable metadata set. Every additional field you add is friction for the data owner. I use these as the baseline:

from dataclasses import dataclass
from typing import Optional
from enum import Enum

class SensitivityLevel(str, Enum):
PUBLIC = "PUBLIC"
INTERNAL = "INTERNAL"
CONFIDENTIAL = "CONFIDENTIAL"
RESTRICTED = "RESTRICTED"

@dataclass
class TableMetadata:
table_fqn: str # catalog.schema.table
description: str # what this table contains in plain language
data_owner: str # email of the person responsible
sensitivity: SensitivityLevel
contains_pii: bool
source_system: str # where the data originally came from
refresh_frequency: str # how often it's updated
known_issues: Optional[str] = None # anything a consumer should know

A Minimal Intake Form

from flask import Flask, render_template, request, redirect, url_for
import requests

app = Flask(__name__)

@app.route('/metadata/table/')
def table_metadata_form(table_fqn: str):
# Pre-populate with whatever we already have from the catalog
existing = fetch_existing_metadata(table_fqn)
return render_template('table_metadata_form.html',
table_fqn=table_fqn,
existing=existing)

@app.route('/metadata/table//submit', methods=['POST'])
def submit_table_metadata(table_fqn: str):
metadata = TableMetadata(
table_fqn=table_fqn,
description=request.form['description'],
data_owner=request.form['data_owner'],
sensitivity=SensitivityLevel(request.form['sensitivity']),
contains_pii=request.form.get('contains_pii') == 'on',
source_system=request.form['source_system'],
refresh_frequency=request.form['refresh_frequency'],
known_issues=request.form.get('known_issues') or None
)
write_to_catalog(metadata)
return redirect(url_for('table_metadata_form', table_fqn=table_fqn))

Column-Level Metadata

Table-level descriptions aren't enough for tables that contain mixed sensitivity. The customer_orders table might have purely operational columns alongside PII columns. Column-level sensitivity tagging matters for masking, anonymization, and downstream access decisions.

-- Unity Catalog: apply tags to individual columns
ALTER TABLE prod_analytics.silver.customer_orders
ALTER COLUMN customer_email SET TAGS ('pii' = 'EMAIL', 'sensitivity' = 'RESTRICTED');

ALTER TABLE prod_analytics.silver.customer_orders
ALTER COLUMN customer_ssn SET TAGS ('pii' = 'SSN', 'sensitivity' = 'RESTRICTED', 'masked' = 'true');

Closing the Loop: Completeness Tracking

def get_metadata_completeness(catalog: str, schema: str) -> list[dict]:
tables = spark.sql(f"""
SELECT table_name, comment
FROM {catalog}.information_schema.tables
WHERE table_schema = '{schema}'
""").collect()

results = []
for row in tables:
results.append({
'table': row.table_name,
'has_description': bool(row.comment),
'form_url': f"/metadata/table/{catalog}.{schema}.{row.table_name}"
})
return [r for r in results if not r['has_description']]

Send the completeness report to data owners weekly with links to the intake forms for their tables. Response rates improve dramatically when you remove the friction of "go find the catalog, navigate to your table, figure out how to edit it" and replace it with a direct link and a five-field form. As always, I'm here to help.

Read more