Data Governance in the GenAI Era: Unity Catalog's Expanding Role
At DAIS 2024, Unity Catalog was extended to govern not just data tables but models, features, vector indexes, and lineage that spans the full machine learning lifecycle. I've spent the last few months integrating this more deeply into real projects, and the expanded UC story is genuinely changing how I approach AI governance — not as a compliance checkbox at the end of a project, but as the first architectural decision at the beginning of one.
Why Governance Has to Come First
The old model: build the AI system, then figure out governance. This fails for the same reason that building a data warehouse and then figuring out access control fails. The decisions you make while building — which tables the model was trained on, which features went into it, which version is in production — are the governance record. If you're not capturing them as you go, you're creating a forensic reconstruction problem later.
The new model: define your governance requirements before you write the first line of training code. What data can this model be trained on? Who has to approve it before it touches production? What auditability do you need for each inference call? These aren't compliance questions — they're architectural requirements that determine whether you use Unity Catalog for the model registry, which serving endpoint configuration you choose, and whether you need row-level security on the underlying training data.
What UC Does Well for AI Governance
The model registry integration is the strongest part. Models registered in Unity Catalog have a complete lineage chain: serving endpoint → model version → training run → feature tables → source data. For any model in production, you can answer "what data was this model trained on" by traversing the UC lineage graph. That answer is required for GDPR right-to-erasure compliance (if the training data contains personal data), for EU AI Act documentation requirements, and for any audit that asks about model provenance.
from mlflow.tracking import MlflowClient
client = MlflowClient()
# Get the production model version and its training run
versions = client.get_latest_versions("RiskClassifier", stages=["Production"])
prod_version = versions[0]
run_id = prod_version.run_id
# Get what data the training run used
run = client.get_run(run_id)
input_tables = run.data.tags.get("mlflow.databricks.cluster.usedContextualDataSources", "")
print(f"Production model: {prod_version.version}")
print(f"Trained in run: {run_id}")
print(f"Training data: {input_tables}")
print(f"Approved by: {prod_version.tags.get('approved_by', 'unknown')}")
Where UC Is Still Maturing
Feature lineage is the weakest link right now. If your features are computed in complex PySpark transformations or involve multi-source joins, the column-level lineage capture is incomplete. UC knows that feature table X was read and model Y was trained, but it doesn't know that column Z in the feature table came from column W in the source table through a specific transformation. For fine-grained data lineage — the kind needed for GDPR data minimization audits — you need to supplement UC with custom lineage logging.
Cross-workspace and cross-account lineage still doesn't exist at the product level. Models trained in a dev workspace and promoted to a prod workspace lose the training run linkage across the workspace boundary unless you've explicitly replicated the MLflow experiment. This is a genuine gap for enterprises with workspace separation requirements.
My Wishlist for UC in 2025
Complete Python feature engineering lineage through instrumentation — even if it requires decorating functions rather than being fully automatic. A native concept of "model card" as a first-class UC object, separate from the MLflow run metadata. And real-time governance alerts: notify me when a production model hasn't been evaluated in 30 days, or when a model's training data source has been modified since training. Governance that requires humans to manually check is governance that doesn't get done. As always, I'm here to help.