MLflow Model Registry: Promoting Models from Experiment to Production
You've trained a model. MLflow captured the run — parameters, metrics, the pickled artifact. It lives in the MLflow tracking server, tagged with the experiment name and a run ID that's forty random characters you'll never remember. Now you need to put it in production, and "run it from the training notebook" is not a deployment strategy.
MLflow Model Registry is the answer to this. It's the layer between "experiment artifact" and "production model" — with versioning, stage transitions, and an audit trail.
Registering a Model
import mlflow
from mlflow.tracking import MlflowClient
# Register during a training run
with mlflow.start_run() as run:
# ... training code ...
mlflow.sklearn.log_model(
sk_model=trained_model,
artifact_path="risk_classifier",
registered_model_name="RiskClassifier" # creates the registry entry
)
# Or register an existing run after the fact
client = MlflowClient()
model_uri = f"runs:/{run_id}/risk_classifier"
client.create_registered_model("RiskClassifier")
client.create_model_version(
name="RiskClassifier",
source=model_uri,
run_id=run_id
)
After registration, the model exists in the registry as RiskClassifier with version 1. Every subsequent registration of a model with the same name creates a new version. The run ID is linked — you can always trace back to the experiment that produced a specific version.
Stage Transitions
The registry has four stages: None, Staging, Production, and Archived. The workflow I use for most projects:
client = MlflowClient()
# Move to Staging for validation
client.transition_model_version_stage(
name="RiskClassifier",
version="3",
stage="Staging",
archive_existing_versions=False
)
# After validation passes, promote to Production
# archive_existing_versions moves whatever was in Production to Archived
client.transition_model_version_stage(
name="RiskClassifier",
version="3",
stage="Production",
archive_existing_versions=True
)
# Add a comment to the transition for the audit trail
client.update_model_version(
name="RiskClassifier",
version="3",
description="Promoted after 2-week A/B test on Staging. F1 improved 4.2% vs v2."
)
Loading by Stage in Production Code
Production code should never reference a version number directly — it should reference the stage. This decouples deployment from the consuming code:
import mlflow.pyfunc
# Always loads whatever is currently in Production stage
model = mlflow.pyfunc.load_model("models:/RiskClassifier/Production")
# Use the model
predictions = model.predict(input_df)
When you promote version 4 to Production, all consuming code immediately uses version 4 on the next load — no code changes, no redeployment. The registry is the deployment mechanism.
Rollback
client = MlflowClient()
# Version 4 has a bug — move it to Archived, move v3 back to Production
client.transition_model_version_stage(
name="RiskClassifier",
version="4",
stage="Archived"
)
client.transition_model_version_stage(
name="RiskClassifier",
version="3",
stage="Production",
archive_existing_versions=False
)
The Stage-Based ACL Model
In Databricks, model registry permissions align with workspace permissions. The pattern that works: data scientists have write access to register new versions and promote to Staging. A separate gate — code review, validation notebook, approval from the model owner — is required before anything moves to Production. That gate can be automated (a notebook that runs validation tests and calls transition_model_version_stage on pass) or manual (a human reviews the Staging test results and executes the promotion). Either way, the registry gives you the checkpoint. As always, I'm here to help.