MLflow Model Registry: Promoting Models from Experiment to Production

You've trained a model. MLflow captured the run — parameters, metrics, the pickled artifact. It lives in the MLflow tracking server, tagged with the experiment name and a run ID that's forty random characters you'll never remember. Now you need to put it in production, and "run it from the training notebook" is not a deployment strategy.

MLflow Model Registry is the answer to this. It's the layer between "experiment artifact" and "production model" — with versioning, stage transitions, and an audit trail.

Registering a Model

import mlflow
from mlflow.tracking import MlflowClient

# Register during a training run
with mlflow.start_run() as run:
    # ... training code ...
    mlflow.sklearn.log_model(
        sk_model=trained_model,
        artifact_path="risk_classifier",
        registered_model_name="RiskClassifier"  # creates the registry entry
    )

# Or register an existing run after the fact
client = MlflowClient()
model_uri = f"runs:/{run_id}/risk_classifier"
client.create_registered_model("RiskClassifier")
client.create_model_version(
    name="RiskClassifier",
    source=model_uri,
    run_id=run_id
)

After registration, the model exists in the registry as RiskClassifier with version 1. Every subsequent registration of a model with the same name creates a new version. The run ID is linked — you can always trace back to the experiment that produced a specific version.

Stage Transitions

The registry has four stages: None, Staging, Production, and Archived. The workflow I use for most projects:

client = MlflowClient()

# Move to Staging for validation
client.transition_model_version_stage(
    name="RiskClassifier",
    version="3",
    stage="Staging",
    archive_existing_versions=False
)

# After validation passes, promote to Production
# archive_existing_versions moves whatever was in Production to Archived
client.transition_model_version_stage(
    name="RiskClassifier",
    version="3",
    stage="Production",
    archive_existing_versions=True
)

# Add a comment to the transition for the audit trail
client.update_model_version(
    name="RiskClassifier",
    version="3",
    description="Promoted after 2-week A/B test on Staging. F1 improved 4.2% vs v2."
)

Loading by Stage in Production Code

Production code should never reference a version number directly — it should reference the stage. This decouples deployment from the consuming code:

import mlflow.pyfunc

# Always loads whatever is currently in Production stage
model = mlflow.pyfunc.load_model("models:/RiskClassifier/Production")

# Use the model
predictions = model.predict(input_df)

When you promote version 4 to Production, all consuming code immediately uses version 4 on the next load — no code changes, no redeployment. The registry is the deployment mechanism.

Rollback

client = MlflowClient()

# Version 4 has a bug — move it to Archived, move v3 back to Production
client.transition_model_version_stage(
name="RiskClassifier",
version="4",
stage="Archived"
)
client.transition_model_version_stage(
name="RiskClassifier",
version="3",
stage="Production",
archive_existing_versions=False
)

The Stage-Based ACL Model

In Databricks, model registry permissions align with workspace permissions. The pattern that works: data scientists have write access to register new versions and promote to Staging. A separate gate — code review, validation notebook, approval from the model owner — is required before anything moves to Production. That gate can be automated (a notebook that runs validation tests and calls transition_model_version_stage on pass) or manual (a human reviews the Staging test results and executes the promotion). Either way, the registry gives you the checkpoint. As always, I'm here to help.

Read more