MLflow on Databricks: Tracking Your Experiments So You Can Reproduce What Actually Worked

If you've ever trained a model, gotten good results, then lost track of which hyperparameters you used, you understand why MLflow exists. It's an experiment tracking system — a log of every training run with the parameters, metrics, and artifacts that produced each result. On Databricks, it's built in and connected to your notebook environment without any additional setup.

Why This Is a Data Engineering Concern

MLflow isn't just for data scientists. Every ML pipeline has two components that need to be tracked together: the model and the data it was trained on. If you retrain a model next month with different data, you need to know which data version produced which model version. MLflow's experiment tracking is where you log not just hyperparameters but also the data snapshot used for training, so you can reproduce results by re-running the same pipeline against the same data version.

Basic Experiment Tracking

import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, roc_auc_score

# Start a tracking run
with mlflow.start_run(run_name="churn_model_v3"):
    # Log parameters (hyperparameters, config choices)
    mlflow.log_param("n_estimators", 200)
    mlflow.log_param("max_depth", 5)
    mlflow.log_param("learning_rate", 0.05)
    mlflow.log_param("training_data_version", 47)  # Delta table version used

    # Train
    model = GradientBoostingClassifier(n_estimators=200, max_depth=5, learning_rate=0.05)
    model.fit(X_train, y_train)

    # Log metrics (evaluation results)
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]

    mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
    mlflow.log_metric("roc_auc", roc_auc_score(y_test, y_prob))
    mlflow.log_metric("training_rows", len(X_train))

    # Log the model itself
    mlflow.sklearn.log_model(model, "churn_model")

    # Log artifacts (feature importance plot, confusion matrix, etc.)
    import matplotlib.pyplot as plt
    fig, ax = plt.subplots()
    # ... plot feature importances ...
    plt.savefig("/tmp/feature_importance.png")
    mlflow.log_artifact("/tmp/feature_importance.png")

The Databricks MLflow UI

In the Databricks workspace, go to Machine Learning → Experiments. Every mlflow.start_run() call in your notebooks creates an entry here. You can compare runs side-by-side, sort by any metric, filter to runs that beat a threshold, and drill into any run to see all logged parameters, metrics, and artifacts.

This is the answer to "which run had the best AUC?" — instead of searching through notebook outputs or maintaining a manual spreadsheet, every run is in the experiment tracker with its full parameter/metric context.

Auto-Logging

For common ML frameworks, MLflow can automatically log parameters and metrics without explicit log calls:

# Enable autolog for scikit-learn
mlflow.sklearn.autolog()

with mlflow.start_run():
    model = GradientBoostingClassifier(n_estimators=200, max_depth=5)
    model.fit(X_train, y_train)
    # MLflow automatically logs: model parameters, training metrics,
    # model artifact, feature importances (for tree models)

Autolog is available for scikit-learn, XGBoost, LightGBM, PyTorch, TensorFlow, and Keras. It doesn't log domain-specific context (which version of your training data you used, which Great Expectations suite validated it) — you still need manual log_param calls for that.

Model Registry: From Experiment to Production

When a run produces a model good enough for production, register it in the MLflow Model Registry:

# Register a model from a specific run
run_id = "abc123"  # from the Experiments UI or mlflow search
mlflow.register_model(
    model_uri=f"runs:/{run_id}/churn_model",
    name="churn_model"
)

# Or programmatically via the registry client
from mlflow.tracking import MlflowClient

client = MlflowClient()
client.transition_model_version_stage(
    name="churn_model",
    version=3,
    stage="Production"
)

The registry tracks model versions with stage labels: Staging, Production, Archived. Downstream consumers (inference notebooks, API services, batch prediction jobs) load models by stage rather than run ID, so they automatically pick up new Production versions without code changes:

# Load the current production model -- always gets the latest Production version
model = mlflow.sklearn.load_model("models:/churn_model/Production")

Logging the Data Version

# Track which version of the Delta table trained this model
from delta.tables import DeltaTable

customers_version = spark.sql("DESCRIBE HISTORY analytics.customer_features LIMIT 1")                         .collect()[0]["version"]

with mlflow.start_run():
    mlflow.log_param("feature_table_version", customers_version)
    mlflow.log_param("feature_table", "analytics.customer_features")
    # ... train model ...

With the Delta table version logged, you can reproduce the exact training dataset by reading the feature table at that version: spark.read.format("delta").option("versionAsOf", version).table("analytics.customer_features"). Time travel on the data side, model versioning on the ML side — the combination makes the full pipeline reproducible.

MLflow on Databricks: Tracking Your Experiments So You Can Reproduce What Actually Worked

Shannon Lowder

Why This Is a Data Engineering Concern

Basic Experiment Tracking

The Databricks MLflow UI

Auto-Logging

Model Registry: From Experiment to Production

Logging the Data Version

Read more

The Context Problem Neither Agent Mesh Nor OpenSharing Solves

Unity AI Gateway and What a Governed Model Access Layer Actually Buys You

You Don't Need Fable. You Need a Router.

DAIS 2026: Genie One and the Context Problem Databricks Is Solving