MLflow Model Serving in Databricks: Turning a Registered Model Into a REST Endpoint

The MLflow Model Registry gives you versioning and stage transitions. The next question is serving: how do you turn a registered model into a REST endpoint that a downstream application can actually call? Databricks has a first-party answer for this now, and it's worth understanding both what it does well and where you'll need to supplement it.

Creating a Model Serving Endpoint

In Databricks, model serving is configured through the UI or the REST API. You point it at a registered model version (or a stage like "Production"), configure the cluster size, and Databricks provisions a real-time serving cluster behind a REST endpoint.

import requests

DATABRICKS_HOST = "https://your-workspace.azuredatabricks.net"
TOKEN = "dapi..."  # from env / secrets

def create_serving_endpoint(model_name: str, model_version: str) -> dict:
    payload = {
        "name": f"{model_name.lower().replace(' ', '-')}-serving",
        "config": {
            "served_models": [{
                "model_name": model_name,
                "model_version": model_version,
                "workload_size": "Small",  # Small | Medium | Large
                "scale_to_zero_enabled": True
            }]
        }
    }
    resp = requests.post(
        f"{DATABRICKS_HOST}/api/2.0/serving-endpoints",
        headers={"Authorization": f"Bearer {TOKEN}"},
        json=payload
    )
    resp.raise_for_status()
    return resp.json()

endpoint = create_serving_endpoint("RiskClassifier", "3")
print(f"Endpoint name: {endpoint['name']}")
print(f"URL: {DATABRICKS_HOST}/serving-endpoints/{endpoint['name']}/invocations")

Calling the Endpoint

import pandas as pd

def score_records(endpoint_name: str, records: pd.DataFrame) -> list:
payload = {
"dataframe_records": records.to_dict(orient='records')
}
resp = requests.post(
f"{DATABRICKS_HOST}/serving-endpoints/{endpoint_name}/invocations",
headers={
"Authorization": f"Bearer {TOKEN}",
"Content-Type": "application/json"
},
json=payload,
timeout=30
)
resp.raise_for_status()
return resp.json()['predictions']

# Score a batch of records
test_records = pd.DataFrame([
{'order_amount': 1299.99, 'account_age_days': 45, 'region_code': 'WEST', 'prior_disputes': 0},
{'order_amount': 8750.00, 'account_age_days': 12, 'region_code': 'EAST', 'prior_disputes': 2}
])
predictions = score_records("riskclassifier-serving", test_records)
print(predictions)

Traffic Splitting for A/B Testing

One feature worth knowing about: you can split traffic between model versions on a single endpoint. This is how you run a staged rollout — 10% to the new version, 90% to the current version, monitor metrics, then gradually shift traffic.

def update_endpoint_traffic_split(endpoint_name: str, model_name: str) -> None:
payload = {
"served_models": [
{
"model_name": model_name,
"model_version": "3",
"workload_size": "Small",
"scale_to_zero_enabled": True,
"traffic_percentage": 90
},
{
"model_name": model_name,
"model_version": "4",
"workload_size": "Small",
"scale_to_zero_enabled": True,
"traffic_percentage": 10
}
]
}
resp = requests.put(
f"{DATABRICKS_HOST}/api/2.0/serving-endpoints/{endpoint_name}/config",
headers={"Authorization": f"Bearer {TOKEN}"},
json=payload
)
resp.raise_for_status()

The Scale-to-Zero Gotcha

Scale-to-zero is useful for non-production endpoints or low-traffic models — the serving cluster terminates when there are no requests, and restarts on the next request. The cold start time is meaningful: expect 2-5 minutes for a Small endpoint to spin up from zero. For production workloads where latency matters, either keep scale-to-zero disabled or build a warmup call into your client that tolerates the first-request delay. As always, I'm here to help.

Read more