Query serving endpoints for custom models

In this article, learn how to format scoring requests for your served model, and how to send those requests to the model serving endpoint. The guidance is relevant to serving custom models, which Databricks defines as traditional ML models or customized Python models packaged in the MLflow format. They can be registered either in Unity Catalog or in the workspace model registry. Examples include scikit-learn, XGBoost, PyTorch, and Hugging Face transformer models. See Model serving with Databricks for more information about this functionality and supported model categories.

For query requests for generative AI and LLM workloads, see Query generative AI models.

Requirements

Important

As a security best practice for production scenarios, Databricks recommends that you use machine-to-machine OAuth tokens for authentication during production.

For testing and development, Databricks recommends using a personal access token belonging to service principals instead of workspace users. To create tokens for service principals, see Manage tokens for a service principal.

Querying methods and examples

Mosaic AI Model Serving provides the following options for sending scoring requests to served models:

Method

Details

Serving UI

Select Query endpoint from the Serving endpoint page in your Databricks workspace. Insert JSON format model input data and click Send Request. If the model has an input example logged, use Show Example to load it.

REST API

Call and query the model using the REST API. See POST /serving-endpoints/{name}/invocations for details. For scoring requests to endpoints serving multiple models, see Query individual models behind an endpoint.

MLflow Deployments SDK

Use MLflow Deployments SDK’s predict() function to query the model.

SQL function

Invoke model inference directly from SQL using the ai_query SQL function. See Query a served model with ai_query.

Pandas DataFrame scoring example

The following example assumes a MODEL_VERSION_URI like https://<databricks-instance>/model/iris-classifier/Production/invocations, where <databricks-instance> is the name of your Databricks instance, and a Databricks REST API token called DATABRICKS_API_TOKEN.

See Supported scoring formats.

Score a model accepting dataframe split input format.

curl -X POST -u token:$DATABRICKS_API_TOKEN $MODEL_VERSION_URI \
  -H 'Content-Type: application/json' \
  -d '{"dataframe_split": [{
    "columns": ["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"],
    "data": [[5.1, 3.5, 1.4, 0.2], [4.9, 3.0, 1.4, 0.2]]
    }]
  }'

Score a model accepting tensor inputs. Tensor inputs should be formatted as described in TensorFlow Serving’s API documentation.

curl -X POST -u token:$DATABRICKS_API_TOKEN $MODEL_VERSION_URI \
  -H 'Content-Type: application/json' \
  -d '{"inputs": [[5.1, 3.5, 1.4, 0.2]]}'

Important

The following example uses the predict() API from the MLflow Deployments SDK.


import mlflow.deployments

export DATABRICKS_HOST="https://<workspace_host>.databricks.com"
export DATABRICKS_TOKEN="dapi-your-databricks-token"

client = mlflow.deployments.get_deploy_client("databricks")

response = client.predict(
            endpoint="test-model-endpoint",
            inputs={"dataframe_split": {
                    "index": [0, 1],
                    "columns": ["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"],
                    "data": [[5.1, 3.5, 1.4, 0.2], [4.9, 3.0, 1.4, 0.2]]
                    }
                }
          )

Important

The following example uses the built-in SQL function, ai_query. This function is Public Preview and the definition might change. See Query a served model with ai_query.

The following example queries the model behind the sentiment-analysis endpoint with the text dataset and specifies the return type of the request.

SELECT text, ai_query(
    "sentiment-analysis",
    text,
    returnType => "STRUCT<label:STRING, score:DOUBLE>"
  ) AS predict
FROM
  catalog.schema.customer_reviews

You can score a dataset in Power BI Desktop using the following steps:

  1. Open dataset you want to score.

  2. Go to Transform Data.

  3. Right-click in the left panel and select Create New Query.

  4. Go to View > Advanced Editor.

  5. Replace the query body with the code snippet below, after filling in an appropriate DATABRICKS_API_TOKEN and MODEL_VERSION_URI.

    (dataset as table ) as table =>
    let
      call_predict = (dataset as table ) as list =>
      let
        apiToken = DATABRICKS_API_TOKEN,
        modelUri = MODEL_VERSION_URI,
        responseList = Json.Document(Web.Contents(modelUri,
          [
            Headers = [
              #"Content-Type" = "application/json",
              #"Authorization" = Text.Format("Bearer #{0}", {apiToken})
            ],
            Content = {"dataframe_records": Json.FromValue(dataset)}
          ]
        ))
      in
        responseList,
      predictionList = List.Combine(List.Transform(Table.Split(dataset, 256), (x) => call_predict(x))),
      predictionsTable = Table.FromList(predictionList, (x) => {x}, {"Prediction"}),
      datasetWithPrediction = Table.Join(
        Table.AddIndexColumn(predictionsTable, "index"), "index",
        Table.AddIndexColumn(dataset, "index"), "index")
    in
      datasetWithPrediction
    
  6. Name the query with your desired model name.

  7. Open the advanced query editor for your dataset and apply the model function.

Tensor input example

The following example scores a model accepting tensor inputs. Tensor inputs should be formatted as described in TensorFlow Serving’s API docs. This example assumes a MODEL_VERSION_URI like https://<databricks-instance>/model/iris-classifier/Production/invocations, where <databricks-instance> is the name of your Databricks instance, and a Databricks REST API token called DATABRICKS_API_TOKEN.

curl -X POST -u token:$DATABRICKS_API_TOKEN $MODEL_VERSION_URI \
    -H 'Content-Type: application/json' \
    -d '{"inputs": [[5.1, 3.5, 1.4, 0.2]]}'

Supported scoring formats

For custom models, Model Serving supports scoring requests in Pandas DataFrame or Tensor input.

Pandas DataFrame

Requests should be sent by constructing a JSON-serialized Pandas DataFrame with one of the supported keys and a JSON object corresponding to the input format.

  • (Recommended)dataframe_split format is a JSON-serialized Pandas DataFrame in the split orientation.

    {
      "dataframe_split": {
        "index": [0, 1],
        "columns": ["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"],
        "data": [[5.1, 3.5, 1.4, 0.2], [4.9, 3.0, 1.4, 0.2]]
      }
    }
    
  • dataframe_records is JSON-serialized Pandas DataFrame in the records orientation.

    Note

    This format does not guarantee the preservation of column ordering, and the split format is preferred over the records format.

    {
      "dataframe_records": [
      {
        "sepal length (cm)": 5.1,
        "sepal width (cm)": 3.5,
        "petal length (cm)": 1.4,
        "petal width (cm)": 0.2
      },
      {
        "sepal length (cm)": 4.9,
        "sepal width (cm)": 3,
        "petal length (cm)": 1.4,
        "petal width (cm)": 0.2
      },
      {
        "sepal length (cm)": 4.7,
        "sepal width (cm)": 3.2,
        "petal length (cm)": 1.3,
        "petal width (cm)": 0.2
      }
      ]
    }
    

The response from the endpoint contains the output from your model, serialized with JSON, wrapped in a predictions key.

{
  "predictions": [0,1,1,1,0]
}

Tensor input

When your model expects tensors, like a TensorFlow or Pytorch model, there are two supported format options for sending requests: instances and inputs.

If you have multiple named tensors per row, then you have to have one of each tensor for every row.

  • instances is a tensors-based format that accepts tensors in row format. Use this format if all the input tensors have the same 0-th dimension. Conceptually, each tensor in the instances list could be joined with the other tensors of the same name in the rest of the list to construct the full input tensor for the model, which would only be possible if all of the tensors have the same 0-th dimension.

    {"instances": [ 1, 2, 3 ]}
    

    The following example shows how to specify multiple named tensors.

    {
     "instances": [
      {
       "t1": "a",
       "t2": [1, 2, 3, 4, 5],
       "t3": [[1, 2], [3, 4], [5, 6]]
      },
      {
       "t1": "b",
       "t2": [6, 7, 8, 9, 10],
       "t3": [[7, 8], [9, 10], [11, 12]]
      }
     ]
    }
    
  • inputs send queries with tensors in columnar format. This request is different because there are actually a different number of tensor instances of t2 (3) than t1 and t3, so it is not possible to represent this input in the instances format.

    {
     "inputs": {
      "t1": ["a", "b"],
      "t2": [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]],
      "t3": [[[1, 2], [3, 4], [5, 6]], [[7, 8], [9, 10], [11, 12]]]
     }
    }
    

The response from the endpoint is in the following format.

{
  "predictions": [0,1,1,1,0]
}

Notebook example

See the following notebook for an example of how to test your Model Serving endpoint with a Python model:

Test Model Serving endpoint notebook

Open notebook in new tab