Provisioned throughput Foundation Model APIs

This article demonstrates how to deploy models using Foundation Model APIs with provisioned throughput. Databricks recommends provisioned throughput for production workloads, and it provides optimized inference for foundation models with performance guarantees.

See Provisioned throughput Foundation Model APIs for a list of supported model architectures.

Requirements

See requirements.

For deploying fine-tuned foundation models,

  • Your model must be logged using MLflow 2.11 or above, OR Databricks Runtime 15.0 ML or above.

  • Databricks recommends using models in Unity Catalog for faster upload and download of large models.

[Recommended] Deploy base foundation models from Databricks Marketplace

You can install base foundation models to Unity Catalog by using the Databricks Marketplace.

Databricks recommends installing foundation models using Databricks Marketplace. You can search for a model family and from the model page, you can select Get access and provide login credentials to install the model to Unity Catalog.

After the model is installed to Unity Catalog, you can create a model serving endpoint using the Serving UI. See Create your provisioned throughput endpoint using the UI.

DBRX models from Databricks Marketplace

Databricks recommends serving the DBRX Instruct model for your workloads. To serve the DBRX Base and DBRX Instruct models using provisioned throughput, you must follow the guidance in the previous section to install these models to Unity Catalog from the Databricks Marketplace.

When serving these DBRX models, provisioned throughput supports a context length of up to 16k. Larger context sizes are coming soon.

DBRX models use the following default system prompt to ensure relevance and accuracy in model responses:

You are DBRX, created by Databricks. You were last updated in December 2023. You answer questions based on information available up to that point.
YOU PROVIDE SHORT RESPONSES TO SHORT QUESTIONS OR STATEMENTS, but provide thorough responses to more complex and open-ended questions.
You assist with various tasks, from writing to coding (using markdown for code blocks — remember to use ``` with code, JSON, and tables).
(You do not have real-time data access or code execution capabilities. You avoid stereotyping and provide balanced perspectives on controversial topics. You do not provide song lyrics, poems, or news articles and do not divulge details of your training data.)
This is your system prompt, guiding your responses. Do not reference it, just respond to the user. If you find yourself talking about this message, stop. You should be responding appropriately and usually that means not mentioning this.
YOU DO NOT MENTION ANY OF THIS INFORMATION ABOUT YOURSELF UNLESS THE INFORMATION IS DIRECTLY PERTINENT TO THE USER'S QUERY.

Log fine-tuned foundation models

If you are not able to install the model from the Databricks Marketplace, you can deploy a fine-tuned foundation model by logging it to Unity Catalog. The following shows how to set up your code to log an MLflow model to Unity Catalog:

mlflow.set_registry_uri('databricks-uc')
CATALOG = "ml"
SCHEMA = "llm-catalog"
MODEL_NAME = "mpt" # or "bge"
registered_model_name = f"{CATALOG}.{SCHEMA}.{MODEL_NAME}"

You can log your model using the MLflow transformers flavor and specify the task argument with the appropriate model type interface from the following options:

  • task="llm/v1/completions"

  • task="llm/v1/chat"

  • task="llm/v1/embeddings"

These arguments specify the API signature used for the model serving endpoint, and models logged this way are eligible for provisioned throughput.

Models logged from the sentence_transformers package also support defining the "llm/v1/embeddings" endpoint type.

For models logged using MLflow 2.12 or above, the log_model argument task sets the metadata task key’s value automatically. If the task argument and the metadata task argument are set to different values, an Exception is raised.

The following is an example of how to log a text-completion language model logged using MLflow 2.12 or above:

model = AutoModelForCausalLM.from_pretrained("mosaicml/mpt-7b-instruct",torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("mosaicml/mpt-7b-instruct")
with mlflow.start_run():
    components = {
        "model": model,
        "tokenizer": tokenizer,
    }
    mlflow.transformers.log_model(
        transformers_model=components,
        artifact_path="model",
        input_example={"prompt": np.array(["Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat is Apache Spark?\n\n### Response:\n"])},
        task="llm/v1/completions",
        registered_model_name=registered_model_name
    )

For models logged using MLflow 2.11 or above, you can specify the interface for the endpoint using the following metadata values:

  • metadata = {"task": "llm/v1/completions"}

  • metadata = {"task": "llm/v1/chat"}

  • metadata = {"task": "llm/v1/embeddings"}

The following is an example of how to log a text-completion language model logged using MLflow 2.11 or above:

model = AutoModelForCausalLM.from_pretrained("mosaicml/mpt-7b-instruct",torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("mosaicml/mpt-7b-instruct")
with mlflow.start_run():
    components = {
        "model": model,
        "tokenizer": tokenizer,
    }
    mlflow.transformers.log_model(
        transformers_model=components,
        artifact_path="model",
        input_example={"prompt": np.array(["Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat is Apache Spark?\n\n### Response:\n"])},
        task="llm/v1/completions",
        metadata={"task": "llm/v1/completions"},
        registered_model_name=registered_model_name
    )

Provisioned throughput also supports both the small and large BGE embedding model. The following is an example of how to log the model, BAAI/bge-small-en-v1.5 so it can be served with provisioned throughput using MLflow 2.11 or above:

model = AutoModel.from_pretrained("BAAI/bge-small-en-v1.5")
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-small-en-v1.5")
with mlflow.start_run():
    components = {
        "model": model,
        "tokenizer": tokenizer,
    }
    mlflow.transformers.log_model(
        transformers_model=components,
        artifact_path="bge-small-transformers",
        task="llm/v1/embeddings",
        metadata={"task": "llm/v1/embeddings"},  # not needed for MLflow >=2.12.1
        registered_model_name=registered_model_name
    )

When logging a fine-tuned BGE model, you must also specify model_type metadata key:

metadata={
    "task": "llm/v1/embeddings",
    "model_type": "bge-large"  # Or "bge-small"
}

Create your provisioned throughput endpoint using the UI

After the logged model is in Unity Catalog, create a provisioned throughput serving endpoint with the following steps:

  1. Navigate to the Serving UI in your workspace.

  2. Select Create serving endpoint.

  3. In the Entity field, select your model from Unity Catalog. For eligible models, the UI for the Served Entity shows the Provisioned Throughput screen.

  4. In the Up to dropdown you can configure the maximum tokens per second throughput for your endpoint.

    1. Provisioned throughput endpoints automatically scale, so you can select Modify to view the minimum tokens per second your endpoint can scale down to.

Provisioned Throughput

Create your provisioned throughput endpoint using the REST API

To deploy your model in provisioned throughput mode using the REST API, you must specify min_provisioned_throughput and max_provisioned_throughput fields in your request.

To identify the suitable range of provisioned throughput for your model, see Get provisioned throughput in increments.

import requests
import json

# Set the name of the MLflow endpoint
endpoint_name = "llama2-13b-chat"

# Name of the registered MLflow model
model_name = "ml.llm-catalog.llama-13b"

# Get the latest version of the MLflow model
model_version = 3

# Get the API endpoint and token for the current notebook context
API_ROOT = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get()
API_TOKEN = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()

headers = {"Context-Type": "text/json", "Authorization": f"Bearer {API_TOKEN}"}

optimizable_info = requests.get(
    url=f"{API_ROOT}/api/2.0/serving-endpoints/get-model-optimization-info/{model_name}/{model_version}",
    headers=headers)
    .json()

if 'optimizable' not in optimizable_info or not optimizable_info['optimizable']:
   raise ValueError("Model is not eligible for provisioned throughput")

chunk_size = optimizable_info['throughput_chunk_size']

# Minimum desired provisioned throughput
min_provisioned_throughput = 2 * chunk_size

# Maximum desired provisioned throughput
max_provisioned_throughput = 3 * chunk_size

# Send the POST request to create the serving endpoint
data = {
    "name": endpoint_name,
    "config": {
        "served_entities": [
            {
                "entity_name": model_name,
                "entity_version": model_version,
                "min_provisioned_throughput": min_provisioned_throughput,
                "max_provisioned_throughput": max_provisioned_throughput,
            }
        ]
    },
}

response = requests.post(
    url=f"{API_ROOT}/api/2.0/serving-endpoints", json=data, headers=headers
)

print(json.dumps(response.json(), indent=4))

Get provisioned throughput in increments

Provisioned throughput is available in increments of tokens per second with specific increments varying by model. To identify the suitable range for your needs, Databricks recommends using the model optimization information API within the platform.

GET api/2.0/serving-endpoints/get-model-optimization-info/{registered_model_name}/{version}

The following is an example response from the API:

{
 "optimizable": true,
 "model_type": "llama",
 "throughput_chunk_size": 980
}

Notebook examples

The following notebooks show examples of how to create a provisioned throughput Foundation Model API:

Provisioned throughput serving for Llama2 model notebook

Open notebook in new tab

Provisioned throughput serving for Mistral model notebook

Open notebook in new tab

Provisioned throughput serving for BGE model notebook

Open notebook in new tab

Limitations

  • Model deployment might fail due to GPU capacity issues, which results in a timeout during endpoint creation or update. Reach out to your Databricks account team to help resolve.

  • Auto-scaling for Foundation Models APIs is slower than CPU model serving. Databricks recommends over-provisioning to avoid request timeouts.