Tutorial: Create external model endpoints to query OpenAI models

This article provides step-by-step instructions for configuring and querying an external model endpoint that serves OpenAI models for completions, chat, and embeddings using the MLflow Deployments SDK. Learn more about external models.

Requirements

  • Databricks Runtime 13.0 ML or above.

  • MLflow 2.9 or above.

  • OpenAI API keys.

  • Databricks CLI version 0.205 or above.

Step 1: Store the OpenAI API key using the Databricks Secrets CLI

You can store the OpenAI API key using the Databricks Secrets CLI (version 0.205 and above). You can also use the REST API for secrets.

The following creates the secret scope named, my_openai_secret_scope, and then creates the secret openai_api_key in that scope.

databricks secrets create-scope my_openai_secret_scope
databricks secrets put-secret my_openai_secret_scope openai_api_key

Step 2: Install MLflow with external models support

Use the following to install an MLflow version with external models support:

%pip install mlflow[genai]>=2.9.0

Step 3: Create and manage an external model endpoint

Important

The code examples in this section demonstrate usage of the Public Preview MLflow Deployments CRUD SDK.

You can create an external model endpoint for each large language model (LLM) you want to use with the create_endpoint() method from the MLflow Deployments SDK . The following code snippet creates a completions endpoint for OpenAI gpt-3.5-turbo-instruct, as specified in the served_entities section of the configuration.

import mlflow.deployments

client = mlflow.deployments.get_deploy_client("databricks")
client.create_endpoint(
    name="openai-completions-endpoint",
    config={
        "served_entities": [{
            "name": "openai-completions"
            "external_model": {
                "name": "gpt-3.5-turbo-instruct",
                "provider": "openai",
                "task": "llm/v1/completions",
                "openai_config": {
                    "openai_api_key": "{{secrets/my_openai_secret_scope/openai_api_key}}"
                }
            }
        }]
    }
)

If you are using Azure OpenAI, you can also specify the Azure OpenAI deployment name, endpoint URL, and API version in the openai_config section of the configuration.

client.create_endpoint(
    name="openai-completions-endpoint",
    config={
        "served_entities": [
          {
            "name": "openai-completions",
            "external_model": {
                "name": "gpt-3.5-turbo-instruct",
                "provider": "openai",
                "task": "llm/v1/completions",
                "openai_config": {
                    "openai_api_type": "azure",
                    "openai_api_key": "{{secrets/my_openai_secret_scope/openai_api_key}}",
                    "openai_api_base": "https://my-azure-openai-endpoint.openai.azure.com",
                    "openai_deployment_name": "my-gpt-35-turbo-deployment",
                    "openai_api_version": "2023-05-15"
                },
            },
          }
        ],
    },
)

To update an endpoint, use update_endpoint(). The following code snippet demonstrates how to update an endpoint’s rate limits to 20 calls per minute per user.

client.update_endpoint(
    endpoint="openai-completions-endpoint",
    config={
        "rate_limits": [
            {
                "key": "user",
                "renewal_period": "minute",
                "calls": 20
            }
        ],
    },
)

Step 4: Send requests to an external model endpoint

Important

The code examples in this section demonstrate usage of the MLflow Deployments SDK’s predict() method.

You can send chat, completions, and embeddings requests to an external model endpoint using the MLflow Deployments SDK’s predict() method.

The following sends a request to gpt-3.5-turbo-instruct hosted by OpenAI.

completions_response = client.predict(
    endpoint="openai-completions-endpoint",
    inputs={
        "prompt": "What is the capital of France?",
        "temperature": 0.1,
        "max_tokens": 10,
        "n": 2
    }
)
completions_response == {
    "id": "cmpl-8QW0hdtUesKmhB3a1Vel6X25j2MDJ",
    "object": "text_completion",
    "created": 1701330267,
    "model": "gpt-3.5-turbo-instruct",
    "choices": [
        {
            "text": "The capital of France is Paris.",
            "index": 0,
            "finish_reason": "stop",
            "logprobs": None
        },
        {
            "text": "Paris is the capital of France",
            "index": 1,
            "finish_reason": "stop",
            "logprobs": None
        },
    ],
    "usage": {
        "prompt_tokens": 7,
        "completion_tokens": 16,
        "total_tokens": 23
    }
}

Step 5: Compare models from a different provider

Model serving supports many external model providers including Open AI, Anthropic, Cohere, Amazon Bedrock, Google Cloud Vertex AI, and more. You can compare LLMs across providers, helping you optimize the accuracy, speed, and cost of your applications using the AI Playground.

The following example creates an endpoint for Anthropic claude-2 and compares its response to a question that uses OpenAI gpt-3.5-turbo-instruct. Both responses have the same standard format, which makes them easy to compare.

Create an endpoint for Anthropic claude-2

import mlflow.deployments

client = mlflow.deployments.get_deploy_client("databricks")

client.create_endpoint(
    name="anthropic-completions-endpoint",
    config={
        "served_entities": [
            {
                "name": "claude-completions",
                "external_model": {
                    "name": "claude-2",
                    "provider": "anthropic",
                    "task": "llm/v1/completions",
                    "anthropic_config": {
                        "anthropic_api_key": "{{secrets/my_anthropic_secret_scope/anthropic_api_key}}"
                    },
                },
            }
        ],
    },
)

Compare the responses from each endpoint


openai_response = client.predict(
    endpoint="openai-completions-endpoint",
    inputs={
        "prompt": "How is Pi calculated? Be very concise."
    }
)
anthropic_response = client.predict(
    endpoint="anthropic-completions-endpoint",
    inputs={
        "prompt": "How is Pi calculated? Be very concise."
    }
)
openai_response["choices"] == [
    {
        "text": "Pi is calculated by dividing the circumference of a circle by its diameter."
                " This constant ratio of 3.14159... is then used to represent the relationship"
                " between a circle's circumference and its diameter, regardless of the size of the"
                " circle.",
        "index": 0,
        "finish_reason": "stop",
        "logprobs": None
    }
]
anthropic_response["choices"] == [
    {
        "text": "Pi is calculated by approximating the ratio of a circle's circumference to"
                " its diameter. Common approximation methods include infinite series, infinite"
                " products, and computing the perimeters of polygons with more and more sides"
                " inscribed in or around a circle.",
        "index": 0,
        "finish_reason": "stop",
        "logprobs": None
    }
]