Tutorial: Create external model endpoints to query OpenAI models
This page provides step-by-step instructions for configuring and querying an external model endpoint that serves OpenAI models for completions, chat, and embeddings. You create the endpoint with the MLflow Deployments SDK and query it with the OpenAI client. For more information, see external models.
After you create an endpoint, Databricks recommends configuring Unity AI Gateway on it to add governance features like usage tracking, payload logging, guardrails, and rate limits. All external models served through Model Serving are queried using the OpenAI-compatible API, so you can use a single client across providers. See Unity AI Gateway.
If you prefer to use the Serving UI to accomplish this task, see Create an external model serving endpoint.
Requirements
- Databricks Runtime 13.0 ML or above.
- MLflow 2.9 or above.
- OpenAI API keys.
- Install the Databricks CLI version 0.205 or above.
(Optional) Step 0: Store the OpenAI API key using the Databricks Secrets CLI
You can provide your API keys either as plaintext strings in Step 3 or by using Databricks Secrets.
To store the OpenAI API key as a secret, you can use the Databricks Secrets CLI (version 0.205 and above). You can also use the REST API for secrets.
The following creates the secret scope named, my_openai_secret_scope, and then creates the secret openai_api_key in that scope.
databricks secrets create-scope my_openai_secret_scope
databricks secrets put-secret my_openai_secret_scope openai_api_key
Step 1: Install MLflow with external models support
Use the following to install an MLflow version with external models support:
%pip install mlflow[genai]>=2.9.0
Step 2: Create and manage an external model endpoint
The code examples in this section demonstrate usage of the Public Preview MLflow Deployments CRUD SDK.
To create an external model endpoint for a large language model (LLM), use the create_endpoint() method from the MLflow Deployments SDK. You can also create external model endpoints in the Serving UI.
The following code snippet creates a completions endpoint for OpenAI gpt-3.5-turbo-instruct, as specified in the served_entities section of the configuration. For your endpoint, be sure to populate the name and openai_api_key with your unique values for each field.
import mlflow.deployments
client = mlflow.deployments.get_deploy_client("databricks")
client.create_endpoint(
name="openai-completions-endpoint",
config={
"served_entities": [{
"name": "openai-completions",
"external_model": {
"name": "gpt-3.5-turbo-instruct",
"provider": "openai",
"task": "llm/v1/completions",
"openai_config": {
"openai_api_key": "{{secrets/my_openai_secret_scope/openai_api_key}}"
}
}
}]
}
)
The following code snippet shows how you can provide your OpenAI API key as a plaintext string for an alternative way to create the same completions endpoint as above.
import mlflow.deployments
client = mlflow.deployments.get_deploy_client("databricks")
client.create_endpoint(
name="openai-completions-endpoint",
config={
"served_entities": [{
"name": "openai-completions",
"external_model": {
"name": "gpt-3.5-turbo-instruct",
"provider": "openai",
"task": "llm/v1/completions",
"openai_config": {
"openai_api_key_plaintext": "sk-yourApiKey"
}
}
}]
}
)
If you are using Azure OpenAI, you can also specify the Azure OpenAI deployment name, endpoint URL, and API version in the
openai_config section of the configuration.
client.create_endpoint(
name="openai-completions-endpoint",
config={
"served_entities": [
{
"name": "openai-completions",
"external_model": {
"name": "gpt-3.5-turbo-instruct",
"provider": "openai",
"task": "llm/v1/completions",
"openai_config": {
"openai_api_type": "azure",
"openai_api_key": "{{secrets/my_openai_secret_scope/openai_api_key}}",
"openai_api_base": "https://my-azure-openai-endpoint.openai.azure.com",
"openai_deployment_name": "my-gpt-35-turbo-deployment",
"openai_api_version": "2023-05-15"
},
},
}
],
},
)
To configure rate limits, usage tracking, payload logging, or guardrails on the endpoint, use Unity AI Gateway. Configuring rate limits through Unity AI Gateway supports both query-based (QPM) and token-based (TPM) limits and lets you set per-user, per-group, and endpoint-wide limits.
See Configure Unity AI Gateway on model serving endpoints for a programmatic example that updates an endpoint to add rate limits and other Unity AI Gateway features.
The previously documented client.update_endpoint() pattern with a top-level rate_limits field is deprecated. Use the Unity AI Gateway configuration on the endpoint instead.
Step 3: Send requests to an external model endpoint
Databricks recommends querying external model endpoints using the OpenAI client. Model Serving exposes a unified, OpenAI-compatible API across providers, so the same client code works whether the underlying model is from OpenAI, Anthropic, Cohere, Amazon Bedrock, Google Cloud Vertex AI, or a custom provider.
Install the OpenAI client on your compute:
%pip install openai
The following sends a chat completions request to an endpoint that serves an OpenAI chat model. Replace the base_url value with your Databricks workspace URL, and provide a Databricks personal access token for api_key. Set the model parameter to the name of your model serving endpoint.
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("DATABRICKS_TOKEN"),
base_url="https://<workspace-name>.cloud.databricks.com/serving-endpoints"
)
response = client.chat.completions.create(
model="openai-chat-endpoint",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
max_tokens=128,
temperature=0.1,
)
print(response.choices[0].message.content)
To send a completions request to an endpoint configured for the llm/v1/completions task, use client.completions.create():
response = client.completions.create(
model="openai-completions-endpoint",
prompt="What is the capital of France?",
max_tokens=10,
temperature=0.1,
n=2,
)
print(response)
To send an embeddings request to an endpoint configured for the llm/v1/embeddings task, use client.embeddings.create():
response = client.embeddings.create(
model="openai-embeddings-endpoint",
input="Databricks is a unified analytics platform.",
)
print(response.data[0].embedding)
If you run the OpenAI client from inside a Databricks notebook, you can use the databricks-openai helper, which automatically configures authentication and the workspace base URL. See Use foundation models for details.
Step 4: Compare models from a different provider
Model serving supports many external model providers including Open AI, Anthropic, Cohere, Amazon Bedrock, Google Cloud Vertex AI, and more. You can compare LLMs across providers, helping you optimize the accuracy, speed, and cost of your applications using the AI Playground.
The following example creates an endpoint for Anthropic claude-2 and compares its response to a question that uses OpenAI gpt-3.5-turbo-instruct. Both responses have the same standard format, which makes them easy to compare.
Create an endpoint for Anthropic claude-2
import mlflow.deployments
client = mlflow.deployments.get_deploy_client("databricks")
client.create_endpoint(
name="anthropic-completions-endpoint",
config={
"served_entities": [
{
"name": "claude-completions",
"external_model": {
"name": "claude-2",
"provider": "anthropic",
"task": "llm/v1/completions",
"anthropic_config": {
"anthropic_api_key": "{{secrets/my_anthropic_secret_scope/anthropic_api_key}}"
},
},
}
],
},
)
Compare the responses from each endpoint
Because all external model endpoints expose an OpenAI-compatible API, you can query both endpoints with the same OpenAI client by switching the model parameter to the corresponding endpoint name.
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("DATABRICKS_TOKEN"),
base_url="https://<workspace-name>.cloud.databricks.com/serving-endpoints"
)
prompt = "How is Pi calculated? Be very concise."
openai_response = client.completions.create(
model="openai-completions-endpoint",
prompt=prompt,
)
anthropic_response = client.completions.create(
model="anthropic-completions-endpoint",
prompt=prompt,
)
print("OpenAI:", openai_response.choices[0].text)
print("Anthropic:", anthropic_response.choices[0].text)