How to query Foundation Model APIs with the Python SDK

Important

The Foundation Model APIs Python SDK is an Experimental feature and the API definition may change.

This article provides guidance on how to query Databricks Foundation Model APIs with the Python SDK. It includes installation instructions and query and response formats.

The Python SDK is a layer on top of the REST API. It handles low-level details, such as authentication and mapping model IDs to endpoint URLs, making it easier to interact with the models. The SDK is designed to be used from inside Databricks notebooks.

Requirements

See the Requirements section.

Install the Foundation Model APIs Python SDK

You can install the SDK on a cluster attached to a Databricks Notebook or to your local environment. After the SDK is installed, you can use it to query models. See the examples below:

Install the SDK on a Databricks Notebook

You can install the SDK on your cluster attached to your Databricks Notebook. The following command can be run in your notebook:

!pip install databricks-genai-inference

dbutils.library.restartPython()

Install the SDK on your local environment

If you are working outside of a Databricks Notebook, you can install the SDK in your local environment. The following command can be run in your terminal:

pip install databricks-genai-inference

Databricks native authentication is required to use this SDK.

To do this, you need to first generate a personal access token for your application, then set the two environment variables DATABRICKS_HOST and DATABRICKS_TOKEN.

In the following command, DATABRICKS_HOST represents the Databricks host URL for your workspace. This URL typically starts with "https://" and includes the workspace instance name. DATABRICKS_TOKEN represents your Databricks personal access token value that you generated.

export DATABRICKS_HOST=<YOUR HOST NAME>
export DATABRICKS_TOKEN=<YOUR TOKEN>

Query a chat completion model

To query the llama-2-70b-chat completions model, use ChatCompletion.create() to execute a model query. The ChatCompletion.create() function accepts the same arguments as the Chat task request API.

from databricks_genai_inference import ChatCompletion

response = ChatCompletion.create(model="llama-2-70b-chat",
                                 messages=[{"role": "system", "content": "You are a helpful assistant."},
                                           {"role": "user","content": "Knock knock."}],
                                 max_tokens=128)
print(f"response.message:{response.message}")

By default, create() returns a single response object (ChatCompletionObject) after the complete response has been generated. This can easily take >5 seconds for a large model like llama-2-70b-chat.

For a more responsive experience, you can choose to stream text fragments as they are generated. Pass in stream=True to enable streaming. As a result , create() returns a generator which provides a sequence of response fragments (ChatCompletionChunkObject).

Both kinds of response objects have the same top-level properties:

Property

Type

Description

json

dict

Raw json response (see Chat task API for details)

id

string

Unique request ID

model

string

Model name

message

string

Chat completion

usage

dict

Token usage metadata (cumulative for streaming)

Chat session

ChatSession is a high level class to manage multi-round chat conversations. It provides the following functions:

Function

Return

Description

reply (string)

Takes a new user message

last

string

Last message from assistant

history

list of dict

Messages in chat history, including roles.

count

int

Number of chat rounds conducted so far.

To initialize ChatSession, you use the same set of arguments as ChatCompletion, and those arguments are used throughout the chat session.


from databricks_genai_inference import ChatSession

chat = ChatSession(model="llama-2-70b-chat", system_message="You are a helpful assistant.", max_tokens=128)
chat.reply("Knock, knock!")
chat.last # return "Hello! Who's there?"
chat.reply("Guess who!")
chat.last # return "Okay, I'll play along! Is it a person, a place, or a thing?"

chat.history
# return: [
#     {'role': 'system', 'content': 'You are a helpful assistant.'},
#     {'role': 'user', 'content': 'Knock, knock.'},
#     {'role': 'assistant', 'content': "Hello! Who's there?"},
#     {'role': 'user', 'content': 'Guess who!'},
#     {'role': 'assistant', 'content': "Okay, I'll play along! Is it a person, a place, or a thing?"}
# ]

Query an embedding model

To query the bge-large-en embedding model, use Embedding.create() to execute a model query. The Embedding.create() function accepts the same arguments as the Embedding task request API.

The following example generates embeddings optimized for indexing.

from databricks_genai_inference import Embedding

response = Embedding.create(
    model="bge-large-en",
    input="3D ActionSLAM: wearable person tracking in multi-floor environments")
print(f'embeddings: {response.embeddings}')

You can pass multiple inputs into create by setting input to a list of strings.

To optimize an embedding for query retrieval in a RAG application, add the parameter instruction = "Represent this sentence for searching relevant passages:".

The Embedding.create() function returns an EmbeddingObject with the following properties:

Property

Type

Description

json

dict

Raw json response (see Embedding task API for details)

id

string

Unique request ID

model

string

Model name

embeddings

list of array

List of embeddings

usage

dict

Token usage metadata

Query a text completion model

To query the mpt-7b-instruct text completions model, use Completion.create() to execute a query. The Completion.create() function accepts the same arguments as the Completion task request API.

from databricks_genai_inference import Completion

response = Completion.create(
    model="mpt-7b-instruct",
    prompt="Write 3 reasons why you should train an AI model on domain specific data sets.",
    max_tokens=128)
print(f"response.text:{response.text:}")

Completion is similar to ChatCompletion: by default it waits and returns the complete response, but you can use stream=True to retrieve response fragments as they are generated. In both cases the response objects have the following top-level properties:

Property

Type

Description

json

dict

Raw json response (see Completion task API for details)

id

string

Unique request ID

model

string

Model name

text

list of string

List of text completions

usage

dict

Token usage metadata (cumulative for streaming)