Foundation Model APIs reference

Preview

This feature is in Private Preview. To try it, reach out to your Databricks contact.

This article provides general API information for Foundation Model APIs and the models they support. The Foundation Model APIs are designed to be similar to OpenAI’s REST API to make migrating existing projects easier.

To enroll in the Private Preview, please submit the enrollment form.

Endpoints

Each model has a single endpoint. Users can interact with these endpoints using HTTP POSTs. Requests and responses use JSON, the exact JSON structure depends on an endpoint’s task type. Chat and completion endpoints support streaming responses.

Model name

Task

Endpoint URL

llama-2-70b-chat

Chat

https://{workspace_host}/serving-endpoints/databricks-llama-2-70b-chat/invocations

mpt-7b-8k-instruct

Completion

https://{workspace_host}/serving-endpoints/databricks-mpt-7b-instruct/invocations

bge-large-en-v1.5

Embedding

https://{workspace_host}/serving-endpoints/databricks-bge-large-en/invocations

Usage

Responses include a usage sub-message which reports the number of tokens in the request and response. The format of this sub-message is the same across all task types.

Field

Type

Description

completion_tokens

Integer

Number of generated tokens. Not included in embedding responses.

prompt_tokens

Integer

Number of tokens from the input prompt(s).

total_tokens

Integer

Number of total tokens.

For models like llama-2-70b-chat a user prompt is transformed using a prompt template before being passed into the model. A system prompt might also be added. prompt_tokens includes all text added by our server.

Chat task

Chat tasks are optimized for multi-turn conversations with a model. Each request describes the conversation so far, where the messages field must alternate between user and assistant roles, ending with a user message. The model response provides the next assistant message in the conversation.

Chat request

Field

Default

Type

Description

messages

ChatMessage list

A list of messages representing the current conversation. (Required)

max_tokens

nil

Integer greater than zero or nil, which represents infinity

The maximum number of tokens to generate.

stream

false

Boolean

Whether to stream the response as it is generated.

temperature

1.0

Float in [0,2]

The sampling temperature. 0 is deterministic and higher values introduce more randomness.

top_p

1.0

Float in (0,1]

The probability threshold used for nucleus sampling.

top_k

nil

Integer greater than zero or nil, which represents infinity

Defines the number of k most likely tokens to use for top-k-filtering. Set this value to 1 to make outputs deterministic.

stop

[]

String or List[String]

Model stops generating further tokens when any one of the sequences in stop is encountered.

ChatMessage

Field

Type

Description

role

String

Required. The role of the author of the message. Can be "system", "user", or "assistant".

content

String

Required. The content of the message. (Required)

The system role can only be used once, as the first message in a conversation. It overrides the model’s default system prompt.

Request example

curl \
-u token:$DATABRICKS_TOKEN \
-X POST \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant. Keep your responses short and concise."
    },
    {
      "role": "user",
      "content": "Hello! What is a fun fact about llamas?"
    }
  ],
  "max_tokens": 128
}' \
https://<workspace_host>.databricks.com/serving-endpoints/databricks-llama-2-70b-chat/invocations \

Chat response

For non-streaming requests, the response is a single chat completion object. For streaming requests, the response is a text/event-stream where each event is a completion chunk object. The top-level structure of completion and chunk objects is almost identical: only choices has a different type.

Field

Type

Description

id

String

Unique identifier for the chat completion.

choices

List[ChatCompletionChoice] or List[ChatCompletionChunk] (streaming)

Always a single element list with the chat completion text.

object

String

The object type. Equal to either "chat.completions" for non-streaming or "chat.completion.chunk" for streaming.

created

Integer

The time the chat completion was generated in seconds.

model

String

The model version used to generate the response.

usage

Usage

Token usage metadata. May not be present on streaming responses.

ChatCompletionChoice

Field

Type

Description

index

Integer

The index of the choice in the list of generated choices.

message

ChatMessage

A chat completion message returned by the model. The role will be assistant.

finish_reason

String

The reason the model stopped generating tokens.

ChatCompletionChunk

Field

Type

Description

index

Integer

The index of the choice in the list of generated choices.

delta

ChatMessage

A chat completion message part of generated streamed responses from the model. Only the first chunk is guaranteed to have role populated.

finish_reason

String

The reason the model stopped generating tokens. Only the last chunk will have this populated.

Completion task

Text completion tasks are for generating responses to a single prompt. Unlike Chat, this task supports batched inputs: multiple independent prompts can be sent in one request.

Completion request

Field

Default

Type

Description

prompt

String or List[String]

The prompt(s) for the model. (Required)

max_tokens

nil

Integer greater than zero or nil, which represents infinity

The maximum number of tokens to generate.

stream

false

Boolean

Whether to stream the response as it is generated. If this is true there must only be one prompt.

temperature

1.0

Float in [0,2]

The sampling temperature. 0 is deterministic and higher values introduce more randomness.

top_p

1.0

Float in (0,1]

The probability threshold used for nucleus sampling.

top_k

nil

Integer greater than zero or nil, which represents infinity

Defines the number of k most likely tokens to use for top-k-filtering. Set this value to 1 to make outputs deterministic.

error_behavior

"error"

"truncate" or "error"

For timeouts and context-length-exceeded errors. One of: "truncate" (return as many tokens as possible) and "error" (return an error).

stop

[]

String or List[String]

Model stops generating further tokens when any one of the sequences in stop is encountered.

suffix

""

String

A string that is appended to the end of every completion.

echo

false

Boolean

Returns the prompt along with the completion.

use_raw_prompt

false

Boolean

If true, pass the prompt directly into the model without any transformation.

Request example

curl \
 -u token:$DATABRICKS_TOKEN \
 -X POST \
 -H "Content-Type: application/json" \
 -d '{"prompt": "Write 3 reasons why you should train an AI model on domain specific data sets"}' \
https://<workspace_host>.databricks.com/serving-endpoints/databricks-mpt-7b-instruct/invocations

Completion response

Field

Type

Description

id

String

Unique identifier for the text completion.

choices

CompletionChoice

A list text completions, with one element for every prompt in the request.

object

String

The object type. Equal to "text_completion"

created

Integer

The time the completion was generated in seconds.

usage

Usage

Token usage metadata.

CompletionChoice

Field

Type

Description

index

Integer

The index of the prompt in request.

text

String

The generated completion.

finish_reason

String

The reason the model stopped generating tokens.

Embedding task

Embedding tasks map input strings into embedding vectors. Many inputs can be batched together in each request.

Embedding request

Field

Type

Description

input

String or List[String]

The input text to embed. Can be a string or a list of strings. (Required)

instruction

String

An optional instruction to pass to the embedding model.

Instructions are optional and highly model specific. For instance the The BGE authors recommend no instruction when indexing chunks and recommend using the instruction "Represent this sentence for searching relevant passages:" for retrieval queries. Other models like Instructor-XL support a wide range of instruction strings.

Request example

curl \
-u token:$DATABRICKS_TOKEN \
-X POST \
-H "Content-Type: application/json" \
-d  '{ "input": "Let us generate an embedding!"}' \
https://<workspace_host>.databricks.com/serving-endpoints/databricks-bge-large-en/invocations

Embeddings response

Field

Type

Description

id

String

Unique identifier for the embedding.

object

String

The object type. Equal to "list".

model

String

The name of the embedding model used to create the embedding.

data

EmbeddingObject

The embedding object.

usage

Usage

Token usage metadata.

EmbeddingObject

Field

Type

Description

object

String

The object type. Equal to "embedding".

index

Integer

The index of the embedding in the list of embeddings generated by the model.

embedding

List[Float]

The embedding vector. Each model will return a fixed size vector (1024 for BGE-Large)