Foundation model REST API reference
This article provides general API information for Databricks Foundation Model APIs and the models they support. The Foundation Model APIs are designed to be similar to OpenAI’s REST API to make migrating existing projects easier. Both the pay-per-token and provisioned throughput endpoints accept the same REST API request format.
Endpoints
Each pay-per-token model has a single endpoint, and users can interact with these endpoints using HTTP POST requests. Provisioned throughput endpoints can be created using the API or the Serving UI. These endpoints also support multiple models per endpoint for A/B testing, as long as both served models expose the same API format. For example, both models are chat models. See POST /api/2.0/serving-endpoints for endpoint configuration parameters.
Requests and responses use JSON, the exact JSON structure depends on an endpoint’s task type. Chat and completion endpoints support streaming responses.
Pay-per-token workloads support certain models, see Supported models for pay-per-token for those models and accepted API formats.
Usage
Responses include a usage
sub-message which reports the number of tokens in the request and response. The format of this sub-message is the same across all task types.
Field |
Type |
Description |
---|---|---|
|
Integer |
Number of generated tokens. Not included in embedding responses. |
|
Integer |
Number of tokens from the input prompt(s). |
|
Integer |
Number of total tokens. |
For models like llama-2-70b-chat
a user prompt is transformed using a prompt template before being passed into the model. For pay-per-token endpoints, a system prompt might also be added. prompt_tokens
includes all text added by our server.
Chat task
Chat tasks are optimized for multi-turn conversations with a model. Each request describes the conversation so far, where the messages
field must alternate between user
and assistant
roles, ending with a user
message. The model response provides the next assistant
message in the conversation. See POST /serving-endpoints/{name}/invocations for querying endpoint parameters.
Chat request
Field |
Default |
Type |
Description |
---|---|---|---|
|
ChatMessage list |
Required. A list of messages representing the current conversation. |
|
|
|
|
The maximum number of tokens to generate. |
|
|
Boolean |
Stream responses back to a client in order to allow partial results for requests. If this parameter is included in the request, responses are sent using the Server-sent events standard. |
|
|
Float in [0,2] |
The sampling temperature. 0 is deterministic and higher values introduce more randomness. |
|
|
Float in (0,1] |
The probability threshold used for nucleus sampling. |
|
|
|
Defines the number of k most likely tokens to use for top-k-filtering. Set this value to 1 to make outputs deterministic. |
|
[] |
String or List[String] |
Model stops generating further tokens when any one of the sequences in |
|
1 |
Integer greater than zero |
The API returns |
|
|
String or ToolChoiceObject |
Used only in conjunction with the |
|
|
A list of |
|
|
|
An object specifying the format that the model must output. Accepted types are Setting to Setting to |
|
|
|
Boolean |
This parameter indicates whether to provide the log probability of a token being sampled. |
|
|
Integer |
This parameter controls the number of most likely token candidates to return log probabilities for at each sampling step. Can be 0-20. |
ChatMessage
Field |
Type |
Description |
---|---|---|
|
String |
Required. The role of the author of the message. Can be |
|
String |
The content of the message. Required for chat tasks that do not involve tool calls. |
|
ToolCall list |
The list of |
|
String |
When |
The system
role can only be used once, as the first message in a conversation. It overrides the model’s default system prompt.
ToolCall
A tool call action suggestion by the model. See Function calling on Databricks.
Field |
Type |
Description |
---|---|---|
|
String |
Required. A unique identifier for this tool call suggestion. |
|
String |
Required. Only |
|
Required. A function call suggested by the model. |
FunctionCallCompletion
Field |
Type |
Description |
---|---|---|
|
String |
Required. The name of the function the model recommended. |
|
Object |
Required. Arguments to the function as a serialized JSON dictionary. |
ToolChoiceObject
See Function calling on Databricks.
Field |
Type |
Description |
---|---|---|
|
String |
Required. The type of the tool. Currently, only |
|
Object |
Required. An object defining which tool to call of the form |
ToolObject
See Function calling on Databricks.
Field |
Type |
Description |
---|---|---|
|
String |
Required. The type of the tool. Currently, only |
|
Required. The function definition associated with the tool. |
FunctionObject
Field |
Type |
Description |
---|---|---|
|
String |
Required. The name of the function to be called. |
|
Object |
Required. The detailed description of the function. The model uses this description to understand the relevance of the function to the prompt and generate the tool calls with higher accuracy. |
|
Object |
The parameters the function accepts, described as a valid JSON schema object. If the tool is called, then the tool call is fit to the JSON schema provided. Omitting parameters defines a function without any parameters. The number of |
|
Boolean |
Whether to enable strict schema adherence when generating the function call. If set to |
ResponseFormatObject
See Structured outputs on Databricks.
Field |
Type |
Description |
---|---|---|
|
String |
Required. The type of response format being defined. Either |
|
Required. The JSON schema to adhere to if |
JsonSchemaObject
See Structured outputs on Databricks.
Field |
Type |
Description |
---|---|---|
|
String |
Required. The name of the response format. |
|
String |
A description of what the response format is for, used by the model to determine how to respond in the format. |
|
Object |
Required. The schema for the response format, described as a JSON schema object. |
|
Boolean |
Whether to enable strict schema adherence when generating the output. If set to |
Chat response
For non-streaming requests, the response is a single chat completion object. For streaming requests, the response is a text/event-stream
where each event is a completion chunk object. The top-level structure of completion and chunk objects is almost identical: only choices
has a different type.
Field |
Type |
Description |
---|---|---|
|
String |
Unique identifier for the chat completion. |
|
List[ChatCompletionChoice] or List[ChatCompletionChunk] (streaming) |
List of chat completion texts. |
|
String |
The object type. Equal to either |
|
Integer |
The time the chat completion was generated in seconds. |
|
String |
The model version used to generate the response. |
|
Token usage metadata. Might not be present on streaming responses. |
ChatCompletionChoice
Field |
Type |
Description |
---|---|---|
|
Integer |
The index of the choice in the list of generated choices. |
|
A chat completion message returned by the model. The role will be |
|
|
String |
The reason the model stopped generating tokens. |
ChatCompletionChunk
Field |
Type |
Description |
---|---|---|
|
Integer |
The index of the choice in the list of generated choices. |
|
A chat completion message part of generated streamed responses from the model. Only the first chunk is guaranteed to have |
|
|
String |
The reason the model stopped generating tokens. Only the last chunk will have this populated. |
Completion task
Text completion tasks are for generating responses to a single prompt. Unlike Chat, this task supports batched inputs: multiple independent prompts can be sent in one request. See POST /serving-endpoints/{name}/invocations for querying endpoint parameters.
Completion request
Field |
Default |
Type |
Description |
---|---|---|---|
|
String or List[String] |
Required. The prompts for the model. |
|
|
|
|
The maximum number of tokens to generate. |
|
|
Boolean |
Stream responses back to a client in order to allow partial results for requests. If this parameter is included in the request, responses are sent using the Server-sent events standard. |
|
|
Float in [0,2] |
The sampling temperature. 0 is deterministic and higher values introduce more randomness. |
|
|
Float in (0,1] |
The probability threshold used for nucleus sampling. |
|
|
|
Defines the number of k most likely tokens to use for top-k-filtering. Set this value to 1 to make outputs deterministic. |
|
|
|
For timeouts and context-length-exceeded errors. One of: |
|
1 |
Integer greater than zero |
The API returns |
|
[] |
String or List[String] |
Model stops generating further tokens when any one of the sequences in |
|
|
String |
A string that is appended to the end of every completion. |
|
|
Boolean |
Returns the prompt along with the completion. |
|
|
Boolean |
If |
Completion response
Field |
Type |
Description |
---|---|---|
|
String |
Unique identifier for the text completion. |
|
A list of text completions. For every prompt passed in, |
|
|
String |
The object type. Equal to |
|
Integer |
The time the completion was generated in seconds. |
|
Token usage metadata. |
Embedding task
Embedding tasks map input strings into embedding vectors. Many inputs can be batched together in each request. See POST /serving-endpoints/{name}/invocations for querying endpoint parameters.
Embedding request
Field |
Type |
Description |
---|---|---|
|
String or List[String] |
Required. The input text to embed. Can be a string or a list of strings. |
|
String |
An optional instruction to pass to the embedding model. |
Instructions are optional and highly model specific. For instance the The BGE authors recommend no instruction when indexing chunks and recommend using the instruction "Represent this sentence for searching relevant passages:"
for retrieval queries. Other models like Instructor-XL support a wide range of instruction strings.