Databricks Foundation Model APIs


This feature is in Private Preview. To try it, reach out to your Databricks contact.

This article provides an overview of the Foundation Model APIs in Databricks. It includes requirements for use, supported models, and limitations.

What are Databricks Foundation Model APIs?

Databricks Model Serving now supports Foundation Model APIs which allow you to access and query state-of-the-art open source models from a dedicated serving endpoint. With Foundation Model APIs, developers can quickly and easily build applications that leverage a high-quality generative AI model without maintaining their own model deployment.

The Foundation Model APIs are provided on a pay-per-token basis and the supported models are accessible in your Databricks Workspace. See Use Foundation Model APIs.

With the Foundation Model APIs you can:

  • Query a generalized LLM to verify a project’s validity before investing more resources.

  • Query a generalized LLM in order to create a quick proof-of-concept for an LLM-based application before investing in training and deploying a custom model.

  • Use a foundation model, along with a vector database, to build a chatbot using retrieval augmented generation (RAG).

  • Replace proprietary models with open source alternatives to optimize for cost and performance.

  • Efficiently compare LLMs to see which is the best candidate for their use case, or swap a production model with a better performing one.

  • Build an LLM application for development or production on top of a scalable, SLA-backed LLM serving solution that can support their production traffic spikes.


  • Foundation Model APIs are in Private Preview. To enroll in the Private Preview, please submit the enrollment form.

  • Databricks API token to authenticate requests to the endpoints

  • Serverless compute

  • Workspace in an AWS-US region

Supported models

The following table summarizes the supported models for the Private Preview. See Databricks Foundation Model APIs supported models for more information.

Model name




70 billion parameter language model with a context length of 4,096 tokens, trained by Meta. The model was pre-trained on 2T tokens of text and fine-tuned for dialog use cases leveraging over 1 million human annotations.

Llama 2 is licensed under the LLAMA 2 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved. Customers are responsible for ensuring compliance with applicable model licenses.



6.7 billion parameter language model with a context length of 8,192 tokens, trained by MosaicML. The model is pre-trained for 1.5 trillion tokens on a mixture of datasets, and then further instruction fine-tuned on a dataset derived from the Databricks Dolly-15k and the Anthropic Helpful and Harmless (HH-RLHF) datasets.



BAAI General Embedding (BGE) that can map any text to a 1024-dimension vector which can be used for tasks like retrieval, classification, clustering, or semantic search. It also can be used in vector databases for LLMs.


Use Foundation Model APIs

You have multiple options for querying the Foundation Model APIs.

You can use the UI, the Python SDK, or curl commands directly from your terminal. Databricks recommends using the Python SDK for extended interactions and the UI for trying out the feature.

These models are accessible in your Databricks workspace. To access the Foundation Model APIs in your workspace, navigate to the Serving tab in the left sidebar. The Foundation Model APIs are located at the top of the Endpoints list view.

Serving endpoints list



For private preview Foundation Model APIs are only available to certain Databricks AWS-US regions. Databricks might process your data outside of the region and cloud provider where your data originated.

The following limitations apply during the Private Preview:

  • This functionality is not HIPAA/Shield compliant.

  • Rate limits vary by model type as noted below. Please reach out to your Databricks account team to increase these limits.

    • 2 queries per second by default for chat and completion models.

    • 300 embedding inputs per second by default for embedding models.

  • Only workspace admins can change the governance settings, like the rate limits, for Foundation Model APIs endpoints.