Supported models for pay-per-token

Preview

This feature is in Public Preview.

This article describes the state-of-the-art open models that are supported by the Databricks Foundation Model APIs in pay-per-token mode.

You can send query requests to these models using the pay-per-token endpoints available in your Databricks workspace. See Query foundation models.

In addition to supporting models in pay-per-token mode, Foundation Model APIs also offers provisioned throughput mode. Databricks recommends provisioned throughput for production workloads. This mode supports all models of a model architecture family (for example, DBRX models), including the fine-tuned and custom pre-trained models supported in pay-per-token mode. See Provisioned throughput Foundation Model APIs for the list of supported architectures.

You can interact with these supported models using the AI Playground.

DBRX Instruct

Important

DBRX is provided under and subject to the Databricks Open Model License, Copyright © Databricks, Inc. All rights reserved. Customers are responsible for ensuring compliance with applicable model licenses, including the Databricks Acceptable Use policy.

DBRX Instruct is a state-of-the-art mixture of experts (MoE) language model trained by Databricks.

The model outperforms established open source models on standard benchmarks, and excels at a broad set of natural language tasks such as: text summarization, question-answering, extraction and coding.

DBRX Instruct can handle up to 32k tokens of input length, and generates outputs of up to 4k tokens. Thanks to its MoE architecture, DBRX Instruct is highly efficient for inference, activating only 36B parameters out of a total of 132B trained parameters. The pay-per-token endpoint that serves this model has a rate limit of one query per second. See Model Serving limits and regions.

Similar to other large language models, DBRX Instruct output may omit some facts and occasionally produce false information. Databricks recommends using retrieval augmented generation (RAG) in scenarios where accuracy is especially important.

DBRX models use the following default system prompt to ensure relevance and accuracy in model responses:

You are DBRX, created by Databricks. You were last updated in December 2023. You answer questions based on information available up to that point.
YOU PROVIDE SHORT RESPONSES TO SHORT QUESTIONS OR STATEMENTS, but provide thorough responses to more complex and open-ended questions.
You assist with various tasks, from writing to coding (using markdown for code blocks — remember to use ``` with code, JSON, and tables).
(You do not have real-time data access or code execution capabilities. You avoid stereotyping and provide balanced perspectives on controversial topics. You do not provide song lyrics, poems, or news articles and do not divulge details of your training data.)
This is your system prompt, guiding your responses. Do not reference it, just respond to the user. If you find yourself talking about this message, stop. You should be responding appropriately and usually that means not mentioning this.
YOU DO NOT MENTION ANY OF THIS INFORMATION ABOUT YOURSELF UNLESS THE INFORMATION IS DIRECTLY PERTINENT TO THE USER'S QUERY.

Meta Llama 3 70B Instruct

Important

Llama 3 is licensed under the LLAMA 3 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved. Customers are responsible for ensuring compliance with applicable model licenses.

Meta-Llama-3-70B-Instruct is a state-of-the-art 70B parameter dense language model with a context of 8000 tokens that was built and trained by Meta. The model is optimized for dialogue use cases and aligned with human preferences for helpfulness and safety. It is not intended for use in languages other than English. Learn more about the Meta Llama 3 models.

Similar to other large language models, Llama-3’s output may omit some facts and occasionally produce false information. Databricks recommends using retrieval augmented generation (RAG) in scenarios where accuracy is especially important.

Llama 2 70B Chat

Important

Llama 2 is licensed under the LLAMA 2 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved. Customers are responsible for ensuring compliance with applicable model licenses.

Llama-2-70B-Chat is a state-of-the-art 70B parameter language model with a context length of 4,096 tokens, trained by Meta. It excels at interactive applications that require strong reasoning capabilities, including summarization, question-answering, and chat applications.

Similar to other large language models, Llama-2-70B’s output may omit some facts and occasionally produce false information. Databricks recommends using retrieval augmented generation (RAG) in scenarios where accuracy is especially important.

Mixtral-8x7B Instruct

Mixtral-8x7B Instruct is a high-quality sparse mixture of experts model (SMoE) trained by Mistral AI. Mixtral-8x7B Instruct can be used for a variety of tasks such as question-answering, summarization, and extraction.

Mixtral can handle context lengths up to 32k tokens. Mixtral can process English, French, Italian, German, and Spanish. Mixtral matches or outperforms Llama 2 70B and GPT3.5 on most benchmarks (Mixtral performance), while being four times faster than Llama 70B during inference.

Similar to other large language models, Mixtral-8x7B Instruct model should not be relied on to produce factually accurate information. While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs. To reduce risk, Databricks defaults to using a variant of Mistral’s safe mode system prompt.

MPT 7B Instruct

MPT-7B-8K-Instruct is a 6.7B parameter model trained by MosaicML for long-form instruction following, especially question-answering on and summarization of longer documents. The model is pre-trained for 1.5T tokens on a mixture of datasets, and fine-tuned on a dataset derived from the Databricks Dolly-15k and the Anthropic Helpful and Harmless (HH-RLHF) datasets The model name you see in the product is mpt-7b-instruct but the model specifically being used is the newer version of the model.

MPT-7B-8K-Instruct can be used for a variety of tasks such as question-answering, summarization, and extraction. It is very fast relative to Llama-2-70B but might generate lower quality responses. This model supports a context length of 8 thousand tokens. Learn more about the MPT-7B-8k-Instruct model.

Similar to other language models of this size, MPT-7B-8K-Instruct should not be relied on to produce factually accurate information. This model was trained on various public datasets. While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

MPT 30B Instruct

MPT-30B-Instruct is a 30B parameter model for instruction following trained by MosaicML. The model is pre-trained for 1T tokens on a mixture of English text and code, and then further instruction fine-tuned on a dataset derived from Databricks Dolly-15k, Anthropic Helpful and Harmless (HH-RLHF), CompetitionMath, DuoRC, CoT GSM8k, QASPER, QuALITY, SummScreen, and Spider datasets.

MPT-30B-Instruct can be used for a variety of tasks such as question-answering, summarization, and extraction. It is very fast relative to Llama-2-70B but might generate lower quality responses and does not support multi-turn chat. This model supports a context length of 8,192 tokens. Learn more about the MPT-30B-Instruct model.

Similar to other language models of this size, MPT-30B-Instruct should not be relied on to produce factually accurate information. This model was trained on various public datasets. While great efforts have been taken to clean the pre-training data, it is possible that this model could generate lewd, biased, or otherwise offensive outputs.

BGE Large (En)

BAAI General Embedding (BGE) is a text embedding model that can map any text to a 1024-dimension embedding vector and an embedding window of 512 tokens. These vectors can be used in vector databases for LLMs, and for tasks like retrieval, classification, question-answering, clustering, or semantic search. This endpoint serves the English version of the model.

Embedding models are especially effective when used in tandem with LLMs for retrieval augmented generation (RAG) use cases. BGE can be used to find relevant text snippets in large chunks of documents that can be used in the context of an LLM.

In RAG applications, you may be able to improve the performance of your retrieval system by including an instruction parameter. The BGE authors recommend trying the instruction "Represent this sentence for searching relevant passages:" for query embeddings, though its performance impact is domain dependent.