Foundation Model Training

Important

This feature is in Public Preview. Reach out to your Databricks account team to enroll in the Public Preview.

With Foundation Model Training, you can use your own data to customize a foundation model to optimize its performance for your specific application. By fine-tuning or continuing training of a foundation model, you can train your own model using significantly less data, time, and compute resources than training a model from scratch.

With Databricks you have everything in a single platform: your own data to use for training, the foundation model to train, checkpoints saved to MLflow, and the model registered in Unity Catalog and ready to deploy.

This article gives an overview of Foundation Model Training on Databricks. For details on how to use it, see the following:

What is Foundation Model Training?

Foundation Model Training lets you use the Databricks API or UI to tune or further train a foundation model.

Using Foundation Model Training, you can:

  • Train a model with your custom data, with the checkpoints saved to MLflow. You retain complete control of the trained model.

  • Automatically register the model to Unity Catalog, allowing easy deployment with model serving.

  • Further train a completed, proprietary model by loading the weights of a previously trained model.

Databricks recommends that you try Foundation Model Training if:

  • You have tried few-shot learning and want better results.

  • You have tried prompt engineering on an existing model and want better results.

  • You want full ownership over a custom model for data privacy.

  • You are latency-sensitive or cost-sensitive and want to use a smaller, cheaper model with your task-specific data.

Supported tasks

Foundation Model Training supports the following use cases:

  • Supervised fine-tuning: Train your model on structured prompt-response data. Use this to adapt your model to a new task, change its response style, or add instruction-following capabilities.

  • Continued pre-training: Train your model with additional text data. Use this to add new knowledge to a model or focus a model on a specific domain.

  • Chat completion: Train your model on chat logs between a user and an AI assistant. This format can be used both for actual chat logs, and as a standard format for question answering and conversational text. The text is automatically formatted into the appropriate chat format for the specific model.

Requirements

  • A Databricks workspace in one of the following AWS regions: us-east-1, us-west-2.

  • Foundation Model Training APIs installed using pip install databricks_genai.

  • Databricks Runtime 12.2 LTS ML or above if your data is in a Delta table.

See Prepare data for Foundation Model Training for information about required input data formats.

Supported models

Important

Llama 3 is licensed under the LLAMA 3 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved. Customers are responsible for ensuring compliance with applicable model licenses.

Llama 2 and Code Llama models are licensed under the LLAMA 2 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved. Customers are responsible for ensuring compliance with applicable model licenses.

DBRX is provided under and subject to the Databricks Open Model License, Copyright © Databricks, Inc. All rights reserved. Customers are responsible for ensuring compliance with applicable model licenses, including the Databricks Acceptable Use policy.

Model

Maximum context length

databricks/dbrx-base

4096

databricks/dbrx-instruct

4096

meta-llama/Meta-Llama-3-70B

8192

meta-llama/Meta-Llama-3-70B-Instruct

8192

meta-llama/Meta-Llama-3-8B

8192

meta-llama/Meta-Llama-3-8B-Instruct

8192

meta-llama/Llama-2-7b-hf

4096

meta-llama/Llama-2-13b-hf

4096

meta-llama/Llama-2-70b-hf

4096

meta-llama/Llama-2-7b-chat-hf

4096

meta-llama/Llama-2-13b-chat-hf

4096

meta-llama/Llama-2-70b-chat-hf

4096

codellama/CodeLlama-7b-hf

16384

codellama/CodeLlama-13b-hf

16384

codellama/CodeLlama-34b-hf

16384

codellama/CodeLlama-7b-Instruct-hf

16384

codellama/CodeLlama-13b-Instruct-hf

16384

codellama/CodeLlama-34b-Instruct-hf

16384

codellama/CodeLlama-7b-Python-hf

16384

codellama/CodeLlama-13b-Python-hf

16384

codellama/CodeLlama-34b-Python-hf

16384

mistralai/Mistral-7B-v0.1

32768

mistralai/Mistral-7B-Instruct-v0.2

32768

mistralai/Mixtral-8x7B-v0.1

32768

Use Foundation Model Training

Foundation Model Training is accessible using the databricks_genai SDK. The following example creates and launches a training run that uses data from Unity Catalog Volumes. See the Create a training run using the Foundation Model Training API for configuration details.

from databricks.model_training import foundation_model as fm

model = 'meta-llama/Llama-2-7b-chat-hf'
# UC Volume with JSONL formatted data
train_data_path = 'dbfs:/Volumes/main/mydirectory/ift/train.jsonl'
register_to = 'main.mydirectory'
run = fm.create(
  model=model,
  train_data_path=train_data_path,
  register_to=register_to,
)

Limitations

  • Large datasets (10B+ tokens) are not supported due to compute availability.

  • PrivateLink is not supported.

  • For continuous pre-training, workloads are limited to 60-256MB files. Files larger than 1GB may cause longer processing times.

  • Databricks strives to make the latest state-of-the-art models available for customization using Foundation Model Training. As we make new models available, we might remove the ability to access older models from the API and/or UI, deprecate older models, or update supported models. If a foundation model will be removed from the API and/or UI or deprecated, Databricks will take the following steps to notify customers at least three months before the removal and/or deprecation date:

    • Display a warning message in the model card from the Experiments > Foundation Model Training page of your Databricks workspace indicating that the model is scheduled for deprecation.

    • Update our documentation to include a notice indicating that the model is scheduled for deprecation.