Tutorial: Create and deploy a training run using Foundation Model Training

Important

This feature is in Public Preview. Reach out to your Databricks account team to enroll in the Public Preview.

This article describes how to create and configure a run using the Foundation Model Training API, and then review the results and deploy the model using the Databricks UI and Databricks Model Serving.

Requirements

Step 1: Prepare your data for training

See Prepare data for Foundation Model Training.

Step 2: Install the databricks_genai SDK

Use the following to install the databricks_genai SDK.

%pip install databricks_genai

Next, import the foundation_model library:

dbutils.library.restartPython()
from databricks.model_training import foundation_model as fm

Step 3: Create a training run

Create a training run using the Foundation Model Training’s create() function. The following parameters are required:

  • model: the model you want to train.

  • train_data_path: the location of the training dataset in.

  • register_to: the Unity Catalog catalog and schema where you want checkpoints saved in.

For example:

run = fm.create(model='meta-llama/Llama-2-7b-chat-hf',
                train_data_path='dbfs:/Volumes/main/my-directory/ift/train.jsonl', # UC Volume with JSONL formatted data
                register_to='main.my-directory',
                training_duration='1ep')

run

Step 4: View the status of a run

The time it takes to complete a training run depends on the number of tokens, the model, and GPU availability. For faster training, Databricks recommends that you use reserved compute. Reach out to your Databricks account team for details.

After you launch your run, you can monitor the status of it using get_events().

run.get_events()

Step 5: View metrics and outputs

Follow these steps to view the results in the Databricks UI:

  1. In the Databricks workspace, click Experiments in the left nav bar.

  2. Select your experiment from the list.

  3. Review the metrics charts in the Charts tab.

    1. The primary training metric showing progress is loss. Evaluation loss can be used to see if your model is overfitting to your training data. However, loss should not be relied on entirely because in supervised training tasks, the evaluation loss can appear to be overfitting while the model continues to improve.

    2. In this tab, you can also view the output of your evaluation prompts if you specified them.

Step 6: Evaluate multiple customized model with MLflow LLM Evaluate before deploy

See Evaluate large language models with MLflow.

Step 7: Deploy your model

The training run automatically registers your model in Unity Catalog after it completes. The model is registered based on what you specified in the register_to field in the run create() method.

To deploy the model for serving, follow these steps:

  1. Navigate to the model in Unity Catalog.

  2. Click Serve this model.

  3. Click Create serving endpoint.

  4. In the Name field, provide a name for your endpoint.

  5. Click Create.