Train XGBoost model on a single GPU

This notebook demonstrates how to train an XGBoost regression model on a single GPU using Databricks serverless GPU compute. GPU acceleration significantly speeds up model training compared to CPU-based training, especially for large datasets.

Key concepts covered:

GPU-accelerated training: Uses XGBoost's hist tree method with CUDA device for faster training
Model checkpointing: Saves model state periodically to Unity Catalog volumes for recovery and incremental training
California Housing dataset: A regression task predicting median house values

For more information, see XGBoost GPU Support and Unity Catalog volumes.

Requirements

This notebook requires a GPU-enabled compute cluster. Databricks serverless GPU compute is automatically selected when running cells.

Install required libraries

Install XGBoost version 2.0.3 and scikit-learn for dataset loading and evaluation metrics.

Python
%pip install xgboost==2.0.3 # due to this issue: https://github.com/ray-project/xgboost_ray/issues/312
%pip install scikit-learn
dbutils.library.restartPython()

Verify that XGBoost 2.0.3 is installed correctly.

Python
%pip show xgboost

Configure Unity Catalog checkpoint location

Define the Unity Catalog volume location where model checkpoints will be saved. The notebook uses query parameters to configure the catalog, schema, volume, and model name.

Python
# You must have `USE CATALOG` privileges on the catalog, and you must have `USE SCHEMA` privileges on the schema.
# If necessary, change the catalog and schema name here.
dbutils.widgets.text("uc_catalog", "main")
dbutils.widgets.text("uc_schema", "default")
dbutils.widgets.text("uc_model_name", "custom_transformer")
dbutils.widgets.text("uc_volume", "checkpoints")

UC_CATALOG = dbutils.widgets.get("uc_catalog")
UC_SCHEMA = dbutils.widgets.get("uc_schema")
UC_VOLUME = dbutils.widgets.get("uc_volume")
MODEL_NAME = dbutils.widgets.get("uc_model_name")
CHECKPOINT_PATH = f"/Volumes/{UC_CATALOG}/{UC_SCHEMA}/{UC_VOLUME}/{MODEL_NAME}"
CHECKPOINT_PREFIX = "checkpoint"

print(f"UC_CATALOG: {UC_CATALOG}")
print(f"UC_SCHEMA: {UC_SCHEMA}")
print(f"UC_VOLUME: {UC_VOLUME}")
print(f"CHECKPOINT_PATH: {CHECKPOINT_PATH}")

Create a checkpoint callback that saves the model state every 50 boosting rounds to the Unity Catalog volume. This enables recovery from failures and incremental training.

Python
import os
from xgboost.callback import TrainingCheckPoint

# Create the UC Volume where the checkpoint will be saved if it doesn't exist already
os.makedirs(CHECKPOINT_PATH, exist_ok=True)

# Create a callback to checkpoint to a UC volume
checkpoint_cb = TrainingCheckPoint(
    directory=CHECKPOINT_PATH,
    name=CHECKPOINT_PREFIX,
    iterations=50,       # save every 50 boosting rounds
)