Train XGBoost model on a single GPU
This notebook demonstrates how to train an XGBoost regression model on a single GPU using Databricks serverless GPU compute. GPU acceleration significantly speeds up model training compared to CPU-based training, especially for large datasets.
Key concepts covered:
- GPU-accelerated training: Uses XGBoost's
histtree method with CUDA device for faster training - Model checkpointing: Saves model state periodically to Unity Catalog volumes for recovery and incremental training
- California Housing dataset: A regression task predicting median house values
For more information, see XGBoost GPU Support and Unity Catalog volumes.
Requirements
This notebook requires a GPU-enabled compute cluster. Databricks serverless GPU compute is automatically selected when running cells.
Install required libraries
Install XGBoost version 2.0.3 and scikit-learn for dataset loading and evaluation metrics.
%pip install xgboost==2.0.3 # due to this issue: https://github.com/ray-project/xgboost_ray/issues/312
%pip install scikit-learn
dbutils.library.restartPython()
Verify that XGBoost 2.0.3 is installed correctly.
%pip show xgboost
Configure Unity Catalog checkpoint location
Define the Unity Catalog volume location where model checkpoints will be saved. The notebook uses query parameters to configure the catalog, schema, volume, and model name.
# You must have `USE CATALOG` privileges on the catalog, and you must have `USE SCHEMA` privileges on the schema.
# If necessary, change the catalog and schema name here.
dbutils.widgets.text("uc_catalog", "main")
dbutils.widgets.text("uc_schema", "default")
dbutils.widgets.text("uc_model_name", "custom_transformer")
dbutils.widgets.text("uc_volume", "checkpoints")
UC_CATALOG = dbutils.widgets.get("uc_catalog")
UC_SCHEMA = dbutils.widgets.get("uc_schema")
UC_VOLUME = dbutils.widgets.get("uc_volume")
MODEL_NAME = dbutils.widgets.get("uc_model_name")
CHECKPOINT_PATH = f"/Volumes/{UC_CATALOG}/{UC_SCHEMA}/{UC_VOLUME}/{MODEL_NAME}"
CHECKPOINT_PREFIX = "checkpoint"
print(f"UC_CATALOG: {UC_CATALOG}")
print(f"UC_SCHEMA: {UC_SCHEMA}")
print(f"UC_VOLUME: {UC_VOLUME}")
print(f"CHECKPOINT_PATH: {CHECKPOINT_PATH}")
Create a checkpoint callback that saves the model state every 50 boosting rounds to the Unity Catalog volume. This enables recovery from failures and incremental training.
import os
from xgboost.callback import TrainingCheckPoint
# Create the UC Volume where the checkpoint will be saved if it doesn't exist already
os.makedirs(CHECKPOINT_PATH, exist_ok=True)
# Create a callback to checkpoint to a UC volume
checkpoint_cb = TrainingCheckPoint(
directory=CHECKPOINT_PATH,
name=CHECKPOINT_PREFIX,
iterations=50, # save every 50 boosting rounds
)
Train XGBoost model on a single GPU
Load the California Housing dataset, configure XGBoost for GPU training, and train a regression model. The model predicts median house values using features like location, number of rooms, and population density.
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error
# Load California Housing dataset
X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Convert to DMatrix
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
# GPU training parameters for regression
params = {
"tree_method": "hist", # Use GPU histogram
"device": "cuda",
"objective": "reg:squarederror", # Regression objective
"eval_metric": "rmse", # Root Mean Squared Error
"max_depth": 6,
"learning_rate": 0.1,
}
# Train the model
bst = xgb.train(
params=params,
dtrain=dtrain,
num_boost_round=200,
evals=[(dtest, "eval"), (dtrain, "train")],
verbose_eval=10,
callbacks=[checkpoint_cb]
)
# Predict
y_pred = bst.predict(dtest)
# Evaluate
rmse = root_mean_squared_error(y_test, y_pred)
print(f"✅ RMSE on test set: {rmse:.4f}")
Load model from checkpoint and evaluate
Load a previously saved checkpoint from the 150th boosting round and evaluate its performance. This demonstrates how to resume training or use intermediate model states.
# Take sample checkpoint from 150th step
checkpoint = f"{CHECKPOINT_PATH}/{CHECKPOINT_PREFIX}_150.json"
# Load the model from a checkpoint
bst = xgb.Booster()
bst.load_model(checkpoint)
dtest = xgb.DMatrix(X_test)
y_pred = bst.predict(dtest)
# Evaluate
rmse = root_mean_squared_error(y_test, y_pred)
print(f"✅ RMSE on test set: {rmse:.4f}")
Next steps
- XGBoost GPU Support documentation
- Best practices for Serverless GPU compute
- Troubleshoot issues on serverless GPU compute
- Multi-GPU and multi-node distributed training
- Unity Catalog volumes