Skip to main content

Train XGBoost model on a single GPU

This notebook demonstrates how to train an XGBoost regression model on a single GPU using Databricks serverless GPU compute. GPU acceleration significantly speeds up model training compared to CPU-based training, especially for large datasets.

Key concepts covered:

  • GPU-accelerated training: Uses XGBoost's hist tree method with CUDA device for faster training
  • Model checkpointing: Saves model state periodically to Unity Catalog volumes for recovery and incremental training
  • California Housing dataset: A regression task predicting median house values

For more information, see XGBoost GPU Support and Unity Catalog volumes.

Requirements

This notebook requires a GPU-enabled compute cluster. Databricks serverless GPU compute is automatically selected when running cells.

Install required libraries

Install XGBoost version 2.0.3 and scikit-learn for dataset loading and evaluation metrics.

Python
%pip install xgboost==2.0.3 # due to this issue: https://github.com/ray-project/xgboost_ray/issues/312
%pip install scikit-learn
dbutils.library.restartPython()

Verify that XGBoost 2.0.3 is installed correctly.

Python
%pip show xgboost

Configure Unity Catalog checkpoint location

Define the Unity Catalog volume location where model checkpoints will be saved. The notebook uses query parameters to configure the catalog, schema, volume, and model name.

Python
# You must have `USE CATALOG` privileges on the catalog, and you must have `USE SCHEMA` privileges on the schema.
# If necessary, change the catalog and schema name here.
dbutils.widgets.text("uc_catalog", "main")
dbutils.widgets.text("uc_schema", "default")
dbutils.widgets.text("uc_model_name", "custom_transformer")
dbutils.widgets.text("uc_volume", "checkpoints")

UC_CATALOG = dbutils.widgets.get("uc_catalog")
UC_SCHEMA = dbutils.widgets.get("uc_schema")
UC_VOLUME = dbutils.widgets.get("uc_volume")
MODEL_NAME = dbutils.widgets.get("uc_model_name")
CHECKPOINT_PATH = f"/Volumes/{UC_CATALOG}/{UC_SCHEMA}/{UC_VOLUME}/{MODEL_NAME}"
CHECKPOINT_PREFIX = "checkpoint"

print(f"UC_CATALOG: {UC_CATALOG}")
print(f"UC_SCHEMA: {UC_SCHEMA}")
print(f"UC_VOLUME: {UC_VOLUME}")
print(f"CHECKPOINT_PATH: {CHECKPOINT_PATH}")

Create a checkpoint callback that saves the model state every 50 boosting rounds to the Unity Catalog volume. This enables recovery from failures and incremental training.

Python
import os
from xgboost.callback import TrainingCheckPoint

# Create the UC Volume where the checkpoint will be saved if it doesn't exist already
os.makedirs(CHECKPOINT_PATH, exist_ok=True)

# Create a callback to checkpoint to a UC volume
checkpoint_cb = TrainingCheckPoint(
directory=CHECKPOINT_PATH,
name=CHECKPOINT_PREFIX,
iterations=50, # save every 50 boosting rounds
)

Train XGBoost model on a single GPU

Load the California Housing dataset, configure XGBoost for GPU training, and train a regression model. The model predicts median house values using features like location, number of rooms, and population density.

Python
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error

# Load California Housing dataset
X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert to DMatrix
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# GPU training parameters for regression
params = {
"tree_method": "hist", # Use GPU histogram
"device": "cuda",
"objective": "reg:squarederror", # Regression objective
"eval_metric": "rmse", # Root Mean Squared Error
"max_depth": 6,
"learning_rate": 0.1,
}

# Train the model
bst = xgb.train(
params=params,
dtrain=dtrain,
num_boost_round=200,
evals=[(dtest, "eval"), (dtrain, "train")],
verbose_eval=10,
callbacks=[checkpoint_cb]
)

# Predict
y_pred = bst.predict(dtest)

# Evaluate
rmse = root_mean_squared_error(y_test, y_pred)
print(f"✅ RMSE on test set: {rmse:.4f}")

Load model from checkpoint and evaluate

Load a previously saved checkpoint from the 150th boosting round and evaluate its performance. This demonstrates how to resume training or use intermediate model states.

Python
# Take sample checkpoint from 150th step
checkpoint = f"{CHECKPOINT_PATH}/{CHECKPOINT_PREFIX}_150.json"

# Load the model from a checkpoint
bst = xgb.Booster()
bst.load_model(checkpoint)

dtest = xgb.DMatrix(X_test)
y_pred = bst.predict(dtest)

# Evaluate
rmse = root_mean_squared_error(y_test, y_pred)
print(f"✅ RMSE on test set: {rmse:.4f}")

Next steps

Example notebook

Train XGBoost model on a single GPU

Open notebook in new tab