Distributed training of XGBoost models using
This feature is in Public Preview.
sparkdl.xgboost is deprecated starting with Databricks Runtime 12.0 ML, and is removed in Databricks Runtime 13.0 ML and above. For information about migrating your workloads to
xgboost.spark, see Migration guide for the deprecated sparkdl.xgboost module.
Databricks Runtime ML includes PySpark estimators based on the Python
sparkdl.xgboost.XgboostClassifier. You can create an ML pipeline based on these estimators. For more information, see XGBoost for PySpark Pipeline.
Databricks strongly recommends that
sparkdl.xgboost users use Databricks Runtime 11.3 LTS ML or above. Previous Databricks Runtime versions are affected by bugs in older versions of
sparkdl.xgboostmodule is deprecated since Databricks Runtime 12.0 ML. Databricks recommends that you migrate your code to use the
xgboost.sparkmodule instead. See the migration guide.
The following parameters from the
xgboostpackage are not supported:
sample_weight_eval_setare not supported. Instead, use the parameters
validationIndicatorCol. See XGBoost for PySpark Pipeline for details.
base_margin_eval_setare not supported. Use the parameter
baseMarginColinstead. See XGBoost for PySpark Pipeline for details.
missinghas different semantics from the
xgboostpackage. In the
xgboostpackage, the zero values in a SciPy sparse matrix are treated as missing values regardless of the value of
missing. For the PySpark estimators in the
sparkdlpackage, zero values in a Spark sparse vector are not treated as missing values unless you set
missing=0. If you have a sparse training dataset (most feature values are missing), Databricks recommends setting
missing=0to reduce memory consumption and achieve better performance.
Databricks Runtime ML supports distributed XGBoost training using the
num_workers parameter. To use distributed training, create a classifier or regressor and set
num_workers to a value less than or equal to the number of workers on your cluster.
classifier = XgboostClassifier(num_workers=N) regressor = XgboostRegressor(num_workers=N)
Limitations of distributed training
You cannot use
mlflow.xgboost.autologwith distributed XGBoost.
You cannot use
baseMarginColwith distributed XGBoost.
You cannot use distributed XGBoost on an cluster with autoscaling enabled. See Cluster size and autoscaling for instructions to disable autoscaling.
Databricks Runtime 11.3 LTS ML includes XGBoost 1.6.1, which does not support GPU clusters with compute capability 5.2 and below.
Databricks Runtime 9.1 LTS ML and above support GPU clusters for XGBoost training. To use a GPU cluster, set
classifier = XgboostClassifier(num_workers=N, use_gpu=True) regressor = XgboostRegressor(num_workers=N, use_gpu=True)