Distributed training of XGBoost models using sparkdl.xgboost

Preview

This feature is in Public Preview.

Note

sparkdl.xgboost is deprecated starting with Databricks Runtime 12.0 ML. For information about migrating your workloads to xgboost.spark, see Migration guide for the deprecated sparkdl.xgboost module.

Databricks Runtime 7.6 ML and above include PySpark estimators based on the Python xgboost package, sparkdl.xgboost.XgboostRegressor and sparkdl.xgboost.XgboostClassifier. You can create an ML pipeline based on these estimators. For more information, see XGBoost for PySpark Pipeline.

Databricks strongly recommends that sparkdl.xgboost users use Databricks Runtime 11.3 ML or above. Previous Databricks Runtime versions are affected by bugs in older versions of sparkdl.xgboost.

Note

  • The sparkdl.xgboost module is deprecated since Databricks Runtime ML 12.0. Databricks recommends that you migrate your code to use the xgboost.spark module instead. See the migration guide.

  • The following parameters from the xgboost package are not supported: gpu_id, output_margin, validate_features. The parameter kwargs is supported in Databricks Runtime 9.0 ML and above.

  • The parameters sample_weight, eval_set, and sample_weight_eval_set are not supported. Instead, use the parameters weightCol and validationIndicatorCol. See XGBoost for PySpark Pipeline for details.

  • The parameters base_margin, and base_margin_eval_set are not supported. In Databricks Runtime 9.0 ML and above, you can use the parameter baseMarginCol instead. See XGBoost for PySpark Pipeline for details.

  • The parameter missing has different semantics from the xgboost package. In the xgboost package, the zero values in a SciPy sparse matrix are treated as missing values regardless of the value of missing. For the PySpark estimators in the sparkdl package, zero values in a Spark sparse vector are not treated as missing values unless you set missing=0. If you have a sparse training dataset (most feature values are missing), Databricks recommends setting missing=0 to reduce memory consumption and achieve better performance.

Distributed training

Databricks Runtime 9.0 ML and above support distributed XGBoost training using the num_workers parameter. To use distributed training, create a classifier or regressor and set num_workers to a value less than or equal to the number of workers on your cluster.

For example:

classifier = XgboostClassifier(num_workers=N)
regressor = XgboostRegressor(num_workers=N)

Limitations of distributed training

  • You cannot use mlflow.xgboost.autolog with distributed XGBoost.

  • You cannot use baseMarginCol with distributed XGBoost.

  • You cannot use distributed XGBoost on an cluster with autoscaling enabled. See Enable and configure autoscaling for instructions to disable autoscaling.

GPU training

Note

Databricks Runtime 11.3 ML includes XGBoost 1.6.1, which does not support GPU clusters with compute capability 5.2 and below.

Databricks Runtime 9.1 LTS ML and above support GPU clusters for XGBoost training. To use a GPU cluster, set use_gpu to True.

For example:

classifier = XgboostClassifier(num_workers=N, use_gpu=True)
regressor = XgboostRegressor(num_workers=N, use_gpu=True)

Example notebook

This notebook shows the use of the Python package sparkdl.xgboost with Spark MLlib. The sparkdl.xgboost package is deprecated since Databricks Runtime 12.0 ML.

PySpark-XGBoost notebook

Open notebook in new tab