Use XGBoost on Databricks

Databricks Runtime for Machine Learning includes XGBoost libraries for both Python and Scala.

Warning

Versions of XGBoost 1.2.0 and lower have a bug that can cause the shared Spark context to be killed if XGBoost model training fails. The only way to recover is to restart the cluster. Databricks Runtime 7.5 ML and lower include a version of XGBoost that is affected by this bug. To install a different version of XGBoost, see Install XGBoost on Databricks.

Use XGBoost with Python

You can train models using the Python xgboost package. This package supports only single node workloads. To train a PySpark ML pipeline and take advantage of distributed training, see Integration with Spark MLlib (Python).

XGBoost Python notebook

Open notebook in new tab

Integration with Spark MLlib (Python)

Preview

This feature is in Public Preview.

Databricks Runtime 7.6 ML and above include PySpark estimators based on the Python xgboost package, sparkdl.xgboost.XgboostRegressor and sparkdl.xgboost.XgboostClassifier. You can create an ML pipeline based on these estimators. For more information, see XGBoost for PySpark Pipeline.

Note

  • The following parameters from the xgboost package are not supported: gpu_id, output_margin, validate_features. The parameter kwargs is supported in Databricks Runtime 9.0 ML and above.
  • The parameters sample_weight, eval_set, and sample_weight_eval_set are not supported. Instead, use the parameters weightCol and validationIndicatorCol. See XGBoost for PySpark Pipeline for details.
  • The parameters base_margin, and base_margin_eval_set are not supported. In Databricks Runtime 9.0 ML and above, you can use the parameter baseMarginCol instead. See XGBoost for PySpark Pipeline for details.
  • The parameter missing has different semantics from the xgboost package. In the xgboost package, the zero values in a SciPy sparse matrix are treated as missing values regardless of the value of missing. For the PySpark estimators in the sparkdl package, zero values in a Spark sparse vector are not treated as missing values unless you set missing=0. If you have a sparse training dataset (most feature values are missing), Databricks recommends setting missing=0 to reduce memory consumption and achieve better performance.

Distributed training

Databricks Runtime 9.0 ML and above support distributed XGBoost training using the num_workers parameter. To use distributed training, create a classifier or regressor and set num_workers to a value less than or equal to the number of workers on your cluster.

For example:

classifier = XgboostClassifier(num_workers=N, **{other params})
regressor = XgboostRegressor(num_workers=N, **{other params})

Limitations of distributed training

  • You cannot use mlflow.xgboost.autolog with distributed XGBoost.
  • You cannot use baseMarginCol with distributed XGBoost.
  • You cannot use distributed XGBoost on an cluster with autoscaling enabled. See Enable and configure autoscaling for instructions to disable autoscaling.

GPU training

Databricks Runtime 9.0 ML and above support GPU clusters for XGBoost training. To use a GPU cluster, set use_gpu to True.

For example:

classifier = XgboostClassifier(num_workers=N, use_gpu=True, **{other params})
regressor = XgboostRegressor(num_workers=N, use_gpu=True, **{other params})

Example notebook for Python integration with Spark MLlib

PySpark-XGBoost notebook

Open notebook in new tab

Integration with Spark MLlib (Scala)

The examples in this section show how you can use XGBoost with MLlib. The first example shows how to embed an XGBoost model into an MLlib ML pipeline. The second example shows how to use MLlib cross validation to tune an XGBoost model.

XGBoost classification with ML pipeline notebook

Open notebook in new tab

XGBoost regression with cross-validation notebook

Open notebook in new tab