Use XGBoost on Databricks

With the Python xgboost package, you can train only single node workloads. To perform distributed training, you must use XGBoost’s Scala and Java packages.

Warning

Versions of XGBoost 1.2.0 and lower have a bug that can cause the shared Spark context to be killed if XGBoost model training fails. The only way to recover is to restart the cluster. Databricks Runtime 7.5 ML and lower include a version of XGBoost that is affected by this bug. To install a different version of XGBoost, see Install XGBoost on Databricks.

Single node training in Python

You can train models using the Python xgboost package. To train a PySpark ML pipeline with xgboost, see Integration with Spark MLlib (Python).

XGBoost Python notebook

Open notebook in new tab

Integration with Spark MLlib (Python)

Preview

This feature is in Public Preview.

Databricks Runtime 7.6 ML and above include PySpark estimators based on the Python xgboost package, sparkdl.xgboost.XgboostRegressor and sparkdl.xgboost.XgboostClassifier. You can create an ML pipeline based on these estimators. For more information, see Xgboost for PySpark Pipeline.

Note

  • These estimators train the model on a single Spark worker.
  • GPU clusters are not supported.
  • The following parameters from the xgboost package are not supported: gpu_id, kwargs, output_margin, base_margin, validate_features.
  • The parameters sample_weight, eval_set, and sample_weight_eval_set are not supported. Instead, use the parameters weightCol and validationIndicatorCol. See Xgboost for PySpark Pipeline for details.
  • The parameter missing has different semantics from the xgboost package. In the xgboost package, the zero values in a SciPy sparse matrix are treated as missing values regardless of the value of missing. For the PySpark estimators in the sparkdl package, zero values in a Spark sparse vector are not treated as missing values unless you set missing=0. If you have a sparse training dataset (most feature values are missing), Databricks recommends setting missing=0 to reduce memory consumption and achieve better performance.

PySpark-XGBoost notebook

Open notebook in new tab

Integration with Spark MLlib (Scala)

The examples in this section show how you can use XGBoost with MLlib. The first example shows how to embed an XGBoost model into an MLlib ML pipeline. The second example shows how to use MLlib cross validation to tune an XGBoost model.

XGBoost classification with ML pipeline notebook

Open notebook in new tab

XGBoost regression with cross-validation notebook

Open notebook in new tab