Use XGBoost on Databricks


Versions of XGBoost below 1.3.0 have a bug that can cause the shared Spark context to be killed if XGBoost model training fails. The only way to recover is to restart the cluster. All Databricks Runtime ML versions below 7.6 ML include a version of XGBoost that is affected by this bug. To install a different version of XGBoost, see Install XGBoost on Databricks.

Single node training in Python

The Python package allows you to train only single node workloads.

Databricks Runtime 7.6 ML and above


This feature is in Public Preview.

Databricks Runtime 7.6 ML and above include PySpark estimators based on the Python xgboost package, sparkdl.xgboost.XgboostRegressor and sparkdl.xgboost.XgboostClassifier. You can create an ML pipeline based on these estimators. For more information, see Xgboost for PySpark Pipeline.


  • These estimators train the model on a single Spark worker.
  • GPU clusters are not supported.
  • The following parameters from the xgboost package are not supported: gpu_id, kwargs, output_margin, base_margin, validate_features.
  • The parameters sample_weight, eval_set, and sample_weight_eval_set are not supported. Instead, use the parameters weightCol and validationIndicatorCol. See Xgboost for PySpark Pipeline for details.
  • The parameter missing has different semantics from the xgboost package. In the xgboost package, the zero values in a SciPy sparse matrix are treated as missing values regardless of the value of missing. For the PySpark estimators in the sparkdl package, zero values in a Spark sparse vector are not treated as missing values unless you set missing=0. If you have a sparse training dataset (most feature values are missing), Databricks recommends setting missing=0 to reduce memory consumption and achieve better performance.

PySpark-XGBoost notebook

Open notebook in new tab

All Databricks Runtime ML versions

You can train models using the Python xgboost package, but you cannot train a PySpark ML pipeline with xgboost.

XGBoost Python notebook

Open notebook in new tab

Distributed training in Scala

To perform distributed training, you must use XGBoost’s Scala/Java packages. The examples in this section show how you can use XGBoost with MLlib. The first example shows how to embed an XGBoost model into an MLlib ML pipeline. The second example shows how to use MLlib cross validation to tune an XGBoost model.

XGBoost classification with ML pipeline notebook

Open notebook in new tab

XGBoost regression with cross-validation notebook

Open notebook in new tab