Versions of XGBoost below 1.3.0 have a bug that can cause the shared Spark context to be killed if XGBoost model training fails. The only way to recover is to restart the cluster. All Databricks Runtime ML versions below 7.6 ML include a version of XGBoost that is affected by this bug. To install a different version of XGBoost, see Install XGBoost on Databricks.
The Python package allows you to train only single node workloads.
This feature is in Public Preview.
Databricks Runtime 7.6 ML and above include PySpark estimators based on the Python
sparkdl.xgboost.XgboostClassifier. You can create an ML pipeline based on these estimators. For more information, see Xgboost for PySpark Pipeline.
- These estimators train the model on a single Spark worker.
- GPU clusters are not supported.
- The following parameters from the
xgboostpackage are not supported:
- The parameters
sample_weight_eval_setare not supported. Instead, use the parameters
validationIndicatorCol. See Xgboost for PySpark Pipeline for details.
- The parameter
missinghas different semantics from the
xgboostpackage. In the
xgboostpackage, the zero values in a SciPy sparse matrix are treated as missing values regardless of the value of
missing. For the PySpark estimators in the
sparkdlpackage, zero values in a Spark sparse vector are not treated as missing values unless you set
missing=0. If you have a sparse training dataset (most feature values are missing), Databricks recommends setting
missing=0to reduce memory consumption and achieve better performance.
To perform distributed training, you must use XGBoost’s Scala/Java packages. The examples in this section show how you can use XGBoost with MLlib. The first example shows how to embed an XGBoost model into an MLlib ML pipeline. The second example shows how to use MLlib cross validation to tune an XGBoost model.