With the Python
xgboost package, you can train only single node workloads. To perform distributed training, you must use XGBoost’s Scala and Java packages.
Versions of XGBoost 1.2.0 and lower have a bug that can cause the shared Spark context to be killed if XGBoost model training fails. The only way to recover is to restart the cluster. Databricks Runtime 7.5 ML and lower include a version of XGBoost that is affected by this bug. To install a different version of XGBoost, see Install XGBoost on Databricks.
You can train models using the Python
xgboost package. To train a PySpark ML pipeline with
xgboost, see Integration with Spark MLlib (Python).
This feature is in Public Preview.
Databricks Runtime 7.6 ML and above include PySpark estimators based on the Python
sparkdl.xgboost.XgboostClassifier. You can create an ML pipeline based on these estimators. For more information, see Xgboost for PySpark Pipeline.
- These estimators train the model on a single Spark worker.
- GPU clusters are not supported.
- The following parameters from the
xgboostpackage are not supported:
- The parameters
sample_weight_eval_setare not supported. Instead, use the parameters
validationIndicatorCol. See Xgboost for PySpark Pipeline for details.
- The parameter
missinghas different semantics from the
xgboostpackage. In the
xgboostpackage, the zero values in a SciPy sparse matrix are treated as missing values regardless of the value of
missing. For the PySpark estimators in the
sparkdlpackage, zero values in a Spark sparse vector are not treated as missing values unless you set
missing=0. If you have a sparse training dataset (most feature values are missing), Databricks recommends setting
missing=0to reduce memory consumption and achieve better performance.