Problem: Fitting an Apache SparkML Model Throws Error


Databricks throws an error when fitting a SparkML model or Pipeline:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 162.0 failed 4 times, most recent failure: Lost task 0.3 in stage 162.0 (TID 168,, executor 1): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$9: (string) => double)


Often, an error when fitting a SparkML model or Pipeline is a result of issues with the training data.


Check for the following issues:

  1. Identify and address NULL values in a dataset. Spark needs to know how to address missing values in the dataset.
    • Discard rows with missing values with dropna().
    • Impute some value like zero or the average value of the column. This solution depends on what is meaningful for the data set.
  2. Ensure that all training data is appropriately transformed to a numeric format. Spark needs to know how to handle categorical and string variables. A variety of feature transformers are available to address data specific cases.

  3. Check for collinearity. Highly correlated or even duplicate features may cause issues with model fitting. This occurs on rare occasions, but you should make sure to rule it out.