How Databricks AutoML works

This article details how Databricks AutoML works and its implementation of concepts like missing value imputation and large data sampling.

Databricks AutoML performs the following:

  1. Prepares the dataset for model training. For example, AutoML carries out imbalanced data detection for classification problems prior to model training.

  2. Iterates to train and tune multiple models, where each model is constructed from open source components and can easily be edited and integrated into your machine learning pipelines.

    • AutoML automatically distributes hyperparameter tuning trials across the worker nodes of a cluster.

    • With Databricks Runtime 9.1 LTS ML or above, AutoML automatically samples your dataset if it is too large to fit into the memory of a single worker node. See Sampling large datasets.

  3. Evaluates models based on algorithms from the scikit-learn, xgboost, LightGBM, Prophet, and ARIMA packages.

  4. Displays the results and provides a Python notebook with the source code for each trial run so you can review, reproduce, and modify the code. AutoML also calculates summary statistics on your dataset and saves this information in a notebook that you can review later.

AutoML algorithms

Databricks AutoML trains and evaluates models based on the algorithms in the following table.

Note

For classification and regression models, the decision tree, random forests, logistic regression and linear regression with stochastic gradient descent algorithms are based on scikit-learn.

Classification models

Regression models

Forecasting models

Decision trees

Decision trees

Prophet

Random forests

Random forests

Auto-ARIMA (Available in Databricks Runtime 10.3 ML and above.)

Logistic regression

Linear regression with stochastic gradient descent

XGBoost

XGBoost

LightGBM

LightGBM

Supported data feature types

Feature types not listed below are not supported. For example, images are not supported.

The following feature types are supported:

  • Numeric (ByteType, ShortType, IntegerType, LongType, FloatType, and DoubleType)

  • Boolean

  • String (categorical or English text)

  • Timestamps (TimestampType, DateType)

  • ArrayType[Numeric] (Databricks Runtime 10.4 LTS ML and above)

  • DecimalType (Databricks Runtime 11.3 LTS ML and above)

Split data into train/validation/test sets

With Databricks Runtime 10.4 LTS ML and above, you can specify a time column to use for the training/validation/testing data split for classification and regression problems. If you specify this column, the dataset is split into training, validation, and test sets by time. The earliest points are used for training, the next earliest for validation, and the latest points are used as a test set. The time column must be a timestamp, string, or integer column.

Sampling large datasets

Note

Sampling is not applied to forecasting problems.

Although AutoML distributes hyperparameter tuning trials across the worker nodes of a cluster, each model is trained on a single worker node.

AutoML automatically estimates the memory required to load and train your dataset and samples the dataset if necessary.

In Databricks Runtime 9.1 LTS ML through Databricks Runtime 10.4 LTS ML, the sampling fraction does not depend on the cluster’s node type or the amount of memory on each node.

In Databricks Runtime 11.x ML:

  • The sampling fraction increases for worker nodes that have more memory per core. You can increase the sample size by choosing a memory optimized instance type.

  • You can further increase the sample size by choosing a larger value for spark.task.cpus in the Spark configuration for the cluster. The default setting is 1; the maximum value is the number of CPUs on the worker node. When you increase this value, the sample size is larger, but fewer trials run in parallel. For example, in a machine with 4 cores and 64GB total RAM, the default spark.task.cpus=1 runs 4 trials per worker with each trial limited to 16GB RAM. If you set spark.task.cpus=4, each worker runs only one trial but that trial can use 64GB RAM.

In Databricks Runtime 12.2 LTS ML and above, AutoML can train on larger datasets by allocating more CPU cores per training task. You can increase the sample size by choosing an instance size with larger total memory.

In Databricks Runtime 11.3 LTS ML and above, if AutoML sampled the dataset, the sampling fraction is shown in the Overview tab in the UI.

For classification problems, AutoML uses the PySpark sampleBy method for stratified sampling to preserve the target label distribution.

For regression problems, AutoML uses the PySpark sample method.

Imbalanced dataset support for classification problems

In Databricks Runtime 11.3 LTS ML and above, if AutoML detects that a dataset is imbalanced, it tries to reduce the imbalance of the training dataset by downsampling the major class(es) and adding class weights. AutoML only balances the training dataset and does not balance the test and validation datasets. Doing so ensures that the model performance is always evaluated on the non-enriched dataset with the true input class distribution.

To balance an imbalanced training dataset, AutoML uses class weights that are inversely related to the degree by which a given class is downsampled. For example, if a training dataset with 100 samples has 95 samples belonging to class A and 5 samples belonging to class B, AutoML reduces this imbalance by downsampling class A to 70 samples, that is downsampling class A by a ratio of 70/95 or 0.736, while keeping the number of samples in class B at 5. To ensure that the final model is correctly calibrated and the probability distribution of the model output is the same as that of the input, AutoML scales up the class weight for class A by the ratio 1/0.736, or 1.358, while keeping the weight of class B as 1. AutoML then uses these class weights in model training as a parameter to ensure that the samples from each class are weighted appropriately when training the model.

Semantic type detection

Note

  • Semantic type detection is not applied to forecasting problems.

  • AutoML does not perform semantic type detection for columns that have custom imputation methods specified.

With Databricks Runtime 9.1 LTS ML and above, AutoML tries to detect whether columns have a semantic type that is different from the Spark or pandas data type in the table schema. AutoML treats these columns as the detected semantic type. These detections are best effort and might miss the existence of semantic types in some cases. You can also manually set the semantic type of a column or tell AutoML not to apply semantic type detection to a column using annotations.

Specifically, AutoML makes these adjustments:

  • String and integer columns that represent date or timestamp data are treated as a timestamp type.

  • String columns that represent numeric data are treated as a numeric type.

With Databricks Runtime 10.1 ML and above, AutoML also makes these adjustments:

  • Numeric columns that contain categorical IDs are treated as a categorical feature.

  • String columns that contain English text are treated as a text feature.

Semantic type annotations

With Databricks Runtime 10.1 ML and above, you can manually control the assigned semantic type by placing a semantic type annotation on a column. To manually annotate the semantic type of column <column-name> as <semantic-type>, use the following syntax:

metadata_dict = df.schema["<column-name>"].metadata
metadata_dict["spark.contentAnnotation.semanticType"] = "<semantic-type>"
df = df.withMetadata("<column-name>", metadata_dict)

<semantic-type> can be one of the following:

  • categorical: The column contains categorical values (for example, numerical values that should be treated as IDs).

  • numeric: The column contains numeric values (for example, string values that can be parsed into numbers).

  • datetime: The column contains timestamp values (string, numerical, or date values that can be converted into timestamps).

  • text: The string column contains English text.

To disable semantic type detection on a column, use the special keyword annotation native.

Shapley values (SHAP) for model explainability

Note

For MLR 11.1 and below, SHAP plots are not generated, if the dataset contains a datetime column.

The notebooks produced by AutoML regression and classification runs include code to calculate Shapley values. Shapley values are based in game theory and estimate the importance of each feature to a model’s predictions.

AutoML notebooks use the SHAP package to calculate Shapley values. Because these calculations are very memory-intensive, the calculations are not performed by default.

To calculate and display Shapley values:

  1. Go to the Feature importance section in an AutoML generated trial notebook.

  2. Set shap_enabled = True.

  3. Re-run the notebook.

Time series aggregation

For forecasting problems, when there are multiple values for a timestamp in a time series, AutoML uses the average of the values.

To use the sum instead, edit the source code notebook. In the Aggregate data by … cell, change .agg(y=(target_col, "avg")) to .agg(y=(target_col, "sum")), as shown:

group_cols = [time_col] + id_cols
df_aggregation = df_loaded \
  .groupby(group_cols) \
  .agg(y=(target_col, "sum")) \
  .reset_index() \
  .rename(columns={ time_col : "ds" })

Feature Store integration

With Databricks Runtime 11.3 LTS ML and above, you can use existing feature tables in Feature Store to augment the original input dataset for your classification and regression problems.

With Databricks Runtime 12.2 LTS ML and above, you can use existing feature tables in Feature Store to augment the original input dataset for all of your AutoML problems: classification, regression, and forecasting.

To create a feature table, see What is a feature store?.

To use existing feature tables, you can select feature tables with the AutoML UI or set the feature_store_lookups parameter in your AutoML run specification.

feature_store_lookups = [
  {
     "table_name": "example.trip_pickup_features",
     "lookup_key": ["pickup_zip", "rounded_pickup_datetime"],
  },
  {
      "table_name": "example.trip_dropoff_features",
     "lookup_key": ["dropoff_zip", "rounded_dropoff_datetime"],
  }
]

Trial notebook generation

For forecasting experiments, AutoML generated notebooks are automatically imported to your workspace for all trials of your experiment.

For classification and regression experiments, AutoML generated notebooks for data exploration and the best trial in your experiment are automatically imported to your workspace. Generated notebooks for other experiment trials are saved as MLflow artifacts on DBFS, instead of auto-imported into your workspace. For all trials besides the best trial, the notebook_path and notebook_url in the TrialInfo Python API are not set. If you need to use these notebooks, you can manually import them into your workspace with the AutoML experiment UI or the databricks.automl.import_notebook Python API.

If you only use the data exploration notebook or best trial notebook generated by AutoML, the Source column in the AutoML experiment UI contains the link to the generated notebook for the best trial.

If you use other generated notebooks in the AutoML experiment UI, these are not automatically imported into the workspace. You can find the notebooks by clicking into each MLflow run. The IPython notebook is saved in the Artifacts section of the run page. You can download this notebook and import it into the workspace, if downloading artifacts is enabled by your workspace administrators.

Notebook example: AutoML experiment with Feature Store

The following notebook shows how to train an ML model with AutoML and Feature Store feature tables.

AutoML experiment with Feature Store example notebook

Open notebook in new tab