Databricks AutoML data preparation and processing

This article describes how Databricks AutoML prepares data for machine learning training and describes configurable data settings. You can adjust these options during experiment setup in the AutoML UI. For configuring these settings using the AutoML API), refer to the AutoML Python API reference.

Supported data feature types

Feature types not listed below are not supported. For example, images are not supported.

The following feature types are supported:

  • Numeric (ByteType, ShortType, IntegerType, LongType, FloatType, and DoubleType)

  • Boolean

  • String (categorical or English text)

  • Timestamps (TimestampType, DateType)

  • ArrayType[Numeric] (Databricks Runtime 10.4 LTS ML and above)

  • DecimalType (Databricks Runtime 11.3 LTS ML and above)

Column selection

Note

This functionality is only available for classification and regression problems

In Databricks Runtime 10.3 ML and above, you can specify which columns AutoML should use for training. To exclude a column in the UI, uncheck it in the Include column. In the API, use the exclude_cols parameter. For more information, see Databricks AutoML Python API reference.

You cannot drop the column selected as the prediction target or as the time column to split the data.

By default, all columns are included.

Impute missing values

In Databricks Runtime 10.4 LTS ML and above, you can specify how null values are imputed. In the UI, select a method from the drop-down in the Impute with column in the table schema. In the API, use the imputers parameter. For more information, see Databricks AutoML Python API reference.

By default, AutoML selects an imputation method based on the column type and content.

Note

If you specify a non-default imputation method, AutoML does not perform semantic type detection.

Split data into train, validation, and test sets

AutoML splits your data into three splits for training, validation, and testing. Depending on the type of ML problem, you have different options for splitting the data.

Split data for regression and classification

Use the following methods to divide data into training, validation, and test sets for regression and classification tasks:

(Default) Random split: If a data split strategy isn’t specified, the dataset is randomly split into 60% train split, 20% validate split, and 20% test split. For classification, a stratified random split ensures that each class is adequately represented in the training, validation, and test sets.

Chronological split: In Databricks Runtime 10.4 LTS ML and above, you can select a time column to create chronological train, validate, and test splits. Chronological splits use the earliest data points for training, the next earliest for validation, and the latest points for testing. The time column can be a timestamp, integer, or string column.

Manual split: In Databricks Runtime 15.3 ML and above, you can use the API to set up a manual split. Specify a split column and use the values train, validate, or test to identify rows you want to use for training, validation, and testing datasets. Any rows with split column values other than train, test, or validate are ignored and a corresponding alert is raised.

Split data for forecasting

For forecasting tasks, AutoML uses time series cross-validation. This method incrementally extends the training dataset chronologically and performs validation on subsequent time points. Cross-validation provides a robust evaluation of a model’s performance over different temporal segments. It ensures that the forecasting model is rigorously tested against unseen future data, maintaining the relevance and accuracy of predictions.

The number of cross-validation folds depends on input table characteristics such as the number of time series, the presence of covariates, and the time series length.

Sampling large datasets

Note

Sampling is not applied to forecasting problems.

Although AutoML distributes hyperparameter tuning trials across the worker nodes of a cluster, each model is trained on a single worker node.

AutoML automatically estimates the memory required to load and train your dataset and samples the dataset if necessary.

In Databricks Runtime 9.1 LTS ML through Databricks Runtime 10.4 LTS ML, the sampling fraction does not depend on the cluster’s node type or the amount of memory on each node.

In Databricks Runtime 11.x ML:

  • The sampling fraction increases for worker nodes that have more memory per core. You can increase the sample size by choosing a memory optimized instance type.

  • You can further increase the sample size by choosing a larger value for spark.task.cpus in the Spark configuration for the cluster. The default setting is 1; the maximum value is the number of CPUs on the worker node. When you increase this value, the sample size is larger, but fewer trials run in parallel. For example, in a machine with four cores and 64GB total RAM, the default spark.task.cpus=1 runs four trials per worker, with each trial limited to 16GB RAM. If you set spark.task.cpus=4, each worker runs only one trial, but that trial can use 64GB RAM.

In Databricks Runtime 12.2 LTS ML and above, AutoML can train on larger datasets by allocating more CPU cores per training task. You can increase the sample size by choosing an instance size with more total memory.

In Databricks Runtime 11.3 LTS ML and above, if AutoML sampled the dataset, the sampling fraction is shown in the Overview tab in the UI.

For classification problems, AutoML uses the PySpark sampleBy method for stratified sampling to preserve the target label distribution.

For regression problems, AutoML uses the PySpark sample method.

Imbalanced dataset support for classification problems

In Databricks Runtime 11.3 LTS ML and above, if AutoML detects that a dataset is imbalanced, it tries to reduce the imbalance of the training dataset by downsampling the major class(es) and adding class weights. AutoML only balances the training dataset and does not balance the test and validation datasets. Doing so ensures that the model performance is always evaluated on the non-enriched dataset with the true input class distribution.

To balance an imbalanced training dataset, AutoML uses class weights that are inversely related to the degree by which a given class is downsampled. For example, if a training dataset with 100 samples has 95 samples belonging to class A and five samples belonging to class B, AutoML reduces this imbalance by downsampling class A to 70 samples, that is, downsampling class A by a ratio of 70/95 or 0.736 while keeping the number of samples in class B at 5. To ensure that the final model is correctly calibrated and the probability distribution of the model output is the same as that of the input, AutoML scales up the class weight for class A by the ratio 1/0.736, or 1.358, while keeping the weight of class B as 1. AutoML then uses these class weights in model training as a parameter to ensure that the samples from each class are weighted appropriately when training the model.

Time series aggregation

For forecasting problems, when there are multiple values for a timestamp in a time series, AutoML uses the average of the values.

To use the sum instead, edit the source code notebook generated by the trial runs. In the Aggregate data by … cell, change .agg(y=(target_col, "avg")) to .agg(y=(target_col, "sum")), as shown:

group_cols = [time_col] + id_cols
df_aggregation = df_loaded \
  .groupby(group_cols) \
  .agg(y=(target_col, "sum")) \
  .reset_index() \
  .rename(columns={ time_col : "ds" })

Semantic type detection

Note

  • Semantic type detection is not applied to forecasting problems.

  • AutoML does not perform semantic type detection for columns that have custom imputation methods specified.

With Databricks Runtime 9.1 LTS ML and above, AutoML tries to detect whether columns have a semantic type different from the Spark or pandas data type in the table schema. AutoML treats these columns as the detected semantic type. These detections are best effort and might sometimes miss the existence of semantic types. You can also manually set the semantic type of a column or tell AutoML not to apply semantic type detection to a column using annotations.

Specifically, AutoML makes these adjustments:

  • String and integer columns representing date or timestamp data are treated as a timestamp type.

  • String columns that represent numeric data are treated as a numeric type.

With Databricks Runtime 10.1 ML and above, AutoML also makes these adjustments:

  • Numeric columns that contain categorical IDs are treated as a categorical feature.

  • String columns that contain English text are treated as a text feature.

Semantic type annotations

With Databricks Runtime 10.1 ML and above, you can manually control the assigned semantic type by placing a semantic type annotation on a column. To manually annotate the semantic type of column <column-name> as <semantic-type>, use the following syntax:

metadata_dict = df.schema["<column-name>"].metadata
metadata_dict["spark.contentAnnotation.semanticType"] = "<semantic-type>"
df = df.withMetadata("<column-name>", metadata_dict)

<semantic-type> can be one of the following:

  • categorical: The column contains categorical values (for example, numerical values that should be treated as IDs).

  • numeric: The column contains numeric values (for example, string values that can be parsed into numbers).

  • datetime: The column contains timestamp values (string, numerical, or date values that can be converted into timestamps).

  • text: The string column contains English text.

To disable semantic type detection on a column, use the special keyword annotation native.

Feature Store integration

With Databricks Runtime 11.3 LTS ML and above, you can use existing feature tables in Feature Store to augment the original input dataset for your classification and regression problems.

With Databricks Runtime 12.2 LTS ML and above, you can use existing feature tables in Feature Store to augment the original input dataset for all of your AutoML problems: classification, regression, and forecasting.

To create a feature table, see What is a feature store?.

To use existing feature tables, you can select feature tables with the AutoML UI or set the feature_store_lookups parameter in your AutoML run specification.

feature_store_lookups = [
  {
     "table_name": "example.trip_pickup_features",
     "lookup_key": ["pickup_zip", "rounded_pickup_datetime"],
  },
  {
      "table_name": "example.trip_dropoff_features",
     "lookup_key": ["dropoff_zip", "rounded_dropoff_datetime"],
  }
]

Notebook example: AutoML experiment with Feature Store

The following notebook shows how to train an ML model with AutoML and Feature Store feature tables.

AutoML experiment with Feature Store example notebook

Open notebook in new tab