Train ML models with Databricks AutoML Python API

This article demonstrates how to train a model with Databricks AutoML using the API. Learn more about What is AutoML?. The Python API provides functions to start classification, regression, and forecasting AutoML runs. Each function call trains a set of models and generates a trial notebook for each model.

The following steps describe generally how to set up an AutoML experiment using the API:

  1. Create a notebook and attach it to a cluster running Databricks Runtime 8.3 ML or above.

  2. Load a Spark or pandas DataFrame from an existing data source or upload a data file to DBFS and load the data into the notebook. .. note:: Datasets that have multiple columns with the same name are not supported.

     df = spark.read.format("parquet").load("<folder-path>")
    
  3. To start an AutoML run, pass the DataFrame to the appropriate API specification: classification, regression, or forecasting.

  4. When the AutoML run begins, an MLflow experiment URL appears in the console. Use this URL to monitor the progress of the run. Refresh the MLflow experiment to see the trials as they are completed.

  5. After the AutoML run completes:

    • Use the links in the output summary to navigate to the MLflow experiment or to the notebook that generated the best results.

    • Use the link to the data exploration notebook to get some insights into the data passed to AutoML. You can also attach this notebook to the same cluster and re-run the notebook to reproduce the results or do additional data analysis.

    • Use the summary object returned from the AutoML call to explore more details about the trials or to load a model trained by a given trial. Learn more about the AutoMLSummary object.

    • Clone any generated notebook from the trials and re-run the notebook by attaching it to the same cluster to reproduce the results. You can also make necessary edits and re-run them to train additional models and log them to the same experiment.

Requirements

See Requirements for AutoML experiments.

Classification specification

The following code example configures an AutoML run for training a classification model. For additional parameters to further customize your AutoML run see Classification and regression parameters.

Note

The max_trials parameter is deprecated in Databricks Runtime 10.3 ML - 10.5 ML and is not supported in Databricks Runtime 11.0 ML and above. Use timeout_minutes to control the duration of an AutoML run.

databricks.automl.classify(
  dataset: Union[pyspark.DataFrame, pandas.DataFrame],
  *,
  target_col: str,
  data_dir: Optional[str] = None,
  exclude_columns: Optional[List[str]] = None,                      # <DBR> 10.3 ML and above
  exclude_frameworks: Optional[List[str]] = None,                   # <DBR> 10.3 ML and above
  experiment_dir: Optional[str] = None,                             # <DBR> 10.4 LTS ML and above
  feature_store_lookups: Optional[List[Dict]] = None,               # <DBR> 11.3 ML and above
  imputers: Optional[Dict[str, Union[str, Dict[str, Any]]]] = None, # <DBR> 10.4 LTS ML and above
  max_trials: Optional[int] = None,                                 # <DBR> 10.5 ML and below
  pos_label: Optional[Union[int, bool, str] = None,                 # <DBR> 11.1 ML and above
  primary_metric: str = "f1",
  time_col: Optional[str] = None,
  timeout_minutes: Optional[int] = None,
) -> AutoMLSummary

Regression specification

The following code example configures an AutoML run for training a regression model. For additional parameters to further customize your AutoML run see Classification and regression parameters.

Note

The max_trials parameter is deprecated in Databricks Runtime 10.3 ML - 10.5 ML and is not supported in Databricks Runtime 11.0 ML and above. Use timeout_minutes to control the duration of an AutoML run.

databricks.automl.regress(
  dataset: Union[pyspark.DataFrame, pandas.DataFrame],
  *,
  target_col: str,
  data_dir: Optional[str] = None,
  exclude_columns: Optional[List[str]] = None,                      # <DBR> 10.3 ML and above
  exclude_frameworks: Optional[List[str]] = None,                   # <DBR> 10.3 ML and above
  experiment_dir: Optional[str] = None,                             # <DBR> 10.4 LTS ML and above
  feature_store_lookups: Optional[List[Dict]] = None,               # <DBR> 11.3 ML and above
  imputers: Optional[Dict[str, Union[str, Dict[str, Any]]]] = None, # <DBR> 10.4 LTS ML and above
  max_trials: Optional[int] = None,                                 # <DBR> 10.5 ML and below
  primary_metric: str = "r2",
  time_col: Optional[str] = None,
  timeout_minutes: Optional[int] = None,
) -> AutoMLSummary

Forecasting specification

The following code example configures an AutoML run for training a forecasting model. For additional detail about parameters for your AutoML run see Forecasting parameters. To use Auto-ARIMA, the time series must have a regular frequency (that is, the interval between any two points must be the same throughout the time series). The frequency must match the frequency unit specified in the API call. AutoML handles missing time steps by filling in those values with the previous value.

databricks.automl.forecast(
  dataset: Union[pyspark.sql.dataframe.DataFrame, pandas.core.frame.DataFrame, pyspark.pandas.DataFrame],
  *,
  target_col: str,
  time_col: str,
  data_dir: Optional[str] = None,
  exclude_frameworks: Optional[List[str]] = None,
  experiment_dir: Optional[str] = None,
  frequency: str = "D",
  horizon: int = 1,
  identity_col: Optional[Union[str, List[str]]] = None,
  output_database: Optional[str] = None,                            # <DBR> 10.5 ML and above
  primary_metric: str = "smape",
  timeout_minutes: Optional[int] = None,
) -> AutoMLSummary

Classification and regression parameters

Note

For classification and regression problems only, you can:

  • Specify which columns to include in training.

  • Select custom imputation methods.

  • Pass existing feature tables from Feature Store.

Field Name

Type

Description

dataset

pyspark.DataFrame pandas.DataFrame

Input DataFrame that contains training features and target.

target_col

str

Column name for the target label.

data_dir

str of format dbfs:/<folder-name>

(Optional) DBFS path used to store the training dataset. This path is visible to both driver and worker nodes. If empty, AutoML saves the training dataset as an MLflow artifact.

exclude_columns

List[str]

(Optional) List of columns to ignore during AutoML calculations.

Default: []

exclude_ frameworks

List[str]

(Optional) List of algorithm frameworks that AutoML should not consider as it develops models. Possible values: empty list, or one or more of “sklearn”, “lightgbm”, “xgboost”.

Default: [] (all frameworks are considered)

experiment_dir

str

(Optional) Path to the directory in the workspace to save the generated notebooks and experiments.

Default: /Users/<username>/databricks_automl/

feature_store_ lookups

List[Dict]

(Optional) List of dictionaries that represent features from Feature Store for data augmentation. Valid keys in each dictionary are:

  • table_name (str): Required. Name of the feature table.

  • lookup_key (list or str): Required. Column name(s) to use as key when joining the feature table with the data passed in the dataset param. The order of the column names must match the order of the primary keys of the feature table.

  • timestamp_lookup_key (str): Required if the specified table is a time series feature table. The column name to use when performing point-in-time lookup on the feature table with the data passed in the dataset param.

Default: []

imputers

Dict[str, Union[str, Dict[str, Any]]]

(Optional) Dictionary where each key is a column name, and each value is a string or dictionary describing the imputation strategy. If specified as a string, the value must be one of “mean”, “median”, or “most_frequent”. To impute with a known value, specify the value as a dictionary {“strategy”: “constant”, value: <desired value>}. You can also specify string options as dictionaries, for example {“strategy”: “mean”}.

If no imputation strategy is provided for a column, AutoML selects a default strategy based on column type and content. If you specify a non-default imputation method, AutoML does not perform semantic type detection.

Default: {}

max_trials

int

(Optional) Maximum number of trials to run.

This parameter is available in Databricks Runtime 10.5 ML and below, but is deprecated starting in Databricks Runtime 10.3 ML. In Databricks Runtime 11.0 ML and above, this parameter is not supported.

Default: 20

If timeout_minutes=None, AutoML runs the maximum number of trials.

pos_label

Union[int, bool, str]

(Classification only) The positive class. This is useful for calculating metrics such as precision and recall. Should only be specified for binary classification problems.

primary_metric

str

Metric used to evaluate and rank model performance.

Supported metrics for regression: “r2” (default), “mae”, “rmse”, “mse”

Supported metrics for classification: “f1” (default), “log_loss”, “precision”, “accuracy”, “roc_auc”

time_col

str

Available in Databricks Runtime 10.1 ML and above.

(Optional) Column name for a time column.

If provided, AutoML tries to split the dataset into training, validation, and test sets chronologically, using the earliest points as training data and the latest points as a test set.

Accepted column types are timestamp and integer. With Databricks Runtime 10.2 ML and above, string columns are also supported. If column type is string, AutoML tries to convert it to timestamp using semantic detection. If the conversion fails, the AutoML run fails.

timeout_minutes

int

(Optional) Maximum time to wait for AutoML trials to complete. Longer timeouts allow AutoML to run more trials and identify a model with better accuracy.

Default: None (no time limit)

Minimum value: 5 minutes

An error is reported if the timeout is too short to allow at least one trial to complete.

Forecasting parameters

Field Name

Type

Description

dataset

pyspark.DataFrame pandas.DataFrame

Input DataFrame that contains training features and target.

target_col

str

Column name for the target label.

time_col

str

Name of the time column for forecasting.

frequency

str

Frequency of the time series for forecasting. This is the period with which events are expected to occur. The default setting is “D” or daily data. Be sure to change the setting if your data has a different frequency.

Possible values:

“W” (weeks)

“D” / “days” / “day”

“hours” / “hour” / “hr” / “h”

“m” / “minute” / “min” / “minutes” / “T”

“S” / “seconds” / “sec” / “second”

The following are only available with Databricks Runtime 12.0 ML and above:

“M” / “month” / “months”

“Q” / “quarter” / “quarters”

“Y” / “year” / “years”

Default: “D”

horizon

int

Number of periods into the future for which forecasts should be returned. The units are the time series frequency. Default: 1

data_dir

str of format dbfs:/<folder-name>

(Optional) DBFS path used to store the training dataset. This path is visible to both driver and worker nodes. If empty, AutoML saves the training dataset as an MLflow artifact.

exclude_ frameworks

List[str]

(Optional) List of algorithm frameworks that AutoML should not consider as it develops models. Possible values: empty list, or one or more of “prophet”, “arima”. Default: [] (all frameworks are considered)

experiment_dir

str

(Optional) Path to the directory in the workspace to save the generated notebooks and experiments.

Default: /Users/<username>/databricks_automl/

identity_col

Union[str, list]

(Optional) Column(s) that identify the time series for multi-series forecasting. AutoML groups by these column(s) and the time column for forecasting.

output_database

str

(Optional) If provided, AutoML saves predictions of the best model to a new table in the specified database.

Default: Predictions are not saved.

primary_metric

str

Metric used to evaluate and rank model performance. Supported metrics: “smape”(default) “mse”, “rmse”, “mae”, or “mdape”.

timeout_minutes

int

(Optional) Maximum time to wait for AutoML trials to complete. Longer timeouts allow AutoML to run more trials and identify a model with better accuracy.

Default: None (no time limit)

Minimum value: 5 minutes

An error is reported if the timeout is too short to allow at least one trial to complete.

country_code

str

Available in Databricks Runtime 12.0 ML and above. Only supported by the Prophet forecasting model.

(Optional) Two-letter country code that indicates which country’s holidays the forecasting model should use. To ignore holidays, set this parameter to an empty string (“”). Supported countries.

Default: US (United States holidays).

Returns

AutoMLSummary

Summary object for an AutoML run that describes the metrics, parameters, and other details for each of the trials. You also use this object to load the model trained by a specific trial.

Property

Type

Description

experiment

mlflow.entities.Experiment

The MLflow experiment used to log the trials.

trials

List[TrialInfo]

A list containing information about all the trials that were run.

best_trial

TrialInfo

Info about the trial that resulted in the best weighted score for the primary metric.

metric_distribution

str

The distribution of weighted scores for the primary metric across all trials.

output_table_name

str

Used with forecasting only and only if output_database is provided. Name of the table in output_database containing the model’s predictions.

TrialInfo

Summary object for each individual trial.

Property

Type

Description

notebook_path

str

The path to the generated notebook for this trial in the workspace.

notebook_url

str

The URL of the generated notebook for this trial.

mlflow_run_id

str

The MLflow run ID associated with this trial run.

metrics

Dict[str, float]

The metrics logged in MLflow for this trial.

params

Dict[str, str]

The parameters logged in MLflow that were used for this trial.

model_path

str

The MLflow artifact URL of the model trained in this trial.

model_description

str

Short description of the model and the hyperparameters used for training this model.

duration

str

Training duration in minutes.

preprocessors

str

Description of the preprocessors run before training the model.

evaluation_metric_score

float

Score of primary metric, evaluated for the validation dataset.

Method

Description

load_model()

Load the model generated in this trial, logged as an MLflow artifact.

Notebook examples

Review these notebooks to get started with AutoML.

AutoML classification example notebook

Open notebook in new tab

AutoML regression example notebook

Open notebook in new tab

AutoML forecasting example notebook

Open notebook in new tab

AutoML experiment with Feature Store example notebook

Open notebook in new tab