Train ML models with Databricks AutoML Python API
This article demonstrates how to train a model with Databricks AutoML using the API. Learn more about What is AutoML?. The Python API provides functions to start classification, regression, and forecasting AutoML runs. Each function call trains a set of models and generates a trial notebook for each model.
The following steps describe generally how to set up an AutoML experiment using the API:
Create a notebook and attach it to a cluster running Databricks Runtime ML.
Identify which table you want to use from your existing data source or upload a data file to DBFS and create a table.
To start an AutoML run, pass the table name to the appropriate API specification: classification, regression, or forecasting.
When the AutoML run begins, an MLflow experiment URL appears in the console. Use this URL to monitor the progress of the run. Refresh the MLflow experiment to see the trials as they are completed.
After the AutoML run completes:
Use the links in the output summary to navigate to the MLflow experiment or to the notebook that generated the best results.
Use the link to the data exploration notebook to get some insights into the data passed to AutoML. You can also attach this notebook to the same cluster and re-run the notebook to reproduce the results or do additional data analysis.
Use the summary object returned from the AutoML call to explore more details about the trials or to load a model trained by a given trial. Learn more about the AutoMLSummary object.
Clone any generated notebook from the trials and re-run the notebook by attaching it to the same cluster to reproduce the results. You can also make necessary edits and re-run them to train additional models and log them to the same experiment.
Requirements
See Requirements for AutoML experiments.
Classification specification
The following code example configures an AutoML run for training a classification model. For additional parameters to further customize your AutoML run see Classification and regression parameters.
Note
The max_trials
parameter is deprecated in Databricks Runtime 10.4 ML and is not supported in Databricks Runtime 11.0 ML and above. Use timeout_minutes
to control the duration of an AutoML run.
databricks.automl.classify(
dataset: Union[pyspark.sql.DataFrame, pandas.DataFrame, pyspark.pandas.DataFrame, str],
*,
target_col: str,
data_dir: Optional[str] = None,
exclude_cols: Optional[List[str]] = None, # <DBR> 10.3 ML and above
exclude_frameworks: Optional[List[str]] = None, # <DBR> 10.3 ML and above
experiment_dir: Optional[str] = None, # <DBR> 10.4 LTS ML and above
experiment_name: Optional[str] = None, # <DBR> 12.1 ML and above
feature_store_lookups: Optional[List[Dict]] = None, # <DBR> 11.3 LTS ML and above
imputers: Optional[Dict[str, Union[str, Dict[str, Any]]]] = None, # <DBR> 10.4 LTS ML and above
max_trials: Optional[int] = None, # <DBR> 10.5 ML and below
pos_label: Optional[Union[int, bool, str] = None, # <DBR> 11.1 ML and above
primary_metric: str = "f1",
time_col: Optional[str] = None,
timeout_minutes: Optional[int] = None,
) -> AutoMLSummary
Regression specification
The following code example configures an AutoML run for training a regression model. For additional parameters to further customize your AutoML run see Classification and regression parameters.
Note
The max_trials
parameter is deprecated in Databricks Runtime 10.4 ML and is not supported in Databricks Runtime 11.0 ML and above. Use timeout_minutes
to control the duration of an AutoML run.
databricks.automl.regress(
dataset: Union[pyspark.sql.DataFrame, pandas.DataFrame, pyspark.pandas.DataFrame, str],
*,
target_col: str,
data_dir: Optional[str] = None,
exclude_cols: Optional[List[str]] = None, # <DBR> 10.3 ML and above
exclude_frameworks: Optional[List[str]] = None, # <DBR> 10.3 ML and above
experiment_dir: Optional[str] = None, # <DBR> 10.4 LTS ML and above
experiment_name: Optional[str] = None, # <DBR> 12.1 ML and above
feature_store_lookups: Optional[List[Dict]] = None, # <DBR> 11.3 LTS ML and above
imputers: Optional[Dict[str, Union[str, Dict[str, Any]]]] = None, # <DBR> 10.4 LTS ML and above
max_trials: Optional[int] = None, # <DBR> 10.5 ML and below
primary_metric: str = "r2",
time_col: Optional[str] = None,
timeout_minutes: Optional[int] = None,
) -> AutoMLSummary
Forecasting specification
The following code example configures an AutoML run for training a forecasting model. For additional detail about parameters for your AutoML run see Forecasting parameters. To use Auto-ARIMA, the time series must have a regular frequency (that is, the interval between any two points must be the same throughout the time series). The frequency must match the frequency unit specified in the API call. AutoML handles missing time steps by filling in those values with the previous value.
databricks.automl.forecast(
dataset: Union[pyspark.sql.DataFrame, pandas.DataFrame, pyspark.pandas.DataFrame, str],
*,
target_col: str,
time_col: str,
country_code: str = "US", # <DBR> 12.0 ML and above
data_dir: Optional[str] = None,
exclude_frameworks: Optional[List[str]] = None,
experiment_dir: Optional[str] = None,
experiment_name: Optional[str] = None, # <DBR> 12.1 ML and above
feature_store_lookups: Optional[List[Dict]] = None, # <DBR> 12.2 LTS ML and above
frequency: str = "D",
horizon: int = 1,
identity_col: Optional[Union[str, List[str]]] = None,
output_database: Optional[str] = None, # <DBR> 10.5 ML and above
primary_metric: str = "smape",
timeout_minutes: Optional[int] = None,
) -> AutoMLSummary
Classification and regression parameters
Note
For classification and regression problems only, you can:
Specify which columns to include in training.
Select custom imputation methods.
Field Name |
Type |
Description |
---|---|---|
dataset |
str pandas.DataFrame pyspark.DataFrame pyspark.sql.DataFrame |
Input table name or DataFrame that contains training features and target. Table name can be in format “..” or “.” for non Unity Catalog tables |
target_col |
str |
Column name for the target label. |
data_dir |
str of format
|
(Optional) DBFS path used to store the training dataset. This path is visible to both driver and worker nodes. If empty, AutoML saves the training dataset as an MLflow artifact. |
exclude_cols |
List[str] |
(Optional) List of columns to ignore during AutoML calculations. Default: [] |
exclude_ frameworks |
List[str] |
(Optional) List of algorithm frameworks that AutoML should not consider as it develops models. Possible values: empty list, or one or more of “sklearn”, “lightgbm”, “xgboost”. Default: [] (all frameworks are considered) |
experiment_dir |
str |
(Optional) Path to the directory in the workspace to save the generated notebooks and experiments. Default: |
experiment_name |
str |
(Optional) Name for the MLflow experiment that AutoML creates. Default: Name is automatically generated. |
feature_store_ lookups |
List[Dict] |
(Optional) List of dictionaries that represent features from Feature Store for data augmentation. Valid keys in each dictionary are:
Default: [] |
imputers |
Dict[str, Union[str, Dict[str, Any]]] |
(Optional) Dictionary where each key is a column name, and each value
is a string or dictionary describing the imputation strategy. If
specified as a string, the value must be one of “mean”, “median”,
or “most_frequent”. To impute with a known value, specify the value
as a dictionary
If no imputation strategy is provided for a column, AutoML selects a default strategy based on column type and content. If you specify a non-default imputation method, AutoML does not perform semantic type detection. Default: {} |
max_trials |
int |
(Optional) Maximum number of trials to run. This parameter is available in Databricks Runtime 10.5 ML and below, but is deprecated starting in Databricks Runtime 10.3 ML. In Databricks Runtime 11.0 ML and above, this parameter is not supported. Default: 20 If timeout_minutes=None, AutoML runs the maximum number of trials. |
pos_label |
Union[int, bool, str] |
(Classification only) The positive class. This is useful for calculating metrics such as precision and recall. Should only be specified for binary classification problems. |
primary_metric |
str |
Metric used to evaluate and rank model performance. Supported metrics for regression: “r2” (default), “mae”, “rmse”, “mse” Supported metrics for classification: “f1” (default), “log_loss”, “precision”, “accuracy”, “roc_auc” |
time_col |
str |
Available in Databricks Runtime 10.1 ML and above. (Optional) Column name for a time column. If provided, AutoML tries to split the dataset into training, validation, and test sets chronologically, using the earliest points as training data and the latest points as a test set. Accepted column types are timestamp and integer. With Databricks Runtime 10.2 ML and above, string columns are also supported. If column type is string, AutoML tries to convert it to timestamp using semantic detection. If the conversion fails, the AutoML run fails. |
timeout_minutes |
int |
(Optional) Maximum time to wait for AutoML trials to complete. Longer timeouts allow AutoML to run more trials and identify a model with better accuracy. Default: None (no time limit) Minimum value: 5 minutes An error is reported if the timeout is too short to allow at least one trial to complete. |
Forecasting parameters
Field Name |
Type |
Description |
---|---|---|
dataset |
str pandas.DataFrame pyspark.DataFrame pyspark.sql.DataFrame |
Input table name or DataFrame that contains training features and target. Table name can be in format “..” or “.” for non Unity Catalog tables |
target_col |
str |
Column name for the target label. |
time_col |
str |
Name of the time column for forecasting. |
frequency |
str |
Frequency of the time series for forecasting. This is the period with which events are expected to occur. The default setting is “D” or daily data. Be sure to change the setting if your data has a different frequency. Possible values: “W” (weeks) “D” / “days” / “day” “hours” / “hour” / “hr” / “h” “m” / “minute” / “min” / “minutes” / “T” “S” / “seconds” / “sec” / “second” The following are only available with Databricks Runtime 12.0 ML and above: “M” / “month” / “months” “Q” / “quarter” / “quarters” “Y” / “year” / “years” Default: “D” |
horizon |
int |
Number of periods into the future for which forecasts should be returned. The units are the time series frequency. Default: 1 |
data_dir |
str of format
|
(Optional) DBFS path used to store the training dataset. This path is visible to both driver and worker nodes. If empty, AutoML saves the training dataset as an MLflow artifact. |
exclude_ frameworks |
List[str] |
(Optional) List of algorithm frameworks that AutoML should not consider as it develops models. Possible values: empty list, or one or more of “prophet”, “arima”. Default: [] (all frameworks are considered) |
experiment_dir |
str |
(Optional) Path to the directory in the workspace to save the generated notebooks and experiments. Default: |
experiment_name |
str |
(Optional) Name for the MLflow experiment that AutoML creates. Default: Name is automatically generated. |
feature_store_ lookups |
List[Dict] |
(Optional) List of dictionaries that represent features from Feature Store for data augmentation. Valid keys in each dictionary are:
Default: [] |
identity_col |
Union[str, list] |
(Optional) Column(s) that identify the time series for multi-series forecasting. AutoML groups by these column(s) and the time column for forecasting. |
output_database |
str |
(Optional) If provided, AutoML saves predictions of the best model to a new table in the specified database. Default: Predictions are not saved. |
primary_metric |
str |
Metric used to evaluate and rank model performance. Supported metrics: “smape”(default) “mse”, “rmse”, “mae”, or “mdape”. |
timeout_minutes |
int |
(Optional) Maximum time to wait for AutoML trials to complete. Longer timeouts allow AutoML to run more trials and identify a model with better accuracy. Default: None (no time limit) Minimum value: 5 minutes An error is reported if the timeout is too short to allow at least one trial to complete. |
country_code |
str |
Available in Databricks Runtime 12.0 ML and above. Only supported by the Prophet forecasting model. (Optional) Two-letter country code that indicates which country’s holidays the forecasting model should use. To ignore holidays, set this parameter to an empty string (“”). Supported countries. Default: US (United States holidays). |
Returns
AutoMLSummary
Summary object for an AutoML run that describes the metrics, parameters, and other details for each of the trials. You also use this object to load the model trained by a specific trial.
Property |
Type |
Description |
---|---|---|
experiment |
mlflow.entities.Experiment |
The MLflow experiment used to log the trials. |
trials |
List[TrialInfo] |
A list containing information about all the trials that were run. |
best_trial |
TrialInfo |
Info about the trial that resulted in the best weighted score for the primary metric. |
metric_distribution |
str |
The distribution of weighted scores for the primary metric across all trials. |
output_table_name |
str |
Used with forecasting only and only if output_database is provided. Name of the table in output_database containing the model’s predictions. |
TrialInfo
Summary object for each individual trial.
Property |
Type |
Description |
---|---|---|
notebook_path |
str |
The path to the generated notebook for this trial in the workspace. |
notebook_url |
str |
The URL of the generated notebook for this trial. |
mlflow_run_id |
str |
The MLflow run ID associated with this trial run. |
metrics |
Dict[str, float] |
The metrics logged in MLflow for this trial. |
params |
Dict[str, str] |
The parameters logged in MLflow that were used for this trial. |
model_path |
str |
The MLflow artifact URL of the model trained in this trial. |
model_description |
str |
Short description of the model and the hyperparameters used for training this model. |
duration |
str |
Training duration in minutes. |
preprocessors |
str |
Description of the preprocessors run before training the model. |
evaluation_metric_score |
float |
Score of primary metric, evaluated for the validation dataset. |
Method |
Description |
---|---|
load_model() |
Load the model generated in this trial, logged as an MLflow artifact. |
Register and deploy a model
You can register and deploy your AutoML trained model just like any registered model in the MLflow model registry, see Log, load, register, and deploy MLflow models.
No module named ‘pandas.core.indexes.numeric
When serving a model built using AutoML with Model Serving, you may get the error: No module named 'pandas.core.indexes.numeric
.
This is due to an incompatible pandas
version between AutoML and the model serving endpoint environment. You can resolve this error by running the add-pandas-dependency.py script. The script edits the requirements.txt
and conda.yaml
for your logged model to include the appropriate pandas
dependency version: pandas==1.5.3
.
Modify the script to include the
run_id
of the MLflow run where your model was logged.Re-registering the model to the MLflow model registry.
Try serving the new version of the MLflow model.
Notebook examples
Review these notebooks to get started with AutoML.
The following notebook shows how to do classification with AutoML.
The following notebook shows how to do regression with AutoML.
The following notebook shows how to do forecasting with AutoML.
The following notebook shows how to train an ML model with AutoML and Feature Store feature tables.