This feature is in Public Preview.
Databricks AutoML helps you automatically apply machine learning to a dataset. It prepares the dataset for model training and then performs and records a set of trials, creating, tuning, and evaluating multiple models. It displays the results and provides a Python notebook with the source code for each trial run so you can review, reproduce, and modify the code. AutoML also calculates summary statistics on your dataset and saves this information in a notebook that you can review later.
AutoML automatically distributes hyperparameter tuning trials across the worker nodes of a cluster.
Each model is constructed from open source components and can easily be edited and integrated into your machine learning pipelines. You can use Databricks AutoML for regression or classification problems. It evaluates models based on algorithms from the scikit-learn, xgboost, and LightGBM packages.
- Databricks Runtime 8.3 ML or above.
- No additional libraries other than those provided with the Databricks Runtime ML runtime can be installed on the cluster.
Databricks AutoML creates and evaluates models based on these algorithms:
- Classification models
- Regression models
While AutoML distributes hyperparameter tuning trials across the worker nodes of a cluster, each model is trained on a single worker node. With Databricks Runtime 9.1 ML and above, AutoML automatically samples your dataset if it is too large to fit into the memory of a single worker node. AutoML estimates the memory required to load and train your dataset and automatically determines the sampling fraction if sampling is required. The sampled dataset is used for model training.
For classification problems, AutoML uses the PySpark
sampleBy method for stratified sampling to preserve the target label distribution.
For regression problems, AutoML uses the PySpark
With Databricks Runtime 9.1 ML and above, AutoML detects certain columns that have a semantic type that differs from their Spark or pandas data type. AutoML then converts and applies data preprocessing steps based on the detected semantic type. Specifically, AutoML performs the following conversions:
- String and integer columns that represent date or timestamp data are converted to a timestamp type.
- String columns that represent numeric data are converted to a numeric type.
The AutoML UI steps you through the process of training a model on a dataset. To access the UI:
Select Machine Learning from the persona switcher at the top of the left sidebar.
In the sidebar, click Create > AutoML.
You can also create a new AutoML experiment from the Experiments page.
The Configure AutoML experiment page displays. On this page, you configure the AutoML process, specifying the dataset, problem type, target or label column to predict, metric to use to evaluate and score the experiment runs, and stopping conditions.
In the Cluster field, select a cluster running Databricks Runtime 8.3 ML or above.
From the ML problem type drop-down menu, select Regression or Classification. If you are trying to predict a continuous numeric value for each observation, such as annual income, select regression. If you are trying to assign each observation to one of a discrete set of classes, such as good credit risk or bad credit risk, select classification.
Under Dataset, click Browse Tables. A dialog appears listing the available databases and tables. Navigate to the table you want to use and click Select. The table schema appears.
Click in the Prediction target field. A drop-down appears listing the columns shown in the schema. Select the column you want the model to predict.
The Experiment name field shows the default name. To change it, type the new name in the field.
You can specify additional configuration options under Advanced configuration (optional).
- The evaluation metric is the primary metric used to score the runs.
- You can edit the default stopping conditions. By default, the experiment stops after 60 minutes or when it has completed 200 runs, whichever comes first.
- In the Data directory field, you can enter a DBFS location where notebooks generated during the AutoML process are saved. If you leave the field blank, notebooks are saved as MLflow artifacts.
Click Start AutoML. The experiment starts to run, and the AutoML training page appears. To refresh the runs table, click .
From this page, you can:
- Stop the experiment at any time
- Open the data exploration notebook
- Monitor runs
- Navigate to the run page for any run
When the experiment completes, you can:
- Register and deploy one of the models with MLflow.
- Click Edit best model to review and edit the notebook that created the best model.
- Open the data exploration notebook.
- Search, filter, and sort the runs in the runs table.
- See details for any run:
- To open the notebook containing source code for a trial run, click in the Source column
- To view the run page with details about a trial run, click in the Start Time column
- To see information about the model that was created, including code snippets to make predictions, click in the Models column
To return to this AutoML experiment later, find it in the table on the Experiments page.
Click the link in the Models column for the model to register. When a run completes, the best model (based on the primary metric) is the top row.
The artifacts section of the run page for the run that created the model displays.
Click to register the model in Model Registry.
Click Models in the sidebar to navigate to the Model Registry.
Create a notebook and attach it to a cluster running Databricks Runtime 8.3 ML or above.
Load a Spark or pandas DataFrame from an existing data source or upload a data file to DBFS and load the data into the notebook.
df = spark.read.parquet("<folder-path>")
To start an AutoML run, pass the DataFrame to AutoML. See the API docs for details.
When the AutoML run begins, an MLflow experiment URL appears in the console. Use this URL to monitor the progress of the run. Refresh the MLflow experiment to see the trials as they are completed.
After the AutoML run completes:
- Use the links in the output summary to navigate to the MLflow experiment or to the notebook that generated the best results.
- Use the link to the data exploration notebook to get some insights into the data passed to AutoML. You can also attach this notebook to the same cluster and re-run the notebook to reproduce the results or do additional data analysis.
- Use the summary object returned from the AutoML call to explore more details about the trials or to load a model trained by a given trial. See the API docs for details.
- Clone any generated notebook from the trials and re-run the notebook by attaching it to the same cluster to reproduce the results. You can also make necessary edits and re-run them to train additional models and log them to the same experiment.
The Python API provides functions to start classification and regression AutoML runs. Each function call trains a set of models and generates a trial notebook for each model.
databricks.automl.classify( dataset: Union[pyspark.DataFrame, pandas.DataFrame], *, target_col: str, primary_metric: Optional[str], data_dir: Optional[str], timeout_minutes: Optional[int], max_trials: Optional[int] ) -> AutoMLSummary
databricks.automl.regress( dataset: Union[pyspark.DataFrame, pandas.DataFrame], *, target_col: str, primary_metric: Optional[str], data_dir: Optional[str], timeout_minutes: Optional[int], max_trials: Optional[int] ) -> AutoMLSummary
|dataset||pyspark.DataFrame pandas.DataFrame||Input DataFrame that contains training features and target.|
|primary_metric||str||Metric used to evaluate and rank model performance. Supported metrics for regression: “r2” (default), “mae”, “rmse”, “mse” Supported metrics for classification: “f1” (default), “log_loss”, “precision”, “accuracy”, “roc_auc”|
|target_col||str||Column name for the target label.|
|data_dir||str of format
||DBFS path used to store intermediate data. This path is visible to both driver and worker nodes. If empty, AutoML saves intermediate data as MLflow artifacts.|
|timeout_minutes||int||Optional parameter for maximum time to wait for AutoML trials to complete. If omitted will run trials without any time restrictions (default). Throws an exception if the passed timeout is less than 5 minutes or if the timeout is not enough to run at least 1 trial. Longer timeouts allow AutoML to run more trials and provide a model with better accuracy.|
|max_trials||int||Optional parameter for the maximum number of trials to run. The default value is 20. When timeout=None, maximum number of trials will run to completion.|
Summary object for an AutoML classification run that describes the metrics, parameters, and other details for each of the trials. You also use this object to load the model trained by a specific trial.
|experiment||mlflow.entities.Experiment||The MLflow experiment used to log the trials.|
|trials||List[TrialInfo]||A list containing information about all the trials that were run.|
|best_trial||TrialInfo||Info about the trial that resulted in the best weighted score for the primary metric.|
|metric_distribution||str||The distribution of weighted scores for the primary metric across all trials.|
Summary object for each individual trial.
|notebook_path||str||The path to the generated notebook for this trial in the workspace.|
|notebook_url||str||The URL of the generated notebook for this trial.|
|mlflow_run_id||str||The MLflow run ID associated with this trial run.|
|metrics||Dict[str, float]||The metrics logged in MLflow for this trial.|
|params||Dict[str, str]||The parameters logged in MLflow that were used for this trial.|
|model_path||str||The MLflow artifact URL of the model trained in this trial.|
|model_description||str||Short description of the model and the hyperparameters used for training this model.|
|duration||str||Training duration in minutes.|
|preprocessors||str||Description of the preprocessors run before training the model.|
|evaluation_metric_score||float||Score of primary metric, evaluated for the validation dataset.|
|load_model()||Load the model generated in this trial, logged as an MLflow artifact.|
With Databricks Runtime 9.1 ML and above, AutoML depends on the
databricks-automl-runtime package, which contains components that are useful outside of AutoML, and also helps simplify the notebooks generated by AutoML training.
databricks-automl-runtime is available on PyPI.
Only classification and regression are supported
Only the following feature types are supported:
- Numeric (
- String (only categorical)
- Timestamps (
- Numeric (
Feature types not listed above are not supported. For example, images and text are not supported.
With Databricks Runtime 9.0 ML and below, AutoML training uses the full training dataset on a single node. The training dataset must fit into the memory of a single worker node. If you run into out-of-memory issues, try using a worker node with more memory. See Create a cluster.
Alternately, if possible, use Databricks Runtime 9.1 ML or above, where AutoML automatically samples your dataset if it is too large to fit into the memory of a single worker node.