Train ML models with Databricks AutoML Python API
This article demonstrates how to train a model with Databricks AutoML using the AutoML Python API. See Databricks AutoML Python API reference for more details.
The API provides functions to start classification, regression, and forecasting AutoML runs. Each function call trains a set of models and generates a trial notebook for each model.
See Requirements for AutoML experiments.
Setup an experiment using the AutoML API
The following steps generally describe how to set up an AutoML experiment using the API:
Create a notebook and attach it to a cluster running Databricks Runtime ML.
Identify which table you want to use from your existing data source or upload a data file to DBFS and create a table.
To start an AutoML run, use the
automl.regress()
orautoml.classify()
function and pass the table, along with any other training parameters. To see all functions and parameters, see Databricks AutoML Python API reference.For example:
summary = automl.regress(dataset=train_pdf, target_col="col_to_predict")
When the AutoML run begins, an MLflow experiment URL appears in the console. Use this URL to monitor the run’s progress. Refresh the MLflow experiment to see the trials as they are completed.
After the AutoML run completes:
Use the links in the output summary to navigate to the MLflow experiment or the notebook that generated the best results.
Use the link to the data exploration notebook to gain insights into the data passed to AutoML. You can also attach this notebook to the same cluster and re-run it to reproduce the results or do additional data analysis.
Use the summary object returned from the AutoML call to explore more details about the trials or to load a model trained by a given trial. Learn more about the AutoMLSummary object.
Clone any generated notebook from the trials and re-run it by attaching it to the same cluster to reproduce the results. You can also make necessary edits, re-run them to train additional models and log them into the same experiment.
Import a notebook
To import a notebook saved as an MLflow artifact, use the databricks.automl.import_notebook
Python API. For more information see Import notebook
Register and deploy a model
You can register and deploy your AutoML-trained model just like any registered model in the MLflow model registry; see Log, load, register, and deploy MLflow models.
No module named pandas.core.indexes.numeric
When serving a model built using AutoML with Model Serving, you may get the error: No module named 'pandas.core.indexes.numeric
.
This is due to an incompatible pandas
version between AutoML and the model serving endpoint environment. You can resolve this error by running the add-pandas-dependency.py script. The script edits the requirements.txt
and conda.yaml
for your logged model to include the appropriate pandas
dependency version: pandas==1.5.3
.
Modify the script to include the
run_id
of the MLflow run where your model was logged.Re-registering the model to the MLflow model registry.
Try serving the new version of the MLflow model.