Use scikit-learn on Databricks

This page provides examples of how you can use the scikit-learn package to train machine learning models in Databricks. scikit-learn is one of the most popular Python libraries for single-node machine learning and is included in Databricks Runtime and Databricks Runtime ML. See Databricks Runtime release notes for the scikit-learn library version included with your cluster’s runtime.

You can import these notebooks and run them in your Databricks workspace.

For additional example notebooks to get started quickly on Databricks, see Tutorials: Get started with ML.

Basic example using scikit-learn

This notebook provides a quick overview of machine learning model training on Databricks. It uses the scikit-learn package to train a simple classification model. It also illustrates the use of MLflow to track the model development process, and Hyperopt to automate hyperparameter tuning.

If your workspace is enabled for Unity Catalog, use this version of the notebook:

scikit-learn classification notebook (Unity Catalog)

Open notebook in new tab

If your workspace is not enabled for Unity Catalog, use this version of the notebook:

scikit-learn classification notebook

Open notebook in new tab

End-to-end example using scikit-learn on Databricks

This notebook uses scikit-learn to illustrate a complete end-to-end example of loading data, model training, distributed hyperparameter tuning, and model inference. It also illustrates model lifecycle management using MLflow Model Registry to log and register your model.

If your workspace is enabled for Unity Catalog, use this version of the notebook:

Use scikit-learn with MLflow integration on Databricks (Unity Catalog)

Open notebook in new tab

If your workspace is not enabled for Unity Catalog, use this version of the notebook:

Use scikit-learn with MLflow integration on Databricks

Open notebook in new tab

Track scikit-learn model training with MLflow