Model training examples

This section includes examples showing how to train machine learning and deep learning models on Databricks using many popular open-source libraries.

You can also use AutoML, which automatically prepares a dataset for model training, performs a set of trials using open-source libraries such as scikit-learn and XGBoost, and creates a Python notebook with the source code for each trial run so you can review, reproduce, and modify the code.

For an example notebook that shows how to train a machine learning model that uses data in Unity Catalog and write predictions back to Unity Catalog, see Train and register machine learning models with Unity Catalog.

Machine learning examples

Package

Notebook(s)

Features

scikit-learn

Machine learning quickstart

Classification model, MLflow, automated hyperparameter tuning with Hyperopt and MLflow

scikit-learn

End-to-end example

Classification model, MLflow, automated hyperparameter tuning with Hyperopt and MLflow, XGBoost, Model Registry, Model Serving

MLlib

MLlib examples

Binary classification, decision trees, GBT regression, Structured Streaming, custom transformer

xgboost

XGBoost examples

Python, PySpark, and Scala, single node workloads and distributed training

Deep learning examples

Also see Best practices for deep learning on Databricks.

Package

Notebook

Features

TensorFlow Keras

Deep learning quickstart

TensorFlow Keras, TensorBoard, Hyperopt, MLflow

TensorFlow (single node)

TensorFlow tutorial with MNIST dataset

TensorFlow, TensorBoard

PyTorch (single node)

PyTorch tutorial with MNIST dataset

PyTorch

For distributed deep learning training, see:

Package

Notebook

Features

HorovodRunner (TensorFlow Keras)

TensorFlow Keras MNIST example

TensorFlow Keras single node to distributed training

HorovodRunner (PyTorch)

PyTorch MNIST example

PyTorch single node to distributed training

HorovodRunner

Horovod timeline

Horovod timeline

horovod.spark (PyTorch and Keras)

horovod.spark package

horovod.spark estimator API for use in ML pipelines with Keras and PyTorch

spark-tensorflow-distributor

Distributed Training with TensorFlow

Distributed training with TensorFlow on Apache Spark clusters

TorchDistributor

Distributed training with TorchDistributor

Distributed training with PyTorch on Apache Spark clusters

Hyperparameter tuning examples

For general information about hyperparameter tuning in Databricks, see Hyperparameter tuning.

Package

Notebook

Features

Hyperopt

Distributed hyperopt

Distributed hyperopt, scikit-learn, MLflow

Hyperopt

Compare models

Use distributed hyperopt to search hyperparameter space for different model types simultaneously

Hyperopt

Distributed training algorithms and hyperopt

Hyperopt, MLlib

Hyperopt

Hyperopt best practices

Best practices for datasets of different sizes