MLflow autologging
This notebook demonstrates how to track model training and tuning with MLflow. Starting with MLflow version 1.17.0, you can use MLflow autologging with pyspark.ml
. If your cluster is running Databricks Runtime for ML 8.2 or below, you can upgrade the MLflow client to add this pyspark.ml
support. Upgrading is not required to run the notebook.
To upgrade MLflow to a version that supports pyspark.ml
autologging, uncomment and run the following cell.
Part 1. Run distributed training using MLlib
This section shows a simple example of distributed training using MLlib. For more information and examples, see these resources:
- Databricks documentation on MLlib (AWS|Azure|GCP)
- Apache Spark MLlib programming guide
- Apache Spark MLlib Python API documentation
Load data
This notebook uses the classic MNIST handwritten digit recognition dataset. The examples are vectors of pixels representing images of handwritten digits. For example:
These datasets are stored in the popular LibSVM dataset format. The following cell shows how to load them using MLlib's LibSVM dataset reader utility.
Create a function to train a model
In this section, you define a function to train a decision tree. Wrapping the training code in a function is important for passing the function to Hyperopt for tuning later.
Details: The tree algorithm needs to know that the labels are categories 0-9, rather than continuous values. This example uses the StringIndexer
class to do this. A Pipeline
ties this feature preprocessing together with the tree algorithm. ML Pipelines are tools Spark provides for piecing together Machine Learning algorithms into workflows. To learn more about Pipelines, check out other ML example notebooks in Databricks and the ML Pipelines user guide.
Part 2. Use Hyperopt to tune hyperparameters
In this section, you create the Hyperopt workflow.
- Define a function to minimize
- Define a search space over hyperparameters
- Specify the search algorithm and use
fmin()
to tune the model
For more information about the Hyperopt APIs, see the Hyperopt documentation.
Define the search space over hyperparameters
This example tunes two hyperparameters: minInstancesPerNode
and maxBins
. See the Hyperopt documentation for details on defining a search space and parameter expressions.
Tune the model using Hyperopt fmin()
- Set
max_evals
to the maximum number of points in hyperparameter space to test (the maximum number of models to fit and evaluate). Because this command evaluates many models, it can take several minutes to execute. - You must also specify which search algorithm to use. The two main choices are:
hyperopt.tpe.suggest
: Tree of Parzen Estimators, a Bayesian approach which iteratively and adaptively selects new hyperparameter settings to explore based on previous resultshyperopt.rand.suggest
: Random search, a non-adaptive approach that randomly samples the search space
Important:
When using Hyperopt with MLlib and other distributed training algorithms, do not pass a trials
argument to fmin()
. When you do not include the trials
argument, Hyperopt uses the default Trials
class, which runs on the cluster driver. Hyperopt needs to evaluate each trial on the driver node so that each trial can initiate distributed training jobs.
Do not use the SparkTrials
class with MLlib. SparkTrials
is designed to distribute trials for algorithms that are not themselves distributed. MLlib uses distributed computing already and is not compatible with SparkTrials
.
Tuning distributed training algorithms: Hyperopt and Apache Spark MLlib
Databricks Runtime for Machine Learning includes Hyperopt, a library for ML hyperparameter tuning in Python, and Apache Spark MLlib, a library of distributed algorithms for training ML models (also often called "Spark ML"). This example notebook shows how to use them together.
Use case
Distributed machine learning workloads in Python for which you want to tune hyperparameters.
In this example notebook
The demo shows how to tune hyperparameters for an example machine learning workflow in MLlib. You can follow this example to tune other distributed machine learning algorithms from MLlib or from other libraries.
This guide includes two sections to illustrate the process you can follow to develop your own workflow:
Requirements
This notebooks requires Databricks Runtime for Machine Learning.