Hyperparameter tuning and automated machine learning

Databricks Runtime for Machine Learning incorporates MLflow and Hyperopt, two open source tools that automate the process of model selection and hyperparameter tuning.

Automated MLflow tracking

MLflow is an open source platform for managing the end-to-end machine learning lifecycle. MLflow provides automated tracking for model tuning with Apache Spark MLlib. With automated MLflow tracking, when you run tuning code using CrossValidator or TrainValidationSplit, the specified hyperparameters and evaluation metrics are automatically logged, making it easy to identify the optimal model. Automated MLflow tracking is available for Python notebooks only.

Hyperparameter tuning with Hyperopt

Databricks Runtime ML includes Hyperopt, a Python library that facilitates distributed hyperparameter tuning and model selection. With Hyperopt, you can scan a set of Python models while varying algorithms and hyperparameters across spaces that you define. Hyperopt works with both distributed ML algorithms such as Apache Spark MLlib and Horovod, as well as with single-machine ML models such as scikit-learn and TensorFlow.

The basic steps when using Hyperopt are:

  1. Define an objective function to minimize. Typically this is the training or validation loss.
  2. Define the hyperparameter search space. Hyperopt provides a conditional search space, which lets you compare different ML algorithms in the same run.
  3. Specify the search algorithm. Hyperopt uses stochastic tuning algorithms that perform a more efficient search of hyperparameter space than a deterministic grid search.
  4. Run the Hyperopt function fmin(). fmin() takes the items you defined in the previous steps and identifies the set of hyperparameters that minimizes the objective function.

To get started quickly using Hyperopt with scikit-learn algorithms, see:

For more details about how Hyperopt works, and for additional examples, see: