hyperopt-spark-mlflow

Distributed Hyperopt and automated MLflow tracking

Hyperopt is a Python library for hyperparameter tuning. Databricks Runtime for Machine Learning includes an optimized and enhanced version of Hyperopt, including automated MLflow tracking and the SparkTrials class for distributed tuning.

This notebook illustrates how to scale up hyperparameter tuning for a single-machine Python ML algorithm and track the results using MLflow. In part 1, you create a single-machine Hyperopt workflow. In part 2, you learn to use the SparkTrials class to distribute the workflow calculations across the Spark cluster.

from sklearn.datasets import load_irisfrom sklearn.model_selection import cross_val_scorefrom sklearn.svm import SVC from hyperopt import fmin, tpe, hp, SparkTrials, STATUS_OK, Trials # If you are running Databricks Runtime for Machine Learning, `mlflow` is already installed and you can skip the following line. import mlflow

# Load the iris dataset from scikit-learn
iris = iris = load_iris()
X = iris.data
y = iris.target

def objective(C):
    # Create a support vector classifier model
    clf = SVC(C=C)
    
    # Use the cross-validation accuracy to compare the models' performance
    accuracy = cross_val_score(clf, X, y).mean()
    
    # Hyperopt tries to minimize the objective function. A higher accuracy value means a better model, so you must return the negative accuracy.
    return {'loss': -accuracy, 'status': STATUS_OK}

search_space = hp.lognormal('C', 0, 1.0)

algo=tpe.suggest

argmin = fmin(
  fn=objective,
  space=search_space,
  algo=algo,
  max_evals=16)

# Print the best value found for C
print("Best value found: ", argmin)

Part 2. Distributed tuning using Apache Spark and MLflow

To distribute tuning, add one more argument to fmin(): a Trials class called SparkTrials.

SparkTrials takes 2 optional arguments:

parallelism: Number of models to fit and evaluate concurrently. The default is the number of available Spark task slots.
timeout: Maximum time (in seconds) that fmin() can run. The default is no maximum time limit.

This example uses the very simple objective function defined in Cmd 7. In this case, the function runs quickly and the overhead of starting the Spark jobs dominates the calculation time, so the calculations for the distributed case take more time. For typical real-world problems, the objective function is more complex, and using SparkTrails to distribute the calculations will be faster than single-machine tuning.

Automated MLflow tracking is enabled by default. To use it, call mlflow.start_run() before calling fmin() as shown in the example.

from hyperopt import SparkTrials

# To display the API documentation for the SparkTrials class, uncomment the following line.
# help(SparkTrials)

spark_trials = SparkTrials()

with mlflow.start_run():
  argmin = fmin(
    fn=objective,
    space=search_space,
    algo=algo,
    max_evals=16,
    trials=spark_trials)

# Print the best value found for C
print("Best value found: ", argmin)

hyperopt-spark-mlflow(Python)

Distributed Hyperopt and automated MLflow tracking

Import required packages and load dataset

Part 1. Single-machine Hyperopt workflow

Define a function to minimize

Define the search space over hyperparameters

Select a search algorithm

Part 2. Distributed tuning using Apache Spark and MLflow