Model selection using scikit-learn, Hyperopt, and MLflow

Hyperopt is a Python library for hyperparameter tuning. Databricks Runtime for Machine Learning includes an optimized and enhanced version of Hyperopt, including automated MLflow tracking and the SparkTrials class for distributed tuning.

This notebook shows how to use Hyperopt to identify the best model from among several different scikit-learn algorithms and sets of hyperparameters for each model. It also shows how to use MLflow to track Hyperopt runs so you can examine them later.

This tutorial covers the following steps:

Prepare the dataset.
Define the function to minimize.
Define the search space over hyperparameters.
Select the search algorithm.
Use Hyperopt's fmin() function to find the best combination of hyperparameters.

import numpy as np from sklearn.datasets import fetch_california_housingfrom sklearn.model_selection import cross_val_scorefrom sklearn.svm import SVCfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.linear_model import LogisticRegression from hyperopt import fmin, tpe, hp, SparkTrials, STATUS_OK, Trials import mlflow

/databricks/python/lib/python3.7/site-packages/past/builtins/misc.py:45: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses from imp import reload

X, y = fetch_california_housing(return_X_y=True)

Scale the predictor values

The predictor columns are median income, house age, average number of rooms in a house, average number of bedrooms, block population, average house occupancy, latitude, and longitude. The ranges of these predictors varies significantly. Block population is in the thousands, but the average number of rooms in a house is around 5. To prevent the predictors with large values from dominating the calculations, it's a good idea to normalize the predictor values so they are all on the same scale. To do this, you can use the scikit-learn function StandardScaler.

# Review the mean value of each column in the dataset. You can see that they vary by several orders of magnitude, from 1425 for block population to 1.1 for average number of bedrooms. X.mean(axis=0)

Out[3]: array([ 3.87067100e+00, 2.86394864e+01, 5.42899974e+00, 1.09667515e+00, 1.42547674e+03, 3.07065516e+00, 3.56318614e+01, -1.19569704e+02])

from sklearn.preprocessing import StandardScaler scaler = StandardScaler()X = scaler.fit_transform(X)

# After scaling, the mean value for each column is close to 0. X.mean(axis=0)

Out[5]: array([ 6.60969987e-17, 5.50808322e-18, 6.60969987e-17, -1.06030602e-16, -1.10161664e-17, 3.44255201e-18, -1.07958431e-15, -8.52651283e-15])

y_discrete = np.where(y < np.median(y), 0, 1)

def objective(params):    classifier_type = params['type']    del params['type']    if classifier_type == 'svm':        clf = SVC(**params)    elif classifier_type == 'rf':        clf = RandomForestClassifier(**params)    elif classifier_type == 'logreg':        clf = LogisticRegression(**params)    else:        return 0    accuracy = cross_val_score(clf, X, y_discrete).mean()        # Because fmin() tries to minimize the objective, this function must return the negative accuracy.     return {'loss': -accuracy, 'status': STATUS_OK}

search_space = hp.choice('classifier_type', [    {        'type': 'svm',        'C': hp.lognormal('SVM_C', 0, 1.0),        'kernel': hp.choice('kernel', ['linear', 'rbf'])    },    {        'type': 'rf',        'max_depth': hp.quniform('max_depth', 2, 5, 1),        'criterion': hp.choice('criterion', ['gini', 'entropy'])    },    {        'type': 'logreg',        'C': hp.lognormal('LR_C', 0, 1.0),        'solver': hp.choice('solver', ['liblinear', 'lbfgs'])    },])

algo=tpe.suggest

spark_trials = SparkTrials()

Because the requested parallelism was None or a non-positive value, parallelism will be set to (8), which is Spark's default parallelism (8), or the current total of Spark task slots (8), or 1, whichever is greater. We recommend setting parallelism explicitly to a positive value because the total of Spark task slots is subject to cluster sizing.

with mlflow.start_run():  best_result = fmin(    fn=objective,     space=search_space,    algo=algo,    max_evals=32,    trials=spark_trials)

Hyperopt with SparkTrials will automatically track trials in MLflow. To view the MLflow experiment associated with the notebook, click the 'Runs' icon in the notebook context bar on the upper right. There, you can view all runs. To view logs from trials, please check the Spark executor logs. To view executor logs, expand 'Spark Jobs' above until you see the (i) icon next to the stage from the trial job. Click it and find the list of tasks; Task 0 is the first trial attempt, and subsequent Tasks are retries. Click the 'stderr' link for a task to view trial logs. 0%| | 0/32 [00:00<?, ?trial/s, best loss=?] 3%|▎ | 1/32 [00:03<01:35, 3.08s/trial, best loss: -0.8149709302325581] 6%|▋ | 2/32 [00:05<01:23, 2.77s/trial, best loss: -0.8150678294573643] 9%|▉ | 3/32 [00:09<01:31, 3.15s/trial, best loss: -0.8150678294573643] 12%|█▎ | 4/32 [00:12<01:27, 3.11s/trial, best loss: -0.8150678294573643] 16%|█▌ | 5/32 [00:15<01:23, 3.09s/trial, best loss: -0.8150678294573643] 19%|█▉ | 6/32 [00:16<01:03, 2.46s/trial, best loss: -0.8150678294573643] 31%|███▏ | 10/32 [00:21<00:46, 2.10s/trial, best loss: -0.8150678294573643] 34%|███▍ | 11/32 [00:25<00:56, 2.68s/trial, best loss: -0.8150678294573643] 38%|███▊ | 12/32 [00:27<00:49, 2.48s/trial, best loss: -0.8150678294573643] 41%|████ | 13/32 [00:35<01:18, 4.14s/trial, best loss: -0.8150678294573643] 44%|████▍ | 14/32 [00:38<01:08, 3.80s/trial, best loss: -0.8150678294573643] 47%|████▋ | 15/32 [00:39<00:50, 2.97s/trial, best loss: -0.8150678294573643] 50%|█████ | 16/32 [00:41<00:42, 2.68s/trial, best loss: -0.8150678294573643] 53%|█████▎ | 17/32 [00:43<00:37, 2.48s/trial, best loss: -0.8150678294573643] 56%|█████▋ | 18/32 [00:44<00:28, 2.04s/trial, best loss: -0.8150678294573643] 59%|█████▉ | 19/32 [00:47<00:30, 2.33s/trial, best loss: -0.8150678294573643] 62%|██████▎ | 20/32 [00:48<00:23, 1.93s/trial, best loss: -0.8150678294573643] 66%|██████▌ | 21/32 [00:51<00:24, 2.26s/trial, best loss: -0.8150678294573643] 69%|██████▉ | 22/32 [00:52<00:18, 1.89s/trial, best loss: -0.8150678294573643] 75%|███████▌ | 24/32 [00:54<00:13, 1.63s/trial, best loss: -0.8180717054263565] 78%|███████▊ | 25/32 [00:55<00:10, 1.45s/trial, best loss: -0.8180717054263565] 81%|████████▏ | 26/32 [00:57<00:09, 1.61s/trial, best loss: -0.8180717054263565] 88%|████████▊ | 28/32 [00:58<00:05, 1.29s/trial, best loss: -0.8261627906976745] 91%|█████████ | 29/32 [01:02<00:06, 2.12s/trial, best loss: -0.8352713178294575] 94%|█████████▍| 30/32 [01:09<00:07, 3.59s/trial, best loss: -0.8352713178294575] 97%|█████████▋| 31/32 [01:14<00:04, 4.01s/trial, best loss: -0.8352713178294575] 100%|██████████| 32/32 [01:34<00:00, 8.81s/trial, best loss: -0.8352713178294575] 100%|██████████| 32/32 [01:34<00:00, 2.96s/trial, best loss: -0.8352713178294575] Total Trials: 32: 32 succeeded, 0 failed, 0 cancelled.

import hyperoptprint(hyperopt.space_eval(search_space, best_result))

{'C': 3.093586085186027, 'kernel': 'rbf', 'type': 'svm'}

Model selection using scikit-learn, Hyperopt, and MLflow

Prepare the dataset

Scale the predictor values

Convert the numeric target column to discrete values

Define the function to minimize

Define the search space over hyperparameters

Select the search algorithm

Use Hyperopt's `fmin()` function to find the best combination of hyperparameters.

Print the hyperparameters that produced the best result

Model selection using scikit-learn, Hyperopt, and MLflow

Prepare the dataset

Scale the predictor values

Convert the numeric target column to discrete values

Define the function to minimize

Define the search space over hyperparameters

Select the search algorithm

Use Hyperopt's fmin() function to find the best combination of hyperparameters.

Print the hyperparameters that produced the best result

Use Hyperopt's `fmin()` function to find the best combination of hyperparameters.