hyperopt-sklearn-model-selection(Python)
Loading...

Model selection using scikit-learn, Hyperopt, and MLflow

Hyperopt is a Python library for hyperparameter tuning. Databricks Runtime for Machine Learning includes an optimized and enhanced version of Hyperopt, including automated MLflow tracking and the SparkTrials class for distributed tuning.

This notebook shows how to use Hyperopt to identify the best model from among several different scikit-learn algorithms and sets of hyperparameters for each model. It also shows how to use MLflow to track Hyperopt runs so you can examine them later.

This tutorial covers the following steps:

  1. Prepare the dataset.
  2. Define the function to minimize.
  3. Define the search space over hyperparameters.
  4. Select the search algorithm.
  5. Use Hyperopt's fmin() function to find the best combination of hyperparameters.
import numpy as np
 
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
 
from hyperopt import fmin, tpe, hp, SparkTrials, STATUS_OK, Trials
 
import mlflow
/databricks/python/lib/python3.7/site-packages/past/builtins/misc.py:45: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses from imp import reload

Prepare the dataset

This notebook uses the California housing dataset included with scikit-learn. The dataset is based on data from the 1990 US census. It includes the median house value in over 20,000 census blocks in California along with information about the block such as the income, number of people per household, number of rooms and bedrooms per house, and so on.

X, y = fetch_california_housing(return_X_y=True)

Scale the predictor values

The predictor columns are median income, house age, average number of rooms in a house, average number of bedrooms, block population, average house occupancy, latitude, and longitude. The ranges of these predictors varies significantly. Block population is in the thousands, but the average number of rooms in a house is around 5. To prevent the predictors with large values from dominating the calculations, it's a good idea to normalize the predictor values so they are all on the same scale. To do this, you can use the scikit-learn function StandardScaler.

# Review the mean value of each column in the dataset. You can see that they vary by several orders of magnitude, from 1425 for block population to 1.1 for average number of bedrooms. 
X.mean(axis=0)
Out[3]: array([ 3.87067100e+00, 2.86394864e+01, 5.42899974e+00, 1.09667515e+00, 1.42547674e+03, 3.07065516e+00, 3.56318614e+01, -1.19569704e+02])
from sklearn.preprocessing import StandardScaler
 
scaler = StandardScaler()
X = scaler.fit_transform(X)
# After scaling, the mean value for each column is close to 0. 
X.mean(axis=0)
Out[5]: array([ 6.60969987e-17, 5.50808322e-18, 6.60969987e-17, -1.06030602e-16, -1.10161664e-17, 3.44255201e-18, -1.07958431e-15, -8.52651283e-15])

Convert the numeric target column to discrete values

The target value in this dataset is the value of the house, a continuous or numeric value. This notebook illustrates the use of classification functions, so the first step is to convert the target value to a categorical value. The next cell converts the original target values into two discrete levels: 0 if the value of the house is below the median, or 1 if the value of the house is above the median.

y_discrete = np.where(y < np.median(y), 0, 1)

Define the function to minimize

In this notebook, you examine three algorithms available in scikit-learn: support vector machines (SVM), random forest, and logistic regression.

In the following cell, you define a parameter params['type'] for the model name. This function also runs the training and calculates the cross-validation accuracy.

def objective(params):
    classifier_type = params['type']
    del params['type']
    if classifier_type == 'svm':
        clf = SVC(**params)
    elif classifier_type == 'rf':
        clf = RandomForestClassifier(**params)
    elif classifier_type == 'logreg':
        clf = LogisticRegression(**params)
    else:
        return 0
    accuracy = cross_val_score(clf, X, y_discrete).mean()
    
    # Because fmin() tries to minimize the objective, this function must return the negative accuracy. 
    return {'loss': -accuracy, 'status': STATUS_OK}

Define the search space over hyperparameters

See the Hyperopt documentation for details on defining a search space and parameter expressions.

Use hp.choice to select different models.

search_space = hp.choice('classifier_type', [
    {
        'type': 'svm',
        'C': hp.lognormal('SVM_C', 0, 1.0),
        'kernel': hp.choice('kernel', ['linear', 'rbf'])
    },
    {
        'type': 'rf',
        'max_depth': hp.quniform('max_depth', 2, 5, 1),
        'criterion': hp.choice('criterion', ['gini', 'entropy'])
    },
    {
        'type': 'logreg',
        'C': hp.lognormal('LR_C', 0, 1.0),
        'solver': hp.choice('solver', ['liblinear', 'lbfgs'])
    },
])

Select the search algorithm

The two main choices are:

  • hyperopt.tpe.suggest: Tree of Parzen Estimators, a Bayesian approach that iteratively and adaptively selects new hyperparameter settings to explore based on previous results
  • hyperopt.rand.suggest: Random search, a non-adaptive approach that samples over the search space
algo=tpe.suggest

Use Hyperopt's fmin() function to find the best combination of hyperparameters.

SparkTrials takes 2 optional arguments:

  • parallelism: Number of models to fit and evaluate concurrently. The default is the number of available Spark task slots.
  • timeout: Maximum time (in seconds) that fmin() can run. The default is no maximum time limit.
spark_trials = SparkTrials()
Because the requested parallelism was None or a non-positive value, parallelism will be set to (8), which is Spark's default parallelism (8), or the current total of Spark task slots (8), or 1, whichever is greater. We recommend setting parallelism explicitly to a positive value because the total of Spark task slots is subject to cluster sizing.

When you call mlflow.start_run() before calling fmin() as shown in the example below, the Hyperopt runs are automatically tracked with MLflow.

max_evals is the maximum number of points in hyperparameter space to test. This is the maximum number of models Hyperopt fits and evaluates.

with mlflow.start_run():
  best_result = fmin(
    fn=objective, 
    space=search_space,
    algo=algo,
    max_evals=32,
    trials=spark_trials)
Hyperopt with SparkTrials will automatically track trials in MLflow. To view the MLflow experiment associated with the notebook, click the 'Runs' icon in the notebook context bar on the upper right. There, you can view all runs. To view logs from trials, please check the Spark executor logs. To view executor logs, expand 'Spark Jobs' above until you see the (i) icon next to the stage from the trial job. Click it and find the list of tasks; Task 0 is the first trial attempt, and subsequent Tasks are retries. Click the 'stderr' link for a task to view trial logs. 0%| | 0/32 [00:00<?, ?trial/s, best loss=?] 3%|▎ | 1/32 [00:03<01:35, 3.08s/trial, best loss: -0.8149709302325581] 6%|▋ | 2/32 [00:05<01:23, 2.77s/trial, best loss: -0.8150678294573643] 9%|▉ | 3/32 [00:09<01:31, 3.15s/trial, best loss: -0.8150678294573643] 12%|█▎ | 4/32 [00:12<01:27, 3.11s/trial, best loss: -0.8150678294573643] 16%|█▌ | 5/32 [00:15<01:23, 3.09s/trial, best loss: -0.8150678294573643] 19%|█▉ | 6/32 [00:16<01:03, 2.46s/trial, best loss: -0.8150678294573643] 31%|███▏ | 10/32 [00:21<00:46, 2.10s/trial, best loss: -0.8150678294573643] 34%|███▍ | 11/32 [00:25<00:56, 2.68s/trial, best loss: -0.8150678294573643] 38%|███▊ | 12/32 [00:27<00:49, 2.48s/trial, best loss: -0.8150678294573643] 41%|████ | 13/32 [00:35<01:18, 4.14s/trial, best loss: -0.8150678294573643] 44%|████▍ | 14/32 [00:38<01:08, 3.80s/trial, best loss: -0.8150678294573643] 47%|████▋ | 15/32 [00:39<00:50, 2.97s/trial, best loss: -0.8150678294573643] 50%|█████ | 16/32 [00:41<00:42, 2.68s/trial, best loss: -0.8150678294573643] 53%|█████▎ | 17/32 [00:43<00:37, 2.48s/trial, best loss: -0.8150678294573643] 56%|█████▋ | 18/32 [00:44<00:28, 2.04s/trial, best loss: -0.8150678294573643] 59%|█████▉ | 19/32 [00:47<00:30, 2.33s/trial, best loss: -0.8150678294573643] 62%|██████▎ | 20/32 [00:48<00:23, 1.93s/trial, best loss: -0.8150678294573643] 66%|██████▌ | 21/32 [00:51<00:24, 2.26s/trial, best loss: -0.8150678294573643] 69%|██████▉ | 22/32 [00:52<00:18, 1.89s/trial, best loss: -0.8150678294573643] 75%|███████▌ | 24/32 [00:54<00:13, 1.63s/trial, best loss: -0.8180717054263565] 78%|███████▊ | 25/32 [00:55<00:10, 1.45s/trial, best loss: -0.8180717054263565] 81%|████████▏ | 26/32 [00:57<00:09, 1.61s/trial, best loss: -0.8180717054263565] 88%|████████▊ | 28/32 [00:58<00:05, 1.29s/trial, best loss: -0.8261627906976745] 91%|█████████ | 29/32 [01:02<00:06, 2.12s/trial, best loss: -0.8352713178294575] 94%|█████████▍| 30/32 [01:09<00:07, 3.59s/trial, best loss: -0.8352713178294575] 97%|█████████▋| 31/32 [01:14<00:04, 4.01s/trial, best loss: -0.8352713178294575] 100%|██████████| 32/32 [01:34<00:00, 8.81s/trial, best loss: -0.8352713178294575] 100%|██████████| 32/32 [01:34<00:00, 2.96s/trial, best loss: -0.8352713178294575] Total Trials: 32: 32 succeeded, 0 failed, 0 cancelled.

Print the hyperparameters that produced the best result

import hyperopt
print(hyperopt.space_eval(search_space, best_result))
{'C': 3.093586085186027, 'kernel': 'rbf', 'type': 'svm'}

To view the MLflow experiment associated with the notebook, click the Experiment icon in the notebook context bar on the upper right. There, you can view all runs. To view runs in the MLflow UI, click the icon at the far right next to Experiment Runs.

To examine the effect of tuning a specific hyperparameter:

  1. Select the resulting runs and click Compare.
  2. In the Scatter Plot, select the hyperparameter from the X-axis drop-down menu and select loss from the Y-axis drop-down menu.