import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from hyperopt import fmin, tpe, hp, SparkTrials, STATUS_OK, Trials
import mlflow
/databricks/python/lib/python3.7/site-packages/past/builtins/misc.py:45: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
from imp import reload
# Review the mean value of each column in the dataset. You can see that they vary by several orders of magnitude, from 1425 for block population to 1.1 for average number of bedrooms.
X.mean(axis=0)
Out[3]: array([ 3.87067100e+00, 2.86394864e+01, 5.42899974e+00, 1.09667515e+00,
1.42547674e+03, 3.07065516e+00, 3.56318614e+01, -1.19569704e+02])
def objective(params):
classifier_type = params['type']
del params['type']
if classifier_type == 'svm':
clf = SVC(**params)
elif classifier_type == 'rf':
clf = RandomForestClassifier(**params)
elif classifier_type == 'logreg':
clf = LogisticRegression(**params)
else:
return 0
accuracy = cross_val_score(clf, X, y_discrete).mean()
# Because fmin() tries to minimize the objective, this function must return the negative accuracy.
return {'loss': -accuracy, 'status': STATUS_OK}
search_space = hp.choice('classifier_type', [
{
'type': 'svm',
'C': hp.lognormal('SVM_C', 0, 1.0),
'kernel': hp.choice('kernel', ['linear', 'rbf'])
},
{
'type': 'rf',
'max_depth': hp.quniform('max_depth', 2, 5, 1),
'criterion': hp.choice('criterion', ['gini', 'entropy'])
},
{
'type': 'logreg',
'C': hp.lognormal('LR_C', 0, 1.0),
'solver': hp.choice('solver', ['liblinear', 'lbfgs'])
},
])
spark_trials = SparkTrials()
Because the requested parallelism was None or a non-positive value, parallelism will be set to (8), which is Spark's default parallelism (8), or the current total of Spark task slots (8), or 1, whichever is greater. We recommend setting parallelism explicitly to a positive value because the total of Spark task slots is subject to cluster sizing.
with mlflow.start_run():
best_result = fmin(
fn=objective,
space=search_space,
algo=algo,
max_evals=32,
trials=spark_trials)
Hyperopt with SparkTrials will automatically track trials in MLflow. To view the MLflow experiment associated with the notebook, click the 'Runs' icon in the notebook context bar on the upper right. There, you can view all runs.
To view logs from trials, please check the Spark executor logs. To view executor logs, expand 'Spark Jobs' above until you see the (i) icon next to the stage from the trial job. Click it and find the list of tasks; Task 0 is the first trial attempt, and subsequent Tasks are retries. Click the 'stderr' link for a task to view trial logs.
0%| | 0/32 [00:00<?, ?trial/s, best loss=?]
3%|▎ | 1/32 [00:03<01:35, 3.08s/trial, best loss: -0.8149709302325581]
6%|▋ | 2/32 [00:05<01:23, 2.77s/trial, best loss: -0.8150678294573643]
9%|▉ | 3/32 [00:09<01:31, 3.15s/trial, best loss: -0.8150678294573643]
12%|█▎ | 4/32 [00:12<01:27, 3.11s/trial, best loss: -0.8150678294573643]
16%|█▌ | 5/32 [00:15<01:23, 3.09s/trial, best loss: -0.8150678294573643]
19%|█▉ | 6/32 [00:16<01:03, 2.46s/trial, best loss: -0.8150678294573643]
31%|███▏ | 10/32 [00:21<00:46, 2.10s/trial, best loss: -0.8150678294573643]
34%|███▍ | 11/32 [00:25<00:56, 2.68s/trial, best loss: -0.8150678294573643]
38%|███▊ | 12/32 [00:27<00:49, 2.48s/trial, best loss: -0.8150678294573643]
41%|████ | 13/32 [00:35<01:18, 4.14s/trial, best loss: -0.8150678294573643]
44%|████▍ | 14/32 [00:38<01:08, 3.80s/trial, best loss: -0.8150678294573643]
47%|████▋ | 15/32 [00:39<00:50, 2.97s/trial, best loss: -0.8150678294573643]
50%|█████ | 16/32 [00:41<00:42, 2.68s/trial, best loss: -0.8150678294573643]
53%|█████▎ | 17/32 [00:43<00:37, 2.48s/trial, best loss: -0.8150678294573643]
56%|█████▋ | 18/32 [00:44<00:28, 2.04s/trial, best loss: -0.8150678294573643]
59%|█████▉ | 19/32 [00:47<00:30, 2.33s/trial, best loss: -0.8150678294573643]
62%|██████▎ | 20/32 [00:48<00:23, 1.93s/trial, best loss: -0.8150678294573643]
66%|██████▌ | 21/32 [00:51<00:24, 2.26s/trial, best loss: -0.8150678294573643]
69%|██████▉ | 22/32 [00:52<00:18, 1.89s/trial, best loss: -0.8150678294573643]
75%|███████▌ | 24/32 [00:54<00:13, 1.63s/trial, best loss: -0.8180717054263565]
78%|███████▊ | 25/32 [00:55<00:10, 1.45s/trial, best loss: -0.8180717054263565]
81%|████████▏ | 26/32 [00:57<00:09, 1.61s/trial, best loss: -0.8180717054263565]
88%|████████▊ | 28/32 [00:58<00:05, 1.29s/trial, best loss: -0.8261627906976745]
91%|█████████ | 29/32 [01:02<00:06, 2.12s/trial, best loss: -0.8352713178294575]
94%|█████████▍| 30/32 [01:09<00:07, 3.59s/trial, best loss: -0.8352713178294575]
97%|█████████▋| 31/32 [01:14<00:04, 4.01s/trial, best loss: -0.8352713178294575]
100%|██████████| 32/32 [01:34<00:00, 8.81s/trial, best loss: -0.8352713178294575]
100%|██████████| 32/32 [01:34<00:00, 2.96s/trial, best loss: -0.8352713178294575]
Total Trials: 32: 32 succeeded, 0 failed, 0 cancelled.
Model selection using scikit-learn, Hyperopt, and MLflow
Hyperopt is a Python library for hyperparameter tuning. Databricks Runtime for Machine Learning includes an optimized and enhanced version of Hyperopt, including automated MLflow tracking and the
SparkTrials
class for distributed tuning.This notebook shows how to use Hyperopt to identify the best model from among several different scikit-learn algorithms and sets of hyperparameters for each model. It also shows how to use MLflow to track Hyperopt runs so you can examine them later.
This tutorial covers the following steps:
fmin()
function to find the best combination of hyperparameters.Last refresh: Never