Hyperopt concepts

This article describes some of the concepts you need to know to use distributed Hyperopt.

For examples illustrating how to use Hyperopt in Databricks, see Hyperparameter tuning with Hyperopt.

fmin()

You use fmin() to execute a Hyperopt run. The arguments for fmin() are shown in the table; see the Hyperopt documentation for more information. For examples of how to use each argument, see the example notebooks.

Argument name Description
fn Objective function. Hyperopt calls this function with values generated from the hyperparameter space provided in the space argument. This function can return the loss as a scalar value or in a dictionary (see Hyperopt docs for details). This function typically contains code for model training and loss calculation.
space Defines the hyperparameter space to search. Hyperopt provides great flexibility in how this space is defined. You can choose a categorical option such as algorithm, or probabilistic distribution for numeric values such as uniform and log.
algo Hyperopt search algorithm to use to search hyperparameter space. Most commonly used are hyperopt.rand.suggest for Random Search and hyperopt.tpe.suggest for TPE.
max_evals Number of hyperparameter settings to try (the number of models to fit).
max_queue_len Number of hyperparameter settings Hyperopt should generate ahead of time. Because the Hyperopt TPE generation algorithm can take some time, it can be helpful to increase this beyond the default value of 1, but generally no larger than the SparkTrials setting parallelism.
trials A Trials or SparkTrials object. Use SparkTrials when you call single-machine algorithms such as scikit-learn methods in the objective function. Use Trials when you call distributed training algorithms such as MLlib methods or Horovod in the objective function.

The SparkTrials class

SparkTrials is an API developed by Databricks that allows you to distribute a Hyperopt run without making other changes to your Hyperopt code. SparkTrials accelerates single-machine tuning by distributing trials to Spark workers.

Note

SparkTrials is designed to parallelize computations for single-machine ML models such as scikit-learn. For models created with distributed ML algorithms such as MLlib or Horovod, do not use SparkTrials. In this case the model building process is automatically parallelized on the cluster and you should use the default Hyperopt class Trials.

This section describes how to configure the arguments you pass to SparkTrials and implementation aspects of SparkTrials.

Arguments

SparkTrials takes two optional arguments:

  • parallelism: Maximum number of trials to evaluate concurrently. A higher number lets you scale-out testing of more hyperparameter settings. Because Hyperopt proposes new trials based on past results, there is a trade-off between parallelism and adaptivity. For a fixed max_evals, greater parallelism speeds up calculations, but lower parallelism may lead to better results since each iteration has access to more past results.

    Default: Number of Spark executors available. Maximum: 128. If the value is greater than the number of concurrent tasks allowed by the cluster configuration, SparkTrials reduces parallelism to this value.

  • timeout: Maximum number of seconds an fmin() call can take. When this number is exceeded, all runs are terminated and fmin() exits. Information about completed runs is saved.

Implementation

When defining the objective function fn passed to fmin(), and when selecting a cluster setup, it is helpful to understand how SparkTrials distributes tuning tasks.

In Hyperopt, a trial generally corresponds to fitting one model on one setting of hyperparameters. Hyperopt iteratively generates trials, evaluates them, and repeats.

With SparkTrials, the driver node of your cluster generates new trials, and worker nodes evaluate those trials. Each trial is generated with a Spark job which has one task, and is evaluated in the task on a worker machine. If your cluster is set up to run multiple tasks per worker, then multiple trials may be evaluated at once on that worker.

SparkTrials and MLflow

The level of support for MLflow depends on the Databricks Runtime version:

  • Databricks Runtime 5.5 LTS ML

    Supports automated MLflow tracking for hyperparameter tuning with Hyperopt and SparkTrials in Python. When automated MLflow tracking is enabled and you run fmin() with SparkTrials, hyperparameters and evaluation metrics are automatically logged in MLflow. Without automated MLflow tracking, you must make explicit API calls to log to MLflow. Automated MLflow tracking is enabled by default. To disable it, set the Spark configuration spark.databricks.mlflow.trackHyperopt.enabled to false. You can still use SparkTrials to distribute tuning even without automated MLflow tracking.

  • Databricks Runtime 6.3 ML and above support logging to MLflow from workers. You can add custom logging code in the objective function you pass to Hyperopt.

SparkTrials logs tuning results as nested MLflow runs as follows:

  • Main or parent run: The call to fmin() is logged as the main run. If there is an active run, SparkTrials logs to this active run and does not end the run when fmin() returns. If there is no active run, SparkTrials creates a new run, logs to it, and ends the run before fmin() returns.
  • Child runs: Each hyperparameter setting tested (a “trial”) is logged as a child run under the main run. For clusters running on Databricks Runtime 6.0 ML and above, MLflow log records from workers are also stored under the corresponding child runs.

When calling fmin(), Databricks recommends active MLflow run management; that is, wrap the call to fmin() inside a with mlflow.start_run(): statement. This ensures that each fmin() call is logged to a separate MLflow main run, and makes it easier to log extra tags, parameters, or metrics to that run.

Note

When you call fmin() multiple times within the same active MLflow run, MLflow logs those calls to the same main run. To resolve name conflicts for logged parameters and tags, MLflow appends a UUID to names with conflicts.

When logging from workers in Databricks Runtime 6.3 ML and above, you do not need to manage runs explicitly in the objective function. Call mlflow.log_param("param_from_worker", x) in the objective function to log a parameter to the child run. You can log parameters, metrics, tags, and artifacts in the objective function.