In addition to single-machine training algorithms such as those used in scikit-learn, you can use Hyperopt to with distributed training algorithms. For these functions, Hyperopt generates trials with different hyperparameter settings on the driver node, and each trial is evaluated using distributed training algorithms to take advantage of the full cluster. This applies to any distributed machine learning algorithms or libraries, including Apache Spark MLlib and HorovodRunner.
HorovodRunner is a general API used to run distributed deep learning workloads on Databricks. HorovodRunner integrates Horovod with Spark’s barrier mode to provide higher stability for long-running deep learning training jobs on Spark.
In Hyperopt, trials are evaluated sequentially on the Spark driver node. In contrast, HorovodRunner is launched on the Spark driver node, and it distributes training jobs to Spark worker nodes. HorovodRunner collects the return values to the driver node and then passes them to Hyperopt.
Databricks does not support automatic logging to MLflow with the
Trials class. With HorovodRunner, you do not use the
SparkTrials class, and you must manually call MLflow to log trials for Hyperopt.
For an example of hyperparameter tuning for distributed training using Hyperopt with HorovodRunner, see: