Medium datasets (~100MB)
Calling a medium dataset directly from the objective function can be inefficient.
If you change the objective function code, the data would have to be broadcast again.
Databricks recommends broadcasting the data explicitly using Spark and getting back its value from the broadcasted variable on workers.
Hyperopt: best practices for datasets of different sizes
This notebook provides guidelines for using the Hyperopt class
SparkTrials
when working with datasets of different sizes:The notebook uses randomly generated datasets. The goal is to tune the regularization parameter
alpha
in a LASSO model.Requirements: