hyperopt-spark-data(Python)

Loading...

Hyperopt: best practices for datasets of different sizes

This notebook provides guidelines for using the Hyperopt class SparkTrials when working with datasets of different sizes:

  • small (~10MB or less)
  • medium (~100MB)
  • large (~1GB or more)

The notebook uses randomly generated datasets. The goal is to tune the regularization parameter alpha in a LASSO model.

Requirements:

  • Databricks Runtime for Machine Learning
  • Two workers

Small datasets (~10MB or less)

When a dataset is small, you can load it on the driver and call it from the objective function directly.
SparkTrials automatically broadcasts the data and the objective function to workers.
There is negligible overhead.

Medium datasets (~100MB)

Calling a medium dataset directly from the objective function can be inefficient.
If you change the objective function code, the data would have to be broadcast again.
Databricks recommends broadcasting the data explicitly using Spark and getting back its value from the broadcasted variable on workers.

Large datasets (~1GB or more)

Broadcasting a large dataset requires significant cluster resources.
Consider storing the data on DBFS and loading it back onto workers using the DBFS local file interface.