Simplify data conversion from Spark to TensorFlow

This notebook demonstrates the following workflow on Databricks:

Load data using Spark.
Convert the Spark DataFrame to a TensorFlow Dataset using petastorm spark_dataset_converter.
Feed the data into a single-node TensorFlow model for training.
Feed the data into a distributed hyperparameter tuning function.
Feed the data into a distributed TensorFlow model for training.

The example in this notebook is based on the transfer learning tutorial from TensorFlow. It applies the pre-trained MobileNetV2 model to the flowers dataset.

Requirements

Databricks Runtime ML.
Node type: one driver and two workers. Databricks recommends using GPU instances.

1. Load data using Spark

This example uses the flowers dataset from the TensorFlow team, which contains flower photos stored under five subdirectories, one per class. The dataset is available under Databricks Datasets at dbfs:/databricks-datasets/flowers.

The example loads the flowers table, which contains the preprocessed flowers dataset, using the binary file data source. To reduce running time, this notebook uses a small subset of the flowers dataset, including ~90 training images and ~10 validation images. When you run this notebook, you can increase the number of images used for better model accuracy.

Preprocess images

Before feeding the dataset into the model, you need to decode the raw image bytes and apply standard ImageNet transforms. Databricks recommends not doing this transformation on the Spark DataFrame since that substantially increases the size of the intermediate files and might decrease performance. Instead, do this transformation in a TransformSpec function in petastorm.

Alternatively, you can also apply the transformation to the TensorFlow Dataset returned by the converter using dataset.map() with tf.map_fn().

5. Feed the data into a distributed TensorFlow model for training

Use HorovodRunner for distributed training.

Use the default value of parameter num_epochs=None to generate infinite batches of data to avoid handling the last incomplete batch. This is particularly useful in the distributed training scenario, where you need to guarantee that the numbers of data records seen on all workers are identical per step. Given that the length of each data shard may not be identical, setting num_epochs to any specific number would fail to meet the guarantee.

petastorm-spark-converter-tensorflow(Python)

Simplify data conversion from Spark to TensorFlow

Requirements

1. Load data using Spark

2. Cache the Spark DataFrame using Petastorm Spark converter

3. Feed the data into a single-node TensorFlow model for training

Get the model MobileNetV2 from tensorflow.keras

Preprocess images

Train and evaluate the model on the local machine

4. Feed the data into a distributed hyperparameter tuning function

5. Feed the data into a distributed TensorFlow model for training