Simplify data conversion from Spark to PyTorch

This notebook demonstrates the following workflow on Databricks:

Load data using Spark.
Convert the Spark DataFrame to a PyTorch DataLoader using petastorm spark_dataset_converter.
Feed the data into a single-node PyTorch model for training.
Feed the data into a distributed hyperparameter tuning function.
Feed the data into a distributed PyTorch model for training.

The example in this notebook is based on the transfer learning tutorial from PyTorch. It applies the pre-trained MobileNetV2 model to the flowers dataset.

Requirements

Databricks Runtime ML.
Node type: one driver and two workers. Databricks recommends using GPU instances.

1. Load data using Spark

This example uses the flowers dataset from the TensorFlow team, which contains flower photos stored under five subdirectories, one per class. The dataset is available under Databricks Datasets at dbfs:/databricks-datasets/flowers.

The example loads the flowers table, which contains the preprocessed flowers dataset, using the binary file data source. To reduce running time, this notebook uses a small subset of the flowers dataset, including ~90 training images and ~10 validation images. When you run this notebook, you can increase the number of images used for better model accuracy.

5. Feed the data into a distributed PyTorch model for training.

Use HorovodRunner for distributed training.

The example uses the default value of parameter num_epochs=None to generate infinite batches of data to avoid handling the last incomplete batch. This is particularly useful in the distributed training scenario, where you need to guarantee that the numbers of data records seen on all workers are identical per step. Given that the length of each data shard may not be identical, setting num_epochs to any specific number would fail to meet the guarantee.

petastorm-spark-converter-pytorch(Python)

Simplify data conversion from Spark to PyTorch

Requirements

1. Load data using Spark

2. Cache the Spark DataFrame using Petastorm Spark converter

3. Feed the data into a single-node PyTorch model for training

Get the model MobileNetV2 from torchvision

Define the train and evaluate function for the model

Preprocess images

Train and evaluate the model on the local machine

4. Feed the data into a distributed hyperparameter tuning function.

5. Feed the data into a distributed PyTorch model for training.