Load data using Petastorm

This article describes how to use Petastorm convert data from Apache Spark to TensorFlow or PyTorch. It also provides an example showing how to use Petastorm to prepare data for ML.

Note

The petastorm package is deprecated. Mosaic Streaming is the recommended replacement for loading large datasets from cloud storage.

Petastorm is an open source data access library. It enables single-node or distributed training and evaluation of deep learning models directly from datasets in Apache Parquet format and datasets that are already loaded as Apache Spark DataFrames. Petastorm supports popular Python-based machine learning (ML) frameworks such as TensorFlow, PyTorch, and PySpark. For more information about Petastorm, see the Petastorm API documentation.

Load data from Spark DataFrames using Petastorm

The Petastorm Spark converter API simplifies data conversion from Spark to TensorFlow or PyTorch. The input Spark DataFrame is first materialized in Parquet format and then loaded as a tf.data.Dataset or torch.utils.data.DataLoader. See the Spark Dataset Converter API section in the Petastorm API documentation.

The recommended workflow is:

Use Apache Spark to load and optionally preprocess data.
Use the Petastorm spark_dataset_converter method to convert data from a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader.
Feed data into a DL framework for training or inference.

Configure cache directory

The Petastorm Spark converter caches the input Spark DataFrame in Parquet format in a user-specified cache directory location. The cache directory must be a DBFS path starting with file:///dbfs/, for example, file:///dbfs/tmp/foo/ which refers to the same location as dbfs:/tmp/foo/. You can configure the cache directory in two ways:

In the cluster Spark config add the line: petastorm.spark.converter.parentCacheDirUrl file:///dbfs/...

In your notebook, call spark.conf.set():

from petastorm.spark import SparkDatasetConverter, make_spark_converter

spark.conf.set(SparkDatasetConverter.PARENT_CACHE_DIR_URL_CONF, 'file:///dbfs/...')

You can either explicitly delete the cache after using it by calling converter.delete() or manage the cache implicitly by configuring the lifecycle rules in your object storage.

Databricks supports DL training in three scenarios: