Load data using Petastorm

Petastorm is an open source data access library. This library enables single-node or distributed training and evaluation of deep learning models directly from datasets in Apache Parquet format and datasets that are already loaded as Apache Spark DataFrames. Petastorm supports popular Python-based machine learning (ML) frameworks such as Tensorflow, PyTorch, and PySpark. For more information about Petastorm, refer to the Petastorm GitHub page and Petastorm API documentation.

Load data from Spark DataFrames using Petastorm

The Petastorm Spark converter API simplifies data conversion from Spark to TensorFlow or PyTorch. The input Spark DataFrame is first materialized in Parquet format and then loaded as a tf.data.Dataset or torch.utils.data.DataLoader. See the Spark Dataset Converter API section in the Petastorm API documentation.

The recommended workflow is:

  1. Use Apache Spark to load and optionally preprocess data.
  2. Use the Petastorm spark_dataset_converter method to convert data from a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader.
  3. Feed data into a DL framework for training or inference.

Configure cache directory

The Petastorm Spark converter caches the input Spark DataFrame in Parquet format in a user-specified cache directory location. The cache directory must be a DBFS FUSE path starting with file:///dbfs/, for example, file:///dbfs/tmp/foo/ which refers to the same location as dbfs:/tmp/foo/. You can configure the cache directory in two ways:

  • In the cluster Spark config add the line: petastorm.spark.converter.parentCacheDirUrl file:///dbfs/...

  • In your notebook, call spark.conf.set():

    from petastorm.spark import SparkDatasetConverter, make_spark_converter
    
    spark.conf.set(SparkDatasetConverter.PARENT_CACHE_DIR_URL_CONF, 'file:///dbfs/...')
    

You can either explicitly delete the cache after using it by calling converter.delete() or manage the cache implicitly by configuring the lifecycle rules in your object storage.

Databricks supports DL training in three scenarios:

  • Single-node training
  • Distributed hyperparameter tuning
  • Distributed training

For end-to-end examples, see the following notebooks:

Load Parquet files directly using Petastorm

This method is less preferred than the Petastorm Spark converter API.

The recommended workflow is:

  1. Use Apache Spark to load and optionally preprocess data.
  2. Save data in Parquet format into a DBFS path that has a companion FUSE mount.
  3. Load data in Petastorm format via the FUSE mount point.
  4. Use data in a DL framework for training or inference.

See example notebook for an end-to-end example.

Examples

Simplify data conversion from Spark to TensorFlow notebook

Open notebook in new tab

Simplify data conversion from Spark to PyTorch notebook

Open notebook in new tab

Use Spark and Petastorm to prepare data for deep learning notebook

Open notebook in new tab