Load data using Petastorm
Petastorm is an open source data access library. This library enables single-node or distributed training and evaluation of deep learning models directly from datasets in Apache Parquet format and datasets that are already loaded as Apache Spark DataFrames. Petastorm supports popular Python-based machine learning (ML) frameworks such as TensorFlow, PyTorch, and PySpark. For more information about Petastorm, see the Petastorm GitHub page and Petastorm API documentation.
Load data from Spark DataFrames using Petastorm
The Petastorm Spark converter API simplifies data conversion from Spark to TensorFlow or PyTorch. The input Spark DataFrame is first materialized in Parquet format and then loaded as a tf.data.Dataset
or torch.utils.data.DataLoader
.
See the Spark Dataset Converter API section in the Petastorm API documentation.
The recommended workflow is:
Use Apache Spark to load and optionally preprocess data.
Use the Petastorm
spark_dataset_converter
method to convert data from a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader.Feed data into a DL framework for training or inference.
Configure cache directory
The Petastorm Spark converter caches the input Spark DataFrame in Parquet format in a user-specified cache directory location. The cache directory must be a DBFS path starting with file:///dbfs/
, for example, file:///dbfs/tmp/foo/
which refers to the same location as dbfs:/tmp/foo/
. You can configure the cache directory in two ways:
In the cluster Spark config add the line:
petastorm.spark.converter.parentCacheDirUrl file:///dbfs/...
In your notebook, call
spark.conf.set()
:from petastorm.spark import SparkDatasetConverter, make_spark_converter spark.conf.set(SparkDatasetConverter.PARENT_CACHE_DIR_URL_CONF, 'file:///dbfs/...')
You can either explicitly delete the cache after using it by calling converter.delete()
or manage the cache implicitly by configuring the lifecycle rules in your object storage.
Databricks supports DL training in three scenarios:
Single-node training
Distributed hyperparameter tuning
Distributed training
For end-to-end examples, see the following notebooks:
Load Parquet files directly using Petastorm
This method is less preferred than the Petastorm Spark converter API.
The recommended workflow is:
Use Apache Spark to load and optionally preprocess data.
Save data in Parquet format into a DBFS path that has a companion DBFS mount.
Load data in Petastorm format via the DBFS mount point.
Use data in a DL framework for training or inference.
See example notebook for an end-to-end example.