Use Spark and Petastorm to prepare data for deep learning
This notebook demonstrates the following workflow on Databricks:
Use Spark to load and preprocess data.
Save data using Parquet under dbfs:/ml.
Load data using Petastorm via the optimized FUSE mount file:/dbfs/ml.
Feed data into a DL framework for training or inference.
Requirements
Databricks Runtime ML
Load, preprocess, and save data using Spark
Spark can load data from many sources.
This notebooks downloads the MNIST dataset in LIBSVM format and loads it using Spark's built-in LIBSVM data source.
Petastorm supports scalar and array columns in Spark DataFrame.
MLlib vector is a user-defined type (UDT), which requires special handling.
Register a user-defined function (UDF) that converts MLlib vectors into dense arrays.
Load data using Petastorm and feed data into a DL framework
Use Petastorm to load the Parquet data and create a tf.data.Dataset.
Then fit a simple neural network model using tf.Keras.
Use Spark and Petastorm to prepare data for deep learning
This notebook demonstrates the following workflow on Databricks:
dbfs:/ml
.file:/dbfs/ml
.Requirements