petastorm(Python)

Import Notebook

Use Spark and Petastorm to prepare data for deep learning

This notebook demonstrates the following workflow on Databricks:

Use Spark to load and preprocess data.
Save data using Parquet under dbfs:/ml.
Load data using Petastorm via the optimized FUSE mount file:/dbfs/ml.
Feed data into a DL framework for training or inference.

Requirements

Databricks Runtime ML

Load, preprocess, and save data using Spark

Spark can load data from many sources. This notebooks downloads the MNIST dataset in LIBSVM format and loads it using Spark's built-in LIBSVM data source.

Petastorm supports scalar and array columns in Spark DataFrame. MLlib vector is a user-defined type (UDT), which requires special handling. Register a user-defined function (UDF) that converts MLlib vectors into dense arrays.

Load data using Petastorm and feed data into a DL framework

Use Petastorm to load the Parquet data and create a tf.data.Dataset. Then fit a simple neural network model using tf.Keras.