def decode_and_normalize(serialized_example):
"""
Decode and normalize an image and label from the given `serialized_example`.
It is used as a map function for `dataset.map`
"""
IMAGE_SIZE = 224
feature_dataset = tf.io.parse_single_example(
serialized_example,
features={
'content': tf.io.FixedLenFeature([], tf.string),
'label_index': tf.io.FixedLenFeature([], tf.int64),
})
image = tf.io.decode_jpeg(feature_dataset['content'])
label = tf.cast(feature_dataset['label_index'], tf.int32)
image = tf.image.resize(image, [IMAGE_SIZE, IMAGE_SIZE])
image = tf.cast(image, tf.float32) * (1. / 255) - 0.5
return image, label
parsed_dataset = dataset.map(decode_and_normalize)
Preparing image data for Distributed DL
This notebook uses the flowers dataset from the TensorFlow team as an example to show how to save image data from Spark DataFrames to TFRecords and load it using TensorFlow.
The dataset contains flower photos stored in five subdirectories, one per class. It is hosted under Databricks Datasets
dbfs:/databricks-datasets/flower_photos
for easy access.This notebook loads the flowers table, which contains the preprocessed flowers dataset, using the binary file data source.