Preparing image data for Distributed DL

This notebook uses the flowers dataset from the TensorFlow team as an example to show how to save image data from Spark DataFrames to TFRecords and load it using TensorFlow.

The dataset contains flower photos stored in five subdirectories, one per class. It is hosted under Databricks Datasets dbfs:/databricks-datasets/flower_photos for easy access.

This notebook loads the flowers table, which contains the preprocessed flowers dataset, using the binary file data source.

from pyspark.sql.functions import col, pandas_udf import osimport uuidimport tensorflow as tf

df = spark.read.format("delta").load("/databricks-datasets/flowers/delta") labels = df.select(col("label")).distinct().collect()label_to_idx = {label: index for index, (label, ) in enumerate(sorted(labels))} @pandas_udf("long")def get_label_idx(labels):  return labels.map(lambda label: label_to_idx[label]) df = df.withColumn("label_index", get_label_idx(col("label"))) \  .select(col("content"), col("label_index")) \  .limit(100)

name_uuid = str(uuid.uuid4())path = '/ml/flowersData/df-{}.tfrecord'.format(name_uuid)df.limit(100).write.format("tfrecords").mode("overwrite").save(path)

display(dbutils.fs.ls(path))

filenames = [("/dbfs" + path + "/" + name) for name in os.listdir("/dbfs" + path) if name.startswith("part")]dataset = tf.data.TFRecordDataset(filenames)

def decode_and_normalize(serialized_example):  """  Decode and normalize an image and label from the given `serialized_example`.  It is used as a map function for `dataset.map`  """  IMAGE_SIZE = 224    # 1. define a parser  feature_dataset = tf.io.parse_single_example(      serialized_example,      # Defaults are not specified since both keys are required.      features={          'content': tf.io.FixedLenFeature([], tf.string),          'label_index': tf.io.FixedLenFeature([], tf.int64),      })  # 2. decode the data  image = tf.io.decode_jpeg(feature_dataset['content'])  label = tf.cast(feature_dataset['label_index'], tf.int32)  # 3. resize  image = tf.image.resize(image, [IMAGE_SIZE, IMAGE_SIZE])  # 4. normalize the data  image = tf.cast(image, tf.float32) * (1. / 255) - 0.5  return image, label parsed_dataset = dataset.map(decode_and_normalize)

batch_size = 4parsed_dataset = parsed_dataset.shuffle(40)parsed_dataset = parsed_dataset.repeat(2)parsed_dataset = parsed_dataset.batch(batch_size)

dbutils.fs.rm(path, True)

Out[8]: True

Save data from Spark DataFrames to TFRecords and load it using TensorFlow(Python)

Preparing image data for Distributed DL

Save data from Spark DataFrames to TFRecords

Load TFRecords using TensorFlow