Save Apache Spark DataFrames as TFRecord files

The TFRecord file format is a simple record-oriented binary format for ML training data. The tf.data.TFRecordDataset class enables you to stream over the contents of one or more TFRecord files as part of an input pipeline.

Note

This guide is not a comprehensive guide on importing data with TensorFlow. See the TensorFlow API Guide.

Save Apache Spark DataFrames to TFRecord files

You can use spark-tensorflow-connector to save Apache Spark DataFrames to TFRecord files.

spark-tensorflow-connector is a library within the TensorFlow ecosystem that enables conversion between Spark DataFrames and TFRecords (a popular format for storing data for TensorFlow). With spark-tensorflow-connector, you can use Spark DataFrame APIs to read TFRecords files into DataFrames and write DataFrames as TFRecords.

Note

The spark-tensorflow-connector library is included in Introduction to Databricks Runtime for Machine Learning. Instead of installing the library using the following instructions, you can simply create a cluster using Introduction to Databricks Runtime for Machine Learning. To use spark-tensorflow-connector on Databricks Runtime, you need to install the library from Maven. See Maven or Spark package for details.

Load data from TFRecord files with TensorFlow

You can load the TFRecord files using the tf.data.TFRecordDataset class. See Reading a TFRecord file from TensorFlow for details.

The following example notebook demonstrates how to save data from Apache Spark DataFrames to TFRecord files and load TFRecord files for ML training.

Prepare image data for Distributed DL

Open notebook in new tab