Save Apache Spark DataFrames as TFRecord files
This article shows you how to use spark-tensorflow-connector to save Apache Spark DataFrames to TFRecord files and load TFRecord with TensorFlow.
The TFRecord file format is a simple record-oriented binary format for ML training data. The tf.data.TFRecordDataset class enables you to stream over the contents of one or more TFRecord files as part of an input pipeline.
Use spark-tensorflow-connector
library
You can use spark-tensorflow-connector to save Apache Spark DataFrames to TFRecord files.
spark-tensorflow-connector
is a library within the TensorFlow ecosystem
that enables conversion between Spark DataFrames and TFRecords (a popular format for storing data for TensorFlow). With spark-tensorflow-connector, you can use
Spark DataFrame APIs to read TFRecords files into DataFrames and write DataFrames as TFRecords.
Note
The spark-tensorflow-connector
library is included in Databricks Runtime for Machine Learning. To use spark-tensorflow-connector
on Databricks Runtime release notes versions and compatibility, you need to install the library from Maven. See Maven or Spark package for details.
Example: Load data from TFRecord files with TensorFlow
The example notebook demonstrates how to save data from Apache Spark DataFrames to TFRecord files and load TFRecord files for ML training.
You can load the TFRecord files using the tf.data.TFRecordDataset
class. See Reading a TFRecord file from TensorFlow for details.