Downloading data from S3

The notebook below is the first of six notebooks demonstrating how to perform distributed training with TensorFlowOnSpark on the MNIST dataset. Click here to download the full set of notebooks, or see the TensorFlowOnSpark guide for more information.

This notebook demonstrates how to download data from S3 and create a data ingest pipeline (load training data from disk into in-memory tensors) using tf.data APIs in Tensorflow. In our example, training data is loaded from TFRecords files. If instead you would like to work with data processed in Spark, consider using the Spark-Tensorflow Connector to persist your DataFrames as TFRecords files, then load them into Tensorflow using a workflow similar to that described below.

Next Steps

The next stage of the TensorFlowOnSpark training pipeline is to construct a Tensorflow graph for distributed model training. For more information, see Constructing the Model Graph.