Downloading Data from Unstructured Cloud Storage

The notebook below is the first of six notebooks demonstrating how to perform distributed training with TensorFlowOnSpark on the MNIST dataset. Download the full set of notebooks or see the TensorFlowOnSpark guide for more information.

This notebook demonstrates how to download data from S3 and create a data ingest pipeline (load training data from disk into in-memory tensors) using APIs in TensorFlow. In our example, training data is loaded from TFRecords files. If instead you would like to work with data processed in Spark, consider using the Spark-TensorFlow connector to persist your DataFrames as TFRecords files, then load them into TensorFlow using a workflow similar to that described below.

Next steps

The next stage of the TensorFlowOnSpark training pipeline is to construct a TensorFlow graph for distributed model training. For more information, see Constructing the Model Graph.