The image data source in Apache Spark 2.4 included in Databricks Runtime 5.0 abstracts from the details of image representations and provides a standard API to load image data:

df ="image").load("...")

Similar APIs exist for Scala, Java, and R.

You can import a nested directory structure (for example, use a path like /path/to/dir/) and you can use partition discovery by specifying a path with a partition directory (that is, a path like /path/to/dir/date=2018-01-02/category=automobile).

Image structure

Image files are loaded as a DataFrame containing a single struct-type column called image with the following fields:

image: struct containing all the image data
  |-- origin: string representing the source URI
  |-- height: integer, image height in pixels
  |-- width: integer, image width in pixels
  |-- nChannels
  |-- mode
  |-- data

where the fields are:

  • nChannels: The number of color channels. Typical values are 1 for grayscale images, 3 for colored images (for example, RGB), and 4 for colored images with alpha channel.

  • mode: Integer flag that provides information on how to interpret the data field. It specifies the data type and channel order the data is stored in. The value of the field is expected (but not enforced) to map to one of the OpenCV types displayed below. OpenCV types are defined for 1, 2, 3, or 4 channels and several data types for the pixel values. Channel order specifies the order in which the colors are stored. For example, if you have a typical three channel image with red, blue, and green components, there are six possible orderings. Most libraries use either RGB or BGR. Three (four) channel OpenCV types are expected to be in BGR(A) order.

    Mapping of Type to Numbers in OpenCV (data types x number of channels)

      C1 C2 C3 C4
    CV_8U 0 8 16 24
    CV_8S 1 9 17 25
    CV_16U 2 10 18 26
    CV_16S 3 11 19 27
    CV_32S 4 12 20 28
    CV_32S 5 13 21 29
    CV_64F 6 14 22 30
  • data: Image data stored in a binary format. Image data is represented as a 3-dimensional array with the dimension shape (height, width, nChannels) and array values of type t specified by the mode field. The array is stored in row-major order.

Display image data

The Databricks display function supports displaying image data. See Images types in DataFrames.


Deep Learning Pipelines provides an easy way to get started with ML using images. This example in the following notebook uses transfer learning to build a custom image classifier. For information on installing the Deep Learning Pipelines library and its dependencies, see Deep Learning Pipelines.

For Deep Learning Pipelines developers, the new image schema changes the ordering of the color channels to BGR from RGB. To minimize confusion, some of the internal APIs now require you to specify the ordering explicitly.