Image
The image data source abstracts from the details of image representations and provides a standard API to load image data. To read image files, specify the data source format
as image
.
df = spark.read.format("image").load("<path-to-image-data>")
Similar APIs exist for Scala, Java, and R.
You can import a nested directory structure (for example, use a path like /path/to/dir/
) and you can use partition discovery by specifying a path with a partition directory (that is, a path like /path/to/dir/date=2018-01-02/category=automobile
).
Note
If you do not want to decode images, Databricks recommends that you use the binary file data source.
Image structure
Image files are loaded as a DataFrame containing a single struct-type column called image
with the following fields:
image: struct containing all the image data
|-- origin: string representing the source URI
|-- height: integer, image height in pixels
|-- width: integer, image width in pixels
|-- nChannels
|-- mode
|-- data
where the fields are:
nChannels
: The number of color channels. Typical values are 1 for grayscale images, 3 for colored images (for example, RGB), and 4 for colored images with alpha channel.mode
: Integer flag that indicates how to interpret the data field. It specifies the data type and channel order the data is stored in. The value of the field is expected (but not enforced) to map to one of the OpenCV types displayed in the following table. OpenCV types are defined for 1, 2, 3, or 4 channels and several data types for the pixel values. Channel order specifies the order in which the colors are stored. For example, if you have a typical three channel image with red, blue, and green components, there are six possible orderings. Most libraries use either RGB or BGR. Three (four) channel OpenCV types are expected to be in BGR(A) order.Map of Type to Numbers in OpenCV (data types x number of channels)
Type C1 C2 C3 C4 CV_8U 0 8 16 24 CV_8S 1 9 17 25 CV_16U 2 10 18 26 CV_16S 3 11 19 27 CV_32S 4 12 20 28 CV_32S 5 13 21 29 CV_64F 6 14 22 30 data
: Image data stored in a binary format. Image data is represented as a 3-dimensional array with the dimension shape (height, width, nChannels) and array values of type t specified by the mode field. The array is stored in row-major order.
Display image data
The Databricks display
function supports displaying image data. See Images.