image-data-source(Python)
Loading...

Image Data Source Sample

This sample notebook illustrates how to use the image data source.

Attribution

The source of these images are from video reenactment of a fight scene by CAVIAR members – EC Funded CAVIAR project/IST 2001 37540. The code used to generate these images can be found at Identify Suspicious Behavior in Video with Databricks Runtime for Machine Learning.

Setup

# Configure image paths
sample_img_dir = "/databricks-datasets/cctvVideos/train_images/"

Create Image DataFrame

Create a DataFrame using image data source included in Apache Spark. The image data source supports Hive-style partitioning, so if you upload images in the following structure:

  • root_dir
    • label=0
      • image_001.jpg
      • image_002.jpg
      • ...
    • label=1
      • image_101.jpg
      • image_102.jpg
      • ...

then the schema generated will be the following (via the image_df.printSchema() command)

root
 |-- image: struct (nullable = true)
 |    |-- origin: string (nullable = true)
 |    |-- height: integer (nullable = true)
 |    |-- width: integer (nullable = true)
 |    |-- nChannels: integer (nullable = true)
 |    |-- mode: integer (nullable = true)
 |    |-- data: binary (nullable = true)
 |-- label: integer (nullable = true)
# Create image DataFrame using image data source
image_df = spark.read.format("image").load(sample_img_dir)

display(image_df) 
 
image
label
1
2
3
4
5
6
7
8
9
10
11
12
13
0
1
0
0
0
0
0
0
1
1
0
0
0

Showing the first 36 rows.

Show image preview
Detected data types for which enhanced rendering is supported. For details, see the Databricks Guide.
# Test image_df.image.origin
display(image_df.select("image.origin"))
 
origin
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/LeftBagframe0004.jpg
dbfs:/databricks-datasets/cctvVideos/train_images/label=1/LeftBagframe0040.jpg
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/LeftBagframe0005.jpg
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/LeftBagframe0015.jpg
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/LeftBagframe0017.jpg
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/LeftBagframe0003.jpg
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/LeftBoxframe0016.jpg
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/LeftBag_AtChairframe0003.jpg
dbfs:/databricks-datasets/cctvVideos/train_images/label=1/LeftBagframe0053.jpg
dbfs:/databricks-datasets/cctvVideos/train_images/label=1/LeftBagframe0041.jpg
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/LeftBagframe0002.jpg
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/LeftBoxframe0005.jpg
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/LeftBoxframe0015.jpg
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Walk2frame0005.jpg
dbfs:/databricks-datasets/cctvVideos/train_images/label=1/LeftBagframe0051.jpg
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/LeftBagframe0016.jpg
dbfs:/databricks-datasets/cctvVideos/train_images/label=1/LeftBoxframe0025.jpg
dbfs:/databricks-datasets/cctvVideos/train_images/label=0/Walk2frame0004.jpg

Showing all 451 rows.

# Print schema of image_df
#  Note the label column based on the label=[0,1] within the file structure
image_df.printSchema()
root |-- image: struct (nullable = true) | |-- origin: string (nullable = true) | |-- height: integer (nullable = true) | |-- width: integer (nullable = true) | |-- nChannels: integer (nullable = true) | |-- mode: integer (nullable = true) | |-- data: binary (nullable = true) |-- label: integer (nullable = true)