Object metadata column

Beta

This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.

The _object_metadata column is a hidden metadata column that exposes cloud object-level properties for each file read by a file-based data source. Unlike _metadata (which contains information like file path, size, and modification time), _object_metadata provides richer storage-layer properties fetched via cloud APIs — including MIME type, ETag, user-defined key-value metadata, system-defined metadata, and object tags.

The _object_metadata column is available for all input file formats when reading from cloud object storage. To include the _object_metadata column in the returned DataFrame, you must explicitly select it in the read query where you specify the source.

If the data source contains a column named _object_metadata, queries against _object_metadata return the data source column, not the cloud object metadata. To access the cloud object metadata column in this case, prepend an additional underscore (__object_metadata). Repeat if __object_metadata also collides.

Common file metadata like the file path or size can be queried using the _metadata column. For more information about the _metadata column, see File metadata column.

warning

New fields might be added to the _object_metadata column in future releases. To prevent schema evolution errors if the _object_metadata column is updated, you can select specific fields from the column in your queries. See Examples.

Schema

The _object_metadata column is a STRUCT containing the following fields, available starting from Databricks Runtime 18.1. All fields are nullable.

Name	Type	Description	Example
mime_type	`STRING`	MIME type (content type) of the object, for example `application/parquet` or `text/csv`.	`application/parquet`
etag	`STRING`	ETag of the object. ETags are useful for detecting changes or versioning.	`"abc123def456"`
user_metadata	`VARIANT`	User-defined metadata key-value pairs stored on the object. For example, in S3 these are user-defined metadata headers. See User-defined metadata headers in the AWS documentation. In Azure Blob, these are user-defined metadata. See Manage blob properties and metadata with .NET in the Azure documentation.	`{"my_key":"my_value"}`
system_metadata	`VARIANT`	System-defined key-value pairs set by the cloud storage provider.	`{"Content-Length":"1024", ...}`
tags	`VARIANT`	User-defined object tag key-value pairs stored on the object. For example, in S3 these are object tags. See Categorizing your objects using tags in the AWS documentation. Not all cloud storage services support object tags. See Notes for per-provider behavior.	`{"my_tag":"my_value"}`

Examples

The following examples show how to read and query the _object_metadata column using different ingestion methods.

Read a batch of files

The following example reads a CSV file and selects both the _metadata and _object_metadata columns.

Python
Scala

Python
path = "<path-to-load-from>"

df = spark.read.format("csv").load(path)
display(df.select("*", "_metadata", "_object_metadata"))

Scala
val path = "<path-to-load-from>"

val df = spark.read.format("csv").load(path)
display(df.select("*", "_metadata", "_object_metadata"))

Stream files with Auto Loader

The following example uses Auto Loader to stream files from cloud storage and writes the _object_metadata column to a Delta table.

Python
Scala

Python
path = "<path-to-load-from>"
checkpoint = "<checkpoint-path>"
schema_location = "<schema-location-path>"
table = "<output-table-path>"

dsw = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "text")
    .option("cloudFiles.schemaLocation", schema_location)
    .option("header", "true")
    .load(path)
    .selectExpr("*", "_metadata as md", "_object_metadata as obj_md")
    .writeStream
    .format("delta")
    .option("checkpointLocation", checkpoint)
    .trigger(once=True)
    .start(table)
)

dsw.awaitTermination()

df = spark.read.format("delta").load(table).select("value", "md", "obj_md")
display(df)

Scala
val path = "<path-to-load-from>"
val checkpoint = "<checkpoint-path>"
val schemaLocation = "<schema-location-path>"
val table = "<output-table-path>"

val dsw = spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "text")
    .option("cloudFiles.schemaLocation", schemaLocation)
    .option("header", "true")
    .load(path)
    .selectExpr("*", "_metadata as md", "_object_metadata as obj_md")
    .writeStream
    .format("delta")
    .option("checkpointLocation", checkpoint)
    .trigger(Trigger.Once)
    .start(table)

dsw.awaitTermination()

val df = spark.read.format("delta").load(table).select("value", "md", "obj_md")
display(df)

Select specific fields

To avoid schema evolution errors from future changes to _object_metadata, select only the specific fields you need.

Python
Scala

Python
path = "<path-to-load-from>"

(spark.read
   .format("csv")
   .schema(schema)
   .load(path)
   .select("_object_metadata.user_metadata", "_object_metadata.tags", "_object_metadata.etag"))

Scala
val path = "<path-to-load-from>"

spark.read
  .format("csv")
  .schema(schema)
  .load(path)
  .select("_object_metadata.user_metadata", "_object_metadata.tags", "_object_metadata.etag")

Use with `COPY INTO`

The following example uses COPY INTO to load files into a Delta table while selecting the _object_metadata column.

SQL
COPY INTO my_delta_table
FROM (
  SELECT *, _object_metadata FROM '<path-to-load-from>'
)
FILEFORMAT = CSV

Extract values from `VARIANT` fields

The user_metadata, system_metadata, and tags fields are VARIANT type. The following example extracts specific values using the :: cast operator. You can extract specific values using the :: cast operator or VARIANT functions. See VARIANT type.

Python
SQL

Python
path = "<path-to-load-from>"

(spark.read
   .format("csv")
   .schema(schema)
   .load(path)
   .selectExpr(
     "*",
     "_object_metadata.user_metadata:my_key::string as my_key",
     "_object_metadata.tags:environment::string as env_tag"
   ))

SQL
SELECT
  *,
  _object_metadata.user_metadata:my_key::STRING AS my_key,
  _object_metadata.tags:environment::STRING AS env_tag
FROM csv.`<path-to-load-from>`

Notes

Keep the following in mind when using _object_metadata.

The _object_metadata column works with Amazon S3, Azure DFS, Azure Blob, and GCP.
Selecting any field from _object_metadata triggers up to two additional cloud API calls per file, so queries over a large number of small files may experience some latency increase.
_object_metadata.tags is supported for S3 and Azure Blob Storage (non-HNS, blob.core.windows.net). On all other providers (Azure DFS, WASB, GCP), tags returns {}.
For S3, the credential must have s3:GetObjectTagging permission. If unavailable, tags returns null.
If Databricks encounters an error fetching tags from a supported provider, tags returns null.
System metadata, user metadata, and tags are not available for Databricks-managed storage and are set to null.

Schema​

Examples​

Read a batch of files​

Stream files with Auto Loader​

Select specific fields​

Use with COPY INTO​

Extract values from VARIANT fields​

Notes​