Skip to main content

Object metadata column

Preview

This feature is in Public Preview.

The _object_metadata column is a hidden metadata column that exposes cloud object-level properties for each file read by a file-based data source. Unlike _metadata (which contains information like file path, size, and modification time), _object_metadata provides richer storage-layer properties fetched via cloud APIs — including MIME type, ETag, user-defined key-value metadata, system-defined metadata, and object tags.

The _object_metadata column is available for all input file formats when reading from cloud object storage. To include the _object_metadata column in the returned DataFrame, you must explicitly select it in the read query where you specify the source.

If the data source contains a column named _object_metadata, queries against _object_metadata return the data source column, not the cloud object metadata. To access the cloud object metadata column in this case, prepend an additional underscore (__object_metadata). Repeat if __object_metadata also collides.

Common file metadata like the file path or size can be queried using the _metadata column. For more information about the _metadata column, see File metadata column.

warning

New fields might be added to the _object_metadata column in future releases. To prevent schema evolution errors if the _object_metadata column is updated, you can select specific fields from the column in your queries. See Examples.

Schema

The _object_metadata column is a STRUCT containing the following fields, available starting from Databricks Runtime 18.1. All fields are nullable.

Name

Type

Description

Example

mime_type

STRING

MIME type (content type) of the object, for example application/parquet or text/csv.

application/parquet

etag

STRING

ETag of the object. ETags are useful for detecting changes or versioning.

"abc123def456"

user_metadata

VARIANT

User-defined metadata key-value pairs stored on the object. For example, in S3 these are user-defined metadata headers. See User-defined metadata headers in the AWS documentation. In Azure Blob, these are user-defined metadata. See Manage blob properties and metadata with .NET in the Azure documentation.

{"my_key":"my_value"}

system_metadata

VARIANT

System-defined key-value pairs set by the cloud storage provider.

{"Content-Length":"1024", ...}

tags

VARIANT

User-defined object tag key-value pairs stored on the object. For example, in S3 these are object tags. See Categorizing your objects using tags in the AWS documentation. Not all cloud storage services support object tags. See Notes for per-provider behavior.

{"my_tag":"my_value"}

Examples

The following examples show how to read and query the _object_metadata column using different ingestion methods.

Read a batch of files

The following example reads a CSV file and selects both the _metadata and _object_metadata columns.

Python
path = "<path-to-load-from>"

df = spark.read.format("csv").load(path)
display(df.select("*", "_metadata", "_object_metadata"))

Stream files with Auto Loader

The following example uses Auto Loader to stream files from cloud storage and writes the _object_metadata column to a Delta table.

Python
path = "<path-to-load-from>"
checkpoint = "<checkpoint-path>"
schema_location = "<schema-location-path>"
table = "<output-table-path>"

dsw = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "text")
.option("cloudFiles.schemaLocation", schema_location)
.option("header", "true")
.load(path)
.selectExpr("*", "_metadata as md", "_object_metadata as obj_md")
.writeStream
.format("delta")
.option("checkpointLocation", checkpoint)
.trigger(once=True)
.start(table)
)

dsw.awaitTermination()

df = spark.read.format("delta").load(table).select("value", "md", "obj_md")
display(df)

Select specific fields

To avoid schema evolution errors from future changes to _object_metadata, select only the specific fields you need.

Python
path = "<path-to-load-from>"

(spark.read
.format("csv")
.schema(schema)
.load(path)
.select("_object_metadata.user_metadata", "_object_metadata.tags", "_object_metadata.etag"))

Use with COPY INTO

The following example uses COPY INTO to load files into a Delta table while selecting the _object_metadata column.

SQL
COPY INTO my_delta_table
FROM (
SELECT *, _object_metadata FROM '<path-to-load-from>'
)
FILEFORMAT = CSV

Extract values from VARIANT fields

The user_metadata, system_metadata, and tags fields are VARIANT type. The following example extracts specific values using the :: cast operator. You can extract specific values using the :: cast operator or VARIANT functions. See VARIANT type.

Python
path = "<path-to-load-from>"

(spark.read
.format("csv")
.schema(schema)
.load(path)
.selectExpr(
"*",
"_object_metadata.user_metadata:my_key::string as my_key",
"_object_metadata.tags:environment::string as env_tag"
))

Notes

Keep the following in mind when using _object_metadata.

  • The _object_metadata column works with Amazon S3, Azure DFS, Azure Blob, and GCP.
  • Selecting any field from _object_metadata triggers up to two additional cloud API calls per file, so queries over a large number of small files may experience some latency increase.
  • _object_metadata.tags is supported for S3 and Azure Blob Storage (non-HNS, blob.core.windows.net). On all other providers (Azure DFS, WASB, GCP), tags returns {}.
  • For S3, the credential must have s3:GetObjectTagging permission. If unavailable, tags returns null.
  • If Databricks encounters an error fetching tags from a supported provider, tags returns null.
  • System metadata, user metadata, and tags are not available for Databricks-managed storage and are set to null.