Object metadata column
This feature is in Public Preview.
The _object_metadata column is a hidden metadata column that exposes cloud object-level properties for each file read by a file-based data source. Unlike _metadata (which contains information like file path, size, and modification time), _object_metadata provides richer storage-layer properties fetched via cloud APIs — including MIME type, ETag, user-defined key-value metadata, system-defined metadata, and object tags.
The _object_metadata column is available for all input file formats when reading from cloud object storage. To include the _object_metadata column in the returned DataFrame, you must explicitly select it in the read query where you specify the source.
If the data source contains a column named _object_metadata, queries against _object_metadata return the data source column, not the cloud object metadata. To access the cloud object metadata column in this case, prepend an additional underscore (__object_metadata). Repeat if __object_metadata also collides.
Common file metadata like the file path or size can be queried using the _metadata column. For more information about the _metadata column, see File metadata column.
New fields might be added to the _object_metadata column in future releases. To prevent schema evolution errors if the _object_metadata column is updated, you can select specific fields from the column in your queries. See Examples.
Schema
The _object_metadata column is a STRUCT containing the following fields, available starting from Databricks Runtime 18.1. All fields are nullable.
Name | Type | Description | Example |
|---|---|---|---|
mime_type |
| MIME type (content type) of the object, for example |
|
etag |
| ETag of the object. ETags are useful for detecting changes or versioning. |
|
user_metadata |
| User-defined metadata key-value pairs stored on the object. For example, in S3 these are user-defined metadata headers. See User-defined metadata headers in the AWS documentation. In Azure Blob, these are user-defined metadata. See Manage blob properties and metadata with .NET in the Azure documentation. |
|
system_metadata |
| System-defined key-value pairs set by the cloud storage provider. |
|
tags |
| User-defined object tag key-value pairs stored on the object. For example, in S3 these are object tags. See Categorizing your objects using tags in the AWS documentation. Not all cloud storage services support object tags. See Notes for per-provider behavior. |
|
Examples
The following examples show how to read and query the _object_metadata column using different ingestion methods.
Read a batch of files
The following example reads a CSV file and selects both the _metadata and _object_metadata columns.
- Python
- Scala
path = "<path-to-load-from>"
df = spark.read.format("csv").load(path)
display(df.select("*", "_metadata", "_object_metadata"))
val path = "<path-to-load-from>"
val df = spark.read.format("csv").load(path)
display(df.select("*", "_metadata", "_object_metadata"))
Stream files with Auto Loader
The following example uses Auto Loader to stream files from cloud storage and writes the _object_metadata column to a Delta table.
- Python
- Scala
path = "<path-to-load-from>"
checkpoint = "<checkpoint-path>"
schema_location = "<schema-location-path>"
table = "<output-table-path>"
dsw = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "text")
.option("cloudFiles.schemaLocation", schema_location)
.option("header", "true")
.load(path)
.selectExpr("*", "_metadata as md", "_object_metadata as obj_md")
.writeStream
.format("delta")
.option("checkpointLocation", checkpoint)
.trigger(once=True)
.start(table)
)
dsw.awaitTermination()
df = spark.read.format("delta").load(table).select("value", "md", "obj_md")
display(df)
val path = "<path-to-load-from>"
val checkpoint = "<checkpoint-path>"
val schemaLocation = "<schema-location-path>"
val table = "<output-table-path>"
val dsw = spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "text")
.option("cloudFiles.schemaLocation", schemaLocation)
.option("header", "true")
.load(path)
.selectExpr("*", "_metadata as md", "_object_metadata as obj_md")
.writeStream
.format("delta")
.option("checkpointLocation", checkpoint)
.trigger(Trigger.Once)
.start(table)
dsw.awaitTermination()
val df = spark.read.format("delta").load(table).select("value", "md", "obj_md")
display(df)
Select specific fields
To avoid schema evolution errors from future changes to _object_metadata, select only the specific fields you need.
- Python
- Scala
path = "<path-to-load-from>"
(spark.read
.format("csv")
.schema(schema)
.load(path)
.select("_object_metadata.user_metadata", "_object_metadata.tags", "_object_metadata.etag"))
val path = "<path-to-load-from>"
spark.read
.format("csv")
.schema(schema)
.load(path)
.select("_object_metadata.user_metadata", "_object_metadata.tags", "_object_metadata.etag")
Use with COPY INTO
The following example uses COPY INTO to load files into a Delta table while selecting the _object_metadata column.
COPY INTO my_delta_table
FROM (
SELECT *, _object_metadata FROM '<path-to-load-from>'
)
FILEFORMAT = CSV
Extract values from VARIANT fields
The user_metadata, system_metadata, and tags fields are VARIANT type. The following example extracts specific values using the :: cast operator. You can extract specific values using the :: cast operator or VARIANT functions. See VARIANT type.
- Python
- SQL
path = "<path-to-load-from>"
(spark.read
.format("csv")
.schema(schema)
.load(path)
.selectExpr(
"*",
"_object_metadata.user_metadata:my_key::string as my_key",
"_object_metadata.tags:environment::string as env_tag"
))
SELECT
*,
_object_metadata.user_metadata:my_key::STRING AS my_key,
_object_metadata.tags:environment::STRING AS env_tag
FROM csv.`<path-to-load-from>`
Notes
Keep the following in mind when using _object_metadata.
- The
_object_metadatacolumn works with Amazon S3, Azure DFS, Azure Blob, and GCP. - Selecting any field from
_object_metadatatriggers up to two additional cloud API calls per file, so queries over a large number of small files may experience some latency increase. _object_metadata.tagsis supported for S3 and Azure Blob Storage (non-HNS,blob.core.windows.net). On all other providers (Azure DFS, WASB, GCP),tagsreturns{}.- For S3, the credential must have
s3:GetObjectTaggingpermission. If unavailable,tagsreturnsnull. - If Databricks encounters an error fetching tags from a supported provider,
tagsreturnsnull. - System metadata, user metadata, and tags are not available for Databricks-managed storage and are set to
null.