Enrich Delta Lake tables with custom metadata
In addition to table comments, you can use fields in the Delta Lake transaction log to add custom tags to a table or messages for an individual commit.
Set user-defined commit metadata
You can specify user-defined strings as metadata in commits made by these operations, either using the DataFrameWriter option userMetadata
or the SparkSession configuration spark.databricks.delta.commitInfo.userMetadata
. If both of them have been specified, then the option takes preference. This user-defined metadata is readable in the describe history operation.
SET spark.databricks.delta.commitInfo.userMetadata=overwritten-for-fixing-incorrect-data
INSERT OVERWRITE default.people10m SELECT * FROM morePeople
df.write.format("delta") \
.mode("overwrite") \
.option("userMetadata", "overwritten-for-fixing-incorrect-data") \
.save("/tmp/delta/people10m")
df.write.format("delta")
.mode("overwrite")
.option("userMetadata", "overwritten-for-fixing-incorrect-data")
.save("/tmp/delta/people10m")
Store custom tags in table properties
You can store your own metadata as a table property using TBLPROPERTIES
in CREATE
and ALTER
. You can then SHOW
that metadata. For example:
ALTER TABLE default.people10m SET TBLPROPERTIES ('department' = 'accounting', 'delta.appendOnly' = 'true');
-- Show the table's properties.
SHOW TBLPROPERTIES default.people10m;
-- Show just the 'department' table property.
SHOW TBLPROPERTIES default.people10m ('department');
TBLPROPERTIES
are stored as part of Delta table metadata. You cannot define new TBLPROPERTIES
in a CREATE
statement if a Delta table already exists in a given location.
In addition, to tailor behavior and performance, Delta Lake supports certain Delta table properties:
Block deletes and updates in a Delta table:
delta.appendOnly=true
.Configure log and data retention properties for time travel:
delta.logRetentionDuration=<interval-string>
anddelta.deletedFileRetentionDuration=<interval-string>
. For details, see Configure data retention for time travel queries.Configure the number of columns for which statistics are collected:
delta.dataSkippingNumIndexedCols=n
. This property indicates to the writer that statistics are to be collected only for the firstn
columns in the table. Also the data skipping code ignores statistics for any column beyond this column index. This property takes affect only for new data that is written out.
Randomize file prefixes to avoid hot spots in S3 metadata:
delta.randomizeFilePrefixes=true
. For tables that require a lot (thousands of requests per second) of fast read/write operations, we strongly recommend dedicating an S3 bucket to a table (locating table at the root of the bucket), and enabling randomized file prefixes to get the best experience.
Note
Modifying a Delta table property is a write operation that will conflict with other concurrent write operations, causing them to fail. Databricks recommends you modify a table property only when there are no concurrent write operations on the table.
You can also set delta.
-prefixed properties during the first commit to a Delta table using Spark configurations. For example, to initialize a Delta table with the property delta.appendOnly=true
, set the Spark configuration spark.databricks.delta.properties.defaults.appendOnly
to true
. For example:
spark.sql("SET spark.databricks.delta.properties.defaults.appendOnly = true")
spark.conf.set("spark.databricks.delta.properties.defaults.appendOnly", "true")
spark.conf.set("spark.databricks.delta.properties.defaults.appendOnly", "true")
See also the Delta table properties reference.
Explore table metadata
Delta Lake supports SHOW [PARTITIONS | COLUMNS]
and DESCRIBE TABLE
. See SHOW PARTITIONS, SHOW COLUMNS, DESCRIBE TABLE