Collation support for Delta Lake
You can specify collation for string fields in Delta tables in Databricks Runtime 16.1 and above.
Enabling collation for a table adds the collations-preview
writer table feature. You can read tables with collation enabled in Databricks Runtime 15.4 and above. See How does Databricks manage Delta Lake feature compatibility?.
Note
By default, Delta Lake sets the collation for string fields to UTF8_BINARY
.
Create a table with collation at the column level
You can create a new table with collation at the column level using the following command:
CREATE TABLE $tableName (
nonCollatedColName STRING,
collatedColName STRING COLLATE UNICODE,
structColName STRUCT<nestedFieldName: STRING COLLATE UNICODE>,
mapColName MAP<STRING, STRING COLLATE UNICODE>,
arrayColName ARRAY<STRING COLLATE UNICODE>
) USING delta
Alter a table column to specify collation
You can update an existing column to use collation using the following commands:
ALTER TABLE tableName ALTER COLUMN columnName TYPE newType
To remove a non-default collation (if one was present):
ALTER TABLE tableName ALTER COLUMN columnName TYPE STRING COLLATE UTF8_BINARY
To change the column collation to utf8_lcase
:
ALTER TABLE tableName ALTER COLUMN columnName TYPE STRING COLLATE UTF8_LCASE
Altering the collation for a table does not automatically update statistics or data layout for previously written data. To improve file skipping over historical data under the new collation, Databricks recommends the following:
Run
ANALYZE table_name COMPUTE DELTA STATISTICS
to update file skipping statistics for existing data files.For tables with liquid clustering enabled, run
OPTIMIZE FULL table_name
to update liquid clustering.For tables that use
ZORDER
, do the following:Disable incremental optimization in the Spark Session by overriding the default Spark configuration with the following command:
SET spark.databricks.optimize.incremental=false
Run
OPTIMIZE table_name ZORDER BY zorder_column
to rewrite all existing data files.
Collation will always be respected by Databricks in the query’s results.
Disable collation for a table
You must explicitly disable collation for each string column in a table before dropping the collation feature.
Use the following syntax to set the collation for a column to UTF8_BINARY
:
ALTER TABLE table_name
ALTER COLUMN column_name
TYPE STRING
COLLATE UTF8_BINARY
To drop the table feature, run the following command:
ALTER TABLE table_name
DROP FEATURE collations-preview
Schema evolution and collation
Collation interacts with schema evolution using the following rules:
If a source column already exists in the target table, the collation of the column in the target table remains unchanged.
If a source column has collation specified, the column added to the target table uses the specified collation.
If the target table does not have collation enabled when a column with collation is added, the
collations-preview
table feature is enabled.
Limitations
The following limitations exist for tables with collation enabled:
Delta tables created externally with a collation not recognized by the Databricks Runtime throw an exception when queried.
There is no support for Delta Sharing.
Collated columns cannot be used with
CHECK
constraints.Generated columns cannot use collation.
Collated columns cannot be used with bloom filter index columns.
There is no support for collation in OSS Delta Lake APIs for Scala or Python. You must use Spark SQL or DataFrame APIs to enable collation.
Dynamic partition overwrite is not supported on collated columns.
Collated columns cannot be referenced in Structured Streaming stateful queries.
External readers that do not respect the
collations-preview
table feature fallback to default collation ofUTF8_BINARY
.A
MAP
cannot have a key that is a collated string.UniForm does not work with collations.