Collation support for Delta Lake
You can specify collations on string fields and read Delta tables that use collations in Databricks Runtime 16.4 LTS and above.
Enabling collation for a table adds the collations-preview
writer table feature. See Delta Lake feature compatibility and protocols.
By default, Delta Lake sets the collation for string fields to UTF8_BINARY
.
Create a table with collation at the column level
You can create a new table with collation at the column level using the following command:
CREATE TABLE $tableName (
nonCollatedColName STRING,
collatedColName STRING COLLATE UNICODE,
structColName STRUCT<nestedFieldName: STRING COLLATE UNICODE>,
mapColName MAP<STRING, STRING COLLATE UNICODE>,
arrayColName ARRAY<STRING COLLATE UNICODE>
) USING delta
Alter a table column to specify collation
You can update an existing column to use collation using the following commands:
ALTER TABLE tableName ALTER COLUMN columnName TYPE newType
To remove a non-default collation (if one was present):
ALTER TABLE tableName ALTER COLUMN columnName TYPE STRING COLLATE UTF8_BINARY
To change the column collation to utf8_lcase
:
ALTER TABLE tableName ALTER COLUMN columnName TYPE STRING COLLATE UTF8_LCASE
Altering the collation for a table does not automatically update statistics or data layout for previously written data. To improve file skipping over historical data under the new collation, Databricks recommends the following:
-
Run
ANALYZE table_name COMPUTE DELTA STATISTICS
to update file skipping statistics for existing data files. -
For tables with liquid clustering enabled, run
OPTIMIZE FULL table_name
to update liquid clustering. -
For tables that use
ZORDER
, do the following:-
Disable incremental optimization in the Spark Session by overriding the default Spark configuration with the following command:
SQLSET spark.databricks.optimize.incremental=false
-
Run
OPTIMIZE table_name ZORDER BY zorder_column
to rewrite all existing data files.
-
Collation will always be respected by Databricks in the query's results.
Disable collation for a table
You must explicitly disable collation for each string column in a table before dropping the collation feature.
Use the following syntax to set the collation for a column to UTF8_BINARY
:
ALTER TABLE table_name
ALTER COLUMN column_name
TYPE STRING
COLLATE UTF8_BINARY
To drop the table feature, run the following command:
ALTER TABLE table_name
DROP FEATURE collations-preview
See Drop a Delta Lake table feature and downgrade table protocol.
Schema evolution and collation
Collation interacts with schema evolution using the following rules:
- If a source column already exists in the target table, the collation of the column in the target table remains unchanged.
- If a source column has collation specified, the column added to the target table uses the specified collation.
- If the target table does not have collation enabled when a column with collation is added, the
collations-preview
table feature is enabled.
Limitations
The following limitations exist for tables with collation enabled:
- Delta tables created externally with a collation not recognized by the Databricks Runtime throw an exception when queried.
- There is no support for Delta Sharing.
- Collated columns cannot be used with
CHECK
constraints. - Generated columns cannot use collation.
- Collated columns cannot be used with bloom filter index columns.
- There is no support for collation in OSS Delta Lake APIs for Scala or Python. You must use Spark SQL or DataFrame APIs to enable collation.
- Dynamic partition overwrite is not supported on collated columns.
- Collated columns cannot be referenced in Structured Streaming stateful queries.
- External readers that do not respect the
collations-preview
table feature fallback to default collation ofUTF8_BINARY
. - A
MAP
cannot have a key that is a collated string. - UniForm does not work with collations.