Collation support for Delta Lake
You can specify collations on string fields and read Delta tables that use collations in Databricks Runtime 16.4 LTS and above.
Enabling collation for a table adds the collations-preview writer table feature. See Delta Lake feature compatibility and protocols.
By default, Delta Lake sets the collation for string fields to UTF8_BINARY.
Create a table with collation at the column level
You can create a new table with collation at the column level using the following command:
CREATE TABLE $tableName (
 nonCollatedColName STRING,
 collatedColName STRING COLLATE UNICODE,
 structColName STRUCT<nestedFieldName: STRING COLLATE UNICODE>,
 mapColName MAP<STRING, STRING COLLATE UNICODE>,
 arrayColName ARRAY<STRING COLLATE UNICODE>
) USING delta
Alter a table column to specify collation
You can update an existing column to use collation using the following commands:
ALTER TABLE tableName ALTER COLUMN columnName TYPE newType
To remove a non-default collation (if one was present):
ALTER TABLE tableName ALTER COLUMN columnName TYPE STRING COLLATE UTF8_BINARY
To change the column collation to utf8_lcase:
ALTER TABLE tableName ALTER COLUMN columnName TYPE STRING COLLATE UTF8_LCASE
Altering the collation for a table does not automatically update statistics or data layout for previously written data. To improve file skipping over historical data under the new collation, Databricks recommends the following:
- 
Run ANALYZE table_name COMPUTE DELTA STATISTICSto update file skipping statistics for existing data files.
- 
For tables with liquid clustering enabled, run OPTIMIZE FULL table_nameto update liquid clustering.
- 
For tables that use ZORDER, do the following:- 
Disable incremental optimization in the Spark Session by overriding the default Spark configuration with the following command: SQLSET spark.databricks.optimize.incremental=false
- 
Run OPTIMIZE table_name ZORDER BY zorder_columnto rewrite all existing data files.
 
- 
Collation will always be respected by Databricks in the query's results.
Disable collation for a table
You must explicitly disable collation for each string column in a table before dropping the collation feature.
Use the following syntax to set the collation for a column to UTF8_BINARY:
ALTER TABLE table_name
ALTER COLUMN column_name
TYPE STRING
COLLATE UTF8_BINARY
To drop the table feature, run the following command:
ALTER TABLE table_name
DROP FEATURE collations-preview
See Drop a Delta Lake table feature and downgrade table protocol.
Schema evolution and collation
Collation interacts with schema evolution using the following rules:
- If a source column already exists in the target table, the collation of the column in the target table remains unchanged.
- If a source column has collation specified, the column added to the target table uses the specified collation.
- If the target table does not have collation enabled when a column with collation is added, the collations-previewtable feature is enabled.
Limitations
The following limitations exist for tables with collation enabled:
- Delta tables created externally with a collation not recognized by the Databricks Runtime throw an exception when queried.
- There is no support for Delta Sharing.
- Collated columns cannot be used with CHECKconstraints.
- Generated columns cannot use collation.
- Collated columns cannot be used with bloom filter index columns.
- There is no support for collation in OSS Delta Lake APIs for Scala or Python. You must use Spark SQL or DataFrame APIs to enable collation.
- Dynamic partition overwrite is not supported on collated columns.
- Collated columns cannot be referenced in Structured Streaming stateful queries.
- External readers that do not respect the collations-previewtable feature fallback to default collation ofUTF8_BINARY.
- A MAPcannot have a key that is a collated string.
- UniForm does not work with collations.