Collation support for Delta Lake

You can specify collation for string fields in Delta tables in Databricks Runtime 16.1 and above.

Enabling collation for a table adds the collations-preview writer table feature. You can read tables with collation enabled in Databricks Runtime 15.4 and above. See How does Databricks manage Delta Lake feature compatibility?.

Note

By default, Delta Lake sets the collation for string fields to UTF8_BINARY.

Create a table with collation at the column level

You can create a new table with collation at the column level using the following command:

CREATE TABLE $tableName (
 nonCollatedColName STRING,
 collatedColName STRING COLLATE UNICODE,
 structColName STRUCT<nestedFieldName: STRING COLLATE UNICODE>,
 mapColName MAP<STRING, STRING COLLATE UNICODE>,
 arrayColName ARRAY<STRING COLLATE UNICODE>
) USING delta

Alter a table column to specify collation

You can update an existing column to use collation using the following commands:

ALTER TABLE tableName ALTER COLUMN columnName TYPE newType

To remove a non-default collation (if one was present):

ALTER TABLE tableName ALTER COLUMN columnName TYPE STRING COLLATE UTF8_BINARY

To change the column collation to utf8_lcase:

ALTER TABLE tableName ALTER COLUMN columnName TYPE STRING COLLATE UTF8_LCASE

Altering the collation for a table does not automatically update statistics or data layout for previously written data. To improve file skipping over historical data under the new collation, Databricks recommends the following:

  • Run ANALYZE table_name COMPUTE DELTA STATISTICS to update file skipping statistics for existing data files.

  • For tables with liquid clustering enabled, run OPTIMIZE FULL table_name to update liquid clustering.

  • For tables that use ZORDER, do the following:

    • Disable incremental optimization in the Spark Session by overriding the default Spark configuration with the following command:

      SET spark.databricks.optimize.incremental=false
      
    • Run OPTIMIZE table_name ZORDER BY zorder_column to rewrite all existing data files.

Collation will always be respected by Databricks in the query’s results.

Disable collation for a table

You must explicitly disable collation for each string column in a table before dropping the collation feature.

Use the following syntax to set the collation for a column to UTF8_BINARY:

ALTER TABLE table_name
ALTER COLUMN column_name
TYPE STRING
COLLATE UTF8_BINARY

To drop the table feature, run the following command:

ALTER TABLE table_name
DROP FEATURE collations-preview

See Drop Delta table features.

Schema evolution and collation

Collation interacts with schema evolution using the following rules:

  • If a source column already exists in the target table, the collation of the column in the target table remains unchanged.

  • If a source column has collation specified, the column added to the target table uses the specified collation.

  • If the target table does not have collation enabled when a column with collation is added, the collations-preview table feature is enabled.

Limitations

The following limitations exist for tables with collation enabled:

  • Delta tables created externally with a collation not recognized by the Databricks Runtime throw an exception when queried.

  • There is no support for Delta Sharing.

  • Collated columns cannot be used with CHECK constraints.

  • Generated columns cannot use collation.

  • Collated columns cannot be used with bloom filter index columns.

  • There is no support for collation in OSS Delta Lake APIs for Scala or Python. You must use Spark SQL or DataFrame APIs to enable collation.

  • Dynamic partition overwrite is not supported on collated columns.

  • Collated columns cannot be referenced in Structured Streaming stateful queries.

  • External readers that do not respect the collations-preview table feature fallback to default collation of UTF8_BINARY.

  • A MAP cannot have a key that is a collated string.

  • UniForm does not work with collations.