Collation
This feature is in Public Preview.
Applies to:  Databricks SQL 
 Databricks Runtime 16.1 and above
A collation is a set of rules that determines how string comparisons are performed. Collations are used to compare strings in a case-insensitive, accent-insensitive, or trailing space insensitive manner, or to sort strings in a specific language-aware order.
Strings in Databricks are represented as UTF-8 encoded Unicode characters.
By default Databricks compares strings by their binary UTF8 representation. This is known as UTF8_BINARY collation.
UTF8_BINARY comparisons are fast and appropriate in many cases, but may not be suitable for all applications, especially those that require language-aware sorting or comparisons.
Aside from language-aware comparisons, a common use case is enforcing case-insensitivity.
Databricks has the UTF8_LCASE collation specifically for this purpose.
It converts strings to lowercase before comparing them using the fast UTF8_BINARY collation.
For language-aware comparisons, Databricks employs the following technologies:
- International Components for Unicode (ICU) library to compute collation
- Common Locale Data Repository (CLDR) tables for specific locale-aware collation.
- Unicode Locale Data Markup Language (LDML) to encode collations internally.
These technologies are encapsulated in a set of named collations that can be used in SQL statements.
Collation names
Because identifying collations by their LDML specification can be complex and challenging to read, Databricks has a set of easier-to-use named system collations.
Syntax
{ UTF8_BINARY |
  UTF8_LCASE |
  { UNICODE | locale } [ _ modifier [...] ] }
locale
  language_code [ _ script_code ] [ _ country_code ]
modifier
  { CS | CI | AS | AI | RTRIM }
- 
UTF8_BINARY A meta-locale binary collation that compares strings byte by byte based on the UTF-8 byte representation. UTF8_BINARYis the default and most lightweight collation for string comparison in Databricks.In this collation 'A' (x'65') < 'B' (x'66') < … < 'Z' (x'90'). However, 'Z' (x'90') < 'a' (x'97'), and 'A' (x'65') <> 'a' (x'97'). Further, characters such as 'Ä' (x'C384') are greater than 'Z' and 'z' in this collation. 
- 
UTF8_LCASE A lightweight meta-locale case-insensitive collation that compares strings using their UTF-8 byte representation after converting the strings to lower case. UTF8_LCASEis the collation used for identifiers in Databricks.For example: SQLORDER BY col COLLATE UTF8_LCASEis equivalent to SQLORDER BY LOWER(col) COLLATE UTF8_BINARY
- 
UNICODE The ICU root locale. This collation, known in CLDR as the 'root' locale (LDML specification: 'und-u') imposes a language agnostic order, which tries to be intuitive overall. In this collation, like characters are grouped. For example: 'a' < 'A' < 'Ä' < 'b'. 'A' is not considered equivalent to 'a'. Therefore, the collation is case-sensitive. 'a' is not considered equivalent to 'ä'. Therefore, the collation is accent-sensitive. 
- 
locale A locale-aware collation based on the CLDR tables. The locale is specified as a language code, an optional script code, and an optional country code. localeis case-insensitive.- language_code: A two-letter ISO 639-1 language code.
- script_code: A four-letter ISO 15924 script code.
- country_code: A three-letter ISO 3166-1 alpha-3 country code.
 
- 
modifier Specifies the collation behavior regarding case sensitivity and accent sensitivity. - CS: Case-sensitive. The default behavior.
- CI: Case-insensitive.
- AS: Accent-sensitive. The default behavior.
- AI: Accent-insensitive.
 Applies to: Databricks SQL Databricks Runtime 16.2 and above - RTRIM: Trailing space insensitive. Trims trailing spaces ('u0020') before comparison.
 Applies to: Databricks SQL Databricks Runtime 16.2 and above You can specify RTRIM, eitherCSorCI, and eitherASorAIat most once and in any order. The modifiers themselves are case-insensitive.
When processing a collation, Databricks normalizes collation names by removing defaults.
For example, SR_CYR_SRN_CS_AS is normalized to SR.
For a list of supported collations, see Supported collations.
Examples
-- You can fully qualify collations, and case doesn't matter.
system.builtin.unicode
-- Since all collations are system defined you don't need to qualify them
unicode
-- Using 2-letter language code only for german collation
DE
-- Using 2-letter language code and 3-letter country code for french-canadian collation
-- It is common to use lower case 2-letter language codes and upper case 3-letter country codes
-- But collation names are case insensitive
fr_CAN
-- Using 2-letter language code and 4-letter script code and 3-letter country code for traditional chinese in Macao
zh_Hant_MAC
-- Using a 2 letter german language code and 2 modifiers for case insensitive and accent insensitive
-- So 'Ä', 'A', and 'a' are all considered equal
de_CI_AI
-- Using back ticks is allowed, but unnecessary for builtin collations
`UTF8_BINARY`
Default collation
The default collation applies when using STRING literals, parameter markers, functions without STRING parameters producing strings, and when defining column, field or variable types without a COLLATE clause.
The default collation is derived in one of the following ways:
- 
For DDL statements such as ALTER TABLE,CREATE VIEW,CREATE TABLE, andCREATE FUNCTION:- The default collation is the default collation of the object being created or altered.
- If no DEFAULT COLLATIONclause is specified, the default collation isUTF8_BINARY.
 
- 
For DML (UPDATE, DELETE, INSERT, MERGE), and query statements the default collation is UTF8_BINARY.
Collation precedence
To decide which collation to use for a given string Databricks defines collation precedence rules.
The rules assign 4 levels of precedence to collations:
- 
Explicit The collation has been explicitly assigned to a string using COLLATE expression. Examples SQL-- Force fast binary collation to check whether a vin matches a Ferrari
 vin COLLATE UTF8_BINARY LIKE 'ZFF%'
 -- Force German collation to order German first names
 ORDER BY vorname COLLATE DE
- 
Implicit The collation is implicitly assigned by the column, field, column-alias, variable, or routine parameter reference. This includes the result of a subquery as long as the collation is not None. Examples SQL-- Use the collation of the column as it was defined
 employee.name LIKE 'Mc%'
 -- Use the collation of the variable as it was defined.
 translate(session.tempvar, 'Z', ',')
- 
Default A STRINGliteral, named or unnamed parameter marker, or aSTRINGproduced by a function from another type.Examples SQL-- A literal string has the default collation
 'Hello'
 -- :parm1 is a parameter marker using session default collation
 EXECUTE IMMEDIATE 'SELECT :parm1' USING 'Hello' AS parm1;
 -- ? is a parameter marker using session default collation
 EXECUTE IMMEDIATE 'SELECT ?' USING 'Hello';
 -- The result of a cast of a non-STRING to a STRING is a STRING with the default collation
 CAST(5 AS STRING)
 -- The date is converted to a string using the default collation
 to_char(DATE'2016-04-08', 'y')
 -- The collation of the session_user STRING is the default collation
 session_user()The assigned collation is the Default Collation. 
- 
None A STRINGresult of a function, operator or set operation (e.g.UNION) that takes more than oneSTRINGargument which have different implicit collations.Examples SQL-- Concatenating two strings with different explicit collations results in no collation
 SELECT fr || de AS freutsch FROM VALUES('Voulez vous ' COLLATE FR), 'Kartoffelsupp...' COLLATE DE) AS T(fr, de)
 -- A union of two strings with different excplicit collations results in no collation
 SELECT 'Voulez vous ' COLLATE FR UNION ALL SELECT 'Kartoffelsupp...' COLLATE DE
Collation derivation
When deriving the collation for a STRING result, the collation precedence rules are applied in the following ways:
If the expression:
- 
matches the definitions above The collation and precedence is as defined. 
- 
is a function or operator with a single STRINGparameter, returning aSTRINGThe collation and precedence is that of the STRINGparameter.
- 
is a function or operator with two or more STRINGparameters- 
with the same collations and precedence The collation and precedence is that of the STRINGparameters.
- 
with different collations or precedence Let C1andC2be distinct collations and letDbe the default collation. The precedence and the collation is determined by the following table:Collation and Precedence C1 Explicit C1 Implicit D Default None C2 Explicit Error C2 Explicit C2 Explicit C2 Explicit C2 Implicit Explicit C1 None C2 Implicit None D Default C1 Explicit C1 Implicit D Default None None C1 Explicit None None None 
 
- 
Examples
> SELECT 'hello' = 'hello   ' COLLATE UNICODE_RTRIM;
  true
> CREATE TABLE words(fr STRING COLLATE FR, de STRING COLLATE DE, en STRING COLLATE EN);
> INSERT INTO words VALUES ('Salut', 'Hallo', 'Hello');
-- A literal string has the default collation
> SELECT collation('Ciao');
  UTF8_BINARY
-- A function producing a STRING has the default collation
> SELECT collation(user());
  UTF8_BINARY
-- Function modifying a STRING passes the collation through
> SELECT collation(upper('Ciao'));
  UTF8_BINARY
-- Implicit collation (French) wins over default collation
> SELECT collation(fr || 'Ciao') FROM words;
  FR
-- Explicit collation (French) wins over implicit collation (German)
> SELECT collation('Salut' COLLATE FR || de) FROM words;
  FR
-- Implicit collation German collides with implicit collation French
-- The result is no collation
> SELECT collation(de || fr) FROM words;
  null
-- Explicit collation (French) wins over default collation (Italian)
> SELECT collation('Salut' COLLATE FR || 'Ciao');
  FR
-- Explicit collation (French) collides with explicit collation (German)
> SELECT collation('Salut' COLLATE FR || 'Hallo' COLLATE DE);
  COLLATION_MISMATCH.EXPLICIT
-- The explicit collation wins over no collation
> SELECT collation('Ciao' COLLATE IT || (fr || de)) FROM words;
  IT
-- The implict collation (English) does not win over None
> SELECT collation(en || (fr || de)) FROM words;
  null
-- The explicit collation (English) wins over Implicit collation anywhere in the expression
> SELECT collation((fr || ltrim('H' COLLATE EN, fr)) || fr) FROM words;
  EN