メインコンテンツまでスキップ

Regular expressions

Applies to: check marked yes Databricks SQL check marked yes Databricks Runtime

A regular expression (regex) is a pattern that describes a set of strings. Several Databricks SQL functions and operators take a regular expression to match, locate, count, extract, replace, or split text:

  • rlike operator and regexp operator: test whether a string matches a pattern.
  • regexp_like function: test whether a string matches a pattern.
  • regexp_count function: count how many times a pattern matches.
  • regexp_instr function: return the position of the first match.
  • regexp_substr function: return the first matching substring.
  • regexp_extract and regexp_extract_all functions: return a capture group from the first match or from all matches.
  • regexp_replace function: replace matches with a replacement string.
  • split function: split a string around matches of a pattern.

The like and ilike operators do not use regular expressions. They use a simpler pattern language based on _ and %.

Regex engine

Databricks evaluates regular expressions using the Java java.util.regex engine. Patterns must follow Java regular expression syntax, which is broadly compatible with Perl-style regular expressions but differs in some details. For the complete and authoritative reference, see the Java Pattern class documentation.

The following sections describe the most commonly used constructs.

Character classes

Construct

Matches

.

Any single character except a line terminator (unless the s flag is set).

[abc]

Any one of the listed characters: a, b, or c.

[^abc]

Any single character that is not a, b, or c.

[a-z]

Any character in the range a through z.

\d, \D

A digit ([0-9]); \D matches any non-digit.

\w, \W

A word character ([a-zA-Z_0-9]); \W matches any non-word character.

\s, \S

A whitespace character; \S matches any non-whitespace character.

Anchors and boundaries

Construct

Matches

^

The start of the input (or of a line with the m flag).

$

The end of the input (or of a line with the m flag).

\b, \B

A word boundary; \B matches a non-word boundary.

Quantifiers

Quantifiers specify how many times the preceding element must match. By default, quantifiers are greedy and match as much as possible. Append ? to make a quantifier reluctant (match as little as possible), for example .*?.

Construct

Matches the preceding element…

*

Zero or more times.

+

One or more times.

?

Zero or one time.

{n}

Exactly n times.

{n,}

At least n times.

{n,m}

Between n and m times.

Groups and alternation

Construct

Description

(...)

A capturing group. Use the group index with regexp_extract.

(?:...)

A non-capturing group. Groups without capturing the match.

a|b

Alternation. Matches either a or b.

Inline flags

Place a flag at the start of the pattern to change how the whole pattern is matched.

Flag

Effect

(?i)

Case-insensitive matching.

(?s)

Dotall mode. . matches line terminators.

(?m)

Multiline mode. ^ and $ match at line breaks.

(?x)

Comments mode. Ignores unescaped whitespace in the pattern.

For example, (?i)ste(v\|ph)en matches Steven, steven, and STEPHEN regardless of case.

Escaping and string literals

To match a character that is also a regex metacharacter (such as ., *, (, [, or \), precede it with a backslash. For example, \. matches a literal period, and \\ matches a literal backslash.

Because a backslash is also the escape character in ordinary SQL string literals, you must double each backslash that you intend to pass to the regex engine. For example, to pass the pattern \d+ to a function, write it as '\\d+' in a regular string literal.

To avoid this double escaping, use a raw-literal (r prefix), which disables escape-character preprocessing. In a raw literal, write the pattern exactly as the regex engine expects it:

SQL
-- Regular string literal: backslashes must be doubled.
> SELECT regexp_substr('item 42 in stock', '\\d+');
42

-- Raw literal: write the pattern as-is.
> SELECT regexp_substr('item 42 in stock', r'\d+');
42

Common patterns

The following patterns cover frequent tasks. They are written as raw literals so that backslashes pass through to the regex engine unchanged. The email, URL, and IP patterns are intentionally simple illustrations, not strict validators.

Goal

Pattern

Integer, optionally signed

r'-?\d+'

Decimal number

r'-?\d+(\.\d+)?'

A word

r'\w+'

The whole word cat only

r'\bcat\b'

Email address (simple)

r'[\w.%+-]+@[\w.-]+\.\w{2,}'

Host in a URL

r'https?://([^/]+)'

IPv4 address (does not check 0–255)

r'\d{1,3}(\.\d{1,3}){3}'

ISO date (YYYY-MM-DD)

r'\d{4}-\d{2}-\d{2}'

Leading or trailing whitespace

r'^\s+|\s+$'

US phone number (simple)

r'\d{3}-\d{3}-\d{4}'

Word boundaries (\b) match a whole word rather than a substring:

SQL
-- 'cat' as a standalone word
> SELECT 'the cat sat on the mat' rlike r'\bcat\b';
true

-- 'cat' only as part of a larger word
> SELECT 'category' rlike r'\bcat\b';
false

Combine a literal prefix with a quantifier to extract a structured token, such as an order ID:

SQL
> SELECT regexp_substr('Ref: ORD-12345 shipped on 2024-03-15', r'ORD-\d+');
ORD-12345

Choose the right function

Use this table to pick the function that matches your goal.

You want to…

Use

Test whether a string matches.

rlike / regexp / regexp_like

Return the first matching substring.

regexp_substr

Return a capture group.

regexp_extract

Return all matches.

regexp_extract_all

Return the position of a match.

regexp_instr

Count the number of matches.

regexp_count

Replace matches with other text.

regexp_replace

Split a string around matches.

split

Match a simple _ / % pattern.

like / ilike