in practice

Regexes in practice #

Concrete regexes deviate from theoretical ones in several ways. In particular:

  1. Concrete syntaxes include a wide range of syntactic sugar, i.e. operators that could in theory be expressed in terms of |, * and concatenation.

  2. Many regex engines support expressions with features (such as backreferences) that go beyond the expressivity of theoretical regexes. Technically, such expressions do not qualify as as regexes (they may describe a language that is not regular). However, in practice, they are still referred to as “regexes”. This is also what we will do in this section.

Flavours #

Multiple syntaxes for regexes coexist, which have a lot in common. Among these:

  • Regexes defined by the IEEE POSIX standards are supported by a variety of command-line utilities, scripting languages or database engines.

  • The regex engine of the programming language Perl has been very influential. A widely used variant is the PCRE (Perl Compatible Regular Expressions) library, written in C (and natively used by PHP and R). Besides, many programming languages (Java, Javascript, Python, C#, etc.) have adopted a regex syntax and engine inspired by Perl’s.

A detailed comparison of some the main flavours of regexes can be found here.

Warning. The patterns used in .gitigore files are not regular expressions.

Java regexes #

Java regexes are very similar to PCRE regexes (albeit less expressive).

We list below some of the main constructs of Java regexes, and refer to the Javadoc for an exhaustive list. Most of these are identical in other regex languages inspired by Perl’s.

Characters #

A character in a Java regex can alternatively be written using its Unicode hexadecimal identifier, preceded with \u.

Examples.

  • the character A can be written either in its normal form, or \u0041
  • the carriage return character can be written either \r or \u000D

Special characters (like | or *) need to be escaped to be treated as standard ones (except in some specific contexts, see below).

Examples. In order to be treated as a standard character:

  • * can be either escaped (\*) or written \u002A
  • \ can be either escaped (\\) or written \u005C

Syntactic sugar #

We have seen in previous sections that in theory, regular expressions only admit three operators:

  • concatenation ($e_1e_2$),
  • union ($e_1$|$e_2$) and
  • closure ($e$*).

However, this is impractical for most implication.

Example. In order to match a character that is not a digit, with these three operators only, one would need to write an expression that enumerates all other characters in the underlying alphabet, e.g.:

$\qquad\qquad$ a | b | c | .. | A | B | C | .. | & | ! | > | ..

For the simple ASCII alphabet, this is already more than a hundred characters. For the Unicode alphabet, this is more than 100 000.

In practice, regexes include additional operators, most of which are syntactic sugar (i.e. could in theory be expressed with only concatenation, union, and closure).

Character class #

Square brackets ([ and ]) are used to denote a set of characters, called a character class.

Examples.

  • [abcd] is equivalent to a|b|c|d

  • gr[ae]y is equivalent to gr(a|e)y

A character class can also be defined with ranges of characters. If $c_1$ and $c_2$ are two characters, then $c_1$-$c_2$ (within square brackets still) denotes any character between $c_1$ and $c_2$ in the Unicode alphabet.

Examples.

  • [A-Z] matches any character between A and Z
  • [A-Za-z0-9] matches any character between (A and Z) or between (a and z) or between (0 and 9)
  • [A-Za-z@] matches @ or any character between (A and Z) or between (a and z)

The complement (in the Unicode alphabet) of a character class is described by adding a ^ after the opening square bracket [.

Examples.

  • [^a] matches any character different from a
  • [^abc] matches any character that is neither a, b or c
  • [^0-9] matches any character that is not a digit
  • [^A-Za-z0-9] matches any character that is not an ASCII letter or digit

Most special characters (like * or |) can be used unescaped inside square brackets to refer to a character (some exceptions to this rule are -, ^ or \).

Examples.

  • [|*] matches either a | or a *
  • [^|] matches any character that is not |

The special character . stands for any (Unicode) character

Examples.

  • .. matches any string of two characters
  • .* matches any Unicode string

List all matches returned by a regex engine for:

  1. the regex gr[ae]y in the words grey and gray are homonyms
  2. the regex [A-Z][^;]*; in Alice:12;Bob,35;
  3. the regex <.*> in <span class="title">My Title</span>
  1. $[10 .. 14)$: grey, $[19 .. 23)$: gray
  2. $[0 .. 8)$: Alice:12;, $[19 .. 23)$: Bob:35;
  3. $[0 .. 35)$: <span class="title">My Title</span>

Use the regex validation tool regex101 to write a regex whose evaluation matches HTML tags. For instance, in the word

$\qquad\qquad$<span class="title">My Title</span>

the regex should match <span class="title"> and </span>.

<[^>]*>

Predefined character classes #

Some escaped characters denote common character classes:

  • \d is equivalent to [0-9]
  • \w is equivalent to [A-Za-z0-9]
  • \s stands for any whitespace character. It is equivalent to

$\qquad\qquad$[ \t\n\x0B\f\r]

  • \R stands for any Unicode linebreak sequence. It is equivalent to

    $\qquad$\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]

The complements of some of these classes are also available:

  • \D is equivalent to [^\d]
  • \S is equivalent to [^\s]
  • \W is equivalent to [^\w]

Quantifiers #

If $i \le j \in \mathbb{N}$ then $e${$i$,$j$} concatenates $e$ with itself at least $i$ times and at most $j$ times, and is evaluated in a greedy way.

Examples.

  • a{1,3} is equivalent to aaa|aa|a
  • [A-Z]{1,3} greedily matches a sequence of 1 to 3 capital letters
  • [A-Z]{0,3} greedily matches a (possibly empty) sequence of at most 3 capital letters

$e${$i$} is equivalent to $e${$i$,$i$}.

Examples.

  • a{3} is equivalent to aaa
  • [A-Z]{4} matches a sequence of 4 capital letters

$e$+ is equivalent to $ee$*.

Examples.

  • a+ greedily matches a sequence of at least one a
  • \d+ greedily matches a nonempty sequence of digits
  • [^\d]+ greedily matches a nonempty sequence of non-digits

If $e$ does not end with a quantifier, then $e$? is equivalent to $e$|$\varepsilon$.

Examples.

  • a? greedily matches at most one occurrence of a
  • (ab)? greedily matches at most one occurrence of ab
  • [A-Z]? greedily matches at most one capital letter

Lazy quantification #

If $e$ ends with a quantifier (e.g. with * or +) then $e$? forces a lazy evaluation of this quantifier.

Examples.

  • In the word $w$ = ab:

    • ab* matches $w$
    • ab*? matches $w[0 .. 1)$: a
  • In the word $w$ = abbb:

    • ab{1,3} matches $w$
    • ab{1,3}? matches $w[0 .. 2)$: ab

Boundary matchers #

Some special characters match empty segments (i.e. segments with word $\varepsilon$).

In particular:

  • ^ (outside of a character class definition) matches the beginning of a string,
  • $ matches the end of a string.

Examples. Consider the string $w$ = a123b

  • the regex \d+ matches $w[1 .. 3)$: 123
  • the regex ^\d+ has no match in $w$
  • the regex \d+$ has no match in $w$

Hint. These two special characters are widely used, in particular for string validation: an input word $w$ belongs to the language described by a regex $e$ iff there is a match for the regex ^$e$$ in $w$.

Besides, if this is the case, then the best first match is also the longest possible match.

Example. In the string $w$ = ab:

  • the best first match for a|ab is $w[a .. 1)$: a
  • the best first match for ^(a|ab)$ is $w[a .. 2)$: ab

The special character \b matches any “natural language word” boundary (i.e. beginning or end).

Example. Consider the string $w$ = User Alice797 is 55 years old

  • The best first match for \d+ in $w$ is $w[10 .. 13)$: 797
  • The best first match for \d+\b in $w$ is $w[10 .. 13)$: 797
  • The only match for \b\d+ in $w$ is $w[17 .. 19)$: 55
  • The only match for \b\d+\b in $w$ is $w[17 .. 19)$: 55

Group #

A group in a regex is the content of a pair of parentheses.

Groups allow capturing subsegments of a matched segment.

Example. Consider the regex: $e$ = a(b|c)d and the word $w = $abd

There is only one match for $e$ in $w$ (namely $w$ itself).

Within this match, the subsegment captured by the group (a|b) is $w[1 .. 2)$: b.

Groups in a regex are (totally) ordered based on the position of their opening parenthesis.

Example. In the regex a(([\d]*)(a|b))

  • Group number 1 is (([\d]*)(a|b))
  • Group number 2 is ([\d]*)
  • Group number 3 is (a|b)

Use the regex validation tool regex101 to write a regex with a group that matches what is inside a pair of opening and a closing HTML tags. For instance, in the word

$\qquad\qquad$<span class="title">My Title</span> <span class="content">Hello</span>

the group should match My Title and Hello

>([^<]*)</

Backreference #

A group can be referenced within a regex, with \$n$, where $n$ is the group number. If the reference appears after the group, this is called a backreference.

Example.

  • In the regex a(b|c)d\1q, the backreference \1 refers to the group (b|c)
  • In the regex (a(b|m))d\2m, the backreference \2 refers to the group (b|m)

Evaluation. Let $e$ be a regex with backreferences.

Consider the expression $e’$ identical to $e$, but where each backreference is replaced with the group that it references. Then a match for $e$ is a match for $e’$ where each group and its copy capture segments with identical words.

Example. The regex (a|b)\1 is equivalent to aa|bb

Which words does the following regex match?

$\qquad\qquad$((\d)_\2)@\1

.

  • 1_1@1_1,
  • 2_2@2_2,
  • 3_3@3_3,
  • etc.

Note. Regexes with backreferences go beyond the expressivity of traditional regular expressions, meaning that they can describe languages that are not regular.