Pattern matching

Pattern matching #

Two of the main uses of regexes are:

  • string validation, i.e. check whether a string satisfies certain constraints. For instance:
    • has at least one digit and one special character,
    • is a valid address,
    • etc.
  • string search, i.e. identify occurrences of a pattern in a line or in a file. This can be useful for:
    • replacing strings in a file or a folder (e.g. within a codebase),
    • splitting a sentence into (natural language) words,
    • splitting a program into meaningful substrings, e.g. split the instruction int a=5; into int, a, =, 5 and ;,
    • filtering (e.g. with the grep command) of modifying (e.g. with the sed command) lines in a file,
    • etc.

In both scenarios, a regex is viewed as a pattern that should be matched against a string.

Segment #

To make the notion of pattern matching more precise, we will represent a word $w$ as an array of characters, and a segment of $w$ as a (possibly empty) subarray of this array.

Notation. If $w$ is a word, we use $w[i .. j)$ for the subsegment of $w$ that starts at index $i$ included, and ends at index $j$ excluded.

Example. Let $w$ = abca. Then:

  • $w[0 .. 1)$ is a segment of $w$ with length 1 and word a,
  • $w[3 .. 4)$ is another segment of $w$ with length 1 and word a
  • $w[0 .. 3)$ is a segment of $w$ with length 3 and word abc
  • $w[0 .. 0)$ is a segment of $w$ with length 0 and word $\varepsilon$
  • $w[1 .. 1)$ is another segment of $w$ with length 0 and word $\varepsilon$

Match #

We can now define what a match is:

Definition. Let $w$ be a word (viewed as an array), and let $e$ be a regex that describes the language $L$.

A match for $e$ in $w$ is a (possibly empty) segment of $w$ whose word belongs to $L$.

Example. Let $e$ be the regex ab*, and let $w$ be the word aba.

There are 3 matches for $e$ in $w$:

  • $w[0 .. 1)$ with word a,
  • $w[0 .. 2)$ with word ab,
  • $w[2 .. 3)$ with word a

Warning. Several matches may carry the same word (e.g. the first and third matches in the example above).

Warning. Some matches may overlap (e.g. the first and second matches in the example above).

Consider the regex $e$ = a* and the word $w$ = aa.

How many matches are there for $e$ in $w$?

There are 6 matches for the regex a* in the word aa:

  • $w[0 .. 0)$: $\varepsilon$
  • $w[0 .. 1)$: a
  • $w[0 .. 2)$: aa
  • $w[1 .. 1)$: $\varepsilon$
  • $w[1 .. 2)$: a
  • $w[2 .. 2)$: $\varepsilon$

Best first match #

In practice, (most) regex engines do not identify all matches for a regex in a word. Instead, they rely on the notion of best first match (explained below). Reasons include:

  • performance,
  • avoiding overlapping matches (e.g. when performing a global “search and replace”).

Let $e$ be a regex, let $w$ be a word, and let $m_1$ and $m_2$ be two matches for $e$ in $w$.

$m_1$ is preferred to $m_2$ if:

  • $m_1$ starts before $m_2$: for instance $w[1 .. 4)$ is preferred to $w[3 .. 8)$, or
  • they start at the same index and the regex engine favors $m_1$ over $m_2$.

Preference is a total order over the matches for for $e$ in $w$, meaning that if there is a match, then there can be at most one best first match.

Note. When $m_1$ and $m_2$ start at the same index (second case above), whether $m_1$ is favored over $m_2$ can vary in subtle ways from one regex engine to the other. These preferences are usually explained in algorithmic terms (and half-informally).

For a reasonably detailed tutorial about the behavior(s) of regex search engine, we refer to this website. For Java, the Oracle tutorial on regexes can also be a good entry point (although less precise that the previous reference).

In this section, we only introduce basic behaviors of regex engines.

Left-to-right #

Commutative operations are evaluated from left-to-right.

Example. Consider the word $w$ = ab.

  • the best first match for the regex ab|a in $w$ is $w[0 .. 2) (with word ab)
  • the best first match for the regex a|ab in $w$ is $w[0 .. 1) (with word a)

Exception. In this example, A POSIX-compliant regex engine would produce $w[0 .. 2) (with word ab) as best first match for both expressions.

Warning. As illustrated with the above example, two regexes that describe the same language may have different best first matches (in the same word).

Greedyness #

The * operator is evaluated (by default) in a greedy way. This means that the engine tries to matches as many characters as possible for this operator. More precisely, when encountering a subexpression of the form $e$*:

  • the engine first considers the longest possible match for $e$*,
  • if this match does not result in a match for the whole expression, then the engine backtracks by one character (i.e. reduces the length of the match for $e$* by one), and tries again to find a match for the whole expression,
  • if this fails again, then the engine backtracks again by one character,
  • etc.

Examples.

  • The best first match for the regex a* in aa is the whole word.
  • The best first match for the regex a* in aab is $[0 ..2)$ (with word aa).
  • The best first match for the regex a* in aabaa is $[0 ..2)$ (with word aa).
  • The best first match for the regex (a|b)*b in abab is the whole word. In this case,
    • the engine first tries to match (a|b)* against the whole word,
    • this does not result in a match for the whole regex, because the trailing b in the regex is not matched,
    • so the engine backtracks, and matches (a|b)* against $[0 .. 3)$ (with word aba),
    • this results in a match for the whole expression.

Warning. The interaction of left-to-right evaluation and greedy matching can be hard to predict. This is why we highly recommend writing and debugging regexes with a validation engine, such as regex101.

Warning. Nested * operators can be costly, due to a combinatorial explosion of the number of attempts to find a match. This is sometimes referred to as catastrophic backtracking.

Successive best first matches #

Most regex engines can return best first matches in an iterative way. Intuitively, the engine “consumes” the best first match. Then the next match is the best first match in the remaining string, etc.

More precisely:

Definition. Let $e$ be regex and let $w$ be a word with lenght $n$.

Then:

  • the first match $w[i_1 .. j_1)$ for $e$ in $w$ is the best first match (if any) for $e$ in $w$,
  • the second match $w[i_2 .. j_2)$ for $e$ in $w$ is the best first match (if any) for $e$ in $w[j_1 .. n)$,
  • the third match $w[i_3 .. j_3)$ for $e$ in $w$ is the best first match (if any) for $e$ in $w[j_2 .. n)$,
  • etc.

Example. Let $e$ be the regex ab*, and let $w$ be the word abac.

  • The best first match for $e$ in $w$ is $w[0 ..2)$, with word ab.

  • The second match is the best first match in the rest of $w$, i.e. in $w[2 .. 4)$. This match is $w[2 .. 3)$, with word a.

  • There is no third match.

In each of the cases below, find all successive best first matches for the regex $e$ in word $w$:

regex $e$ word $w$
a aba
a*b aba
a* a
a* aba
(ab)* aba
(ab)* abab
regex $e$ word $w$ matches
a aba $[0 .. 1)$:a, $[2 .. 3)$:a
a*b aba $[0 .. 2)$:ab
a* a $[0 .. 1)$:a, $[1 .. 1)$: $\varepsilon$
a* aba $[0 .. 1)$:a, $[1 .. 1)$: $\varepsilon$, $[2 .. 3)$:a, $[3 .. 3)$: $\varepsilon$
(ab)* aba $[0 .. 1)$:ab, $[2 .. 2)$: $\varepsilon$, $[3 .. 3)$: $\varepsilon$
(ab)* abab $[0 .. 3)$:abab, $[3 .. 3)$: $\varepsilon$

Lazy closure #

Some regex engines (notably the ones used by default in Java, Javascript, Python, etc.) support alternative ways to evaluate * operators.

One of them is lazy evaluation (also called reluctant). In this case, the star operator matches the shortest possible segment.

Examples.

regex $e$ word $w$ best first match with lazy evaluation of star operators
a* aa $[0 .. 0)$:$\varepsilon$
a*ac ac $[0 .. 2)$:ac
a*ac aac $[0 .. 3)$:aac

Note. In a same regular expression, it is possible to specify that some * operators must be evaluated in a lazy way, and others in a greedy way.

In each of the cases below, find all successive best first matches for the regex $e$ in word $w$, where * operators are evaluated in a lazy way.

regex $e$ word $w$
ab* ab
a*b aba
a*ab* ab
a* a
a* aa
b* aa
regex $e$ word $w$ lazy matches
ab* ab $[0 .. 1)$:a
a*b aba $[0 .. 2)$:ab
a*ab* ab $[0 .. 1)$:a
a* a $[0 .. 0)$:$\varepsilon$, $[1 .. 1)$:$\varepsilon$
a* aa $[0 .. 0)$:$\varepsilon$, $[1 .. 1)$:$\varepsilon$, $[2 .. 2)$:$\varepsilon$
b* aa $[0 .. 0)$:$\varepsilon$, $[1 .. 1)$:$\varepsilon$, $[2 .. 2)$:$\varepsilon$

Warning. Lazy * operators can be hard to predict.