Pattern matching #
Two of the main uses of regexes are:
- string validation, i.e. check whether a string satisfies certain constraints. For instance:
- has at least one digit and one special character,
- is a valid address,
- etc.
- string search, i.e. identify occurrences of a pattern in a line or in a file. This can be useful for:
- replacing strings in a file or a folder (e.g. within a codebase),
- splitting a sentence into (natural language) words,
- splitting a program into meaningful substrings, e.g. split the instruction
int a=5;
intoint
,a
,=
,5
and;
, - filtering (e.g. with the
grep
command) of modifying (e.g. with thesed
command) lines in a file, - etc.
In both scenarios, a regex is viewed as a pattern that should be matched against a string.
Segment #
To make the notion of pattern matching more precise, we will represent a word $w$ as an array of characters, and a segment of $w$ as a (possibly empty) subarray of this array.
Notation. If $w$ is a word, we use $w[i .. j)$ for the subsegment of $w$ that starts at index $i$ included, and ends at index $j$ excluded.
Example. Let $w$ =
abca
. Then:
- $w[0 .. 1)$ is a segment of $w$ with length 1 and word
a
,- $w[3 .. 4)$ is another segment of $w$ with length 1 and word
a
- $w[0 .. 3)$ is a segment of $w$ with length 3 and word
abc
- $w[0 .. 0)$ is a segment of $w$ with length 0 and word $\varepsilon$
- $w[1 .. 1)$ is another segment of $w$ with length 0 and word $\varepsilon$
Match #
We can now define what a match is:
Definition. Let $w$ be a word (viewed as an array), and let $e$ be a regex that describes the language $L$.
A match for $e$ in $w$ is a (possibly empty) segment of $w$ whose word belongs to $L$.
Example. Let $e$ be the regex
ab*
, and let $w$ be the wordaba
.There are 3 matches for $e$ in $w$:
- $w[0 .. 1)$ with word
a
,- $w[0 .. 2)$ with word
ab
,- $w[2 .. 3)$ with word
a
Warning. Several matches may carry the same word (e.g. the first and third matches in the example above).
Warning. Some matches may overlap (e.g. the first and second matches in the example above).
Consider the regex $e$ = a*
and the word $w$ = aa
.
How many matches are there for $e$ in $w$?
There are 6 matches for the regex a*
in the word aa
:
- $w[0 .. 0)$: $\varepsilon$
- $w[0 .. 1)$:
a
- $w[0 .. 2)$:
aa
- $w[1 .. 1)$: $\varepsilon$
- $w[1 .. 2)$:
a
- $w[2 .. 2)$: $\varepsilon$
Best first match #
In practice, (most) regex engines do not identify all matches for a regex in a word. Instead, they rely on the notion of best first match (explained below). Reasons include:
- performance,
- avoiding overlapping matches (e.g. when performing a global “search and replace”).
Let $e$ be a regex, let $w$ be a word, and let $m_1$ and $m_2$ be two matches for $e$ in $w$.
$m_1$ is preferred to $m_2$ if:
- $m_1$ starts before $m_2$: for instance $w[1 .. 4)$ is preferred to $w[3 .. 8)$, or
- they start at the same index and the regex engine favors $m_1$ over $m_2$.
Preference is a total order over the matches for for $e$ in $w$, meaning that if there is a match, then there can be at most one best first match.
Note. When $m_1$ and $m_2$ start at the same index (second case above), whether $m_1$ is favored over $m_2$ can vary in subtle ways from one regex engine to the other. These preferences are usually explained in algorithmic terms (and half-informally).
For a reasonably detailed tutorial about the behavior(s) of regex search engine, we refer to this website. For Java, the Oracle tutorial on regexes can also be a good entry point (although less precise that the previous reference).
In this section, we only introduce basic behaviors of regex engines.
Left-to-right #
Commutative operations are evaluated from left-to-right.
Example. Consider the word $w$ =
ab
.
- the best first match for the regex
ab|a
in $w$ is $w[0 .. 2) (with wordab
)- the best first match for the regex
a|ab
in $w$ is $w[0 .. 1) (with worda
)
Exception. In this example, A POSIX-compliant regex engine would produce $w[0 .. 2) (with word
ab
) as best first match for both expressions.
Warning. As illustrated with the above example, two regexes that describe the same language may have different best first matches (in the same word).
Greedyness #
The *
operator is evaluated (by default) in a greedy way.
This means that the engine
tries to matches as many characters as possible for this operator.
More precisely, when encountering a subexpression of the form $e$*
:
- the engine first considers the longest possible match for $e$
*
, - if this match does not result in a match for the whole expression, then the engine backtracks by one character (i.e. reduces the length of the match for $e$
*
by one), and tries again to find a match for the whole expression, - if this fails again, then the engine backtracks again by one character,
- etc.
Examples.
- The best first match for the regex
a*
inaa
is the whole word.- The best first match for the regex
a*
inaab
is $[0 ..2)$ (with wordaa
).- The best first match for the regex
a*
inaabaa
is $[0 ..2)$ (with wordaa
).- The best first match for the regex
(a|b)*b
inabab
is the whole word. In this case,
- the engine first tries to match
(a|b)*
against the whole word,- this does not result in a match for the whole regex, because the trailing
b
in the regex is not matched,- so the engine backtracks, and matches
(a|b)*
against $[0 .. 3)$ (with wordaba
),- this results in a match for the whole expression.
Warning. The interaction of left-to-right evaluation and greedy matching can be hard to predict. This is why we highly recommend writing and debugging regexes with a validation engine, such as regex101.
Warning. Nested
*
operators can be costly, due to a combinatorial explosion of the number of attempts to find a match. This is sometimes referred to as catastrophic backtracking.
Successive best first matches #
Most regex engines can return best first matches in an iterative way. Intuitively, the engine “consumes” the best first match. Then the next match is the best first match in the remaining string, etc.
More precisely:
Definition. Let $e$ be regex and let $w$ be a word with lenght $n$.
Then:
- the first match $w[i_1 .. j_1)$ for $e$ in $w$ is the best first match (if any) for $e$ in $w$,
- the second match $w[i_2 .. j_2)$ for $e$ in $w$ is the best first match (if any) for $e$ in $w[j_1 .. n)$,
- the third match $w[i_3 .. j_3)$ for $e$ in $w$ is the best first match (if any) for $e$ in $w[j_2 .. n)$,
- etc.
Example. Let $e$ be the regex
ab*
, and let $w$ be the wordabac
.
The best first match for $e$ in $w$ is $w[0 ..2)$, with word
ab
.The second match is the best first match in the rest of $w$, i.e. in $w[2 .. 4)$. This match is $w[2 .. 3)$, with word
a
.There is no third match.
In each of the cases below, find all successive best first matches for the regex $e$ in word $w$:
regex $e$ | word $w$ |
---|---|
a |
aba |
a*b |
aba |
a* |
a |
a* |
aba |
(ab)* |
aba |
(ab)* |
abab |
regex $e$ | word $w$ | matches |
---|---|---|
a |
aba |
$[0 .. 1)$:a , $[2 .. 3)$:a |
a*b |
aba |
$[0 .. 2)$:ab |
a* |
a |
$[0 .. 1)$:a , $[1 .. 1)$: $\varepsilon$ |
a* |
aba |
$[0 .. 1)$:a , $[1 .. 1)$: $\varepsilon$, $[2 .. 3)$:a , $[3 .. 3)$: $\varepsilon$ |
(ab)* |
aba |
$[0 .. 1)$:ab , $[2 .. 2)$: $\varepsilon$, $[3 .. 3)$: $\varepsilon$ |
(ab)* |
abab |
$[0 .. 3)$:abab , $[3 .. 3)$: $\varepsilon$ |
Lazy closure #
Some regex engines (notably the ones used by default in Java, Javascript, Python, etc.) support alternative ways to evaluate *
operators.
One of them is lazy evaluation (also called reluctant). In this case, the star operator matches the shortest possible segment.
Examples.
regex $e$ word $w$ best first match with lazy evaluation of star operators a*
aa
$[0 .. 0)$:$\varepsilon$ a*ac
ac
$[0 .. 2)$: ac
a*ac
aac
$[0 .. 3)$: aac
Note. In a same regular expression, it is possible to specify that some
*
operators must be evaluated in a lazy way, and others in a greedy way.
In each of the cases below, find all successive best first matches for the regex $e$ in word $w$, where *
operators are evaluated in a lazy way.
regex $e$ | word $w$ |
---|---|
ab* |
ab |
a*b |
aba |
a*ab* |
ab |
a* |
a |
a* |
aa |
b* |
aa |
regex $e$ | word $w$ | lazy matches |
---|---|---|
ab* |
ab |
$[0 .. 1)$:a |
a*b |
aba |
$[0 .. 2)$:ab |
a*ab* |
ab |
$[0 .. 1)$:a |
a* |
a |
$[0 .. 0)$:$\varepsilon$, $[1 .. 1)$:$\varepsilon$ |
a* |
aa |
$[0 .. 0)$:$\varepsilon$, $[1 .. 1)$:$\varepsilon$, $[2 .. 2)$:$\varepsilon$ |
b* |
aa |
$[0 .. 0)$:$\varepsilon$, $[1 .. 1)$:$\varepsilon$, $[2 .. 2)$:$\varepsilon$ |
Warning. Lazy
*
operators can be hard to predict.