Regular expressions #

The theory of regular expressions (a.k.a. regexes) was developed in the 1950’s (notably by Stephen Cole Kleene), and regexes have been used since the 1960’s for variety of computational tasks. Common applications include:

input string validation (e.g. checking whether a date is valid, or whether a password contains characters of certain types),
searching and replacing strings (e.g. within an IDE),
lexical analysis in a compiler.

Definition #

A regular expression is a (finite) expression that describes a (possibly infinite) regular language.

Syntax. A regular expression is either:

$\emptyset$, or

$\varepsilon$, or

a single character (e.g. a), or

an expression of the form:

$e_1 | e_2$, or

$e_1e_2$, or

$e*$

where $e, e_1, e_2$ are regular expressions.

Examples.

b is a regular expression

b|c is another regular expression

((b\n)|c)* is yet another regular expression

Precedence. By convention:

$*$ has precedence over concatenation,

concatenation has precedence over $|$

Precedence allows omitting some parentheses. For instance:

a|bc is equivalent to a|(bc)
ab*|c* is equivalent to (a(b*))|(c*)

Semantics. If $e$ is a regular expression, we use $\llbracket{e}\rrbracket$ to denote the regular language described by $e$.

Let a be a character, and let $e, e_1, e_2$ be regular expressions. Then:

$\emptyset$ describes the empty language:

$\llbracket{\emptyset}\rrbracket = $ { }

$\varepsilon$ describes the language that contains only the empty word:

$\llbracket \varepsilon \rrbracket$ = { $\varepsilon$ }

a describes the language that contains only the word a:

$\llbracket$a$\rrbracket$ = { a }

$e_1|e_2$ describes the union of the languages described by $e_1$ and $e_2$:

$\llbracket{ e_1 | e_2 }\rrbracket = \llbracket{e_1}\rrbracket \cup \llbracket{e_2}\rrbracket$

$e_1 e_2$ describes the element-wise concatenation of the languages described by $e_1$ and $e_2$:

$\llbracket{e_1e_2}\rrbracket = \llbracket{e_1}\rrbracket \circ \llbracket{e_2}\rrbracket$

$e*$ describes the closure of the language described by $e$:

$\llbracket e * \rrbracket$ = $\llbracket e \rrbracket^*$

Illustrations #

Example.

The expression a describes the language { a }

The expression b describes the language { b }

The expression ab describes the language { a } $\circ$ { b } = { ab }

The expression a|b describes the language { a } $\cup$ { b } = {a,b}

The expression (ab)* describes the language { $\varepsilon$, ab, abab, ababab, … }

The expression a(b|c|d)* describes the language of all words over the alphabet { a, b, c, d} that start with an a and contain no other a

The expression (b|c)(a|b|c)* describes the language of all nonempty words over the alphabet { a, b, c} where a cannot appear as the first character.

Exercise

What is the language described by each of the following regexes?

ab|a
a*a
a*ba*
a(a|b)*a
(a|b)*|(a|b)*c(a|b)*
(a|b)*
(b*a*)*
(b*a)*

Solutions

ab|a describes { a, ab }
a*a describes the set of all nonempty words that contain only the character a
a*ba* describes the set of words over the alphabet {a,b} that contain exactly one b
a(a|b)*a describes the set of words over the alphabet {a,b} that start with an a and end with a different a
(a|b)*|(a|b)*c(a|b)* describes the set of words over the alphabet {a,b,c} that contain at most one c
(a|b)* describes the closure of {a,b}
(b*a*)* also describes the closure of {a,b}
(b*a)* describes the set of all words over the alphabet {a,b} that do not end with a b

Exercise

As we saw in the exercise above, two regular expressions can describe the same language. Can you find more examples?

Exercise

For each of the following regular languages, write a regex that describes this language.

The language of all words over the alphabet {a,b,c} whose characters (if any) are in alphabetical order (from left to right)
The language of all words over the alphabet {a,b,c} such that every c is immediately followed by an a
The language of all words over the alphabet {a,b} that start with an a, end with a b, and do not contain two identical consecutive characters
The language of all words over the alphabet {a,b} that contain at least two consecutive b’s
The language of all words over the alphabet {a,b} that do not contain two consecutive b’s
The language of all words over the alphabet {a,b} that do not contain three consecutive b’s

Possible Solutions

a*b*c*
(a|b|ca)*
a(ba)*b
(a|b)*bb(a|b)*
(a|ba)*(a*|b)
(a|ba|bba)*(a*|b|bb)