Regular expressions #
The theory of regular expressions (a.k.a. regexes) was developed in the 1950’s (notably by Stephen Cole Kleene), and regexes have been used since the 1960’s for variety of computational tasks. Common applications include:
- input string validation (e.g. checking whether a date is valid, or whether a password contains characters of certain types),
- searching and replacing strings (e.g. within an IDE),
- lexical analysis in a compiler.
Definition #
A regular expression is a (finite) expression that describes a (possibly infinite) regular language.
Syntax. A regular expression is either:
- $\emptyset$, or
- $\varepsilon$, or
- a single character (e.g.
a
), or- an expression of the form:
- $e_1 | e_2$, or
- $e_1e_2$, or
- $e*$
where $e, e_1, e_2$ are regular expressions.
Examples.
b
is a regular expression
b|c
is another regular expression
((b\n)|c)*
is yet another regular expression
Precedence. By convention:
- $*$ has precedence over concatenation,
- concatenation has precedence over $|$
Precedence allows omitting some parentheses. For instance:
a|bc
is equivalent toa|(bc)
ab*|c*
is equivalent to(a(b*))|(c*)
Semantics. If $e$ is a regular expression, we use $\llbracket{e}\rrbracket$ to denote the regular language described by $e$.
Let
a
be a character, and let $e, e_1, e_2$ be regular expressions. Then:
- $\emptyset$ describes the empty language:
- $\llbracket{\emptyset}\rrbracket = $ { }
- $\varepsilon$ describes the language that contains only the empty word:
- $\llbracket \varepsilon \rrbracket$ = { $\varepsilon$ }
a
describes the language that contains only the worda
:
- $\llbracket$
a
$\rrbracket$ = {a
}- $e_1|e_2$ describes the union of the languages described by $e_1$ and $e_2$:
- $\llbracket{ e_1 | e_2 }\rrbracket = \llbracket{e_1}\rrbracket \cup \llbracket{e_2}\rrbracket$
- $e_1 e_2$ describes the element-wise concatenation of the languages described by $e_1$ and $e_2$:
- $\llbracket{e_1e_2}\rrbracket = \llbracket{e_1}\rrbracket \circ \llbracket{e_2}\rrbracket$
- $e*$ describes the closure of the language described by $e$:
- $\llbracket e * \rrbracket$ = $\llbracket e \rrbracket^*$
Illustrations #
Example.
The expression
a
describes the language {a
}The expression
b
describes the language {b
}The expression
ab
describes the language {a
} $\circ$ {b
} = {ab
}The expression
a|b
describes the language {a
} $\cup$ {b
} = {a
,b
}The expression
(ab)*
describes the language { $\varepsilon$,ab
,abab
,ababab
, … }The expression
a(b|c|d)*
describes the language of all words over the alphabet {a
,b
,c
,d
} that start with ana
and contain no othera
The expression
(b|c)(a|b|c)*
describes the language of all nonempty words over the alphabet {a
,b
,c
} wherea
cannot appear as the first character.
What is the language described by each of the following regexes?
ab|a
a*a
a*ba*
a(a|b)*a
(a|b)*|(a|b)*c(a|b)*
(a|b)*
(b*a*)*
(b*a)*
ab|a
describes {a
,ab
}a*a
describes the set of all nonempty words that contain only the charactera
a*ba*
describes the set of words over the alphabet {a
,b
} that contain exactly oneb
a(a|b)*a
describes the set of words over the alphabet {a
,b
} that start with ana
and end with a differenta
(a|b)*|(a|b)*c(a|b)*
describes the set of words over the alphabet {a
,b
,c
} that contain at most onec
(a|b)*
describes the closure of {a
,b
}(b*a*)*
also describes the closure of {a
,b
}(b*a)*
describes the set of all words over the alphabet {a
,b
} that do not end with ab
As we saw in the exercise above, two regular expressions can describe the same language. Can you find more examples?
For each of the following regular languages, write a regex that describes this language.
- The language of all words over the alphabet {
a
,b
,c
} whose characters (if any) are in alphabetical order (from left to right) - The language of all words over the alphabet {
a
,b
,c
} such that everyc
is immediately followed by ana
- The language of all words over the alphabet {
a
,b
} that start with ana
, end with ab
, and do not contain two identical consecutive characters - The language of all words over the alphabet {
a
,b
} that contain at least two consecutive b’s - The language of all words over the alphabet {
a
,b
} that do not contain two consecutive b’s - The language of all words over the alphabet {
a
,b
} that do not contain three consecutive b’s
a*b*c*
(a|b|ca)*
a(ba)*b
(a|b)*bb(a|b)*
(a|ba)*(a*|b)
(a|ba|bba)*(a*|b|bb)