Regular expressions #
The theory of regular expressions (a.k.a. regexes) was developed in the 1950’s (notably by Stephen Cole Kleene), and regexes have been used since the 1960’s for variety of computational tasks. Common applications include:
- input string validation (e.g. checking whether a date is valid, or whether a password contains characters of certain types),
- searching and replacing strings (e.g. within an IDE),
- lexical analysis in a compiler.
Definition #
A regular expression is a (finite) expression that describes a (possibly infinite) regular language.
Syntax. A regular expression is either:
- $\emptyset$, or
- $\varepsilon$, or
- a single character (e.g.
a), or- an expression of the form:
- $e_1 | e_2$, or
- $e_1e_2$, or
- $e*$
where $e, e_1, e_2$ are regular expressions.
Examples.
bis a regular expression
b|cis another regular expression
((b\n)|c)*is yet another regular expression
Precedence. By convention:
- $*$ has precedence over concatenation,
- concatenation has precedence over $|$
Precedence allows omitting some parentheses. For instance:
a|bcis equivalent toa|(bc)ab*|c*is equivalent to(a(b*))|(c*)
Semantics. If $e$ is a regular expression, we use $\llbracket{e}\rrbracket$ to denote the regular language described by $e$.
Let
abe a character, and let $e, e_1, e_2$ be regular expressions. Then:
- $\emptyset$ describes the empty language:
- $\llbracket{\emptyset}\rrbracket = $ { }
- $\varepsilon$ describes the language that contains only the empty word:
- $\llbracket \varepsilon \rrbracket$ = { $\varepsilon$ }
adescribes the language that contains only the worda:
- $\llbracket$
a$\rrbracket$ = {a}- $e_1|e_2$ describes the union of the languages described by $e_1$ and $e_2$:
- $\llbracket{ e_1 | e_2 }\rrbracket = \llbracket{e_1}\rrbracket \cup \llbracket{e_2}\rrbracket$
- $e_1 e_2$ describes the element-wise concatenation of the languages described by $e_1$ and $e_2$:
- $\llbracket{e_1e_2}\rrbracket = \llbracket{e_1}\rrbracket \circ \llbracket{e_2}\rrbracket$
- $e*$ describes the closure of the language described by $e$:
- $\llbracket e * \rrbracket$ = $\llbracket e \rrbracket^*$
Illustrations #
Example.
The expression
adescribes the language {a}The expression
bdescribes the language {b}The expression
abdescribes the language {a} $\circ$ {b} = {ab}The expression
a|bdescribes the language {a} $\cup$ {b} = {a,b}The expression
(ab)*describes the language { $\varepsilon$,ab,abab,ababab, … }The expression
a(b|c|d)*describes the language of all words over the alphabet {a,b,c,d} that start with anaand contain no otheraThe expression
(b|c)(a|b|c)*describes the language of all nonempty words over the alphabet {a,b,c} whereacannot appear as the first character.
What is the language described by each of the following regexes?
ab|aa*aa*ba*a(a|b)*a(a|b)*|(a|b)*c(a|b)*(a|b)*(b*a*)*(b*a)*
ab|adescribes {a,ab}a*adescribes the set of all nonempty words that contain only the characteraa*ba*describes the set of words over the alphabet {a,b} that contain exactly oneba(a|b)*adescribes the set of words over the alphabet {a,b} that start with anaand end with a differenta(a|b)*|(a|b)*c(a|b)*describes the set of words over the alphabet {a,b,c} that contain at most onec(a|b)*describes the closure of {a,b}(b*a*)*also describes the closure of {a,b}(b*a)*describes the set of all words over the alphabet {a,b} that do not end with ab
As we saw in the exercise above, two regular expressions can describe the same language. Can you find more examples?
For each of the following regular languages, write a regex that describes this language.
- The language of all words over the alphabet {
a,b,c} whose characters (if any) are in alphabetical order (from left to right) - The language of all words over the alphabet {
a,b,c} such that everycis immediately followed by ana - The language of all words over the alphabet {
a,b} that start with ana, end with ab, and do not contain two identical consecutive characters - The language of all words over the alphabet {
a,b} that contain at least two consecutive b’s - The language of all words over the alphabet {
a,b} that do not contain two consecutive b’s - The language of all words over the alphabet {
a,b} that do not contain three consecutive b’s
a*b*c*(a|b|ca)*a(ba)*b(a|b)*bb(a|b)*(a|ba)*(a*|b)(a|ba|bba)*(a*|b|bb)