Regular languages #

Definitions #

Alphabet #

Definition.

An alphabet is a finite set of characters.

Examples.

{a} is a alphabet,

{a,b} is another alphabet,

{a, b, 3, 5,\n} is yet another alphabet.

Warning. A blank space is is a character.

The letter $\Sigma$ is often used to refer to an alphabet.

Word #

Definition. A word over an alphabet $\Sigma$ is a (possibly empty) finite sequence of character in $\Sigma$.

The empty word (a.k.a. “empty string”) is traditionally denoted with $\varepsilon$.

Examples. Let $\Sigma$ be the alphabet {a, b, 3, 5, \n}. Then possible words over $\Sigma$ are:

$\varepsilon$

a

aa

ba\n

\n\nba353\n\nb

Language #

Definition. A formal language over an alphabet $\Sigma$ is a (possibly infinite) set of words over $\Sigma$.

Examples. Let $\Sigma$ be the alphabet { a, b }.

Then possible languages over $\Sigma$ are:

{ }

{ a }

{ a, aa, aaa, b, aba }

{ $\varepsilon$, a, ba }

the set of all words that contain only the character a

the set of all words over $\Sigma$

the set of all words over $\Sigma$ that do not contain 2 consecutive b’s

the set of all words over $\Sigma$ that contain as many a’s as b’s

Operations on languages #

Element-wise concatenation #

If $L_1$ and $L_2$ are languages, then we use $L_1 \circ L_2$ for the language that consists of the element-wise concatenation of all words in $L_1$ and $L_2$ (in that order).

Examples.

If $L_1$ = {ab, c} and $L_2$ = {d, ef}, then $L_1 \circ L_2$ = { abd, abef, cd, cef}

If $L_3$ = {ab, b} and $L_4$ = { $\varepsilon$, ac }, then $L_3 \circ L_4$ = { ab, abac, b, bac }

Closure #

If $L$ is a language, then the closure (also called Kleene closure) $L^*$ of $L$ is the language

$L^*$ = { $\varepsilon$ } $\cup\ L \cup\ (L \circ L) \cup\ (L \circ L \circ L) \cup (L \circ L \circ L \circ L) \cup …$

Equivalently, $L^*$ can be defined as:

the set that consists of $L$, the empty word $\varepsilon$, and all possible concatenations of words in $L$.
the smallest superset of $L\ \cup$ { $\varepsilon$ } that is closed under word concatenation (meaning that if $L^*$ contains two words $w_1$ and $w_2$, then it also contains the word $w_1 w_2$).

Regular language #

Definition. A language is regular if it is:

the empty set, or

the language { $\varepsilon$ }, or

the language { a } for some character a, or

the union of two regular languages, or

the element-wise concatenation of two regular languages, or

the closure of a regular language.

Examples

$L_1$ = { a } is regular

$L_2$ = { b } is regular

$L_3$ = $L_1 \cup L_2$ = { a, b } is regular

$L_4 = L_1 \circ L_3$ = { aa, ab } is regular

$(L_4)^*$ = { $\varepsilon$, aa, ab, aaaa, aaab, abaa, abab, aaaaaa, aaaaab, … } is regular

Observation. If $\Sigma$ is an alphabet, then the language $\Sigma^*$ of all possible words over $\Sigma$ is regular.

Counter-examples.

If $\Sigma$ = { a,b }, then the set of all words over $\Sigma$ that contain as many a’s as b’s is not regular.

If $\Sigma$ = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} $\cup$ { +, -, ( , )} , then the set of well-formed algebraic expressions over $\Sigma$ (in particular, where each opening parenthesis is closed) is not regular.

More generally, most programming languages are not regular.

Properties #

The intersection of two regular languages is a regular language.

The difference between two regular languages is a regular language.

Regular expressions #

A language is regular iff it can be described by a regular expression.

Regular expressions are covered in the dedicated chapter.

Finite automata #

Regular languages have a close connection to so-called finite automata.

A finite automaton can be viewed as a simple program that recognizes a regular language $L$, meaning that the automaton takes as input a word $w$ and returns true iff $w$ belongs to $L$.

Finite automata are one of the simplest (theoretical) models of computation. They (approximately) correspond to computers that cannot use memory.

Automata theory is beyond the scope of this course.