Regular languages #
Definitions #
Alphabet #
Definition.
An alphabet is a finite set of characters.
Examples.
{
a} is a alphabet,{
a,b} is another alphabet,{
a,b,3,5,\n} is yet another alphabet.
Warning. A blank space is is a character.
The letter $\Sigma$ is often used to refer to an alphabet.
Word #
Definition. A word over an alphabet $\Sigma$ is a (possibly empty) finite sequence of character in $\Sigma$.
The empty word (a.k.a. “empty string”) is traditionally denoted with $\varepsilon$.
Examples. Let $\Sigma$ be the alphabet {
a,b,3,5,\n}. Then possible words over $\Sigma$ are:
- $\varepsilon$
aaaba\n\n\nba353\n\nb
Language #
Definition. A formal language over an alphabet $\Sigma$ is a (possibly infinite) set of words over $\Sigma$.
Examples. Let $\Sigma$ be the alphabet {
a,b}.Then possible languages over $\Sigma$ are:
- { }
- {
a}- {
a,aa,aaa,b,aba}- { $\varepsilon$,
a,ba}- the set of all words that contain only the character
a- the set of all words over $\Sigma$
- the set of all words over $\Sigma$ that do not contain 2 consecutive
b’s- the set of all words over $\Sigma$ that contain as many
a’s asb’s
Operations on languages #
Element-wise concatenation #
If $L_1$ and $L_2$ are languages, then we use $L_1 \circ L_2$ for the language that consists of the element-wise concatenation of all words in $L_1$ and $L_2$ (in that order).
Examples.
If $L_1$ = {
ab,c} and $L_2$ = {d,ef}, then $L_1 \circ L_2$ = {abd,abef,cd,cef}If $L_3$ = {
ab,b} and $L_4$ = { $\varepsilon$,ac}, then $L_3 \circ L_4$ = {ab,abac,b,bac}
Closure #
If $L$ is a language, then the closure (also called Kleene closure) $L^*$ of $L$ is the language
$L^*$ = { $\varepsilon$ } $\cup\ L \cup\ (L \circ L) \cup\ (L \circ L \circ L) \cup (L \circ L \circ L \circ L) \cup …$
Equivalently, $L^*$ can be defined as:
- the set that consists of $L$, the empty word $\varepsilon$, and all possible concatenations of words in $L$.
- the smallest superset of $L\ \cup$ { $\varepsilon$ } that is closed under word concatenation (meaning that if $L^*$ contains two words $w_1$ and $w_2$, then it also contains the word $w_1 w_2$).
Regular language #
Definition. A language is regular if it is:
- the empty set, or
- the language { $\varepsilon$ }, or
- the language {
a} for some charactera, or- the union of two regular languages, or
- the element-wise concatenation of two regular languages, or
- the closure of a regular language.
Examples
- $L_1$ = {
a} is regular- $L_2$ = {
b} is regular- $L_3$ = $L_1 \cup L_2$ = {
a,b} is regular- $L_4 = L_1 \circ L_3$ = {
aa,ab} is regular- $(L_4)^*$ = { $\varepsilon$,
aa,ab,aaaa,aaab,abaa,abab,aaaaaa,aaaaab, … } is regular
Observation. If $\Sigma$ is an alphabet, then the language $\Sigma^*$ of all possible words over $\Sigma$ is regular.
Counter-examples.
If $\Sigma$ = {
a,b}, then the set of all words over $\Sigma$ that contain as manya’s asb’s is not regular.If $\Sigma$ = {
0,1,2,3,4,5,6,7,8,9} $\cup$ {+,-,(,)} , then the set of well-formed algebraic expressions over $\Sigma$ (in particular, where each opening parenthesis is closed) is not regular.More generally, most programming languages are not regular.
Properties #
- The intersection of two regular languages is a regular language.
- The difference between two regular languages is a regular language.
Regular expressions #
A language is regular iff it can be described by a regular expression.
Regular expressions are covered in the dedicated chapter.
Finite automata #
Regular languages have a close connection to so-called finite automata.
A finite automaton can be viewed as a simple program that recognizes a regular language $L$, meaning that the automaton takes as input a word $w$ and returns true iff $w$ belongs to $L$.
Finite automata are one of the simplest (theoretical) models of computation. They (approximately) correspond to computers that cannot use memory.
Automata theory is beyond the scope of this course.