In the last two Tech Tuesdays, I introduced Finite State Machines (FSM) and showed examples of both their usefulness and their limitation. Just as a reminder, we study computer science theory to get a handle on two fundamental questions: what can be computed at all (computability)? And, how hard is it to compute something that can be computed (complexity)? The upshot on FSM is that they cannot compute a whole lot but what they can they do very efficiently.
Let me flush both parts of that assertion out a bit more. Las week I introduced the idea of a formal language. Essentially such a language consists of an alphabet and then a series of rules describing which strings (drawn from letters of the alphabet) belong to the language and which don’t. Our two simple examples both worked with the alphabet {“a”, “b”}. We saw that an FSM can easily be constructed for the language “all strings that contain exactly one b” but that there is no FSM which can be used to determine whether or not a string belongs to the language of palindromes (i.e. strings which are the same backwards and forwards).
This raises two interesting questions: is there a way to categorize different types of languages? And are there machines other than FSM that can be used to deal with languages such as palindromes? The answers to these questions involve a classification that ties computer science to the world of linguistic and in particular to Noam Chomsky’s work on generative grammars. A generative grammar is a set of rules that “generates” the strings that belong to the language from the alphabet (or sentences from words).
Let’s take a look at our example of strings that contain exactly one “b”. Here is a generative grammar that produces all the strings in this language:
S → aS
S → bA
A → aA
A → ε
Each rule describes a substitution where the left contains pattern to match and the right says what to substitute for that pattern. Capital letters stand for so-called non-terminal symbols which must be replaced, whereas lower case letters are terminals (or literals) from the alphabet (in our case either “a” or “b”).
To generate a string in the language you start with the non-terminal symbol S and then apply the rules until you have no more non-terminals left. At that point you have a string that only contains letters from the alphabet {“a”, “b”} – again, these appear in the rules in lower case with ε denoting the empty string (no letter).
Here are some possible substitution sequences and the strings that they generate:
S, aS, aaS, aabA, aabaA, aaba
S, bA, b
S, bA, baA, baaA, baaaA, baaa
As long as you apply the rules until you have only terminal symbols you will generate a string in the “has exactly one b” language. This means we now have two ways of describing this language: a generative grammar for producing strings and from last week’s post a finite state machine for consuming (or “deciding”) them.
Now if you look closely at the structure of the rules for this language you will notice a clear structure. The left hand side is a single non-terminal symbol and the right hand side of each rule has non-terminal symbols only to the right of terminal ones. That is not by accident, such a grammar is known as a right regular grammar and gives us a regular language which is one that an FSM can be constructed for.
But what if we allow more general rules? Well then it turns out we need more powerful machines and more importantly, there is a precise matching between constraints on rules and types of machines that is captured in the following table (based on the Wikipedia entry on the Chomsky Hierarchy; in the Grammar column A, B are non-terminal symbols, a is any terminal and α, β and γ are strings of terminals and non-terminals combined)
These language classes form a strict inclusion hierarchy as follows. All regular languages are also context-free, but there are some context-free languages that are not regular. For each class of languages we now have a nice duality: a set of constraints on production rules for the generative grammar and a type of automaton (abstract computer) that can decide the language.
In the upcoming posts we will get to know the most powerful of these automata, the so-called Turing machine. At some later point we may come back and revisit some of the intermediate machines such as the the pushdown automaton.
Now as an interesting exercise for the reader (who has made it this far). What does the generative grammar for the palindrome language over {“a”, “b”} look like? Which class does that place the palindrome language in? And based on that, what kind of automaton can be used to recognize palindromes?