Regular Language

A regular language is the simplest class of formal language, occupying the type 3 level at the bottom of the Chomsky hierarchy. The defining property is that a regular language is exactly the set of strings recognized by a finite automaton, a machine with a fixed, finite number of states and no extra memory. The same class of languages can be described by regular expressions, which is why the two ideas are tightly linked.

Because a finite automaton has only finitely many states, it cannot remember how many times something has occurred without bound. This makes regular languages ideal for recognizing patterns of fixed or bounded shape, such as identifiers, numbers, and keywords. In a compiler, this is the lexing or tokenizing stage: the raw character stream is chopped into tokens using rules that a finite automaton can match quickly.

The limitation is also sharp. Regular languages cannot describe arbitrarily nested structure, such as balanced parentheses, because that requires counting an unbounded number of open brackets to match the same number of closing brackets. The ALGOL 60 report shows why this matters in practice: it notes that the definition of expressions “is necessarily recursive,” and that recursive, nested structure is precisely what regular languages cannot capture.

For this reason, regular languages handle only the surface tokens of a programming language, while the nested grammar above them is described with a context-free grammar. The two layers together, regular lexing followed by context-free parsing, form the standard front end of a compiler.

Sources

Related