Grammars as Systems of Equations
This article is a continuation of the series now titled "The Untold
Story of Formal Languages". The previous 8 parts (revised with minor errors
in the originals corrected) will be available soon by FTP. In the future,
this will likely be compiled and published in the form of a monograph or
a text.
(1) History
One of the earliest notations used for depicting grammars was BNF, or
Backus-Naur Form. In it, a rule was written as:
Variable ::= Terms
and it was explicitly noted that these were to be thought of as equations
of some kind. However, for some curious and unknown reason the comment was
always qualified and later formalisms never really took on the idea, even
though it remains latent in most presentations of formal language theory.
The traditional formalism for a Context Free Grammar reads like this:
A Context Free Grammar consists of the following:
N: a set of non-terminals (sometimes called variables,
reflecting the older tradition of grammars as equations)
X: a set of terminals (i.e. the input alphabet, also
called tokens)
S: an element in N denoting the START (or SENTENCE) symbol
P: a subset of N x (N + X)*. Write n -> w if and only if (n, w) is
an element of P and call n -> w a PRODUCTION or RULE.
Similar to what was done to arrive at the correspondence (Automaton =
System of Equations), we will adopt the axiom that --
Non-Terminal = Expression
and then arrive at the conclusion that a grammar, too, is a system of
equations whose least fixed point solution in the main variable, S, is
the language represented by the grammar.
Associated with a grammar is a definition for DERIVABILITY. One
states that
* w1 n w2 |- w1 w w2 if n -> w
where w1, w2 are in (N + X)*
and then defines the reflexive-transitive closure as follows:
* (R) A |= A
* (T) if A |- B, B |= C then A |= C
* (C) |= is the smallest relation defined by (R) and (T).
where I'm using the symbol |= in lieu of |-^* because it's easier to type.
For each word w in (N + X)*, define:
|w| = { w' in X*: w |= w' }
then the language represented by the grammar is defined to be |S|.
Using properties R and T of |=, one can arrive at a system of equations
that |n| will satisfy. The property C will guarantee that |n| will be the
least fixed point solution of its system of equations. By applying the
axiom (Non-Terminal = Expression) in the form:
n = |n|
one arrives at the desired correspondence: Grammar <-> System of Equations.
This is achieved by the following lemmas:
Lemma 1: For each non-terminal n in N: |n| = union(|w|: n -> w)
Proof:
|n| = { w' in X*: n |= w' }
= { w' in X*: n = w' } + { w' in X*: n |- w, w |= w' }
= 0 + union ( {w' in X*: w |= w'} : n -> w)
= union { |w|: n -> w }
Lemma 2: For each word x in X*: |x| = x
Proof:
|x| = { w' in X*: x |= w' }
= { w' in X*: x = w' } + { w' in X*: x |- w, w |= w' }
= {x} + 0
The latter reduction is due to the fact that relation x |- w requires x to
have at least one non-terminal in it, by definition of |-. Therefore x |- w
is impossible when x is a word in X*.
By notational convention when writing language expressions a singleton set
is denoted by its element, thus |x| = {x} = x.
Lemma 3: If w1 |- w2, then w' w1 w'' |- w' w2 w''
Proof:
The condition w1 |- w2 can only occur in the form:
w3 n w4 |- w3 w w4 with w3, w4 in (N + X)* and n -> w
then
w' w1 w'' = w' (w3 n w4) w''
= (w' w3) n (w4 w'')
|- (w' w3) w (w4 w'')
= w' (w3 w w4) w'' = w' w2 w''
Lemma 4: |w1 w2| = |w1| |w2|
Proof:
Suppose x1 x2 lies in |w1| |w2|. Then x1, x2 are both words in X* with
w1 = a0 |- a1 |- a2 |- ... |- am = x1
w2 = b0 |- b1 |- b2 |- ... |- bn = x2
for some m, n >= 0 and a's and b's in (N + X)*. Then by Lemma 3:
w1 w2 = a0 b0 |- a1 b0 |- a2 b0 |- ... |- am b0 = x1 b0
|- x1 b1 |- x1 b2 |- ... |- x1 bn = x1 x2
or
w1 w2 |= x1 x2,
or
x1 x2 is an element of |w1 w2|.
thus |w1| |w2| is a subset of |w1 w2|.
Suppose x lies in |w1 w2|. Then x is a word in X* with
w1 w2 = c0 |- c1 |- c2 |- ... |- cp = x
for some p >= 0 and c's in (N + X)*. We may inductively define ci = ai bi
as follows:
* c0 = a0 b0 = w1 w2
* if ci = ai bi and ai bi |- c{i+1}
then either
-- ai = u n v, n -> w and
ai bi = u n v bi |- u w v bi = c{i+1}
in which case define a{i+1} = u w v, b{i+1} = bi
or -- bi = u n v, n -> w and
ai bi = ai u n v |- ai u w v = c{i+1}
in which case define a{i+1} = ai, b{i+1} = u w v
Thus one arrives at x = ap bp, where w1 = a0 |= ap, and w2 = b0 |= bp.
Therefore, x is an element of |w1| |w2|.
This establishes the equality of |w1 w2| and |w1| |w2|.
Lemma 5: If w1 |= w2 then |w1| is a superset of |w2|.
Proof:
Let x be an element of w2. Then since |= is transitive, it follows
that
w1 |= w2 |= x, therefore w1 |= x, or x is in |w1|.
Thus |w1| contains |w2| as a subset. If we apply the axiom |w| = w, then
this lemma establishes a direct relation between inclusion and |=:
w1 |= w2 --> w1 contains w2 <--> w1 + w2 = w1
From these basic relations, one arrives at the following property:
Theorem: Corresponding to every grammar is a system of equations of the
form:
n = sum (w: n -> w)
with one equation for each non-terminal, n, in N.
whose least fixed point solution is |n|.
Proof:
We've shown that the |n|'s are the solutions to the equations comprising
the system. The closure property guarantees that |n| must then be a subset of
every other solution to the equation in n. In particular, if w in |n| then:
n |= w
|n| contains |w| = {w}, by Lemmas 5 and 2.
w is an element of |n|.
By lemmas 1, 2 and 4, we may write:
|n| = sum ( |y1| |y2| ... |ym|: yi in N + X, n -> y1 y2 ... ym }
which after application of the axiom |n| = n and Lemma 2 becomes:
n = sum (w: n -> w)
Thus each grammar is equivalent to a system of equations with the
correspondences:
S <---> language represented by the grammar
n -> w <---> n = ... + w + ...
So now the start/sentence symbol is not just related to the language
represented by the grammar, it IS the language!
Example:
For the context free grammar given by
N = { S, T }
X = { a, b }
P = { (S, 1), (S, a T), (T, S b) }
one has the corresponding system:
S = 1 + a T
T = S b
Its least fixed point solution in terms of context free expressions is:
S = <0| (<1| a)* (b |1>)* |0>
T = <0| (<1| a)* (b |1>)* b |0>
(see part 8 of the series), and in fact, one can write
S = <0| (1 + <1| a)* <1| a (|1> b)*) |0>
= <0| |0> + <0| (<1| a)* <1| a (|1> b)* |0>
= 1 + <0| (<1| a)* <1| a (|0> + |1> b (|1> b)* |0>)
= 1 + <0| (<1| a)* a (<1||0> + <1||1> b (|1> b)*) |0>
= 1 + <0| (<1| a)* a ( 0 + 1 b (|1> b)*) |0>
= 1 + <0| (a <1|)* a b (|1> b)* |0>
= 1 + <0| a (<1| a)* (b |1>)* b |0>
= 1 + a <0| (<1| a)* (|1> b)* b |0>
= 1 + a T
and
T = <0| (<1| a)* (b |1>)* b |0>
= <0| (<1| a)* (b |1>)* |0> b
= S b
(2) Systems of Equations
The previous discussion leads to the concept of a system of equations,
which can be formalised as follows:
A System of Equations over an alphabet X consists of the following:
V: a set of variables.
S: an element of V denoting the main variable.
P: V -> Power((V + X)*), where Power(S) denotes the power set of S.
Each variable, v, is associated with P(v) which denotes a
subset of (V + X)*, for which we'll write v = P(v).
Examples:
BNF -- V is finite
P(v) is a finite sum of products of terms in V + X.
EBNF -- Extended BNF
V is finite
P(v) is a regular expression over V + X.
HBNF -- "Hyper" BNF
V is finite
P(v) is a context free expression over V + X.
Exercise: Which class of languages can be represented by HBNF's, EBNF's?
Hint: Part 8 ("Solving Context Free Equations") was actually about deriving a
context-free expression from a system of equations in EBNF form. But
the method, in fact, was formulated with respect to systems of equations
in HBNF form.
ISA -- Infinite State Automaton (see part 3 of this series)
V = Q, the automaton's state set, possibly infinite
S = the start state
P(q) = \q + sum(x q': q |- x q') + sum(q': q |- q')
where \q = 1 if q is a final state, 0 else
The last example (which a Turing Machine is a special case of) shows the
degree of generality that systems of equations can achieve.
The notion of derivation may be defined analogously:
Let v be a variable in V, and w, w1, w2 words in (V + X)*:
* v -> w if w is an element of P(v).
* w1 v w2 |- w1 w w2 if v -> w
* |= is defined as the reflexive-transitive closure of |-.
* |w| = { x in X*: w |= x }
then the language represented by the system of equations will be defined
as |S|. By completely analogous proofs, the following lemmas apply:
Lemma 1': For each variable v in V: |v| = union(|w|: v -> w) = |P(v)|.
Lemma 2': For each word x in X*: |x| = x
Lemma 3': If w1 |- w2, then w' w1 w'' |- w' w2 w''
Lemma 4': |w1 w2| = |w1| |w2|
Lemma 5': If w1 |= w2 then |w1| is a superset of |w2|.
as well as:
The Fundamental Theorem of Systems of Equations
Let (V, S, P) be a system of equations given as above
then the least fixed point solution to the equations
{ v = P(v): v in V } is |v|.
This includes the result arrived at in part 3 of the series for Infinite
State Automata as a special case.
(3) Derivations as Systems of Inequalities
Recalling the definition of inequality from part 5 (A -> B iff A = A + B),
we can express the fundamental theorem in the form --
Corollary:
v = { w in X*: v -> w }, for v in V
is the least fixed point solution to the system
{ v = P(v): v in V }.
Proof:
Let w be a word in X*. If v -> w then v = v + w = v union {w}, thus
w is an element of v. By our axiom, v = |v|, it follows that w is an
element of |v| or that v |= w.
Conversely, by Lemma 5' if v |= w, then v is a superset of {w}, therefore
v = v + w, and v -> w. Thus
{ w in X*: v -> w } = { w in X*: v |= w } = |v|.
Thus, a derivation is seen to be a special kind of system of inequalities.
(4) Parse Trees as Finite Systems of Equations
In dealing with context free grammars one also encounters the idea of
the parse tree. A parse tree is the depiction of a finite set of relations
of the form N -> x1 ... xn, where N is the label of a node in the tree and
x1, ..., xn are the labels of the nodes that branch from it.
Consider the following grammar given by the equation:
S = x + a S b + S S
Suppose we were to parse the input a x x b x. Then the following derivation
could be constructed:
S |- S S |- a S b S |- a S S b S
|- a x S b S |- a x x b S |- a x x b x
This can also be presented as a system of equations, where each equation
essentially represents a "restriction" of the corresponding equation given
by the grammar:
S0 = S1 S5
S1 = a S2 b
S2 = S3 S4
S3 = x
S4 = x
S5 = x
When you solve this system for S0, the result is a one-element set
{ a x x b x }. The equations themselves capture the set of branchings in
the corresponding parse tree along the following lines:
S0 <----> root node
S1 = a S2 b <----> S1 branches to leaf a, node S2 and leaf b
The process of parsing can therefore be represented as the process of writing
down a series of equations, each a "subset" of the original equation, whose
solution is a one-element set.
More generally, if you want to include multiple parse trees, you have to
allow the use of the + operator as well. In the most general case, this
might even result in an infinite set. For example, with the grammar:
S = 1 + x + S S
the set of parse trees corresponding to the input x x can be represented by
the set of equations:
S02 = S00 S02 + S01 S12 + S02 S22
S01 = S00 S01 + S01 S11 + x
S00 = S00 S00 + 1
S12 = S11 S12 + S12 S22 + x
S11 = S11 S11 + 1
S22 = S22 S22 + 1
This method of representing sets of parse trees has been noted often in the
literature. It corresponds well to the idea of AND-OR trees, and proof
trees in Horn Clause evaluators and has even led to analogous methods
being applied in those areas.
But this generalization, or perspective if you will, is not general enough
and doesn't really capture the flavor of what's going on here fully. In
the most general case, what the parsing process is really giving you is
not just a AND-OR tree (which can be thought of as a finite grammar), but
more generally the output of the parsing process will be a grammar that
generates a one-element set. The grammar itself may not even be a
regular grammar, much less finite. For example, the set of parse trees
corresponding to the grammar:
S = 1 + S + S S + x
with input
x x
is not just infinite, it's a context free set in its own right.
Actually, the general problem of parsing on a CFL must also take into
account that different structures will be built or actions taken depending
on what parse tree is generated. This will be discussed in a later section
on Translation Expressions and Translation Grammars. There, we will also
see that if we allow for this generalised method of representing sets of
parse trees, then CFG parsing will become an O(n) process. The drawback:
the structures are over-packed, i.e., testing a parse structure for emptiness
will have equal complexity as testing for CFG emptiness.