Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

The Untold Story of Formal Languages, part 10

9 views
Skip to first unread message

Mark Hopkins

unread,
Aug 12, 1995, 3:00:00 AM8/12/95
to
Properties of the Context Free Languages

This is part 10 of the series "The Untold Story of Formal Languages". The
previous parts will be available soon by FTP. In the future, this will likely
be compiled and published in the form of a monograph or a text.

In this part, we'll show the remarkable ease of establishing some basic
results concerning Context Free Languages, applying the insights that have
been discussed in the previous sections. This includes derivations and proof
outlines of the following:

-- Direct proof (outline) of the relation D-CFL < N-CFL.
-- A CFE -> CFG conversion algorithm which has, as a corollary, an
efficient PDA -> CFG conversion algorithm.
-- Closure properties for CFL's, and the CCFL's.
-- An efficient CFE, RE intersection algorithm, which leads to an O(n)
algorithm for intersecting a PDA and NFA, where n is the size of the
NFA. In contrast, the algorithm usually cited is O(n^2).

(1) The PDA - ISA correspondence in detail
The traditional definition of a Push Down Automaton runs like this:

A Push Down Automaton consists of the following:
X: an input alphabet
S: a stack alphabet = { 1, 2, ..., n }
where n is the number of stack symbols.
Q: a finite set of states
q0: an element of Q (the start state)
F: a subset of Q (the final states)
D: a subset of ((S + {0}) x Q x (X + 1)) x (S* x Q)
where s q x |- w q' iff ((s, q, x), (w, q')) is in D.

A configuration takes on the form -- (S*, Q, X*) and denotes respectively
the current stack state, the current state and the remaining word. One
writes:
w' s q x y |- w' w q' y if s q x |- w q'
q x y |- w q' y if 0 q x |- w q'

where w, w' are stack words in S*, y is a word in X*, x is in X + 1, and
q and q' are states.

As was noted in part 3 ("The Equational Representation of Formal
Languages"), this is a special case of an infinite state automaton where

* State set = Q x S*, Q finite.
Each element (q, w) of Q x S* is written as q(w).
* The transitions take on the form:
q(w' s) x |- q'(w' w) if s q x |- w q'
q(1) x |- q'(w) if 0 q x |- w q'

Which implies the symmetry property --

For all stack words w1, w2 in S*:
q(w1 s) x |- q'(w1 w) <==> q(w2 s) x |- q'(w2 w)

From this, a correspondence to a system of equations was derived where:

q = ... + x us v(w') q' + ... if s q x |- w q'
q = ... + p0 v(w') q' + ... if 0 q x |- w q'
q = 1 + ... if q is a final state

which is a right-linear system over the Context Free Algebra.

The set of states Q x S* may be depicted as an infinite number of copies
of Q arranged in an infinite n-ary tree with the relations --

{ q(1): q in Q } lies at the root level of the tree
q(w m) lies along mth branch down from state q(w).

This leads to an interpretation for the generators of the Context Free
Algebra:
vm: Go down branch m.
um: Test for branch m,
followed by move up to parent node
p0: Test for root level.

Example:
Corresponding to the language defined by the equation

S = 1 + p S q S + b S d S
we have
S() = 1 + a S(1) + c S(2)
S(w 1) = q S(w) + p S(w 1 1) + b S(w 1 2)
S(w 2) = d S(w) + p S(w 2 1) + b S(w 2 2)
where
S(n1 n2 ... nk) = S xk S ... S x2 S x1 S
where xi = q if ni = 1
xi = d if ni = 2

From this, we could derive the word a c d b either by a set of inequalities:

S() -> a S(1)
-> a c S(1 2)
-> a c d S(1)
-> a c d b S()
-> a c d b
or by a derivation sequence

S() a c d b |- S(1) c d b |- S(1 2) d b |- S(1) b |- S()

In terms of the Context Free Algebra the system of equations take the form:

start: <0| S
S = |0> + v1 p S + v2 b S + u1 q S + u2 d S

The states that comprise this automaton will have the topology of an infinite
binary tree:
|
v
[[S]]
*-------- ^ ^ --------*
| | | |
p | q | | d | b
v | | v
S --------* *-------- S
*---- ^ ^ ----* *---- ^ ^ ----*
| | | | | | | |
p | q | | d | b p | q | | d | b
v | | v v | | v
S ----* *---- S S ----* *---- S
... ... ... ...

The symmetry property alluded to above is what gives the automaton its
topological feature.

(2) D-CFL < N-CFL
Call a context free language deterministic if it is recognized by a
(lambda-free) deterministic PDA. A deterministic PDA will be a PDA in
which no more than one transition (s, q, x) |- (w, q') will occur for any
(s, q, x) in S x Q x X.

A deterministic ISA may be defined analogously as one in which no more
than one transition (q, x) |- q' occurs for any (q, x) in Q x X. Then there
will be a direct correspondence between deterministic PDA's and deterministic
ISA's which have the form of an infinite n-ary tree.

There are context free languages whose ISA's when made deterministic will
not take the form of an infinite n-ary tree. These languages therefore cannot
be represented by any deterministic PDA and so are not deterministic. The
following language is an example:

L = { a^i b^j c^k: i = j or j = k }

which can be specified by the following system of equations:

S = U c* + a* V
U = 1 + a U b
V = 1 + b V c

To arrive at an Infinite State Automaton, define the following expressions:

A(n) = U b^n c* C(n) = b^n c* E = c*
B = a* V D(n) = V c^n F(n) = c^n

then by direct substitution one can verify that they satisfy the system of
equations:
S = A(0) + B
A(n) = C(n) + a A(n+1)
B = a B + D(0)
C(0) = E, C(n+1) = b C(n)
D(n) = F(n) + b D(n+1)
E = 1 + c E
F(0) = 1, F(n+1) = c F(n)

which corresponds to the following Infinite State Automaton:

[[F(0)]] <-- F(1) <--- F(2) <--- F(3) ...
^ c ^ c ^ c ^
| | | |
*----> B -----> D(1) ---> D(2) ---> D(3) ...
| /^ b b b
| / |
--> S ---* *->-* a
|
| a a a
*---> A(0) --> A(1) --> A(2) --> A(3) ...
| | | |
v b v b v b v
[[E]] <-- C(1) <-- C(2) <-- C(3) ...
/^
/ |
*->-* c

Because S branches off into both A(0) and B with lambda transitions, then
the automaton is non-deterministic. Even when the lambda transitions are
removed it will remain a non-deterministic ISA since S will branch off to
A(1) and B on the input a.

However, it can be made deterministic by combining states. Let:

G(n) = A(n) + B (= U b^n c* + a* V)
H(n, m) = C(n) + D(m) (= b^n c* + V c^m)

and note that E + F(m) = c* + c^m = c*. Then one can write the following
system of equations:

S = G(0)
G(n) = a G(n+1) + H(n, 0)
H(0, m) = b D(m+1) + E = b D(m+1) + 1 + c E
H(n+1, m) = b H(n, m+1) + F(m)

which leads to the deterministic infinite state automaton:
Not shown: each of the G and H states under in the first row has a lambda
transition to E and each of the G and H states in the rows underneath has
a lambda transition to the F state in the same column. When the lambda
transitions are removed, the automaton becomes deterministic.

c c c c
[[F(0)]] <- F(1) <--- F(2) <--- F(3) <--- ...
^ ^ ^
c | b | b | b
+--+ D(1) ---> D(2) ---> D(3) ---> ...
v | ^ ^ ^
[[E]] <---+ / / /
| /b / b / b
| / / /
---> S = G(0) H(0,1) H(0,2) H(0,3) ...
| ^ ^ ^
a | / b / b / b / b
v / / / /
G(1) H(1,1) H(1,2) H(1,3) ...
| ^ ^ ^
a | / b / b / b / b
v / / / /
G(2) H(2,1) H(2,2) H(2,3) ...
| ^ ^ ^
a | / b / b / b / b
v / / / /
G(3) H(3,1) H(3,2) H(3,3) ...
|
...

which basically has the form of an 2-dimensional grid. Even more, this
automaton (once the lambda transitions are removed) is in minimal form: i.e.,
no two states are equivalent:

|q1| = |q2| <--> q1 = q2.

So from each state G(n), there are an infinite number of distinct diamond
shaped paths:
G(n)
/ \
a1 b1
/ \
a2 b2
... ...
an bn
\ /
Q

where all the a's are distinct from all the b's and from each other.

The only conversions possible between deterministic automata are equivalence
conversions where either one state is split into two or more equivalent
states, or a set of equivalent states are merged. Therefore, no tree-shaped
deterministic automaton can be converted to or from an automaton which has
this property.

Therefore the language, L, is an inherently non-deterministic CFL.

(3) Converting CFE -> CFG
In part 8 of this series ("Solving Context Free Equations") a process
was described which found a solution of the form

<0| R |0>

to a system of equations comprising a Context Free Grammar, where R is a
regular expression over the input alphabet X and operator set

D = { <1|, ..., <s|, |1>, ..., |s> }

Note, in particular, the absence of <0| and |0> inside the expression R.
Using either of the RE -> FA algorithms presented in part 5 of the
series ("The Algebraic Representation of Formal Languages"), the expression
<0| R |0> can be converted into a right-linear system of equations of the
form:
S = q0
qi = sum of terms of the forms
|j> qk, |0>, <j| qk and x qk

Collect all the terms of the former two forms and call it qi'. Then we
can write
S = q0
qi = qi' + sum of terms of the form <j| qk, x qk

Visually, the latter two types of terms represent transitions to states
that are at or below the current level in the tree-shaped infinite state
automaton. All transitions on a word w in |qi(L)| (where L is the stack
word in S* representing the current level of state qi), must have the
form --
* A sequence of transitions qi(L) w1 |= qj(L), where all
intermediate transitions take place at or above level L,
and where w1 is a predix of w.
* A transition involving one of the terms in qj', either
to a lower level or (if |0> is a term in qj' and w = w1)
to completion.

Then defining

vij = { w in X*: qi w |= qj, where qj is at the same level as qi and no
intermediate transitions pass below the current level of qi }.

we may write a system of equations of the form:

qi = sum (vij qj')

The result of reconciling these two sets of equations for qi is a FINITE
system of recursive equations in the v's. That's the context free grammar.

First, the initial equation may be rewritten as:

S = <0| q0 = sum (<0| v0j qj')
= sum (v0j <0| qj')
= sum (v0j dj)
where dj = 1 if |0> is a term in qj'
0 otherwise

Therefore S is a sum of terms of the form (v). That's the first equation.

To resolve the equation for qi, first note the following reductions,
indicated schematically:

<j| q = sum (<j| v q')
= sum (<j| v |m> q) + sum (<j| v |0>)

Since j > 0, the second set of terms will cancel and the operators in the
first set of terms will cancel to either 0 or 1, thus yielding a sum of
the form
<j| q = sum (v q)
= sum (v v q')

Where the second set of equations is used to reduce each of the q's to a
sum of the form v q'.
The second set of reductions, also indicated schematically, takes the
form:
x q = sum (x v q')

Finally, on the right hand side of the equation for each q in the first
system is a term of the form q' = 1 q'. Therefore, combining terms, we get:

qi = sum (Pij qj')
where Pij is a sum of terms of the forms v v, x v and 1.

Comparing with the second system of equations, we may then write

vij = Pij = sum of terms of the forms v v, x v and 1.

Furthermore, after substituting for each of the w's in the first equation
for S the result will be an equation of a similar form for S. The final
result is therefore a system of equations of the form:

v = P(v) = ... + v v + ... + x v + ... + 1

This corresponds to a Context Free Grammar in which every rule takes on
one of the following forms:

N -> N'' N'
N -> x N'
N -> 1

where N, N' and N'' are non-terminals and x is a terminal. This is a
normal form which is not too far removed from Chomsky Normal Form, in
which rules take on the forms:

N -> N'' N'
N -> x
and possibly
S -> 1

Example:
Consider the context free expression <0| (<1| p)* (q |1>)* |0>.
This can be written as a right-linear system:

S = <0| A
A = <1| B + q C + |0>
B = p A
C = |1> D
D = q C + |0>

From this we can define:

A' = D' = |0>, C' = |1> D, B' = 0

and then write a second system of equations:

A = wAA A' + wAC C' + wAD D'
B = wBA A' + wBC C' + wBD D'
C = wCA A' + wCC C' + wCD D'
D = wDA A' + wDC C' + wDD D'

Reconciling the equation for S yields:

S = <0| A = <0| wAA A' + <0| wAC C' + <0| wAD D'
= wAA + 0 + wAD

Reconciling the term of the form <j| q in the equations for A, B, C and D
yield:
<1| B = <1| wBA A' + <1| wBC C' + <1| wBD D'
= 0 + wBC D + 0

which when substituted in the first system yields:

A = wBC D + q C + A'
= wBC (wDA A' + wDC C' + wDD D')
+ q (wCA A' + wCC C' + wCD D') + A'
= (wBC wDA + q WCA + 1) A'
+ (wBC wDC + q wCC) C' + (wBC wDD + q wCD) D'
B = p A
= p wAA A' + p wAC C' + p wAD D'
C = C'
= 0 A' + 1 C' + 0 D'
D = q C + D'
= q wCA A' + q wCC C' + (q wCD + 1) D'

Comparing this with the second system yields the following result:

S = wAA + wAD
wAA = wBC wDA + q wCA + 1 wBA = p wAA wCA = 0 wDA = q wCA
wAC = wBC wDC + q wCC wBC = p wAC wCC = 1 wDC = q wCC
wAD = wBC wDD + q wCD wBD = p wAD wCD = 0 wDD = q wCD + 1

Of course, this can be optimized by putting in some simple checks to detect
those sets of equations with least fixed points of 0, eliminating the unused
variables and substituting to reduce the equations further. This is done
below as an illustration.

A system of equations will have 0 as a solution if every term on the right
hand side goes to 0 when 0 is substituted for each of the variables in the
system. There is such a system in this case:

wCA = 0 wDA = q wCA
wCD = 0

Substituting wCA = wCD = wDA = 0 yields 0 on the right hand side of each
equation. Therefore, all these variables may be eliminated from the original
system to yield:

S = wAA + wAD
wAA = 1 wBA = p wAA
wAC = wBC wDC + q wCC wBC = p wAC wCC = 1 wDC = q wCC
wAD = wBC wDD wBD = p wAD wDD = 1

To eliminate unused variables note that S uses wAA and wAD, wAD uses
wBC and wDD, wBC uses wAC, which uses wBC, wDC and WCC. Nowhere are wBA and
wBD used, so they can be eliminated:

S = wAA + wAD
wAA = 1
wAC = wBC wDC + q wCC wBC = p wAC wCC = 1 wDC = q wCC
wAD = wBC wDD wDD = 1

Third, after substituting wAA = wCC = wDD = 1, this will yield the system:

S = 1 + wAD
wAC = wBC wDC + q wBC = p wAC wDC = q
wAD = wBC

and after substituting wBA = p, wDC = q and wAD = wBC, this further reduces
to:
S = 1 + wBC
wAC = wBC q + q
wBC = p wAC

Finally, since wAC = (wBC + 1) q = S q, this system can be reduced to a single
equation:
S = 1 + p S q

which recovers the equation where the context free expression came from.

(4) Closure properties for CFL's.
It's not too hard to establish that context free languages are closed
under concatenation, reversal, union and star. Let L and M be context
free languages with context free expressions <0| R |0> and <0| S |0>.
Then
L M = <0| R |0><0| S |0> = <0| R p0 S |0>
where p0 = |0><0|

Alternatively,
L M = <0| <n| R |n><n| S |n> |0>
where <n| and |n> are operators that do not occur in R or S.

Also, we can write:

L + M = <0| (R + S) |0>
L* = (<0| R |0>)*
= 1 + <0| R |0> (<0| R |0>)*
= <0| |0> + <0| R (p0 R)* |0>
L* = <0| (1 + R (p0 R)*) |0>
where p0 = |0><0|

and defining L' as the reversal of L, we can write:

L' = <0| R' |0>

where R' is the adjoint of the context free expression R, defined inductively
by:
(A B)' = B' A'
(A*)' = A'*
(A + B)' = A' + B'
x' = x, for any input word in X*.
<i|' = |i>, and |i>' = <i|
1' = 1, 0' = 0

A proof by induction then shows that <0| R' |0> = (<0| R |0>)' is the reversal
of L = <0| R |0>.

In part 8 of the series, the language D(p, q) ^ D(b, d) was presented with

D(p, q) ^ D(b, d) = (D(p, q) ^ (b + d)*) & ((p + q)* ^ D(b, d))

where D(u, v) is the Dyck language of nested pairs of u's and v's and A & B
denotes the intersection of A and B. This is the intersection of two
context free languages given by:

D(p, q) ^ (b + d)* -- S = 1 + p S q S + b S + d S
(p + q)* ^ D(b, d) -- S = 1 + p S + q S + b S d S

The automaton for this language took on the form of an infinite two
dimensional grid. In view of our example above in which a tree-like
non-deterministic infinite state automaton was converted into a grid-like
deterministic automaton, it's not obvious at first sight that the automaton
for this language can't be converted into a tree-like form. However,
there is lurking somewhere in this automaton a purely topological property
that is invariant with respect to transformations of automata and it will
correspond to the property "dimension".

Defining C_n as the class of languages formed by the intersection of n
context free languages, it is indeed possible to prove that these languages
form a proper hierarchy:

CFL = C_1 < C_2 < C_3 < ... < CCFL = union(C_n: n > 0)

where CCFL is the class of Concurrent Context Free Languages, defined as the
union of all the C_n. In fact the language,

D(p0, q0) ^ D(p1, q1) ^ ... ^ D(pn, qn)

will be a member of C_{n+1}, not in C_n. This can be shown as an indirect
result of the proof listed under:

"An Infinite Hierarchy of Intersections of Context-Free Languages"
Leonard Y. Liu & Peter Weiner
Mathematical Systems Theory (1973), pp. 185-192

In this proof, dimension does indeed play a critical role. Unfortunately,
it is not written in terms of infinite state automata, but in terms of a
subclass of CFL's known as the bounded CFL's. It's not actually too hard
to generalise the proof and show that C_n < C_{n+1} along the following
lines:
* Define N = { 0, 1, 2, ... }
* A linear set is defined to be a set of the form
v + v1 N + v2 N + ... + vm N
where the v's are elements of N^k.
* Define the dimension of a linear set
L = v + v1 N + v2 N + ... + vm N
as the dimension of the smallest vector space containing the
set { v1, v2, ..., vm }.
* For any word w over X = { x1, ..., xk }, define
|w| = (n1, n2, ..., nk) if w contains ni xi's in it.
* Define |L| = { |w|: w in L )
* If |L| is a linear set of dimension n + 1 then L is an element of
C_{n+1} not in C_n.
* |D(p0, q0) ^ D(p1, q1) ^ ... ^ D(pn, qn)| is a linear set of
dimension n + 1.

It then follows that CFL's are not closed under intersection, but that CCFL's
are.

(5) CFL - RL Intersection
One can extend the hierarchy above by defining C_0 to be the class of
regular languages -- or defining a regular language to be a CCFL of dimension
0. Then the following intersection property will apply:

{ L1 & L2: L1 in C_m, L2 in C_n } = C_{m + n} for all m, n >= 0

In particular,
{ R & L: R in RL, C in CFL } = CFL

It has already been established that the language D(p, q) is a context free
language that is not regular, in virtue of the fact that its minimal
deterministic automaton is of the form:

-- p --> -- p --> -- p --> ...
--> [[0]] [1] [2] [3]
<-- q -- <-- q -- <-- q -- ...

which has an infinite number of inequivalent states. The minimal automata
for all regular languages are finite. Therefore the C_n hierarchy can be
extended to:
RL = C_0 < CFL = C_1 < C_2 < ... < CCFL

Let L and R be respectively a context free language and regular language.
Corresponding to these languages one can write the following systems of
right-linear equations:

L = <0| q0 R = r0
qi = sum of terms of the form ri = sum of terms of the form
x qk, <j| qk, |j> qk, |0> x rk, 1

We may apply exactly the same process used in part 5 to arrive at a
DFA for a regular expression. Define the following:

\E = 1 if 1 is an element of E
0 else
x\E = { w: x w is a word in E }
and note that
\E = E & 1, and x x\E = E & x X*

Then for the intersection A & B we have the following properties:

\(A & B) = \A \B
x\(A & B) = x\A & x\B

will allow us to calculate the intersections.

From this we can calculate L & R. Define qim(w) = qi(w) & rm. Then the
initial equation will be:

L & R = (<0| q0) & r0 = q0() & r0 = q00() = <0| q00

Schematically, we may then write

qim(w) = qi(w) & rm
= sum (x qk(w), <j| qk(w), |j> qk(w), \qi(w)) & rm

The term x qk(w) will intersect with rm to yield x (qk(w) & x\rm). The
remaining terms will intersect in a straightforward manner --

<j| qk(w) & rm = qk(w j) & rm = qkm(w j) = <j| qkm(w)
|j> qk(w j) & rm = qk(w) & rm = qkm(w) = |j> qkm(w j)
and
\qi(w) & rm = 1 & rm = \rm if \qi(w) = 1
= 0 & rm = 0 if \qi(w) = 0
thus
\qi(w) & rm = \qi(w) \rm = \(qi(w) & rm) = \qim(w)

Thus, we arrive at a sum of the form

qim = sum (x (qk & x\rm), <j| qkm, |j> qkm, |0>)

Since x\rm will be expressible as a sum over the r's, we can
express qk & x\rm) as a sum over terms qkn.

Therefore, in carrying out the intersection of the two systems of
equations we simply intersect the q states of the automaton representing
the context free language with the r states of the automaton representing
the regular language and keep the arrangement of the operator symbols
unchanged.

Example: D(p, q) & (p* q*).
For L = D(p, q) and R = p* q* write --

L = <0| A R = D = p D + q E + 1
A = <1| B + |1> C + |0> E = q E + 1
B = p A
C = q A

Then define states AD = A&D, AE = A&E, etc. We can then write:

L = <0| A & D = <0| AD
AD = (<1| B + |1> C + |0>) & D
= <1| BD + |1> CD + |0> \D
= <1| BD + |1> CD + |0>, since \D = 1
BD = p A & D = p (A & p\D) = p AD, since p\D = D
CD = q A & D = q (A & q\D) = q AE, since q\D = E
AE = (<1| B + |1> C + |0>) & E
= <1| BE + |1> CE + |0> \E
= <1| BE + |1> CE + |0>, since \E = 1
BE = p A & E = p (A & p\E) = 0, since p\E = 0
CE = q A & E = q (A & q\E) = q AE, since q\E = 1

Thus we arrive at the following system (after eliminating BE and
equating CE to CD):

L = <0| AD
AD = <1| BD + |1> CD + |0>
BD = p AD
CD = q AE
AE = |1> CD + |0>

If you're curious what the corresponding context free expression looks like,
we can reduce the system and find out --

L = <0| AD
AD = <1| p AD + |1> q AE + |0>
AE = |1> q AE + |0>
or
AD = <1| p AD + AE = (<1| p)* AE
AE = q |1> AE + |0> = (q |1>)* |0>
Thus
L = <0| (<1| p)* (q |1>)* |0>

which bring us full circle back to the previous example which we ended up
showing was given by the equation

S = 1 + p S q

Thus, the least fixed point solution to S = 1 + p S q is: D(p, q) & (p* q*).

One should note that the algorithm for converting RE to NFA is O(n) in
the size of the regular expression. The intersection algorithm above yields
O(mn) equations, where m and n are the number of equations respectively in
the systems defining the context free language L and the regular language R.

Therefore, the intersection algorithm yields a system consisting of O(n)
right-linear equations, where n is the size of the regular expression
defining the regular language. In virtue of the direct correspondence between
systems of right-linear equations and push-down automata, this algorithm
also applies to finding the intersection of a PDA and NFA, and is O(n) in
the number of states in the NFA.
In contrast, the algorithm customarily cited in formal language sources is
O(n^2) in the number of states of the NFA.

0 new messages