Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Direct proof of Parikh's Theorem

4 views
Skip to first unread message

Mark Hopkins

unread,
Aug 16, 1995, 3:00:00 AM8/16/95
to
This is an application of some of the results listed in the series
"The Untold Story of Formal Languages". In particular, use is made
of the correspondence:

Grammar <--> System of Equations

and of the fact that every regular language is represented by a
system of eqyations of a particular form.

The following is a direct proof of Parikh's Theorem for Context Free
Languages. The Theorem states that every context free language is a
regular language when the underlying word algebra is commutative. To prove
this for a context free language, first represent the language as a system
of equations, then solve the equations on the premise that the algebra is
commutative.

First some basic properties.

THEOREM 1: Right-Linear systems of equations with regular expressions
as coefficients have regular expressions as least fixed
point solutions. This solution can be arrived at by an
O(n^3) process.
Proof:
Let X be a finite alphabet, R(X) the set of regular expressions over
X, and let the system of equations be given by:

qi = ri + ri0 q0 + ri1 q1 + ... rin qn
for i = 0, 1, ..., n
where the r's are in R(X).

then the equation for q0 can be written as:

q0 = r00 q0 + (r0 + r01 q1 + ... + r0n qn)

Its least fixed point solution must be that of the equation:

q0 = r00* (r0 + r01 q1 + ... + r0n qn)

Substituting this in the remaining equations yields:

qi = ri' + ri1' q1 + ... + rin' qn
for i = 1, ..., n
where ri' = ri + s0 r0
rij' = rij + s0 r0j
s0 = r00*

After n more similar sets one will arrive at the equation which will
express qn in terms of a regular expression not containing any of the
q-variables. Back substituting yields similar equations for the other
q-variables. The total number of subterms defined at every stage is
proportional to n^3.

A solution to a system of equations, in the following, will be
understood as the least fixed point solution to the system. Two equations
are called equivalent when they share the same least fixed point solution.

THEOREM 2: x = a + b x + x c + x d x is equivalent to x = b* a c* (d b* a c*)*
Proof:
That the expression b* a c* (d b* a c*)* (which may contain x's in it) is
a solution to the first equation is verified by direct substitution. Let
X be a language set which is a solution to the first equation. Every word
in X will have the form x1 d x2 d ... d xn, where none of the xi's contain
any d's. Each xi must then be of the form a, b x, or x c where the subterm
does not contain any d's. The subterm itself must be of a similar form
and so on. Therefore xi will have the form b^m a c^n.
The collection of all the possible terms yields the language set given
by the regular expression in the second equation.


THEOREM 3: Some basic properties of regular expressions:
(a) (A + B)* = A* (B A*)* = B* (A B*)*
(b) (A + 1)* = A*
(c) A** = A* = A* A*
(d) (A B*)* = 1 + A (A + B)*
Proof:
(a) The expression (A + B)* is the solution to the equation:

x = 1 + x A + x B

which is equivalent to:

x = (1 + x A) B* = B* + x A B*
and x = (1 + x B) A* = A* + x B A*

whose solutions are B* (A B*)* and A* (B A*)* respectively.
(b) The expression (A + 1)* is the solution to the equation:

x = 1 + x A + x 1 = 1 + x A + x
which is equivalent to
x = 1 + x A

whose solution is A*.
(c) By properties (a) and (b): A* = (A + 1)* = A* (1 A*)* = A* A**.
But then A** = 1 + A* A** = 1 + A* = 1 + (1 + A A*) = 1 + A A* by
idempotency, and 1 + A A* = A* by Theorem 2. It also follows then that
A* A* = A* A** = A*.
(d) In general, one may write:

(A B*)* = 1 + A B* (A B*)* = 1 + A (A + B)*

by Theorem 2.

THEOREM 4: Over a commutative algebra, regular expressions have the
properties:
A* B* = (A + B)*
(A B*)^n = A^n B*
(A* B*)* = A* B*
Proof:
(a) The expressions (A + B)* and A* B* are the solutions respectively
to the equations:
x = 1 + A x + B x
and x = 1 + A x + x B

In a commutative algebra, these equations are equivalent.

(b) One has (A B*)^n = A B* A B* ... A B* = (A A ... A) (B* B* ... B*)

where each sequence has n copies of the repeated terms. But this is
A^n B* by Theorem 1c.

(c) (A* B*)* = (A + B)** = (A + B)* = A* B*.

THEOREM 5: The solution to the equation

x = a0 + a1 x + a2 x^2 + ... an x^n

over a commutative algebra is the regular expression

x = a0 (a1 + a0 a2 + a0^2 a3 + ... a0^{n-1} an)*
Proof:
By Theorem 2, we may write:

x = (a1 + a2 x + ... + an x^{n-1})* a0
and
= a0 (a1 + a2 x + ... + an x^{n-1})*

since the algebra is commutative. The least fixed point solution can
be arrived at in only a few steps by starting with x = 0 and substituting
the current value of x on the right hand side to obtain the next value:

x = 0
--> x = a0 (a1)* = a0 a1*
--> x = a0 (a1 + sum ai (a0 a1*)^{i-1})*
= a0 (a1 + sum ai a0^{i-1} a1*)*

by theorem 4. This is a regular expression of the form

x = a0 (a1 + S a1*)*
= a0 a1* (S a1* a1*)*
= a0 a1* (S a1*)*
- a0 (a1 + S)*

thus we arrive at the indicated solution for x. Substituting it back
in the equation will yield x again.

Since the algebra is commutative, regular expressions will be expressible
in a special form. These forms are defined as follows:

DEFINITION: * A LINEAR expression is one of the form

p q*

where p and q are both finite sums of products.

* A SEMI-LINEAR expression is a finite sum of linear expressions.

THEOREM 6: Every regular expression over a commutative algebra is semi-linear.
Proof:
It's only necessary to prove that semi-linear expressions are closed under
the basic regular expression operators since the regular expressions x, 1
and 0 (where x is in the alphabet x) are all semi-linear.

Closure under sums: trivial.
Closure under products:
(sum pi qi*) (sum rj sj*) = sum (pi rj qi* sj*)
= sum (pi rj (qi + sj)*)
by Theorem 4.
Closure under star:
(sum pi qi*)* = product (pi qi*)* = product (1 + pi (pi + qi)*)

by Theorem 4. By distributivity the product can be broken down into
a finite sum of linear terms.

THEOREM 7: Every context free language over a commutative algebra is regular.
Proof:
A context free language is the solution to a system of equations of the
form:
xi = P(xi), i = 0, 1, 2, ..., v.

where P(xi) is a regular expression over the set X union {xi: 0 <= i <= n}.
This is essentially an EBNF grammar for the language in question.

Let x = P(x) be one of the equations. The regular expression can be
written in the form:

P(x) = L1 + L2 + ... + Lm
Li = pi(x) qi(x)*, i = 1, 2, ..., m

where pi and qi are finite sums of products, which therefore makes them
polynomials in x. Thus, the equation x = P(x) can be written in the
form:
x = p1(x) T1 + p2(x) T2 + ... + pm(x) Tm
Ti = 1 + Ti qi(x), i = 1, 2, ..., m

The expression in the first equation can be written in the form:

x = sum( pij x^i Tj )

where the pij's are free of the T variables and of x. By Theorem 5, its
solution is of the form:

x = r0 (r1 + r0 r2 + ... + r0^k r{k+1})*
for some k,
with ri = sum(pij Tj)

Note that Ti = qi(x)*, so that Ti Ti = qi(x)* qi(x)* = qi(x)* = Ti, by
Theorem 3c. Therefore, the number of possible products that can be
formed from the T's is finite. Define:

T_A = product(Ti: i in A)

for each subset A of { 1, 2, ..., m }. These comprise all the expressions
that can be formed by products of the T's. Then

r1 + r0 r2 + ... + r0^k r{k+1}

must have the form:

sum(pA T_A: A a subset of {1, 2, ..., m})

where all of the pA's are free of the T variables and of x. Thus, the
first equation becomes

x = sum(p0j Tj) ( sum(pA T_A) )*

By Theorem 4
( sum(pA T_A) )* = product( (pA T_A)* )
= product( 1 + pA pA* T_A* )
But
T_A* = (product qi*: i in A)*
= product qi*: i in A, by Theorem 4c
= T_A
therefore
( sum(pA T_A) )* = product(1 + pA pA* T_A)

Using distributivity, this product can then be converted into a finite sum of
the form:
sum PA T_A

Thus the first equation is converted to the form:

x = (sum p0j Tj) (sum PA T_A)

which when multiplied out will take on the form:

x = sum rA T_A

Thus x is a linear polynomial in the terms T_A.

Likewise, since the T_A are closed under products, all powers of x must
be expressible in this form. Therefore, each polynomal qi(x) must be
expressible in this form too:

qi(x) = sum (qi_A T_A)

where the q's are free of any of the T variables or x. Thus the second set of
equations become:
Ti = 1 + sum(qi_A Ti T_A)

which of the form
Ti = 1 + sum(ti_A T_A)

where the ti_A's are free of any occurrences of the T-variables or x. Since
the ti_A's are constructed ultimately from the subterms of the original
regular expression P(x) by a finite number of combinations involving
products, sums and stars, then each of the q's is a regular expression.

The equations for the other T_A's can be obtained by multiplying the
equations for the Ti's which are the components of T_A. The result will
be a set of equations of the form:

T_A = 1 + sum(tA_B T_B)

where the tA_B's are all regular expressions free of the T-variables and
of x. This is a right-linear system. Therefore its solution for the
variable T_A, and for x, must be regular expressions whose only variables
are the remaining variables from the set { xi: 1 = 1, ..., v } - {x}.

After substituting x in the remaining equations { xi = P(xi) }, then
repeat the process until all the equations have been eliminated. Then
after back-substituting, the result will be solutions for each of the
xi's that are regular expressions.

EXAMPLE: S = N V a + S P b
V = v N c
P = p N d
N = n e + N P f + y S g

Solving for V and P first is easiest since the right hand sides of their
equations don't involve either V or P:

V = c v N, P = d p N

Substituting in the remaining equations yields:

S = N c v N a + S d p N b
= a c v N^2 + b d p N S
and
N = n e + N d p N f + y S g
= n e + g y S + d f p N^2

Solving for N first yields:

N = (n e + g y S) (d f p (n e + g y S))*
or
N = N' (d f p N')*
where N' = n e + g y S

Solving for S yields:

S = a c v N^2 (b d p N)*

which, after substituting for N becomes:

S = a c v (N' (d f p N')*)^2 (b d p N' (d f p N')*)*
= a c v N' N' (d f p N')* (b d p N' (d f p N')*)*
= a c v N' N' (b d p N' + d f p N')*

by Theorem 4. Thus

S = a c v N'^2 ((b + f) d p N')*

which is equivalent to the equation:

S = a c v N'^2 + (b + f) d p S N'
= a c v (n e)^2 + a c v (n e) (g y) S + a c v (g y)^2 S^2
+ (b + f) d p n e S + (b + f) d p g y S^2

= (a c e^2 n^2 v) + (a c g v y + (b + f) d p) e n S
+ (a c g^2 v + (b + f) d p) y^2 S^2

whose solution by Theorem 5 is:

S = aceennv
(acgvyen + bdeny + dfeny + aceeinnvyy (acggv + bdp+fdp))*

which happens to be a linear set.

0 new messages