Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

CEX - Context-Free Expression Filter

10 views

Skip to first unread message

Rock Brentwood

unread,

Apr 4, 2021, 5:43:45 PM4/4/21

This is an update and conversion to UTF-8 UNICODE of the 2007 comp.compilers article at:

CEX - Context-Free Expression Filter
2007 February 25

https://compilers.iecc.com/comparch/article/07-02-067

In 1993, the regular expression filter REX was introduced. Notable amongst its features was that processing was carried out entirely within an algebraic paradigm. The utility and the algorithms underlying it have eventually found their way into other applications (e.g. XML-Query).

The latest revision is on GitHub at:
https://github.com/RockBrentwood/RegEx
and is now in active maintenance with expansion in the works to a much larger formal language theory library that will subsume everything in POSIX related to regular expressions or grammars.

The following is a preliminary description of a planned utility called CEX; and of the syntax, algebra and language underlying the utility.

CEX is the long-planned successor to REX. It will not handle the full range of REX functionality, but will process language expressions at the next stage up the Chomsky hierarchy: Context-Free Expressions.

The Untold Story of Formal Languages:
http://federation.g3z.com/CompSci/index.htm#Untold
Currently linked at:
https://web.archive.org/web/20080106150352/http://federation.g3z.com/CompSci/index.htm#Untold
Descriptions only are available, not the PDF's; this is largely subsumed by:

The numerous other references that may be found on this page, both above and below this link have been subsumed both by the following and publications recent and pending listed in the references below

The Algebraic Approach II-V
https://www.scribd.com/document/235317956/The-Algebraic-Approach-II-V

The new framework for context-free expressions, which is part of a new algebraic foundation for formal language and automata theory laid out in recent publications [2,3,4,5,6] and elsewhere, is a descendant of the algebraic formulation published in 1963 by Chomsky and Schützenberger [1].

A follow-up note to the 2018 papers that I sent both directly to Chomsky and as an open letter on the recent developments is here:

https://www.scribd.com/document/392094748/The-evolution-of-recent-developments-in-formal-language-theory-from-the-1960-s

An early draft of the presentation for [3] may be found here:

Co Equalizers and Tensor Products for Idempotent Semirings
https://www.scribd.com/document/392094657/Co-Equalizers-and-Tensor-Products-for-Idempotent-Semirings

References:
[1] Chomsky, N., Schützenberger, M.: "The algebraic theory of context
free languages". In: Braffort, P., Hirschberg, D. (eds.) Computer
Programming and Formal Systems, pp. 118–161. North-Holland, Amsterdam
(1963)
[2] H. Leiß et al: "C-dioids and μ-continuous Chomsky algebras". In:
Desharnais, J., et al. (eds.) RAMiCS 2018. LNCS, vol. 11194, pp.
21–36. Springer, Cham (2018)
[3] M. Hopkins et al: "Coequalizers and Tensor Products for Continuous
Idempotent Semirings". In: Desharnais, J., et al. (eds.) RAMiCS 2018.
LNCS, vol. 11194, pp. 37-52. Springer, Cham (2018)
[4] M.Hopkins: "The algebraic approach I: the algebraization of the
Chomsky hierarchy". In: Berghammer, R., Möller, B., Struth, G. (eds.)
RelMiCS 2008. LNCS, vol. 4988, pp. 155–172. Springer, Heidelberg
(2008).
[5] M. Hopkins. "The algebraic approach II: dioids, quantales and monads." In: Berghammer, R., Möller, B., Struth, G. (eds.)
RelMiCS 2008. LNCS, vol. 4988, pp. 173-190. Springer, Heidelberg
(2008).
[6] N.B.B. Grathwohl et al: "Infinitary axiomatization of the
equational theory of context-free languages". In: Baelde, D., Carayol,
A. (eds.) Fixed Points in Computer Science (FICS 2013). EPTCS, vol.
126, pp. 44–55 (2013)

1. Syntax
─────────
As mentioned in the supplementary material to the RegEx software, REX was originally meant to be the precursor to an even more expanded GREP superset that would also have a syntax for processing context-free expressions.
The biggest obstacle had been finding a suitable embodiment of the underlying algebra.

The following elements are novel to CEX:
∙ (x = e, f) ― substitution expression
∙ (x → e, f) ― a "rule" expression
∙ <x:e> ― least-fixed-point abstraction
∙ <e> ― expectation value
∙ <e| ― bra
∙ |e> ― ket
∙ [e] ― projection

In addition, the following standard regular expression operators are incorporated
∙ e* ― Kleene star
∙ e + f ― alteration
∙ e f ― concatenation
∙ 1 ― success
∙ 0 ― failure
∙ e+ = e e*
∙ e? = | e

The two central features of the context-free expression algebra are its fixed-point calculus and its use of the Bra-Ket algebra.
The latter algebra is defined by the identities
<x| |x> = 1; <x| |y> = 0, if x != y
as well as a "completeness" relation
|x0><x0| + ... + |xn><xn| = 1
summed over the entire set of bra-ket operator pairs.

An example of a CEX syntax context-free expression is the context-free language given by the grammar
S → a S b; S → c
which may be represented by any of the following
S → a S b, S → c, S
S → a S b + c, S
<S: a S b + c>
<(a <1|)* c (|1> b)*>
<0| (a <1|)* c (|1> b)* |0>.
Raise = a <1|, Lower = |1> b, <0| Raise* c Lower* |0>.

2. Semantics
────────────
The value of an expression is that obtained by first applying all the substitutions in its first.
Thus, the (x = e, f) is equal to the expression that results from substituting the expression e for all free occurrences of the variable x in the expression f.
Substitutions may also be done within the bra, ket and projection symbols.
Substitutions do not take place in contexts where the variable x is bound.
Binding contexts include the following
(x = e, f), (x → e, f), <x:e>

The remainder of the semantics is therefore defined, assuming the substitutions have all been carried out.

First, the value of the expectation <e> is the same as <u| e |u>, where the operator pair <u|, |u> is otherwise unused in e.
The expectation value is implicitly taken in the following contexts: * for the top-level expression * for the expression e in the rule (x → e, f).
Thus, (x → e, f) = (x → <e>, f).

A limited subset of the expression syntax may be applied within the bra, ket and projection operators.
Thus,
<e*| = <e|*, |e*> = |e>*
<e+f| = <e|+<f|, |e+f> = |e>+|f>
<ef| = <f|<e|, |ef> = |e>|f>
<e+| = <e|+, |e+> = |e>+
<e?| = <e|?, |e?> = |e>?
The symbols 0 and 1 do not carry the meaning of failure and success within a bra and ket, but are just used as ordinary labels.
Note the reversal of order in <ef|.
This is to ensure the property that
<e| <-> |e>
be obtainable from one another by the process of reversal, for arbitrary expressions e.

For the projections, the equivalences are
[e*] = [e?] = [] = 1
[e+] = [e]
[e,f] = [e+f] = [e] + [f]
[e* f] = [e? f] = [f]
[(e+) f] = [e f]
[(e f) g] = [e (f g)]
[(e + f) g] = [e g] + [f g]
[x e] = |x> [e] <x|,
where x is a label.

The operations a* and a+ are, respectively,
a* = <x : 1 + a x>; a+ = <x : a + a x>.

The least-fixed-point expression <x:e> is equivalent to the rule expression (x → e, x).
If the variable x does not occur freely in e, then <x:e> = <e>.

A "complete" rule expression is a subexpression e of the form
e = (x1 → e1, x2 → e2, ..., xn → en, f)
such that (1) e is not contained within a larger expression of the form (x0 → e0, e), and f is not a rule expression.
Its value is the equivalent of that obtained by collecting all the rules for a given variable and summing the expressions in the right-hand side.
Thus, for instance
(S → S T, T → x, S → 1, T → u S v, S b S)
is equivalent to
(S → S T + 1, T → x + u S v, S b S)

The resulting expression is then said to be in fixed-point form.
This can be equivalently expressed in terms of the least-fixed-point operator.
Thus, if
x0 → e0, x1 → e1, x2 → e2, ..., xn → en, f
is in fixed-point form, then its equivalent value will be
x0 = <x0 : e0>, (x1 → e1, x2 → e2, ..., xn → en, f).
The rules for fixed-point expressions commute.
Thus, for instance,
(S → S T + 1, T → x + u S v, S b S)
and
(T → x + u S v, S → S T + 1, S b S)
denote the same value.

Neither the least-fixed-point operator, nor the rule expression is taken as primitive.
Both may be reduced in terms of the bra-ket algebra.
The result is analogous to writing out a decimal expansion for a solution to a non-linear equation.

The processing may be done on a rule expression in its entirety, or on a fixed-point expression to eliminate one variable at a time.
A variety of methods may be employed.
The one described here may be taken as the normative reference.
This is also described in section 4 of the Untold Story Of Formal Languages reference.
Other methods exist.
Section 7 describes the resolution of the fixed-point operator.
The article "Algebraic LR Parsing" contains notes on implementing the LR paradigm in algebraic form.

For a given fixed point expression, let Q denote its set of variables, H(q) the expression on the right hand side of the rule for variable q, and S the main expression.
This process assumes that there are no further "rule expressions" contained in any H(q) or S (otherwise, we'd apply the reduction to these first).
Add in an auxillary variable s to denote the "main variable" of the system and set H(s) = S.

Each expression H(q) and S is written as the least-fixed point solution in its "main variable".
Thus, we have
H(q) = q_0
q_i → F_i
q_i → H^j_i q_j
where the F's are expressions free of the variables in Q and the H's are either variables in q or expressions free of variables from Q.

This system is converted as follows.
First, for the main variable q of the subsidary system, we have
q → q_0.
Second, for each rule of the form q_i → F_i, we write
q_i → F_i q_F.
Each rule of the form
q_i → e q_j
where e is free of variables is kept intact.
Finally, each rule of the form
q_i → r q_j
where r is another variable in Q is transformed into the pair of rules
q_i → <u| r_0
r_F → |u> q_j,
where <u| and |u> are a fresh operator pair.
For the main variable s, we write
s → <0| s_0
and
s_F → |0>.

The result is a system that is right-linear in its new set of variables.
The same process for converting a right-linear system to a regular expression can be used to find the actual context-free expression that results, though it will prove to be more useful to keep the system in right-linear form.

3. Processing Input: The Normal Form Theorem, Bra-Ket Reduction & Shortest-Paths
────────────────────────────────────────────────────────────────────────────────
I won't go into too much detail here, other than to mention the main features of the processing algorithm.
Given a context-free expression e (or its equivalent right-linear system), and an input x, there is a linear-time algorithm that will find the intersection of e and {x}.

The right-linear system corresponding to this result is a context-free expression either for the set {x} or {}.
The latter case occurs if the word x is not a member of the language given by e.
Once the intersection, say e_x, is found, the symbols comprising the word x can be erased (i.e. mapped to 1).
As a result, e' becomes a pure operator expression.

A context-free expression e in right-linear form can be written as a system in matrix-vector form
e → S Q
Q → C Q
Q → U_i <i| Q
Q → |i> V_i Q
Q → F
where C, U_i, V_i are matrices with coefficients consisting of 0's and input symbols; S is a row vector, Q and F are column vectors.
The least fixed-point of this solution is
e = S (U + V + C)* F
where
U = ∑ U_i <i|
V = ∑ |i> V_i.
The normal form theorem states that this can be expressed equivalently as
e = S (N V)* N (U N)* F
where N is a matrix of context-free expressions that comprise the least fixed-point solution to
N → (∑ (U_i N V_i) + C) N + I.

In the special case, where the expression e' is a pure operator expression, the C matrix is 0 and the result reduces to the equivalent of a "shortest paths" problem.
This, of course, leads directly to the equivalent of Valiant's O(n^(log 7)) context-free recognition algorithm.

0 new messages