TeX syntax?

Russell Shaw

unread,

Feb 8, 2007, 1:24:44 PM2/8/07

to

Hi,
I've looked high and low without success. Where can i find
something resembling the BNF of Knuth's TeX typesetting syntax?

What symbols are fundamental, and what ones are derived?

Knuths TeX book mumbles something about "registers" that
hold things like page numbers or whatever. Where is there
a complete list of these registers, uses, and limitations?

Knuths TeX book is an abomination, describing lexing and parsing
as mouth, gullet and stomach nonsense.
[Well, he invented most of what we know about parsing, he gets to
explain it any way he wants. Chapters 7 and 8 describe the syntax
operationally. -John]

glen herrmannsfeldt

unread,

Feb 9, 2007, 9:00:30 AM2/9/07

to

Russell Shaw wrote:

> I've looked high and low without success. Where can i find
> something resembling the BNF of Knuth's TeX typesetting syntax?

Most of what is described in the TeXbook is plain TeX, which uses many
macros in plain.tex. I believe the book describes as primitives those
that are actually built in to the processor before any macros are
loaded (virtex).

> What symbols are fundamental, and what ones are derived?

As I mentioned earlier today in another newsgroup, TeX can't be
compiled. Among the things that can be changed are which characters
are letters and which are not. That can be changed just about up to
the point that they are read in and processed. You can't even say
what a symbol is until you have described which characters are
letters.

> Knuths TeX book mumbles something about "registers" that hold things
> like page numbers or whatever. Where is there a complete list of
> these registers, uses, and limitations?

I believe the fundamental integer registers are \count0 through
\count255, but most that are actually used are defined through macros.
The TeXbook describes them pretty well.

> Knuths TeX book is an abomination, describing lexing and parsing as
> mouth, gullet and stomach nonsense.

That is pretty important, as in some cases macros change things
just before they are used. If you get it wrong, they are changed
too late. Consider \def\x{\y}\x is x defined before it is
expanded, or not? How about \def\x{\y} \x ?

In some places white space is significant, and others it is not.
That is not going to be easy to do in BNF.

-- glen

A Johnstone

unread,

Feb 9, 2007, 9:02:31 AM2/9/07

to

Russell Shaw <rjs...@netspace.net.au> wrote:
> Hi,
> I've looked high and low without success. Where can i find
> something resembling the BNF of Knuth's TeX typesetting syntax?

Well, the stuff is in the TeXbook, but for most practical applications
that won't get you very far because TeX is a macro-processing
language, which has a tendency to blur (in practice although not in
theory) the syntax and the semantics of the language. Typical
user-visible features in plain or LaTeX involve many layers of macros.
When I try to parse LaTeX I usually just write a crude parser that
recognises LaTeX's special characters and gobbles up {} and []
delimited paramaters. The BNF is trivial.

Adrian

Allan Adler

unread,

Feb 9, 2007, 9:48:51 PM2/9/07

to

glen herrmannsfeldt <g...@ugcs.caltech.edu> writes:

> Russell Shaw wrote:
> > Knuths TeX book mumbles something about "registers" that hold things
> > like page numbers or whatever. Where is there a complete list of
> > these registers, uses, and limitations?
>
> I believe the fundamental integer registers are \count0 through
> \count255, but most that are actually used are defined through macros.
> The TeXbook describes them pretty well.
>
> > Knuths TeX book is an abomination, describing lexing and parsing as
> > mouth, gullet and stomach nonsense.
>
> That is pretty important, as in some cases macros change things
> just before they are used. If you get it wrong, they are changed
> too late. Consider \def\x{\y}\x is x defined before it is
> expanded, or not? How about \def\x{\y} \x ?

Knuth's TeX Book is a manual for using TeX. It is a wonderful book.
It is volume A of a 5 volume work whose volumes are:
A. The TeX Book
B. TeX the Program
C. The METAFONT Book
D. METAFONT the Program
E. Computer Modern Typefaces

Volumes A and C are manuals for TeX and METAFONT, respectively. Volumes
B and D are literate documentation of these two programs. TeX and METAFONT
were written in Knuth's WEB system of literate programming. The sources
were therefore WEB files. A web file can be processed in two ways. One
processor for a web file is called weave. When one applies weave to a
web file, one gets a TeX file which is a literate account of the workings
of the program in all detail. When one applies tangle to a web file, one
gets a pascal program which is the program which is being documented by
the TeX file. One gets the program to work by running the pascal program
through pascal. (Of course, to get around the limitations of pascal, one
instead runs the pascal program through p2c to convert it to a C program
and then compiles the C program.)

The book "TeX the Program" is the result of taking the WEB source for
TeX, running it through weave, and then running the resulting TeX file
through TeX. This method of documenting programs is so good that the
resulting documentation is publishable, in this case as the book, "TeX
the Program". Similarly, the book "METAFONT the Program" is the result
of giving the WEB source of METATONT the same treatment.

So, to make a long story short, if you want details about the actual
workings of TeX and METAFONT, at a level that one can't really expect
a manual such as The TeX Book or The METAFONT Book to provide, you should
look at volumes B and D of Knuth's 5 volume masterpiece.

Volume E explains how all of Knuth's computer modern fonts were designed
in case you want to learn the art of designing professional quality fonts.

When they first appeared in the 1980's, I purchased all 5 volumes for $150
and I have never regretted it, nor have I ever stopped learning from them.

If you want to learn more about the concept of literate programming, Knuth
published an article about it and I think there is also a book on it. You
can also follow the newsgroup comp.programming.literate. There are now many
programs for literate documentation of programs. For C programs, Knuth and
Levy developed CWEB, with processors cweave and ctangle instead of weave and
tangle, and these programs have themselves been literately documented using
CWEB. One of the more popular literate programming programs was developed by
Norman Ramsay and is called noweb. It can be used to document just about any
programming language but loses the automatic formatting of code that one gets
with CWEB, although some people have developed packages to compensate for this.
--
Ignorantly,
Allan Adler <a...@zurich.csail.mit.edu>

Philipp Lucas

unread,

Feb 12, 2007, 12:32:38 PM2/12/07

to

On Thu, 8 Feb 2007, Russell Shaw wrote:

> Where can i find something resembling the BNF of Knuth's TeX typesetting
> syntax?
>

> Knuths TeX book is an abomination, describing lexing and parsing as
> mouth, gullet and stomach nonsense.

You have already been referred to "TeX the Program" for a detailed
explanation of the internal workings. Victor Eijkhout's "TeX by Topic"
(<http://www.eijkhout.net/tbt/>) provides another, quite readable,
introduction to TeX's parsing process (see Chapters 1--3 and 11--14).

--
Philipp Lucas
phl...@f-m.fm

Joseph H Allen

unread,

Feb 16, 2007, 2:05:20 AM2/16/07

to

Allan Adler <a...@nestle.csail.mit.edu> wrote:
>glen herrmannsfeldt <g...@ugcs.caltech.edu> writes:

>> > Knuths TeX book is an abomination, describing lexing and parsing as
>> > mouth, gullet and stomach nonsense.

>Knuth's TeX Book is a manual for using TeX. It is a wonderful book.

>It is volume A of a 5 volume work whose volumes are:
>A. The TeX Book
>B. TeX the Program
>C. The METAFONT Book
>D. METAFONT the Program
>E. Computer Modern Typefaces

I had no patience for any of these books. On the other hand, The Art of
Programming is excellent.

Anyway, I learned TeX with these:

Rayond Seroul's and Silvio Levy's "A Beginners's Book of TeX"
(ISBN-0-387-97562-4, and ISBN-0-540-97562-4) and David Salomon's "The
Advanced TeXbook" (ISBN-0-387-94556-3).

I made my own preprocessor/format for TeX based on these books. I
remember playing a lot of games with the macro name space. For
example, only a limited character set can be used in macro names, but
I wanted to use any character. I ended up escaping: AA maps to A, AB
maps to backslash, AC to {, etc.

--
/* jha...@world.std.com AB1GO */ /* Joseph H. Allen */
int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0)
+r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p<1659?79:0:p>158?-79:0,q?!a[p+q*2
]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n"," #"[!a[q-1]]);}

glen herrmannsfeldt

unread,

Feb 16, 2007, 11:43:38 PM2/16/07

to

Joseph H Allen wrote:
(snip)

> I made my own preprocessor/format for TeX based on these books. I
> remember playing a lot of games with the macro name space. For
> example, only a limited character set can be used in macro names, but
> I wanted to use any character. I ended up escaping: AA maps to A, AB
> maps to backslash, AC to {, etc.

Just about any character can be used in macro names if one really
wants to do it. LaTeX uses @ in many of its internal names, by
first changing the catcode of @ to letter, then defining macros
with @ in the names, or references to those macros. After
defining all the internal macros, it changes @ back to other.

You can also change the escape character used for macro references
from \ to something else. I believe catcode is described fairly
early in the TeXbook.

-- glen

Joseph H Allen

unread,

Feb 25, 2007, 12:46:48 PM2/25/07

to

glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
>Joseph H Allen wrote:

>> I made my own preprocessor/format for TeX based on these books. I
>> remember playing a lot of games with the macro name space. For
>> example, only a limited character set can be used in macro names, but
>> I wanted to use any character. I ended up escaping: AA maps to A, AB
>> maps to backslash, AC to {, etc.

>Just about any character can be used in macro names if one really
>wants to do it. LaTeX uses @ in many of its internal names, by
>first changing the catcode of @ to letter, then defining macros
>with @ in the names, or references to those macros. After
>defining all the internal macros, it changes @ back to other.

This is getting off-topic, but may be useful information for those
wanting to use TeX as a back-end- certainly there is a good
possibility that compiler writers would be consulted for such a task
:-)

I wanted to have any user defined string in a macro name, so it is not
practical to play evil catcode games. Incidentally, this is for
writing cross reference data as a set of macro definitions to an
auxiliary file.

I was writing the pre-processor, so it was very easy to just remap
characters in the way I describe above. Later, I learned a more
conventional way to do this using \meaning (which is more or less how
\label{} and \ref{} in LaTeX work):

Create a macro called "\a \_$" without messing with the catcodes:

\expandafter\def\csname \meaning a\meaning \meaning \\meaning _\meaning $\endcsname{test}

Expand it:

This is a \csname \meaning a\meaning \meaning \\meaning _\meaning $\endcsname.

The TeX Book would say something like:

Exercise for the reader: create macro \label{foo_} which
writes "\expandafter\def\csname \meaning f\meaning o\meaning o\meaning _\endcsname{12}"
(where 12 is the current page number) to the .aux file.

Also write the matching \ref{}.

Extra points: have \meaning appear only on characters which are not letters.

Jim Hill

unread,

Feb 25, 2007, 12:49:39 PM2/25/07

to

Russell Shaw asked for:

> something resembling the BNF of Knuth's TeX typesetting syntax?
> What symbols are fundamental, and what ones are derived?

It doesn't really parse at all, and its lexical entities are defined at
runtime, by the input it's lexing: spelling and boundaries are defined
and quite commonly, even for fundamental symbols, redefined several
times, in the input files themselves.

Most people could agree these would be bad things in a programming language.

Rock Brentwood

unread,

Apr 4, 2021, 9:12:06 PM4/4/21

to

[ This is a followup to a thread from 2007. ]

>I've looked high and low without success. Where can i find
>something resembling the BNF of Knuth's TeX typesetting syntax?

It's in the file weave.web, section 14.

The syntax for TeX was written as a context-sensitive translation
grammar, suitable for a streaming translator, rather than a
context-free grammar. It may be possible to convert it to one (either
directly or as a context-free enveloping grammar with semantic
constraints). That's a matter that may be worth looking into. But in
its present form, there is no tree-building required or involved: it
can stream. The distinction is analogous to that between SSDT's versus
SDT's ([S]SDT = [simple-]syntax-directed translations). SSDT's can be
streamed, SDT's require stacking or treeing values and, in effect, SDT
= SSDT + value-stacking/treeing.

TeX is written in Web which is essentially Pascal + hyper-linked
comments. It is also in C-Web on the main TeX distribution site, which
is C + hyper-linked comments. They *can* be converted directly to more
normal programs with the comments embedded. I did so in the local
versions of my older MiKTeX distribution, but haven't
regression-tested it yet - since I haven't established a working
baseline yet to work off of.

The syntax is in - and an essential part - of the weave.web file. In detail:
Section 14.1 describes the framework used for the syntax
Section 14.2 lists the "category codes" used
Section 14.3 lists additional post-processed lexical units used
Section 14.4 lists describes a processing layer from lexical units to "scraps"
Section 14.5 contains the productions for the context sensitive grammar

Section 15 implements the parser; the most important routine being
translate_cases() (its name in the C-Web file) - as a master "switch"
statement (or "case" statement in Pascal) in section 15.7.

By the way the "open" case (its subcases are in 15.19), "math" subcase
(its sub-subcases iare in 15.20), "var_head" sub-subcase has a bug in
it. The "intro" sub-sub-subcase listed a transition to rule 31,
instead of to rule 33. (I want my money Knuth! :))

I believe it's possible to convert it all to a context-free grammar,
albeit with the possible inclusion of a few semantic constraints. Why
Knuth chose to write everything this way - as borderline obfuscated
code that cannot be scaled upwards or sideways or integrated in other
existing code - is beyond me. But it is not maintainable, and
heavily-laden with Technical Debt; notably, its *critical* reliance on
the dated assumption that the Future would be Pascal, along with all
the other assumptions and - more importantly - the now-unnecessary
restrictions that came out of that.

Much of the very design of the entire Web framework's very conception
and design was premised on the assumed necessity of those
restrictions; and the whole thing can be done on a much simpler
foundation, when remade in more up-to-date terms (relatively speaking)
*natively* in C or C++. Among other things, there isn't a need for any
Web-like framework. You can just simply use ordinary comments. I know,
because I did so: I rewrote the entire set of Web files in my local
copy doing just that. When a baseline is established and it is
regression-validated I'll put a copy up on GitHub.

A follow-up to the additional comments at the end of the article:

>Knuths TeX book is an abomination, describing lexing and parsing
>as mouth, gullet and stomach nonsense.

I know. It's literally a woven and tangled mess - both the book and the code.

>[Well, he invented most of what we know about parsing, he gets to
>explain it any way he wants. Chapters 7 and 8 describe the syntax
>operationally. -John]

Discovery. Not invention. Mathematics is not invented, it is
discovered (and in this case: only a partial and incomplete discovery).

And that, too, is a complete tangle that we had to remake from bottom
up. Now, finally with recent publications [2-5] establishing the
foundations for the new algebraic framework ... along with another,
currently in submission, that may come out in 2021, for the remaking
alluded to in [3] of the 1963 algebraic formulation by Chomsky and
Schuetzenberger [1] that lies at the foundation of this all, we're now
finally in a position to refactor both the theory itself and
everything that's based on it or is an application of it; literally
remaking the entire stack from bottom up.

References:
[1] Chomsky, N., Schuetzenberger, M.: "The algebraic theory of context
free languages". In: Braffort, P., Hirschberg, D. (eds.) Computer
Programming and Formal Systems, pp. 118=E2=80=93161. North-Holland, Amsterdam
(1963)
[2] H. Lei=C3=9F et al: "C-dioids and =CE=BC-continuous Chomsky algebras". In:
Desharnais, J., et al. (eds.) RAMiCS 2018. LNCS, vol. 11194, pp.
21=E2=80=9336. Springer, Cham (2018)
[3] M. Hopkins et al: "Coequalizers and Tensor Products for Continuous
Idempotent Semirings". In: Desharnais, J., et al. (eds.) RAMiCS 2018.
LNCS, vol. 11194, pp. 37-52. Springer, Cham (2018)
[4] M.Hopkins: "The algebraic approach I: the algebraization of the
Chomsky hierarchy". In: Berghammer, R., M=C3=B6ller, B., Struth, G. (eds.)
RelMiCS 2008. LNCS, vol. 4988, pp. 155=E2=80=93172. Springer, Heidelberg
(2008).
[5] N.B.B. Grathwohl et al: "Infinitary axiomatization of the
equational theory of context-free languages". In: Baelde, D., Carayol,
A. (eds.) Fixed Points in Computer Science (FICS 2013). EPTCS, vol.
126, pp. 44=E2=80=9355 (2013)

gah4

unread,

Apr 5, 2021, 10:55:35 AM4/5/21

to

On Sunday, April 4, 2021 at 6:12:06 PM UTC-7, Rock Brentwood wrote:
> [ This is a followup to a thread from 2007. ]
> >I've looked high and low without success. Where can i find
> >something resembling the BNF of Knuth's TeX typesetting syntax?
> It's in the file weave.web, section 14.
>
> The syntax for TeX was written as a context-sensitive translation
> grammar, suitable for a streaming translator, rather than a
> context-free grammar. It may be possible to convert it to one (either
> directly or as a context-free enveloping grammar with semantic
> constraints). That's a matter that may be worth looking into. But in
> its present form, there is no tree-building required or involved: it
> can stream. The distinction is analogous to that between SSDT's versus
> SDT's ([S]SDT = [simple-]syntax-directed translations). SSDT's can be
> streamed, SDT's require stacking or treeing values and, in effect, SDT
> = SSDT + value-stacking/treeing.

> TeX is written in Web which is essentially Pascal + hyper-linked
> comments. It is also in C-Web on the main TeX distribution site, which
> is C + hyper-linked comments. They *can* be converted directly to more
> normal programs with the comments embedded. I did so in the local
> versions of my older MiKTeX distribution, but haven't
> regression-tested it yet - since I haven't established a working
> baseline yet to work off of.

(snip)

It seems to me that macro processors were popular in the 1970's.

Some years ago, I used to work with Mortran, which is a macro processor
meant to be used to process an improved language like Fortran, which is
then converted into Fortran. It is free-form with semicolon terminated
statements, and with block structured if-then-else and for loops.
It is all done with a relatively simple macro processor written in about
as close as possible to standard Fortran 66. The only extension needed
is the ability to compare value read in as characters.

The macro format is very simple, except that macros have the ability
to define other macros. Then, as is usual with compilers, the whole
processor is written in Mortran and translated to Fortran.

So, TeX is also written as a macro processsor, with relatively few
built-in operations, and many more defined through the macro facility.

The input sequences recognized by macros likely can't be generalized.
A macro can specify just about any combination of characters to be
included between macro arguments. Not so many use that, but they can.

There are, then, three ways to name macros. Most common is the
escape character, usually \, followed by some number of letters.
They can also be the escape character folllowed by one non-letter.
The characters which are letters can be changed at any time, but
most of the time, for user-level macros, letters are letters.

The third possibility is that a macro can be any one character with
the catcode of active. No escape character is needed.

It seems that there are now some programs that accept the TeX
way of writing mathematical equations, including the standard
macros for special characters. But not the general ability to
write macros, such as defining new characters or renaming
old ones, or any of the more strange things you can do. I suppose
BNF can be written for that, but likely not for the more general
syntax.
[The m4 macro processor tokenizes its input and then checks to see
if a token is a named macro. It's still used for autoconf
configuration scripts. -John]

gah4

unread,

Apr 6, 2021, 1:38:37 PM4/6/21

to

On Monday, April 5, 2021 at 7:55:35 AM UTC-7, gah4 wrote:

(snip, John wrote)

> [The m4 macro processor tokenizes its input and then checks to see
> if a token is a named macro. It's still used for autoconf
> configuration scripts. -John]

I know about m4, but have managed never to use it.

Things that I might have done with it, I usually do with awk.

Continuing TeX syntax, white space is ignored after a macro named
with letters, (but not the other ones). That is useful if you use a macro before
actual text words. Also it means that you can ignore newlines that come after
a macro named with letters, but not other ones. You can cause it to ignore
a newline otherwise by ending with a comment, with (usually) a %.

(As with all characters, you can change the catcode, but usually %.)

One other funny thing, though. TeX has glue specifications, such that
one might say:

\hskip 1cm plus 0.5cm minus 0.5m

for horizontal space that can stretch or shrink in line justification.

the plus and minus parts are optional. Often macros expand to glue:

\def\somespace{\hskip 1cm}

and I actually knew someone to use such a macro followed by the word
plus that wasn't part of glue, and resulting in a very strange message.

I suppose one should put a \relax (the no-op macro) after every glue,
but that is pretty rare. What is the chance that one might actually
say 'plus' or 'minus' at that point!

But the sometimes significant sometimes not white space is a real
complication in actual use, and will also complicate a BNF
representation.

Rock Brentwood

unread,

Apr 9, 2021, 7:20:27 PM4/9/21

to

I remade {tangle,weave}.web and c{tangle,weave}.w both in more C-like form as
ordinary contiguous code+comments (+ UNICODE, because I'd like to keep some of
the original spirit of making it as more math-/typeset-friendly
program+documentation) ... and might post what I have on GitHub or somewhere,
even though it hasn't yet been baseline tested. Among other things, I want to
remake my C compiler to directly accept UNICODE for the operators. The
glossary I use is (↑,↓,→,≠,≡,≤,≥,∷,∧,∨,¬) for
(++,--,->,!=,==,<=,>=,::,&&,||,!), with the latter treated as legacy digraph
forms of the former. I'll post a note on the link, when I get it up. The
context-sensitive grammar is a switch statement that runs up to 4 levels deep.
It *might* be yacc-convertible, as is, since some versions of yacc may allow
for stack-hacking, e.g. an $-1 argument in a rule.

On Monday, April 5, 2021 at 9:55:35 AM UTC-5, gah4 wrote:
> It seems to me that macro processors were popular in the 1970's.
>
> Some years ago, I used to work with Mortran, which is a macro processor
> meant to be used to process an improved language like Fortran, which is

> then converted into Fortran...

You hit the nail on the head here.

Partially in defense of Knuth, it looks like what he was trying to do (in the
background of all this) was establish an entirely new framework for parsing,
that got out from under the deficiencies of LR, that was (a) streaming and (b)
potentially even parallelizable. By "streaming" I mean that you can literally
start and end a partial parse in mid-stream and *still* get meaningful
results, even of the chunk that was processed doesn't form any kind of
grammatical unit at all ... which has obvious utility for intelligently
processing macros.

That's CYK-on-steroids.

The lack of such a facility also stood as a roadblock to making cfront
directly into a C++ compiler. In the old cfront code (e.g. the first version,
release E, which we transcribed and is currently here

http://www.softwarepreservation.org/projects/c_plus_plus/index.html#cfront
...

an updated/correction version which will be out on GitHub hopefully soon)
shows indication in it that Stroustrup was trying to make it into a *direct*
C++ to C processor. He had an option in the build script in release E or the
other early releases to allow for this, but never implemented it. The macro
layer stands in the way of everything. So, instead, he set it up C++ to C
conversion as a Rube Goldberg device that got rid of macros first ... thus
totally destroying the transparency of the translation process; when what we
actually wanted as a translation that kept the macros intact, but
intelligently processed them.

(His original aim is much more doable now, by the way, than in the 1980's,
because the *forward* advancement of C from the 1980's to C99 actually entails
a dramatic *simplification* of what's needed for a "cfront" like utility, and
much of what went into the original is no longer needed, on that account. You
can almost even write up a utility as a simple editor script, now! ... Except
for exception-handling. Most people forget, translation is a *contravariant*
function of source language - it simplifies when source language complexifies.
So things become *more* doable, not less!)

At the time of LR's inception, you had 2 major results which were never fully
developed or exploited: (a) the Chomsky-Schützenberger Theorem (CST), which
*very nearly* gives you the ability to make 100% streaming representation for
translators, with one "frame" per word -- except the CST "kernels" have word
boundary markers; and (b) its successor of sorts , Greibach's "Hardest CFL"
construction, which successfully incorporates word-boundary markers into the
word frames.

I'll show you how it can be done, now, with the newer formalism for
context-free expressions.

This:
L → α | L S β,
S → x γ | u L v δ,
S w L ω
is a translation grammar containing input words from the set {u,v,w,x}, output
actions from the set {α,β,γ,δ,ω} which have an obvious interpretation in
terms of "reduce" (α,β,γ,δ) and "accept" (ω) actions; plus a starting
configuration S w L ω. That's typical of what yacc assumes, except for the
loosening of restrictions on what's allowed for the "start".

It is possible, by the way, to hack yacc to allow for *any* rule body under
%start and to allow "%start" to be issued anywhere, and multiple times; by
just treating each "%start α" declaration as being synonymous with a hidden
rule of the form "$accept: α $end;" ... because that's actually already how
it is processed and even displayed under the "-v" option in some
implementations. With the loosening, you can even add in start actions under
%start (thereby eliminating the need for some of bison's kludges).

This translation grammar is in "canonical bottom-up" form - all actions occur
at the end, no in-between actions. (In yacc, in-between actions are hacked by
turning them into actions for special single-use non-terminals that have empty
transitions ... which you can see on display when you run the "-v" option on a
source file containing rules with in-between actions.)

The translation grammar has this as its CST-kernel:
<0| (u<u| + v<v| + w<w| + x<x| + (α + |S>|L>β)<L| + (|x>γ +
|v>|L>|u>δ)<S|)* |L>|w>|S>|0> ω

In the 1960's version of the theorem, the labelled brackets <u|,<v|,...;
|u>,|v>,... were interpreted as just extra word symbols (and actually Chomsky
& Schützenberger replaced words with brackets too). In the newer algebraic
form ... which may finally see publication soon this year ... the brackets
form an algebra with <i| |j> = 0 if i ≠ j, <i| |i> = 1, which commute with
the other symbols. In addition, for translator grammars, input and output
symbols commute (e.g. αx = xα, the conversion of αx to xα is, itself, the
very essence of a "look-ahead"). One can also conceive of cases where there
are *multiple* input and output channels, each having their own alphabet, the
different channels' alphabet commuting with one another.

The kernel has the form
I (X + YR)* YF
with
I = <0|
X = u<u| + v<v| + w<w| + x<x|
YR = α + |S>|L>β)<L| + (|x>γ + |v>|L>|u>δ)<S|
YF = |L>|w>|S>|0> ω
and, in fact, YR and YF both factor into
Y = α<α| + β<β| + γ<γ| + δ<δ| + ω<ω|
R = (|α> + |β>|S>|L>)<L| + (|γ>|x> + |δ>|v>|L>|u>)<S|
F = |ω>|L>|w>|S>|0>
and the original kernel is actually equal to I (X + Y + R)* F ... a form which
is suited for translation grammars that are *not* in bottom-up canonical form.
The reduction I (X + Y + R)* F = I (X + YR)* YF only applies if the
translation grammar is in canonical form.

By Kleene algebra, the kernel is equal to: I (YR)* (X (YR)*)* YF which
contains the "renormalized" forms of:
Start action: I' = I (YR)* = <0| (YR)*
Inputs: X' = X (YR)* = Σ_x x p'(x), composed of
Word frames: p'(x) = <x| (YR)*
Final action: F' = F = |L>|w>|S>|0>ω

That's a form suitable for both (a) mid-stream parsing, e.g. a word "vxu"
would be processed as p'(v) p'(x) p'(u) ... for instance, if vxu were a macro
body; and (b) a parallelized form of CYK. Except: instead of assembling chunks
bottom-up in dynamic programming from length 1, then length 2, then length 3,
etc. in a "CYK pyramid", you could do it both in parallel *and* exponentially,
from length 1, length 2, length 4, length 8, etc.

So, there are your "word frames" and you have a means to produce both
in-stream and even parallel translators ... which is ideally suited also for
intelligently handling macros.

It is also suitable for both deterministic and non-deterministic parsing; and
it ultimately subsumes the stack-based framework - devised originally by
Tomita - for GLR parsing; as well as generalizing to translation grammars
which are not canonical.

The role that Knuth's LR theory plays is that it provides a means to construct
a shift-reduce parser. Seen in this context, it's easy to explain both what it
is and does. A shift-reduce parser provides a set of transitions z: q→q'
between states q, q' on a [non-]terminal z, which generates a morphism
<z| ↦ p(z) ≡ Σ_{z:q→q'} |q><q|<q'| for all [non-]terminals z; and <0|
↦ <0|
|z> ↦ q(z) ≡ Σ_{z:q→q'} |q'>|q><q| for all [non-]terminals z; and |0>
↦ |0>
that (a) yields the same result when substituted into the CST kernel and (b)
provides for a more nearly-deterministic push-down transducer when the bracket
symbols are subject to the obvious stack interpretation with |0><0| playing
the role of the "empty stack" projection operator.

It can be written more succinctly in graphical form by writing defining [q]
≡ |q><q|, q⇐ ≡ |q>, ⇒q ≡ <q|, [q,⋯,q'] = [q] + ⋯ + [q]' (and
other comma-separated subgraphs being understood to be added, e.g.
[0]⇒2,[1]⇒3 = |0><0|<2| + |1><1|<3|). In that case, a rule such as S → u
L v corresponds to |v>|L>|u><S| which translates into q(v)q(L)q(u)p(S) which
takes the form of a sum of graphs that looks like q⇐q⇐q⇐[q]⇒q.

That's the expression for the "roll-back" action. The absence of the explicit
account of the roll-back actions from LR theory is the chief deficiency of the
entire formalism! It's normally imposed externally, either as a narrative
description or else (in an implementation) as a cookie-cutter code skeleton
(like in BSD yacc) or in the system library (like UNIX yacc) or
user-specifiable (like in bison).

The right-ward pointing segment on the left corresponds to what is actually
called a "lane" in LR-theory, while [q] is called a "laneahead" state. You can
read the LALR "lookahead"'s directly off the graph - it's whatever lookaheads
the states, q, to the right of the laneahead state have.

The current state of LR(k), for k > 0, by the way, runs pretty much like this.
Pager's classical theory - which came out shortly after Knuth, has been
largely superseded by IELR (the IELR paper is here
https://core.ac.uk/download/pdf/82047055.pdf) which is what bison uses; and to
a limited extent by Pager and Chen's newer formulation that is in "hyacc",
which better handles LR(1) and a limited subset of LR(k) for k > 1. Pager and
Chen, however, haven't figured out how to deal with "cyclic" lookaheads -
which I think correspond to the cases where the (YR)* roll-back expression has
star-height 1. Nor do I know how their newer method compares with IELR; other
than that IELR is limited to LR(1).

You might be able to handle the situation within the algebraic framework just
described here and, thereby, extend their approach to all LR(k) grammars for k
> 1.

gah4

unread,

Apr 9, 2021, 9:57:44 PM4/9/21

to

On Friday, April 9, 2021 at 4:20:27 PM UTC-7, Rock Brentwood wrote:
> (His original aim is much more doable now, by the way, than in the 1980's,
> because the *forward* advancement of C from the 1980's to C99 actually entails
> a dramatic *simplification* of what's needed for a "cfront" like utility, and
> much of what went into the original is no longer needed, on that account. You
> can almost even write up a utility as a simple editor script, now! ... Except
> for exception-handling. Most people forget, translation is a *contravariant*
> function of source language - it simplifies when source language complexifies.
> So things become *more* doable, not less!)

I have recently been working with another macro processor from the 1970's,
which is written using a set of macros, and converted to standard Fortran 66.

Among others, the macros implement IF-THEN-ELSE structures and the usual
DO-WHILE and and DO-UNTIL loops.

I recently realized, though haven't done it yet, that I can rewrite the macros
to generate Fortran 90 structured if and looping constructs.

I did that some years ago with the MORTRAN macros, which also are designed
to generate standard Fortran 66 code.

But yes, it is interesting to look at what can be done with old programs
and modern tools.