Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Supporting multiple input syntaxes

135 views
Skip to first unread message

luser droog

unread,
Aug 12, 2020, 6:32:56 PM8/12/20
to
I've got my project successfully parsing the circa-1975 C syntax
from that old manual. I'd like to add parsers for K&R1 and c90
syntaxes.

How separate should these be? Should they be complete
separate grammars, or more piecewise selection?

My feeling is that separating them will be less headache, but maybe
there's some advantage to changing out smaller pieces of the grammar
in that it might be easier to make sure that they produce the same
structure compatible with the backend.

Any guidance in this area?

https://github.com/luser-dr00g/pcomb/blob/master/pc9syn.c

[Really, it's up to you. My inclination would be to make them
separate but use some sort of macro setup so you can insert
common pieces into each of the grammars. -John]

Kaz Kylheku

unread,
Aug 13, 2020, 6:22:34 PM8/13/20
to
On 2020-08-12, luser droog <mij...@yahoo.com.dmarc.email.dmarc.email> wrote:
> I've got my project successfully parsing the circa-1975 C syntax
> from that old manual. I'd like to add parsers for K&R1 and c90
> syntaxes.
>
> How separate should these be? Should they be complete
> separate grammars, or more piecewise selection?
>
> My feeling is that separating them will be less headache, but maybe
> there's some advantage to changing out smaller pieces of the grammar
> in that it might be easier to make sure that they produce the same
> structure compatible with the backend.
>
> Any guidance in this area?
>
> https://github.com/luser-dr00g/pcomb/blob/master/pc9syn.c

I'd say that since you're not using a parser generator, but using code
statements to construct the grammar objects at initialization time, you
have the flexibility to merge the implementation, because you can check
the value of some dialect-selecting variable, and construct the parser
accordingly, and elsewhere check that same variable to do whatever else
needs to be done conditionally.

The trick is to find a way to embed the *semantics* of the older dialects
into the new so then everything after the parsing can be shared.

Similar remarks would apply to recursive descent.

If you were using something clunky like a Yacc, there are still ways
to combine everything into a single grammar. The input stream can be
primed with one of several "secret internal token" objects that has no
lexeme. (Primed, meaming that the first call to the lexer yields this
secret token instead of processing actual input.) Each token indicates
a dialect to parse. The top-level grammar production can then pick
one of several subordinate production rules corresponding to the entry
points for the respective dialects. Those can then share common rules
as much as possible.

translation_unit : C75_TOKEN c75_translation_unit /* orig flavor */
| C79_TOKEN c79_translation_unit /* "K&R" */
| C90_TOKEN c90_translation_unit /* ANSI/ISO */
;

--
TXR Programming Lanuage: http://nongnu.org/txr
Music DIY Mailing List: http://www.kylheku.com/diy
ADA MP-1 Mailing List: http://www.kylheku.com/mp1

Hans-Peter Diettrich

unread,
Aug 13, 2020, 6:22:51 PM8/13/20
to
Am 13.08.2020 um 00:20 schrieb luser droog:
> I've got my project successfully parsing the circa-1975 C syntax
> from that old manual. I'd like to add parsers for K&R1 and c90
> syntaxes.
>
> How separate should these be? Should they be complete
> separate grammars, or more piecewise selection?

IMO this depends widely on the usage of the parser output (diagnostics,
backend...). C90 is much stricter than K&R, requires more checks. Do you
need extensive error diagnostics, or do you assume that all source code
is free of errors?


> https://github.com/luser-dr00g/pcomb/blob/master/pc9syn.c

You seem to implement an LL(1) parser? My C98 Parser is LL(2), i.e. an
LL(1) parser with one or two locations where more lookahead is required.
Also identifiers are classified as typenames and others prior to their
usage.

For real-world testing (recommended!) a preprocessor is required and a
copy of the standard libraries of existing compiler(s).

Your test_syntax() source misses "=" from the variable declarations
(initializers). What about pointer syntax/semantics? If you add these
(and other) syntax differences conditionally (version specific) to your
code, which way would look better to you? Which way will be safer to
maintain?


Nice code BTW :-)

DoDi

minf...@arcor.de

unread,
Aug 13, 2020, 6:24:36 PM8/13/20
to
Am Donnerstag, 13. August 2020 00:32:56 UTC+2 schrieb luser droog:
> I've got my project successfully parsing the circa-1975 C syntax
> from that old manual. I'd like to add parsers for K&R1 and c90
> syntaxes.
>
> How separate should these be? Should they be complete
> separate grammars, or more piecewise selection? ...

Why not settle for one master dialect and use awk to translate between dialects?

[Probably because there is a great deal of C code written to comply with
the various versions of the standard, users want error messages that match
the code they wrote rather than some intermediate code, and it would be quite
an awk program that could reconcile all the differences among C flavors. -John]

luser droog

unread,
Aug 15, 2020, 10:44:35 AM8/15/20
to
On Thursday, August 13, 2020 at 5:22:51 PM UTC-5, Hans-Peter Diettrich wrote:
> Am 13.08.2020 um 00:20 schrieb luser droog:
> > I've got my project successfully parsing the circa-1975 C syntax
> > from that old manual. I'd like to add parsers for K&R1 and c90
> > syntaxes.
> >
> > How separate should these be? Should they be complete
> > separate grammars, or more piecewise selection?
>
> IMO this depends widely on the usage of the parser output (diagnostics,
> backend...). C90 is much stricter than K&R, requires more checks. Do you
> need extensive error diagnostics, or do you assume that all source code
> is free of errors?
>
>
> > https://github.com/luser-dr00g/pcomb/blob/master/pc9syn.c
>
> You seem to implement an LL(1) parser? My C98 Parser is LL(2), i.e. an
> LL(1) parser with one or two locations where more lookahead is required.
> Also identifiers are classified as typenames and others prior to their
> usage.
>

Yes, it's basically LL(1) with backtracking. There's one part of the
grammar I'm using that's left-recursive and I still need to work that
out.

> For real-world testing (recommended!) a preprocessor is required and a
> copy of the standard libraries of existing compiler(s).
>
> Your test_syntax() source misses "=" from the variable declarations
> (initializers). What about pointer syntax/semantics? If you add these
> (and other) syntax differences conditionally (version specific) to your
> code, which way would look better to you? Which way will be safer to
> maintain?
>

That's actually correct for the 1975 dialect: no '=' to initialize
variables. I think it's pretty ugly without it, but it could be
removed anyway for the AST.

>
> Nice code BTW :-)
>

Thanks! I think I need to sidetrack a bit and work up some primitives
for pattern matching and decomposition to make the backend easier.
I'll report back if/when it can do more tricks.

luser droog

unread,
Aug 15, 2020, 10:45:06 AM8/15/20
to
One of the possible goals for this project is exactly such a translator
that can downgrade or upgrade code from one standard version to another.

Another possible application is a source code formatter. Currently the
CST produced by the parser keeps all the original whitespace attached
to each token.

Christopher F Clark

unread,
Aug 15, 2020, 10:46:29 AM8/15/20
to
We did something similar with Yacc++. We used inheritance of grammars
(a feature of our product) to do so. In fact, the point of the
exercise was to demonstrate that feature. I would presume something
similar would work in a hand-written recursive descent parser.

--
******************************************************************************
Chris Clark email: christoph...@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------

David Lovemore

unread,
Aug 15, 2020, 10:46:47 AM8/15/20
to
It may be useful to consider what you would like to happen if you encounter a
syntax that is ambiguous or works differently or is for another expected
syntax from what you are parsing: produce a warning, error or handle quietly,
or fall over, or don’t care.

luser droog

unread,
Aug 15, 2020, 7:00:33 PM8/15/20
to
Very good points. The parser is backtracking, returning a list of
results. That could conceivably be useful for dealing with ambiguity by
looking at more than just the first result.

Warnings and error messages are going to be trickier I fear. The parser
is built around the idea of Monadic combinators. So my research suggests
that I'll need Monad Transformers to add the state needed for good
messages. There are a bunch of lectures I've found about using these
in Haskell, but not much about how to build them from scratch.

As it is, any error in parsing will just produce no results.

I started a prototype where the input list of characters was instead
a list of (character, line-number, line-position). But it got really
confusing at the time so I stopped. And the few times I've tried to
look at it, I can't figure out what I was thinking.

David Lovemore

unread,
Aug 16, 2020, 11:53:24 AM8/16/20
to
My friend, reporting the furthest position examined by the parser I have
useful in error cases as a simple stop gap when using a combinator approach.

Thinking about it you kind of want to see the furthest failed position and the
stack of rules above it. Such requires meta information when the code is
written in the most natural way. For this reason and others I believe it is
good to represent your grammar in data structures which is further in the
direction of a compiler compiler tool (or compiler interpreter tool).

luser droog

unread,
Aug 23, 2020, 2:39:30 PM8/23/20
to
On Sunday, August 16, 2020 at 10:53:24 AM UTC-5, davidl...@gmail.com wrote:
> My friend, reporting the furthest position examined by the parser I have [found]
> useful in error cases as a simple stop gap when using a combinator approach.
>
> Thinking about it you kind of want to see the furthest failed position and the
> stack of rules above it. Such requires meta information when the code is
> written in the most natural way. For this reason and others I believe it is
> good to represent your grammar in data structures which is further in the
> direction of a compiler compiler tool (or compiler interpreter tool).

Thanks. I've done some further investigating. I built my parsers following
two papers. Hutton and Meijer, Monadic Parser Combinators
https://www.cs.nott.ac.uk/~pszgmh/monparsing.pdf
and Hutton, Higher-Order Functions for Parsing
https://pdfs.semanticscholar.org/6669/f223fba59edaeed7fabe02b667809a5744d9.pdf

The first adds error reporting using Monad Transformers. I'm thinking about
how to move in this direction, but first I'd need to reformulate the code to
make the Monad more explicit. It should be something like an interface or
a mixin, like a base class with all virtual member functions. That could be
done by modelling my objects more like OO objects and have 'bind' and
'result' in a vtable in the Parser object.

But the second paper does it differently, and maybe something I can do
more easily. It redefines the parsers to no longer produce a list of results,
so there's no longer support for ambiguity. Then it defines them to
return a Maybe,

maybe * ::= Fail [char] | Error [char] | OK *
.
where the OK branch has the parse tree, and Fail or Error both contain an error
message. It describes how a Fail can be transformed into an Error. But it isn't
entirely clear where the messages get injected.

Still need to do some thinking on it, but I think I can rewrite the parsers
to follow this model, and then decorate my grammar with possible errors
at each node.

Thanks for the encouragement. My classes start on Monday so I'm hoping
to accomplish something on this before then.

gah4

unread,
Aug 23, 2020, 11:13:45 PM8/23/20
to
On Wednesday, August 12, 2020 at 3:32:56 PM UTC-7, luser droog wrote:
> I've got my project successfully parsing the circa-1975 C syntax
> from that old manual. I'd like to add parsers for K&R1 and c90
> syntaxes.

(snip)

I remember using a later C compiler that accepted
the older form of assignment operators, such as =+, =-, etc.
Presumably for those with older code.

I then had to put an extra space when assigning negative values:

i= -4;

I don't remember now how I found out about that one.

That is the only old C syntax I remember.
(And much later than 1975.)

luser droog

unread,
Aug 23, 2020, 11:15:06 PM8/23/20
to
On Sunday, August 23, 2020 at 1:39:30 PM UTC-5, luser droog wrote:
> On Sunday, August 16, 2020 at 10:53:24 AM UTC-5, davidl...@gmail.com wrote:
> > My friend, reporting the furthest position examined by the parser I have [found]
> > useful in error cases as a simple stop gap when using a combinator approach.
> >
> > Thinking about it you kind of want to see the furthest failed position and the
> > stack of rules above it. Such requires meta information when the code is
> > written in the most natural way. For this reason and others I believe it is
> > good to represent your grammar in data structures which is further in the
> > direction of a compiler compiler tool (or compiler interpreter tool).
>
> Thanks. I've done some further investigating. I built my parsers following
> two papers. Hutton and Meijer, Monadic Parser Combinators
> https://www.cs.nott.ac.uk/~pszgmh/monparsing.pdf
> and Hutton, Higher-Order Functions for Parsing
> https://pdfs.semanticscholar.org/6669/f223fba59edaeed7fabe02b667809a5744d9.pdf
>
> The first adds error reporting using Monad Transformers. [...]>
> But the second paper does it differently, and maybe something I can do
> more easily. It redefines the parsers to no longer produce a list of results,
> so there's no longer support for ambiguity. Then it defines them to
> return a Maybe,
>
> maybe * ::= Fail [char] | Error [char] | OK *
> .
> where the OK branch has the parse tree, and Fail or Error both contain an error
> message. It describes how a Fail can be transformed into an Error. But it isn't
> entirely clear where the messages get injected.
>
> Still need to do some thinking on it, but I think I can rewrite the parsers
> to follow this model, and then decorate my grammar with possible errors
> at each node.

I've made some progress. I wrote a new prototype following Hutton and
modified it to add position information to the character stream.
And then rewrote the parsers to produce the maybe structure and then
to collect rudimentary error messages.

For these, a positive and negative case in PostScript,

0 0 (abcd\ne) input-pos
(abc) str exec
pc
0 0 (abcd\ne) input-pos
(abd) str nofail exec
pq

I get this output:

$ gsnd -q -dNOSAFER pc11.ps
stack:
[/OK [[(a) [(b) [(c) []]]] [[(d) [0 3]] [[(\n) [0 4]] [[(e) [1 0]] null]]]]]
stack:
[/Error [[(after) (a)] [[(after) (b)] [[{(d) eq} (not satisfied)] [[(c) [0 2]] [[(d) [0 3]] [[(\n) [0 4]] [[(e) [1 0]] null]]]]]]]]

So, this indicates that I *can* modify my C parsers to produce error
messages. The remaining input list has the position information for
where the error occurred, [(c) [0 2]].

Following the prototype, I modified the input functions to add positions
for each character and modified the base parser item() to detect and
remove the position stuff before it passes into the rest of the machinery.

The next hurdle was making the extra position information work with
the Unicode filters ucs4_from_utf8() and utf8_from_ucs4(). And that
all appears to be working now.

But that's probably the end of the story for now. Got to gear up for
Operating Systems and Advanced Web Dev with Jave.

Thanks to everyone for the help, esp. Kaz with the brilliant suggestion
to pass a language id token between tokenizer and parser.


Ps. the prototype is written in PostScript extended with function syntax.
https://github.com/luser-dr00g/pcomb/blob/master/ps/pc11.ps
https://codereview.stackexchange.com/questions/193520/an-enhanced-syntax-for-defining-functions-in-postscript

--
l droog
[Why Postscript? I realize it's Turing complete, but it seems odd to run ones parser on a printer. -John]

luser droog

unread,
Aug 24, 2020, 11:42:33 AM8/24/20
to
On Sunday, August 23, 2020 at 10:15:06 PM UTC-5, luser droog wrote:

> Thanks to everyone for the help, esp. Kaz with the brilliant suggestion
> to pass a language id token between tokenizer and parser.
>
>
> Ps. the prototype is written in PostScript extended with function syntax.
> https://github.com/luser-dr00g/pcomb/blob/master/ps/pc11.ps
> https://codereview.stackexchange.com/questions/193520/an-enhanced-syntax-for-defining-functions-in-postscript
>
> --
> l droog
> [Why Postscript? I realize it's Turing complete, but it seems odd to run ones parser on a printer. -John]

I discovered PostScript around '97 or '98. I was taking Computer Graphics
and it was in an Appendix to the textbook (Salman). At the same time
I was editor of the Honors College student magazine so it really piqued
my interest as a graphics and typography language.

But the language itself I just really enjoy. It's my "Lego blocks"
language. The RPN syntax removes all ambiguity about precedence and
sequencing. It has the same code=data properties as Lisp. Application
code can read from the program stream. It has strings, arrays and
dictionaries. It has first class procedures which can be constructed
on the fly. I've found it a nice playpen for syntax extension.

I was also on a many-decades long crusade to never use MS Word after that
/first/ time they screwed everyone by changing the interface. And
PostScript has slowly become my tool for that as my programming skill grew.
https://github.com/luser-dr00g/ibis.ps

On another front, I wanted to have parsers in PostScript so I could
evaluate infix math expressions. And I wanted regular expression
matching in PS thinking it would help to implement algorithmic
hyphenation of text.
[Take a look at Forth. Many of the same advantages, runs a lot more places. -John]

luser droog

unread,
Aug 24, 2020, 11:43:07 AM8/24/20
to
On Thursday, August 13, 2020 at 5:22:51 PM UTC-5, Hans-Peter Diettrich wrote:
> Am 13.08.2020 um 00:20 schrieb luser droog:
> > I've got my project successfully parsing the circa-1975 C syntax
> > from that old manual. I'd like to add parsers for K&R1 and c90
> > syntaxes.
> >
> > How separate should these be? Should they be complete
> > separate grammars, or more piecewise selection?
>
> IMO this depends widely on the usage of the parser output (diagnostics,
> backend...). C90 is much stricter than K&R, requires more checks. Do you
> need extensive error diagnostics, or do you assume that all source code
> is free of errors?
>
>
> > https://github.com/luser-dr00g/pcomb/blob/master/pc9syn.c
>
> You seem to implement an LL(1) parser? My C98 Parser is LL(2), i.e. an
> LL(1) parser with one or two locations where more lookahead is required.

In which places do you need more lookahead? Btw, some of my reading
describes my parsers as LL(infinity) because of the backtracking.

Thomas Koenig

unread,
Aug 24, 2020, 3:12:13 PM8/24/20
to
luser droog <mij...@yahoo.com.dmarc.email.dmarc.email> schrieb:

[PostScript]

> But the language itself I just really enjoy. It's my "Lego blocks"
> language. The RPN syntax removes all ambiguity about precedence and
> sequencing.

I recently had the doubtful pleasure of evaluating the formula

x = ((a-b)*c^2+(-d^2+e^2-a^2+b^2)*c+a^2*b+(f^2-e^2-b^2)*a
+(-f^2+d^2)*b)/((-2*d+2*e)*c+(2*f-2*e)*a-2*b*(f-d))

in Postscript. (Yes, really. Don't ask.)

It was the first time in more than a decade that I wrote a
flex/bison grammar (mostly copied from the bison manual). It was
faster and less error-prone than trying to do it directly.

The grammar actually generated fairly unidiomatic PostScript
because I made it give names to all the variables, so

a-b

became

a b sub

I'm sure a real PostScript aficionado would have done it all
on the stack :-)
[Turning infix into RPN is a pretty basic intro compiler course exercise. Conceptually,
you make the parse tree and then do a postorder tree walk. Or if you'rs using yacc or
bison, in the expression grammar you just print the operator or token in the code for each
rule because they are recognized in the right order, e.g.:

expr: VARIABLE { printf(" %s ", $1): }
| expr '+' expr ( printf(" add "); }
| expr '-' expr ( printf(" sub "); }
| '(' expr ')' { /* print nothing */ }

-John]

luser droog

unread,
Aug 24, 2020, 3:20:10 PM8/24/20
to
On Monday, August 24, 2020 at 10:42:33 AM UTC-5, luser droog wrote:
> > [Why Postscript? I realize it's Turing complete, but it seems odd to run ones parser on a printer. -John]
>
> I discovered PostScript around '97 or '98. I was taking Computer Graphics
> and it was in an Appendix to the textbook (Salman). At the same time
> I was editor of the Honors College student magazine so it really piqued
> my interest as a graphics and typography language. ...

> [Take a look at Forth. Many of the same advantages, runs a lot more places. -John]

Good suggestion. I have looked at Forth quite a bit. I lurked in
comp.lang.forth for a number of years. I've got a half-written
interpreter that stalled because my vm doesn't have any I/O.

https://groups.google.com/d/topic/comp.lang.forth/Y1XlX8wD3RQ/discussion
https://retrocomputing.stackexchange.com/questions/6610/how-to-do-i-o-with-emulation-code

I went down a wild rabbit hole after discovering the document "X86 is an
octal machine" and tried to recode the assembler macros using more octal.
But I kind of stalled on that whole area since the first thing I would
want in Forth is tagged objects like PostScript has. There's Oforth and
8th which both supply that sort of thing, but then I'd probably miss
the PS graphics functions.,,. :)

I've also played with APL and tried writing a few interpreters for it.
But the common thread among all these interpreters was coding them all
in C. So I turned my attention to compiling and analyzing C code.
A friend of mine was wanting a really customizable C formatter so I
thought I might be able to make a tool to accommodate lots of different
backends for doing something with the parse tree or syntax tree.
I want to be able to write C99 code and transpile it automatically to
something that will work with the MS compiler without having to maintain
any MS business in the "master" source.

luser droog

unread,
Aug 29, 2020, 12:59:48 PM8/29/20
to
On Monday, August 24, 2020 at 2:12:13 PM UTC-5, Thomas Koenig wrote:
> luser droog <mij...@yahoo.com.dmarc.email.dmarc.email.dmarc.email> schrieb:
>
> [PostScript]
>
> > But the language itself I just really enjoy. It's my "Lego blocks"
> > language. The RPN syntax removes all ambiguity about precedence and
> > sequencing.
>
> I recently had the doubtful pleasure of evaluating the formula
>
> x = ((a-b)*c^2+(-d^2+e^2-a^2+b^2)*c+a^2*b+(f^2-e^2-b^2)*a
> +(-f^2+d^2)*b)/((-2*d+2*e)*c+(2*f-2*e)*a-2*b*(f-d))
>
> in Postscript. (Yes, really. Don't ask.)
>

In case you need it, I've got a PostScript debugger that can single
step into loops and procedures.

https://github.com/luser-dr00g/debug.ps

$ gsnd db5.ps
GPL Ghostscript 9.52 (2020-03-19)
Copyright (C) 2020 Artifex Software, Inc. All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
GS>{ 0 0 1 5 { add } for } stepon traceon debug
%|-
0 %|- 0
0 %|- 0 0
1 %|- 0 0 1
5 %|- 0 0 1 5
{add} %|- 0 0 1 5 {add}
for nametype object for step: (continue|next|bypass|step|prompt|quit)?s
%|- 0 0
add nametype object add step: (continue|next|bypass|step|prompt|quit)?
%|- 0
1 %|- 0 1
1 %|- 0 1 1
5 %|- 0 1 1 5
{add} %|- 0 1 1 5 {add}
for nametype object for step: (continue|next|bypass|step|prompt|quit)?
%|- 0 1
add nametype object add step: (continue|next|bypass|step|prompt|quit)?
%|- 1
2 %|- 1 2
1 %|- 1 2 1
5 %|- 1 2 1 5
{add} %|- 1 2 1 5 {add}
for nametype object for step: (continue|next|bypass|step|prompt|quit)?
%|- 1 2
add nametype object add step: (continue|next|bypass|step|prompt|quit)?
%|- 3
3 %|- 3 3
1 %|- 3 3 1
5 %|- 3 3 1 5
{add} %|- 3 3 1 5 {add}
for nametype object for step: (continue|next|bypass|step|prompt|quit)?
%|- 3 3
add nametype object add step: (continue|next|bypass|step|prompt|quit)?
%|- 6
4 %|- 6 4
1 %|- 6 4 1
5 %|- 6 4 1 5
{add} %|- 6 4 1 5 {add}
for nametype object for step: (continue|next|bypass|step|prompt|quit)?
%|- 6 4
add nametype object add step: (continue|next|bypass|step|prompt|quit)?
%|- 10
5 %|- 10 5
1 %|- 10 5 1
5 %|- 10 5 1 5
{add} %|- 10 5 1 5 {add}
for nametype object for step: (continue|next|bypass|step|prompt|quit)?
%|- 10 5
add nametype object add step: (continue|next|bypass|step|prompt|quit)?
%|- 15
6 %|- 15 6
1 %|- 15 6 1
5 %|- 15 6 1 5
{add} %|- 15 6 1 5 {add}
for nametype object for step: (continue|next|bypass|step|prompt|quit)?
GS<1>==
15
GS>quit

[This is drifting rather far from compilers now. -John]

anti...@math.uni.wroc.pl

unread,
Feb 11, 2021, 7:05:54 PM2/11/21
to
Gnu Pascal supports several Pascal dialects. Gnu Pascal uses
unified parser for all dialects. Some ideas used:
- flags in scanner decide if dialect specific tokens are
recognized
- superset parsing: several constructs are generalized so
that single construct represents things that othewise
would lead to conflits. Later semantic stage looks at
dialects flags, prunes things not allowed in given
dialect. Example of superset contruction is rule
'call_or_cast', it handles several syntactically similar
constructs that are usually given by separate syntax
rules. Semantic rules beside dialect flags use types to
decide of meaning.
- even after usin two tricks above grammar still have
LALR conflicts, they are resolved using GLR option
of Bison. All conflicts are resolvable using lookahead,
and AFAICS some are only resolvable with lookahead.
Parser lookahead means that traditional trick of
passing semantic info back to scanner does not work
(parser actions are delayed, so scanner may be forced
to produce token before semantic info is available).
Still, it seems that GLR leads to cleaner parser.

My impression is that variation in Pascal dialects is larger
than in C dialects, so case for unified parser in C IMHO
is much stronger. OTOH Gnu Pascal is full compiler with
semantic actions invoked from grammar rules. Semantic code
embedded in the parser changed much more than grammar rules,
so maintaining separate parsers probably would be a
nightmare.

--
Waldek Hebisch

Elijah Stone

unread,
Feb 21, 2021, 5:04:17 PM2/21/21
to
On Thu, 11 Feb 2021, anti...@math.uni.wroc.pl wrote:

> My impression is that variation in Pascal dialects is larger than in C
> dialects, so case for unified parser in C IMHO

Pascal is more fragmented, but it's also much easier to parse than C. I
think it's a wash.

(I also think the whole idea is horrifying and ought not to be pursued;
but.)

-E

anti...@math.uni.wroc.pl

unread,
Feb 23, 2021, 8:47:55 PM2/23/21
to
Elijah Stone <elr...@elronnd.net> wrote:
> On Thu, 11 Feb 2021, anti...@math.uni.wroc.pl wrote:
>
> > My impression is that variation in Pascal dialects is larger than in C
> > dialects, so case for unified parser in C IMHO
>
> Pascal is more fragmented, but it's also much easier to parse than C. I
> think it's a wash.

I did a C parser, it was not hard at all. I in C (like in standard
Pascal) there are conflicts, but that conflicts can be resolved
easily using semantic info. Alternativly, for C one can use 2
token lookahead. Turbo Pascal folks introduced "interesting"
difficulty with caret constants. Frank Heckenbach worked out
how to handle them and his analysis indicates that correct
handling of Turbo Pascal needs IIRC 6 tokens of lookahead.

Note that for both Pascal and C, with 1 token of lokahead
semantic info is available when needed to disambiguate
parsing, once you have more than 1 token of lokahead
semantic info is sometimes too late and in effect paser
must work purely syntactically.

> (I also think the whole idea is horrifying and ought not to be pursued;
> but.)

What you mean by "whole idea"? Do you think that creating
compiler that can correctly handle multiple dialects (Pascal
or other language) is wrong?

--
Waldek Hebisch

ltc...@gmail.com

unread,
Mar 14, 2021, 9:08:36 PM3/14/21
to
> Elijah Stone <elr...@elronnd.net> wrote:
> I did a C parser, it was not hard at all. I in C (like in standard
> Pascal) there are conflicts, but that conflicts can be resolved
> easily using semantic info. Alternativly, for C one can use 2
> token lookahead.

I'm not sure whether I captured the full context of your statement, but, if I
did, I don't think it's 100% correct:

- In regards to lookahead of 2:
This isn't enough to disambiguate, e.g., between a cast-expression in 6.5.4,
`( type-name ) cast-expression`, and a compound literal in 6.5.2, `( type-name
) { initializer-list }`.

- In regards to using semantic info:
Yes, with semantic info you can disambiguate things like `x * y;`, so I'd say
that, from a pragmatic/practical standpoint, this affirmation is right.
However, from a more theoretical perspective, a parser (thinking of it a
program that "simply" validates a sentence based on a grammar), isn't expected
— arguably — to rely on anything else other than syntax. Whether or not
the theoretical aspect of it is relevant, depends on the application of the
parser, I guess. For instance, for the implementation of static analysis tool,
not depending, as much as possible, on semantic information to guide parsing
is an advantage.

This is a table (only for expressions) that I recently put up when rewriting
the C parser of my project:
https://docs.google.com/spreadsheets/d/1oGjtFaqLzSoBEp2aGNgHrbEHxSi4Ijv57mXMP
ymZEcQ/edit?usp=sharing

--
Leandro T. C. Melo

Rock Brentwood

unread,
Mar 14, 2021, 9:09:48 PM3/14/21
to
On Wednesday, August 12, 2020 at 5:32:56 PM UTC-5, luser droog wrote:
> I've got my project successfully parsing the circa-1975 C syntax
> from that old manual. I'd like to add parsers for K&R1 and c90
> syntaxes.
>
> How separate should these be? Should they be complete
> separate grammars, or more piecewise selection?

I'm in a similar situation with a utility that I want to grandfather in the
old syntax for, but write with a new and better syntax. My recommendation is
this: stick to C99, since that's already in POSIX. Write a separate utility to
convert legacy syntax to C99 (and to call out any
irregularities/inconsistencies in the program being converted). That's, like,
"lint" on steroids.

The other syntaxes would be used in the other utilities, only - one per
utility. It can also be hybridized with "indent" and a driver routine can
control the conversion, so that all the conversion utilities can be combined
to one. So, on input, the source syntax is selected, and on output the format
is driven in much the same way that it is with indent. It's an excellent
exercise in Text-To-AST-To-Text programming.

Each program, upon upward conversion to C99, would replace the original, once
it passes the consistency checks provided by the utility; so there isn't a
question of cueing error messages to the format of the older program, because
the older program would be replaced. Doing all of this is an example of
"refactoring" used to pay off "code debt". And there's a lot of code debt out
there that needs to be paid up.

Technical Debt (Wikipedia): https://en.wikipedia.org/wiki/Technical_debt
Code Refactoring (Wikipedia): https://en.wikipedia.org/wiki/Code_refactoring
0 new messages