[erlang-questions] Parsing C with leex and yecc

Joe Armstrong

unread,

Jul 20, 2010, 5:28:38 AM7/20/10

to Erlang

I'm trying to parser ANSI C with leex and yecc and have run into
two problems.

1) /* ... */ comments. Leex is (as I understand things) greedy
thus I can't just write a regexp to match comments, since it
will consume no-only the current comment, but all comments until
the last comment in the file.

To solve this I have just written a simple pre-processor to remove comments
from the original source.

2) The typedef problem.

How can I parse typedefs using leex and yecc?

The traditional yacc/lex approach solves this with a context
sensitive lex and feedback between yacc and lex. But the Erlang
leex and yacc seem to be written as separated passes, are there some
internal functions that allow feedback between leex and yacc?

/Joe

________________________________________________________________
erlang-questions (at) erlang.org mailing list.
See http://www.erlang.org/faq.html
To unsubscribe; mailto:erlang-questio...@erlang.org

Mikael Pettersson

unread,

Jul 20, 2010, 6:13:23 AM7/20/10

to Joe Armstrong, Erlang

Joe Armstrong writes:
> I'm trying to parser ANSI C with leex and yecc and have run into
> two problems.

...

> 2) The typedef problem.
>
> How can I parse typedefs using leex and yecc?
>
> The traditional yacc/lex approach solves this with a context
> sensitive lex and feedback between yacc and lex. But the Erlang
> leex and yacc seem to be written as separated passes, are there some
> internal functions that allow feedback between leex and yacc?

The simplest solution in my experience is to make the scanner
proper ignore the typedef problem and just return identifier
for any identifier-like token that isn't a reserved word. Then
you place a context-sensitive shim between then parser and the
scanner proper, where you maintain needed state to resolve
identifier-vs-typename. Depending on implementation language
and parser technology, this may require tweaking the interface
between then parser and the shim (e.g. for state) or even embedding
the shim in the parser, but at least the scanner proper can remain clean.

Sverker Eriksson

unread,

Jul 20, 2010, 6:32:36 AM7/20/10

to Joe Armstrong, Erlang

Joe Armstrong wrote:
> I'm trying to parser ANSI C with leex and yecc and have run into
> two problems.
>
> 1) /* ... */ comments. Leex is (as I understand things) greedy
> thus I can't just write a regexp to match comments, since it
> will consume no-only the current comment, but all comments until
> the last comment in the file.
>
> To solve this I have just written a simple pre-processor to remove comments
> from the original source.
>
>

re:run("/***first comment***/ /* next comment */",
"/\\*([^*]|(\\*+([^*/])))*\\*+/").

http://ostermiller.org/findcomment.html

/Sverker

Joe Armstrong

unread,

Jul 20, 2010, 9:00:09 AM7/20/10

to Sverker Eriksson, Erlang

On Tue, Jul 20, 2010 at 12:32 PM, Sverker Eriksson
<sve...@erix.ericsson.se> wrote:
> Joe Armstrong wrote:
>>
>> I'm trying to parser ANSI C with leex and yecc and have run into
>> two problems.
>>
>> 1) /* ... */ comments. Leex is (as I understand things) greedy
>> thus I can't just write a regexp to match comments, since it
>> will consume no-only the current comment, but all comments until
>> the last comment in the file.
>>
>> To solve this I have just written a simple pre-processor to remove
>> comments
>> from the original source.
>>
>>
>
> re:run("/***first comment***/ /* next comment */",
> "/\\*([^*]|(\\*+([^*/])))*\\*+/").
>
> http://ostermiller.org/findcomment.html

Ummm ... this will incorrectly match a literal string, containing a
comment ... for example:

char *p = "hi /* not a comment */ how are you?";

Which is not what I want ..

Easiest seems to be bit of pure Erlang:

On Tue, Jul 20, 2010 at 12:32 PM, Sverker Eriksson
<sve...@erix.ericsson.se> wrote:
> Joe Armstrong wrote:
>>
>> I'm trying to parser ANSI C with leex and yecc and have run into
>> two problems.
>>
>> 1) /* ... */ comments. Leex is (as I understand things) greedy
>> thus I can't just write a regexp to match comments, since it
>> will consume no-only the current comment, but all comments until
>> the last comment in the file.
>>
>> To solve this I have just written a simple pre-processor to remove
>> comments
>> from the original source.
>>
>>
>
> re:run("/***first comment***/ /* next comment */",
> "/\\*([^*]|(\\*+([^*/])))*\\*+/").
>
> http://ostermiller.org/findcomment.html

Ummm ... this will incorrectly match a literal string, containing a
comment, for example:

char *p = "hi /* not a comment */ how are you?";

Which is not what I want ..

Easiest seems to be bit of pure Erlang:

%% remove_comments(Str) -> Str'
%% remove C'style comments from a string
%% note1: We retain any embedded NLs in the comment
%% this is so that line number calculations in the tokenizer
%% will still be correct
%% note2. We copy literal strings. Since a literal string might
%% contain a comment we have to parse the string
%% note3: Comments must be replaced by at least one space
%% Since otherwise "123/* comment */456" would be
%% transformed into 123456 (a single integer) instead
%% of two integers 123 and 456. This is why we add a
%% space in the last line of skip_comment/2.

remove_comments(Str) -> remove_comments(Str, []).

remove_comments("/*" ++ T, L) -> skip_comment(T, L);
remove_comments([$"|T], L) -> copy_string_literal(T, [$"|L]);
remove_comments([H|T], L) -> remove_comments(T, [H|L]);
remove_comments([], L) -> reverse(L).

skip_comment("*/" ++ T, L) -> remove_comments(T, L);
skip_comment("\n" ++ T, L) -> skip_comment(T, [$\n|L]);
skip_comment([_|T], L) -> skip_comment(T, L);
skip_comment([], L) -> remove_comments([], [$\s|L]).

/Joe

Robert Virding

unread,

Jul 20, 2010, 10:01:29 AM7/20/10

to Joe Armstrong, Sverker Eriksson, Erlang

If you use the comment regexp Sverker sent for a leex comment token
(which you then ignore at will) together with a C string regexp for a
string token then you should not have any problems. The comment regexp
will not match inside the string regexp and vice versa. You will only
get problems if you try to use it remove comments in a pre-parse.

The typedef problem is difficult. Anyway "real" C didn't have typedef's. :-)

Robert

Tony Arcieri

unread,

Jul 20, 2010, 1:34:14 PM7/20/10

to Joe Armstrong, Erlang

On Tue, Jul 20, 2010 at 3:28 AM, Joe Armstrong <erl...@gmail.com> wrote:

> The traditional yacc/lex approach solves this with a context
> sensitive lex and feedback between yacc and lex. But the Erlang
> leex and yacc seem to be written as separated passes, are there some
> internal functions that allow feedback between leex and yacc?

I've run into the need for lexer/parser feedback myself, specifically
processing interpolated strings (that contain expressions from the outer
language) in Reia. I haven't found a good solution. Interpolation works
but only on a subset of expressions... for example you can't nest strings
within strings.

Have you thought about using neotoma?

--
Tony Arcieri
Medioh! A Kudelski Brand

Joe Armstrong

unread,

Jul 20, 2010, 2:39:47 PM7/20/10

to Tony Arcieri, Erlang

Yes - I've also been playing with my own peg combinator library :-).
The problem is that LALR(1) parsers can give you pretty good automatic error
diagnostics on faulty input. Peg grammars just fail - period - and you
have to write
your own error recovery in the grammar - on the other hand the peg grammar
is much nicer than the yacc grammar

/Joe

>
> --
> Tony Arcieri
> Medioh! A Kudelski Brand
>

________________________________________________________________

Tony Arcieri

unread,

Jul 20, 2010, 2:43:27 PM7/20/10

to Joe Armstrong, Erlang

On Tue, Jul 20, 2010 at 12:39 PM, Joe Armstrong <erl...@gmail.com> wrote:

> Yes - I've also been playing with my own peg combinator library :-).
> The problem is that LALR(1) parsers can give you pretty good automatic
> error
> diagnostics on faulty input. Peg grammars just fail - period - and you
> have to write
> your own error recovery in the grammar - on the other hand the peg grammar
> is much nicer than the yacc grammar
>

That's been my observation as well, and why I've continued to use leex and
yecc for Reia.

Richard O'Keefe

unread,

Jul 20, 2010, 6:20:10 PM7/20/10

to Joe Armstrong, Erlang

On Jul 20, 2010, at 9:28 PM, Joe Armstrong wrote:

> I'm trying to parser ANSI C with leex and yecc and have run into
> two problems.
>
> 1) /* ... */ comments. Leex is (as I understand things) greedy
> thus I can't just write a regexp to match comments, since it
> will consume no-only the current comment, but all comments until
> the last comment in the file.

Actually you CAN write a regular expression which matches
C comments. The trivial /[/][*].*[*][/]/ is not going to work,
but it isn't particularly difficult to write a regular expression
that WILL work. In fact, some books about Lex and Yacc give it
to you. Oh heck. I'm not going to leave it as an exercise for
the reader after all. Think of a C comment as
"/*"
zero or more blocks of (not star)* (star)+ (not star or slash)
one block of (not star)* (star)+ /

/\/\*([^*]*\*+[^/*])*[^*]*\*+\//

Lex books recommend NOT doing this, not because there's any great
difficulty in constructing a regular expression, but because
recognising a comment that way means *storing* the comment as if it
were a token. If you want to keep the comments, that's a great
thing to do. If you don't, you have to allocate a huge token
buffer you wouldn't otherwise need.

There's another approach in (f)lex, which is to use states.
Does Leex support those?

However, there's another approach that might be worth considering.
Run the C files through the preprocessor first, and let *it*
strip out the comments.

Toby Thain

unread,

Jul 20, 2010, 8:00:13 PM7/20/10

to Robert Virding, Erlang

On 20-Jul-10, at 10:01 AM, Robert Virding wrote:

> ...

>
> The typedef problem is difficult. Anyway "real" C didn't have
> typedef's. :-)

^^ what "real" C are you referring to?

--Toby

French, Mike

unread,

Jul 21, 2010, 4:21:36 AM7/21/10

to Joe Armstrong, Tony Arcieri, Erlang

> -----Original Message-----
> From: erlang-q...@erlang.org
> [mailto:erlang-q...@erlang.org]On
> Behalf Of Joe Armstrong
> Sent: 20 July 2010 19:40
> To: Tony Arcieri
> Cc: Erlang
> Subject: Re: [erlang-questions] Parsing C with leex and yecc
...
> Yes - I've also been playing with my own peg combinator library :-).
> The problem is that LALR(1) parsers can give you pretty good automatic
error
> diagnostics on faulty input. Peg grammars just fail - period - and you

> have to write your own error recovery in the grammar ...

I think PEGs can be written to handle errors elegantly.
Each combinator applies an exploration strategy to the input top-down,
and passes the remaining input on to subsequent analysis steps.
When an error occurs, it should pass back the reason for the failure.
Each error ultimately originates from a failed character lexing rule.

Just as input is reduced by moving forwards, so errors are accumulated
flowing backwards. Each combinator is then responsible for merging the
error reports coming back into it. The error handling process is in
some sense dual to the original input processing, so in some sense,
each parser combinator must contain an error co-combinator.

Failure occurs when all available paths have been explored.
Depending how errors have been filtered and discarded,
it is possible for full tree of errors to be returned to the original rule.
In practice, the only interesting co-combinator is in the choice rule.
A simple co-combinator policy is to keep the deepest (rightmost) error,
so the final report is a stack trace of failed rules and input positions.
This may or may not correspond to the 'real' error.

Mike

Thales UK Ltd (Wells) DISCLAIMER: The information contained in this e-mail
is confidential. It may also be legally privileged. It is intended only for
the stated addressee(s) and access to it by any other person is
unauthorised. If you are not an addressee, you must not disclose, copy,
circulate or in any other way use or rely on the information contained in
this e-mail. Such unauthorised use may be unlawful. We may monitor all
e-mail communications through our networks. If you have received this e-mail
in error, please inform us immediately on sender's telephone number above
and delete it and all copies from your system. We accept no responsibility
for changes to any e-mail which occur after it has been sent. Attachments
to this e-mail may contain software viruses which could damage your system.
We therefore recommend you virus-check all attachments before opening.
Thales UK Ltd. Registered Office: 2 Dashwood Lang Road, The Bourne Business
Park, Addlestone, Weybridge, Surrey KT15 2NX Registered in England No.
868273

French, Mike

unread,

Jul 21, 2010, 7:35:29 AM7/21/10

to Joe Armstrong, Tony Arcieri, Erlang

This duality becomes clearer with a slight change of terminology.
It's not really an 'error' until all rules have not matched.
So if you replace 'parser' with 'match' and 'error' with 'non-match', then:

"Each match combinator must contain a non-match co-combinator".

Tony Arcieri

unread,

Jul 21, 2010, 6:54:33 PM7/21/10

to Joe Armstrong, Erlang

On Tue, Jul 20, 2010 at 3:28 AM, Joe Armstrong <erl...@gmail.com> wrote:

> How can I parse typedefs using leex and yecc?
>
> The traditional yacc/lex approach solves this with a context
> sensitive lex and feedback between yacc and lex. But the Erlang
> leex and yacc seem to be written as separated passes, are there some
> internal functions that allow feedback between leex and yacc?
>

The bic C(++) parser from jungerl seems to use leex and yecc. No idea if it
runs afoul of the cases you're running into:

http://gist.github.com/gebi/jungerl/tree/master/lib/bic

Richard O'Keefe

unread,

Jul 22, 2010, 7:44:55 PM7/22/10

to Tony Finch, Joe Armstrong, Erlang

On Jul 23, 2010, at 6:00 AM, Tony Finch wrote:
>> /\/\*([^*]*\*+[^/*])*[^*]*\*+\//
>

> This is sadly incomplete ("sadly" because it's a good example of C's
> syntactic unpleasantness) unless you have already performed translation
> phases 1 and 2 on the source. In phase one you translate trigraphs (in
> particular ??/ -> \) and in phase two you delete backslash-newline
> sequences (which might not be visible until after trigraph substitution).
> Comments are recognized in phase three.

Exactly so. Which is why I suggested using a C preprocessor to do
the heavy lifting. My "favourite" C comment is

/??/ => /\ => /**/
*??/ *\
*??/ *\
/ /
trigraphs backslash-newline removal

gcc actually gets this completely wrong unless you pass -trigraph
on the command line.

>
> 2. Each instance of a backslash character (\) immediately followed by
> a new-line character is deleted, splicing physical source lines to
> form logical source lines. Only the last backslash on any physical
> source line shall be eligible for being part of such a splice. A
> source file that is not empty shall end in a new-line character,
> which shall not be immediately preceded by a backslash character
> before any such splicing takes place.

It's amazing how many Windows C programs violate this rule,
the file ending with say
}<EOF>
and it's even more amazing that some compilers fail to diagnose this.

To adapt something Padlipski wrote,
"There's a rumour that they're moving to 17 phases in the
next standard, because 17 is a sacred number in Bali."

Reply all

Reply to author

Forward