Re: [treetop] There is a easy way to ignore whitespace? (#24)

112 views
Skip to first unread message

Clifford Heath

unread,
Jan 19, 2012, 5:14:41 PM1/19/12
to Geovani de Souza, treet...@googlegroups.com

On 20/01/2012, at 2:01 AM, Geovani de Souza wrote:
> I'm writing my grammar using treetop, for a new programming language (prototype for while), but I cannot found a way to ignore whitespace/comments assertively.
>
> There's an feature in tool, or will be implemented soon?

PEG Parsers do not (usually) have such a feature, because they do not
separate lexing from parsing. You need to implement white-space skipping
along with your lexical rules.

For an example of how to do this, you could view my parser for CQL at
<https://github.com/cjheath/activefacts/tree/master/lib/activefacts/cql>
Note that CQLParser.treetop includes multiple other grammars from the
associated files, including LexicalRules.treetop, in which I define S for
mandatory whitespace/comments and s for optional whitespace. You'll
see these rules used widely to skip whitespace and comments.

When parsing keywords, be careful to avoid the trap of omitting trailing
look-ahead or non-alphanumeric. Using the lookahead prevents the parser
from recognising the first characters of "foobar" as the keyword "foo".
See the bottom of this file as an example:
<https://github.com/cjheath/activefacts/blob/master/lib/activefacts/cql/Language/English.treetop>

Please send any further requests to the mailing list at <treet...@googlegroups.com>
Best of luck,

Clifford Heath.

Mark Wilden

unread,
Jan 19, 2012, 6:18:09 PM1/19/12
to treet...@googlegroups.com, Geovani de Souza
On Thu, Jan 19, 2012 at 2:14 PM, Clifford Heath
<cliffor...@gmail.com> wrote:
>
> associated files, including LexicalRules.treetop, in which I define S for
> mandatory whitespace/comments and s for optional whitespace. You'll
> see these rules used widely to skip whitespace and comments.

GMTA, at least for rule s. :)

I found a huge productivity gain in doing this work by preprocessing
input. Instead of making the grammar handle vagaries of whitespace and
comments, have the preprocessor do it.

I found I used 's' a lot less, and ' ' a lot more. My grammars are
cleaner and have to deal with fewer edge cases. It also leads to a
cleaner separation of concerns, I feel.

Dmitry Mozzherin

unread,
Jan 20, 2012, 10:22:35 AM1/20/12
to treet...@googlegroups.com
Interesting approach Mark

Dmitry

> --
> You received this message because you are subscribed to the Google Groups "Treetop Development" group.
> To post to this group, send email to treet...@googlegroups.com.
> To unsubscribe from this group, send email to treetop-dev...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/treetop-dev?hl=en.
>

Mark Wilden

unread,
Jan 20, 2012, 1:47:16 PM1/20/12
to treet...@googlegroups.com
On Fri, Jan 20, 2012 at 7:22 AM, Dmitry Mozzherin <dmozz...@eol.org> wrote:

> On Thu, Jan 19, 2012 at 6:18 PM, Mark Wilden <ma...@mwilden.com> wrote:
>> I found a huge productivity gain in doing this work by preprocessing
>> input. Instead of making the grammar handle vagaries of whitespace and
>> comments, have the preprocessor do it.

> Interesting approach Mark

Yeah, it was amazing how a few gsubs before parsing made my grammars
so much simpler and easier to test.

///ark
Web Applications Developer
California Academy of Sciences

markus

unread,
Jan 20, 2012, 2:32:33 PM1/20/12
to treet...@googlegroups.com
> >> Instead of making the grammar handle vagaries of whitespace and
> >> comments, have the preprocessor do it.
>
> > Interesting approach Mark
>
> Yeah, it was amazing how a few gsubs before parsing made my grammars
> so much simpler and easier to test.

That works as long as your language doesn't have embedded constructs
(e.g. string literals) that have significant whitespace. Thus the
technique is only effective for a limited subset of the grammars that a
peg can parse.

In the worst case, your preprocessor would need to be much more
complicated than a regular expression based gsub and would have to do
all the work that the parser would have to do in order to determine how
whitespace should be treated...and then throw the results away after the
change was made, and let the parser do it all again. In such cases,
it's clearly better to just let treetop deal with the whitespace. :)

-- M


Mark Wilden

unread,
Jan 20, 2012, 2:45:02 PM1/20/12
to treet...@googlegroups.com
On Fri, Jan 20, 2012 at 11:32 AM, markus <mar...@reality.com> wrote:
>>
>> Yeah, it was amazing how a few gsubs before parsing made my grammars
>> so much simpler and easier to test.
>
> That works as long as your language doesn't have embedded constructs
> (e.g. string literals) that have significant whitespace.  Thus the
> technique is only effective for a limited subset of the grammars that a
> peg can parse.

I added preprocessor directives to the proprietary OOP language we
used at Sierra On-Line back in the day. So yeah, I know the limits of
gsub. :)

However, in my current project, which is parsing a catalog of ant
taxonomy, it's quite useful to collapse whitespace, convert tabs to
spaces, etc.

But there's more. Sometimes the guy who wrote the catalog would
italicize certain symbols and sometimes he wouldn't. Instead of
dealing with it in the grammar, this can be normalized by a (simple)
preprocessor.

Here's my #normalize method, showing the kinds of things I'm talking about:

def normalize string
fix_ending_punctuation(
fix_et_al(
squish_spaces(
fix_no_space_after_semicolon(
fix_double_periods(
fix_space_before_period(
normalize_italics(
remove_bold(
remove_spans(
remove_inner_paragraphs(
remove_mismatched_brackets(
replace_character_entities(
fix_utf_characters(
string)))))))))))))
end

markus

unread,
Jan 20, 2012, 3:06:37 PM1/20/12
to treet...@googlegroups.com

> But there's more. Sometimes the guy who wrote the catalog would
> italicize certain symbols and sometimes he wouldn't. Instead of
> dealing with it in the grammar, this can be normalized by a (simple)
> preprocessor.
>
> Here's my #normalize method, showing the kinds of things I'm talking about:
>
> def normalize string
> fix_ending_punctuation(
> fix_et_al(
> squish_spaces(
> fix_no_space_after_semicolon(
> fix_double_periods(
> fix_space_before_period(
> normalize_italics(
> remove_bold(
> remove_spans(
> remove_inner_paragraphs(
> remove_mismatched_brackets(
> replace_character_entities(
> fix_utf_characters(
> string)))))))))))))
> end

Sweet. I'm a big fan of the composable transform pattern, though it
looks like you're just a gsub(/([a-z_]+)\(/,'(\1') away from lisp
there. :)

One project that I remember fondly we wound up writing a little
micro-language (based on Miller's lightweight structures
http://www.cs.cmu.edu/~rcm/papers/thesis/ ) to express the
transformations, and drove it with a repurposed peephole optimizer. Fun
times.

-- M

Reply all
Reply to author
Forward
0 new messages