Separating Identifiers from Reserved Words

42 views
Skip to first unread message

lhebble...@btinternet.com

unread,
Nov 23, 2012, 9:35:35 AM11/23/12
to modgr...@googlegroups.com
Hi, I have question about identifying words in a language and whether they are valid identifiers. In this language an identifier cannot be a reserved word.

I attach a test script that includes the grammar definition.

The interesting parts, as far as this topic are concerned are:

class Word (Grammar):
    grammar = (WORD("A-Za-z", "A-Za-z0-9_"))

class ReservedWord (Grammar):
    grammar = (L("Form") | L("Data") | L("End"))

class Identifier (Grammar):
    grammar = (EXCEPT(Word, ReservedWord))

class FormHeader (Grammar):
    grammar = (L("Form"), Identifier)

class Form (Grammar):
    grammar = (FormHeader, FormData, FormEnd)

Here's the test input:

Form End
    Form Data
    End Data
End Form

This input is incorrect and should not parse correctly. The reason is that after the word "Form" on the 1st line we should get an identifier. However, "End" is not a valid identifier because it is a reserved word.

When the test script is run with this input the parse error I see is: ParseError: [line 2, column 8] Expected 'Form': Found 'd\n    Form Data\n'               

I think this means that modgrammar matches the 'Form' (line 1) and matches the "En" of "End" as the identifier. It cannot match "End" as an identifier because it knows that "End" is a reserved word. However, it then backtracks and matches "En" instead, even though that is not a complete word.

Is that thinking correct?

My main question is: How can I resolve the problem? The Word grammar element should commit to the longest possible word it can see and should not ever consider leading sub-words. Is that possible?

Many thanks for your attention.
Leigh.
modgrammar_test.py

Alex Stewart

unread,
Dec 14, 2012, 3:12:17 PM12/14/12
to modgr...@googlegroups.com, lhebble...@btinternet.com
Hi there.. sorry I didn't get back to you earlier (I just recently discovered that Google was not notifying me when people posted to this group like it was supposed to)..

Your thinking is indeed correct.  That is exactly what it is doing.  Unfortunately, the EXCEPT operation often isn't actually as useful as it seems like it would be, because (as you've found) it often makes it very hard for the parser to figure out the right error message to return on a parse failure sometimes..

Note that another part of why this ends up with such a strange result is because you are using modgrammar in whitespace-consuming mode, which you probably don't want for this sort of grammar either.  In this mode, the grammar will happily accept any amount of whitespace between your tokens (good), but will also happily accept no whitespace at all between them (bad), so you'll find the following input text will also parse perfectly happily:

FormFoo FormDataEndDataEndForm

..which is probably not something you want.

whitespace-consuming mode is mainly intended for languages like C, where most constructs are delineated by specific operator characters so it's clear regardless of any whitespace in there, but for languages which are mostly text words, you'll probably want to set "grammar_whitespace = False" and then explicitly require spaces between words (I'm toying around with adding a "whitespace required" mode to things which might make this easier in future, but unfortunately I haven't had a chance to implement it yet)..

I was going to suggest that you could fix this issue by modifying the definition of Word to be a "WORD(...), NOT_FOLLOWED_BY(WORD(...))", which when combined with "grammar_whitespace = False" would theoretically sort everything out, but unfortunately when I did that I found a bug in the current version of Modgrammar involving EXCEPT parsing which breaks it all horribly, so I guess I'm going to have to fix that first.. (sigh)

I'm going to see if I can't sort this all out and get you an example of a parser that works the way you want it to in the next day or two, and will let you know..

--Alex

lhebble...@btinternet.com

unread,
Dec 17, 2012, 7:01:07 AM12/17/12
to modgr...@googlegroups.com, lhebble...@btinternet.com
Thanks for the reply.

You were not targeting people writing COBOL parsers then. ;-)

(Not denigrating COBOL. It does what it does in a simple but not very exciting way.)

The language I have to parse is very wordy. I have taken over a parser for it that is in Python 3 and has what I call an ad-hoc parser for the entire language. The whole thing is done in code like so:

def parse_record(record, data=None):
    while not match(END):
        if match(GROUP):
            name = require_id()
            group = Record(GROUP, name, parent=record)
            record[name] = group
            match(UNTRACKED)  # ignore
            if match(OCCURS):
                record.occurs = require_num()
            else:
                record.occurs = 1
            if match(BASE):
                record.base = require_num()
            if match(CURRENT):
                record.current = require_id()
            parse_record(group, data)
            continue
            ... etc. etc. ...

Here match(), require_id(), require_num() are functions and upper case things are tokens.

This is fine, it works. But it is not the way I like to work. (I am a long time Perl Parse::RecDescent user).

I found that there is a limited choice of grammar based parsers for Python 3, so was looking at modgrammar. It works well apart from this issue.

I have to admit there there is probably not the impetus rewrite the existing parser. It'll take a while; it's a big language.

I look forward to seeing what improvements you come up with.

Thanks again,
Leigh.

Alex Stewart

unread,
Dec 17, 2012, 2:24:04 PM12/17/12
to modgr...@googlegroups.com, lhebble...@btinternet.com
Heh..

To be honest, I'm still debating somewhat whether whitespace-consuming should be the default mode or not.  On the one hand, it is convenient for many quick-and-dirty symbol-based grammars, and it also seems to be the presumed default in many academic/formal language circles, which is why I did Modgrammar that way to begin with, but on the other hand, it can result in unexpected (sometimes subtle) strange behaviors if people aren't expecting it to work that way (and I do generally agree with the Pythonism that "explicit is better than implicit").  It is quite possible that in later versions I may change it to be non-whitespace-consuming by default..

In any case, your example is very much the sort of thing that Modgrammar should be able to handle, and if nothing else I'm glad you brought it up because it brought up a rather obscure bug I hadn't previously found, as well as highlighting a couple of enhancements I can make which would make this sort of thing a lot easier..

Once I get all the current round of bugfixes and quick enhancements in (including the ones I mentioned above), I'll put out Modgrammar 0.9 (hopefully within the week) which should make the sort of thing you want to do a lot easier..

--Alex

Alex Stewart

unread,
Jan 4, 2013, 1:46:43 PM1/4/13
to modgr...@googlegroups.com, lhebble...@btinternet.com
Ok, so Modgrammar 0.9 has now been released (took a bit longer than I expected due to the holidays and everything)..

Attached is a modified version of your test script which takes advantage of a couple of new features in 0.9 to make things work a lot more smoothly:
  • The new grammar_whitespace_mode = 'required' setting should make this sort of language a lot easier.  It will require that there must be some whitespace between each token automatically, so you don't have to specify it all explicitly yourself.
  • I've added "longest=True" to the WORD grammar, which will only match the longest possible string of letters, and not try to backtrack to shorter words.  This is technically enforced by the grammar_whitespace_mode = 'required' anyway, since none of the shorter words would have whitespace following them, so they'd fail that check, but this both (a) makes the parsing more efficient and (b) avoids the sort of confusing ParseError message you encountered before.
I also added a grammar_desc to the Identifier class, to make the error message more clear as well, so now I think the error we get with your sample text would make a lot more sense to a potential user:

modgrammar.ParseError: [line 2, column 6] Expected an identifier: Found 'End'

Let me know what you think and if you have any more questions..

--Alex
modgrammar_test-0.9.py
Reply all
Reply to author
Forward
0 new messages