Contextless between the grammar and the lexer

34 views
Skip to first unread message

humbol

unread,
Jul 18, 2006, 9:58:09 AM7/18/06
to pyggy
Hi all,

I want to use this tool in order to generate from an ABNF grammar a
parser for it and i m running to some problems. My election was due
that together with DParser are the only GLR parser generators.

So far i saw that the rules specified earlier have precedence. My
problem is that there is no context with the grammar. To be more
specific, this is the subset of rules that had drove me to the
problem.

the .pyl is something like :

***************************************
INITIAL :
"[bB]" : return "BINLETTER"
"[dD]" : return "DECLETTER"
"[xX]" : return "HEXLETTER"

"[0-1]+": return "BIN"
"[0-9]+": return "DEC"
"[a-fA-F0-9]+": return "HEX"

"\n+" : return
***************************************

and the .pyg :

***************************************
statement ->
BINLETTER BIN :
return "bin"
| DECLETTER DEC :
return "dec"
| HEXLETTER HEX :
return "hex"
;
***************************************

The ambiguity in the lexer comes from BIN , DEC and HEX that are
included in that order, also DECLETTER is included in HEX.

But the grammar shouldnt be ambiguous itself because there is a letter
that precede the numbers.

Now this are the outputs that i get:

BNF > xA1
hex
BNF > x12
ParseError in Token DEC with string : 12
BNF > x11
ParseError in Token BIN with string : 11

The error is obvious and it is because it follows the precedence of
each token in the lexer specification.

Taking a look in the code i found that when the glr.py call the token
function, the latter even if it founds more than one token it only
returns the first one. And in no place it cares about which are the
possible next tokens.

What i m planning to do, if you dont ve any better suggestions is:

Before calling the function token in the glr.py follow the approach
from YAPPS which sends the tokens that are allowed (i.e. the ones that
fit in the grammar). This step will solve this problem. I dont know if
it ll introduce more.

Still, that wont be enought. The second approach that i was thinking is
to return the list of tokens and if the grammar end up in a token error
will mean that i should pass to the next token.

I was already trying to do it, but it was more complicated than i
thought, it is why i ask for suggestions. does it sound okey, any
better ideas?

thx in advance.

Humberto

Tim Newsham

unread,
Jul 18, 2006, 2:00:47 PM7/18/06
to pyggy
> the .pyl is something like :
>
> ***************************************
> INITIAL :
> "[bB]" : return "BINLETTER"
> "[dD]" : return "DECLETTER"
> "[xX]" : return "HEXLETTER"
>
> "[0-1]+": return "BIN"
> "[0-9]+": return "DEC"
> "[a-fA-F0-9]+": return "HEX"
>
> "\n+" : return
> ***************************************
>
> and the .pyg :
>
> ***************************************
> statement ->
> BINLETTER BIN :
> return "bin"
> | DECLETTER DEC :
> return "dec"
> | HEXLETTER HEX :
> return "hex"
> ;
> ***************************************

> Taking a look in the code i found that when the glr.py call the token
> function, the latter even if it founds more than one token it only
> returns the first one. And in no place it cares about which are the
> possible next tokens.

Yes. The parser will handle ambiguous grammars, but the lexer
does not handle ambiguous lexing rules. However, in your case
you can disambiguate entirely in the lexer:

> INITIAL :
> "[bB][01]+" : return "BIN"
> "[dD][0-9]+" : return "DEC"
> "[xX][a-fA-F0-9]+" : return "HEX"
> "\n+" : return


Alternately you could separate the tokens into non-overlapping
tokens and then use those in your grammar. For this problem its
not very ideal, but it may be useful in other situations.
For example:

[01] bin
[2-9] dec
[a-fA-F] hex

then your grammar would be

-> BINLETTER bin+
| DECLETTER (bin | dec)+
| HEXLETTER (bin | dec | hex)+
;


> Humberto

Tim Newsham
http://www.lava.net/~newsham/

Humberto Abdelnur

unread,
Jul 18, 2006, 3:09:10 PM7/18/06
to py...@googlegroups.com
thanks for the quick reply.
What you say is right but I guess it does not apply to the use i want to give to my tool. To let you understand my problem, what i want to do is to take the ABNF of a RFC and automatically generate the parser (among with other stuff) for the protocol in question. In such case i imagine that it will be problems as the one i mentioned, specially because the RFCs define really ambiguous grammars and they are really long in some cases.
Modifying the ABNF is  possible, but i was expecting to do only minor modification. So i was trying to see if the problem was fixable from the root.
Again, thanks a lot.

Humberto

Humberto Abdelnur

unread,
Aug 1, 2006, 4:18:04 AM8/1/06
to py...@googlegroups.com
Hi back,
As i told you early i was planning to modify the code in order to solve the problems that i had.
During that i found and fix  what i think was a bug.
It is when we have an ambiguous grammar and it is asked to process the tree with the ambiguous flag. The problem is that it call the grammar function with  list of symnode or rulenode (i dont remember) rather than with what should be the kids. The problem was in the order that you checked the lengths of the possibilities of the tree. As it is a tree now, what i did was defined a new function to return the next grammar  processed in a  Depth first tree manner. I didnt want to return the list of trees, because some initialization probably has to be done before the next one start. I know that that could be a function in the code section of the grammar, but i keeped it simple.
Also, as i was parsing a huge message with a huge grammar, the proctree reach the recursion limit, so i implement an iterative function for it. To my surprise it runs a little bit slower than the recursive, i dont know if i mess it up somewhere or is because how i m managing the stack. Anyways, i stop taking a look on it, probably i ll do it later. but here is my little contribution. I ll copy in the message but if you want me to send u the file, let me know. This function should be in the helpers.py file (just in case)


class TreeProc :
    def __init__(self, gram, allowambig=0) :
        self.gram = gram
        self.allowambig = allowambig
        self.pathschoose = []
        self.exploringitem = 0

    def nextproctree(self,t):
        while len(self.pathschoose):
            curpath,curposs = self.pathschoose.pop()
            curpath += 1
            if curpath != curposs:
                self.pathschoose.append( (curpath,curposs) )
                break
               
        self.exploringitem = 0
        if len(self.pathschoose):
            return self.proctree(t)
        else:
            return None
       
    def proctree(self,t):
       
        rulestack = []
        stack = []
       
        value = []
        rule = None
       
        stack.append(t)
       
        while len(rulestack) or len(stack):
            if len(stack):
                curr = stack.pop(0)
            else: #elif len(rulestack):
                newvalue = rule(value)
                rule, value, stack = rulestack.pop()
               
                value.append(newvalue)
                continue
               
            if isinstance(curr, glr.symnode) :
                if len(curr.possibilities) == 0 :
                    # for terminals, return the token value
                    value.append( curr.sym[1] )
                elif len(curr.possibilities) == 1 :
                    stack.insert(0,curr.possibilities[0])
                elif self.allowambig :
                    # Save  the path where is going if this area was not explored earlier or
                    #  continue by the path

                    if len(self.pathschoose) <= self.exploringitem:
                        self.pathschoose.append( (0,len(curr.possibilities)) )
               
                    nextpath = self.pathschoose[self.exploringitem][0]
                    self.exploringitem += 1
                   
                    stack.insert(0,curr.possibilities[nextpath])
                else: #elif len( t.possibilities) > 1 :
                    raise AmbigParseError("ambiguous parse of %s" % curr.sym)
           
            elif isinstance(curr, glr.rulenode) :
                # return the result of the semantic action performed on the
                # children.
               
                rulestack.append( (rule, value, stack) )
                sym,redlen,prodno = curr.rule
                rule = self.gram.semactions[prodno]
                stack = curr.elements[:]
                value = []
            else :
                raise ApiError("Illegal node in tree: %s" % curr.__class__)

        if len(value) > 1 :
            raise error("Something wrong in the production of the rule :s %s" % curr.__class__)
           
        return value[0]


A second thing was, as i told you earlier, was for the following discussion:


> Taking a look in the code i found that when the glr.py call the token
> function, the latter even if it founds more than one token it only
> returns the first one. And in no place it cares about which are the
> possible next tokens.

Yes.  The parser will handle ambiguous grammars, but the lexer
does not handle ambiguous lexing rules.  However, in your case
you can disambiguate entirely in the lexer:


What i did was to call the token function with the list of allowed tokens (as YAPPS approach) and it improves a lot in ambiguous lexers and grammars. Still is not enough because of what i ll tell you later, but what  i did is not really clean for the moment. The thing is that the only way for a token (in the lexer) to know the name of it is from the return value. So in order to check which is the name of the token in the lexer i ve to run the action to get the returned value. But the problem is that i m running the action (some code can be set there and I ve side effects of course). So, for my examples in which i dont put code except of the "return TOKEN", it works perfectly. What i was thinking to do is to extend the grammar to be:

deflist  ::=  IDENT ID : ¨ regexp ¨ \n
deflist  ::=  deflist IDENT ID : ¨ regexp ¨ EOL

or something like it rather than:

rulelist  ::=  ¨ rulepat ¨ SRCCODE
rulelist  ::=  rulelist ¨ rulepat ¨ SRCCODE


An in that way when the lexer runs in a new file it can pre calculate the name of the token of each expression and i dont ve to run the action in order to know it and it keeps the full functionality u did for.
If you are interest in that approach let me know.

Now the second problem is even if i pass to the lexer the allowed list of token, the lexer can still find more than one. For this problem i dont see any better solution than to start a new active parser for each token returned. But i didnt work on it yet.

Well, that it is.

Humberto

Tim Newsham

unread,
Aug 1, 2006, 1:37:51 PM8/1/06
to py...@googlegroups.com
I don't have time for a full reply right now, I'll try to give
a more complete reply in about a week, but I wanted to mention
a few points...

> During that i found and fix what i think was a bug.
> It is when we have an ambiguous grammar and it is asked to process the tree
> with the ambiguous flag. The problem is that it call the grammar function
> with list of symnode or rulenode (i dont remember) rather than with what
> should be the kids.

This is the intended behavior. If the parse can be ambiguous the
grammar rules will have to be written differently. The number of
possible trees can grow quite large and I didn't want to separate
them out into all possible trees and eliminate the node sharing.

> Also, as i was parsing a huge message with a huge grammar, the proctree
> reach the recursion limit, so i implement an iterative function for it.

Others have reported this as well. I am not sure what a good solution
would be. I don't think moving the recursion stack to a variable is
a very elegant solution. I haven't really investigated alternatives
much. My preference would be to get a larger python stack...

>>> Yes. The parser will handle ambiguous grammars, but the lexer
>>> does not handle ambiguous lexing rules. However, in your case
>>> you can disambiguate entirely in the lexer:
>
> What i did was to call the token function with the list of allowed tokens
> (as YAPPS approach) and it improves a lot in ambiguous lexers and grammars.

[...]

> Now the second problem is even if i pass to the lexer the allowed list of
> token, the lexer can still find more than one. For this problem i dont see
> any better solution than to start a new active parser for each token
> returned. But i didnt work on it yet.

I don't think this is a very elegant solution. If you want an ambiguous
lexer and an ambiguous parser, a better solution is to do everything
inside of the grammar and do away with the separation between parser and
lexer. This is called scannerless parsing. My original goal was to start
off with a GLR parser and move towards a scannerless parser, but I haven't
had time or the need to work on the parser in a long time and this goal
has sort of been dropped on the floor...

In the meantime, dparser has been made to work with python. It is
a scannerless GLR parser. I don't like the fact that they put
grammar productions in docstrings (like many python based parsers do)
but it shouldn't be too hard to write a spec-file front end for the
thing if someone was so inclined:

http://staff.washington.edu/sabbey/py_dparser/

If pyggy is not quite up to speed for your project, you might want
to investigate that dparser.

> Humberto

Tim Newsham
http://www.thenewsh.com/~newsham/

Reply all
Reply to author
Forward
0 new messages