Web Images Videos Maps News Shopping Gmail more »
Recently Visited Groups | Help | Sign in
Google Groups Home
Contextless between the grammar and the lexer
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  5 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
humbol  
View profile  
 More options Jul 18 2006, 9:58 am
From: "humbol" <hum...@gmail.com>
Date: Tue, 18 Jul 2006 13:58:09 -0000
Local: Tues, Jul 18 2006 9:58 am
Subject: Contextless between the grammar and the lexer
Hi all,

I want to use this tool in order to generate from an ABNF grammar a
parser for it and i m running to some problems. My election was due
that together with DParser are the only GLR parser generators.

So far i saw that the rules specified earlier have precedence. My
problem is that there is no context with the grammar. To be more
specific, this is the subset of rules that  had drove me to the
problem.

the .pyl is something like :

***************************************
INITIAL :
    "[bB]" :    return "BINLETTER"
    "[dD]" :    return "DECLETTER"
    "[xX]" :    return "HEXLETTER"

    "[0-1]+":    return "BIN"
    "[0-9]+":    return "DEC"
    "[a-fA-F0-9]+":    return "HEX"

    "\n+" :        return
***************************************

and the .pyg :

***************************************
statement ->
    BINLETTER BIN :
        return "bin"
    | DECLETTER DEC :
        return "dec"
    | HEXLETTER HEX :
        return "hex"
    ;
***************************************

The ambiguity in the lexer comes from BIN , DEC and HEX that are
included in that order, also DECLETTER is included in HEX.

But the grammar shouldnt be ambiguous itself because there is a letter
that precede the numbers.

Now this are the outputs that i get:

BNF > xA1
hex
BNF > x12
ParseError in Token DEC with string : 12
BNF > x11
ParseError in Token BIN with string : 11

The error is obvious and it is because it follows the precedence of
each token in the lexer specification.

Taking a look in the code i found that when the glr.py call the token
function, the latter even if it founds more than one token it only
returns the first one. And in no place it cares about which are the
possible next tokens.

What  i m planning to do, if you dont ve any better suggestions is:

Before calling the function token in the glr.py follow the approach
from YAPPS which sends the tokens that are allowed (i.e. the ones that
fit in the grammar). This step will solve this problem. I dont know if
it ll introduce more.

Still, that wont be enought. The second approach that i was thinking is
to return the list of tokens and if the grammar end up in a token error
will mean that  i should pass to the next token.

I was already trying to do it, but it was more complicated than i
thought, it is why i ask for suggestions. does it sound okey, any
better ideas?

thx in advance.

Humberto


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tim Newsham  
View profile  
 More options Jul 18 2006, 2:00 pm
From: Tim Newsham <news...@lava.net>
Date: Tue, 18 Jul 2006 08:00:47 -1000 (HST)
Local: Tues, Jul 18 2006 2:00 pm
Subject: Re: Contextless between the grammar and the lexer

Yes.  The parser will handle ambiguous grammars, but the lexer
does not handle ambiguous lexing rules.  However, in your case
you can disambiguate entirely in the lexer:

> INITIAL :
>    "[bB][01]+" :           return "BIN"
>    "[dD][0-9]+" :          return "DEC"
>    "[xX][a-fA-F0-9]+" :    return "HEX"
>    "\n+" :                 return

Alternately you could separate the tokens into non-overlapping
tokens and then use those in your grammar.  For this problem its
not very ideal, but it may be useful in other situations.
For example:

     [01]      bin
     [2-9]     dec
     [a-fA-F]  hex

then your grammar would be

     -> BINLETTER bin+
     |  DECLETTER (bin | dec)+
     |  HEXLETTER (bin | dec | hex)+
     ;

> Humberto

Tim Newsham
http://www.lava.net/~newsham/

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Humberto Abdelnur  
View profile  
 More options Jul 18 2006, 3:09 pm
From: "Humberto Abdelnur" <hum...@gmail.com>
Date: Tue, 18 Jul 2006 21:09:10 +0200
Local: Tues, Jul 18 2006 3:09 pm
Subject: Re: Contextless between the grammar and the lexer

thanks for the quick reply.
What you say is right but I guess it does not apply to the use i want to
give to my tool. To let you understand my problem, what i want to do is to
take the ABNF of a RFC and automatically generate the parser (among with
other stuff) for the protocol in question. In such case i imagine that it
will be problems as the one i mentioned, specially because the RFCs define
really ambiguous grammars and they are really long in some cases.
Modifying the ABNF is  possible, but i was expecting to do only minor
modification. So i was trying to see if the problem was fixable from the
root.
Again, thanks a lot.

Humberto

On 7/18/06, Tim Newsham <news...@lava.net> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Humberto Abdelnur  
View profile  
 More options Aug 1 2006, 4:18 am
From: "Humberto Abdelnur" <hum...@gmail.com>
Date: Tue, 1 Aug 2006 10:18:04 +0200
Local: Tues, Aug 1 2006 4:18 am
Subject: Re: Contextless between the grammar and the lexer

Hi back,
As i told you early i was planning to modify the code in order to solve the
problems that i had.
During that i found and fix  what i think was a bug.
It is when we have an ambiguous grammar and it is asked to process the tree
with the ambiguous flag. The problem is that it call the grammar function
with  list of symnode or rulenode (i dont remember) rather than with what
should be the kids. The problem was in the order that you checked the
lengths of the possibilities of the tree. As it is a tree now, what i did
was defined a new function to return the next grammar  processed in a  Depth
first tree manner. I didnt want to return the list of trees, because some
initialization probably has to be done before the next one start. I know
that that could be a function in the code section of the grammar, but i
keeped it simple.
Also, as i was parsing a huge message with a huge grammar, the proctree
reach the recursion limit, so i implement an iterative function for it. To
my surprise it runs a little bit slower than the recursive, i dont know if i
mess it up somewhere or is because how i m managing the stack. Anyways, i
stop taking a look on it, probably i ll do it later. but here is my little
contribution. I ll copy in the message but if you want me to send u the
file, let me know. This function should be in the helpers.py file (just in
case)

class TreeProc :
    def __init__(self, gram, allowambig=0) :
        self.gram = gram
        self.allowambig = allowambig
        self.pathschoose = []
        self.exploringitem = 0

    def nextproctree(self,t):
        while len(self.pathschoose):
            curpath,curposs = self.pathschoose.pop()
            curpath += 1
            if curpath != curposs:
                self.pathschoose.append( (curpath,curposs) )
                break

        self.exploringitem = 0
        if len(self.pathschoose):
            return self.proctree(t)
        else:
            return None

    def proctree(self,t):

        rulestack = []
        stack = []

        value = []
        rule = None

        stack.append(t)

        while len(rulestack) or len(stack):
            if len(stack):
                curr = stack.pop(0)
            else: #elif len(rulestack):
                newvalue = rule(value)
                rule, value, stack = rulestack.pop()

                value.append(newvalue)
                continue

            if isinstance(curr, glr.symnode) :
                if len(curr.possibilities) == 0 :
                    # for terminals, return the token value
                    value.append( curr.sym[1] )
                elif len(curr.possibilities) == 1 :
                    stack.insert(0,curr.possibilities[0])
                elif self.allowambig :
                    # Save  the path where is going if this area was not
explored earlier or
                    #  continue by the path

                    if len(self.pathschoose) <= self.exploringitem:
                        self.pathschoose.append( (0,len(curr.possibilities))
)

                    nextpath = self.pathschoose[self.exploringitem][0]
                    self.exploringitem += 1

                    stack.insert(0,curr.possibilities[nextpath])
                else: #elif len(t.possibilities) > 1 :
                    raise AmbigParseError("ambiguous parse of %s" % curr.sym
)

            elif isinstance(curr, glr.rulenode) :
                # return the result of the semantic action performed on the
                # children.

                rulestack.append( (rule, value, stack) )
                sym,redlen,prodno = curr.rule
                rule = self.gram.semactions[prodno]
                stack = curr.elements[:]
                value = []
            else :
                raise ApiError("Illegal node in tree: %s" % curr.__class__)

        if len(value) > 1 :
            raise error("Something wrong in the production of the rule :s
%s" % curr.__class__)

        return value[0]

A second thing was, as i told you earlier, was for the following discussion:

> Taking a look in the code i found that when the glr.py call the token
> > > function, the latter even if it founds more than one token it only
> > > returns the first one. And in no place it cares about which are the
> > > possible next tokens.

> > Yes.  The parser will handle ambiguous grammars, but the lexer
> > does not handle ambiguous lexing rules.  However, in your case
> > you can disambiguate entirely in the lexer:

What i did was to call the token function with the list of allowed tokens
(as YAPPS approach) and it improves a lot in ambiguous lexers and grammars.
Still is not enough because of what i ll tell you later, but what  i did is
not really clean for the moment. The thing is that the only way for a token
(in the lexer) to know the name of it is from the return value. So in order
to check which is the name of the token in the lexer i ve to run the action
to get the returned value. But the problem is that i m running the action
(some code can be set there and I ve side effects of course). So, for my
examples in which i dont put code except of the "return TOKEN", it works
perfectly. What i was thinking to do is to extend the grammar to be:

deflist  ::=  IDENT<http://users.lava.net/%7Enewsham/pyggy/html/%3CpyGrammarToken%3E%3Cpy...>ID
:
¨<http://users.lava.net/%7Enewsham/pyggy/html/%3CpyGrammarToken%3E%3Cpy...>regexp
¨<http://users.lava.net/%7Enewsham/pyggy/html/%3CpyGrammarToken%3E%3Cpy...>
\n<http://users.lava.net/%7Enewsham/pyggy/html/%3CpyGrammarToken%3E%3Cpy...>
deflist  ::=  deflist
IDENT<http://users.lava.net/%7Enewsham/pyggy/html/%3CpyGrammarToken%3E%3Cpy...>
ID
: ¨<http://users.lava.net/%7Enewsham/pyggy/html/%3CpyGrammarToken%3E%3Cpy...>regexp
¨<http://users.lava.net/%7Enewsham/pyggy/html/%3CpyGrammarToken%3E%3Cpy...>
EOL<http://users.lava.net/%7Enewsham/pyggy/html/%3CpyGrammarToken%3E%3Cpy...>
or something like it rather than:

rulelist  ::=  ¨<http://users.lava.net/%7Enewsham/pyggy/html/%3CpyGrammarToken%3E%3Cpy...>rulepat
¨<http://users.lava.net/%7Enewsham/pyggy/html/%3CpyGrammarToken%3E%3Cpy...>
SRCCODE<http://users.lava.net/%7Enewsham/pyggy/html/%3CpyGrammarToken%3E%3Cpy...>
rulelist  ::=  rulelist
¨<http://users.lava.net/%7Enewsham/pyggy/html/%3CpyGrammarToken%3E%3Cpy...>rulepat
¨<http://users.lava.net/%7Enewsham/pyggy/html/%3CpyGrammarToken%3E%3Cpy...>
SRCCODE<http://users.lava.net/%7Enewsham/pyggy/html/%3CpyGrammarToken%3E%3Cpy...>

An in that way when the lexer runs in a new file it can pre calculate the
name of the token of each expression and i dont ve to run the action in
order to know it and it keeps the full functionality u did for.
If you are interest in that approach let me know.

Now the second problem is even if i pass to the lexer the allowed list of
token, the lexer can still find more than one. For this problem i dont see
any better solution than to start a new active parser for each token
returned. But i didnt work on it yet.

Well, that it is.

Humberto


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tim Newsham  
View profile  
 More options Aug 1 2006, 1:37 pm
From: Tim Newsham <news...@lava.net>
Date: Tue, 1 Aug 2006 07:37:51 -1000 (HST)
Local: Tues, Aug 1 2006 1:37 pm
Subject: Re: Contextless between the grammar and the lexer
I don't have time for a full reply right now,  I'll try to give
a more complete reply in about a week, but I wanted to mention
a few points...

> During that i found and fix  what i think was a bug.
> It is when we have an ambiguous grammar and it is asked to process the tree
> with the ambiguous flag. The problem is that it call the grammar function
> with  list of symnode or rulenode (i dont remember) rather than with what
> should be the kids.

This is the intended behavior.  If the parse can be ambiguous the
grammar rules will have to be written differently.  The number of
possible trees can grow quite large and I didn't want to separate
them out into all possible trees and eliminate the node sharing.

> Also, as i was parsing a huge message with a huge grammar, the proctree
> reach the recursion limit, so i implement an iterative function for it.

Others have reported this as well.  I am not sure what a good solution
would be.  I don't think moving the recursion stack to a variable is
a very elegant solution.  I haven't really investigated alternatives
much.  My preference would be to get a larger python stack...

>>> Yes.  The parser will handle ambiguous grammars, but the lexer
>>> does not handle ambiguous lexing rules.  However, in your case
>>> you can disambiguate entirely in the lexer:

> What i did was to call the token function with the list of allowed tokens
> (as YAPPS approach) and it improves a lot in ambiguous lexers and grammars.

[...]

> Now the second problem is even if i pass to the lexer the allowed list of
> token, the lexer can still find more than one. For this problem i dont see
> any better solution than to start a new active parser for each token
> returned. But i didnt work on it yet.

I don't think this is a very elegant solution.  If you want an ambiguous
lexer and an ambiguous parser, a better solution is to do everything
inside of the grammar and do away with the separation between parser and
lexer.  This is called scannerless parsing. My original goal was to start
off with a GLR parser and move towards a scannerless parser, but I haven't
had time or the need to work on the parser in a long time and this goal
has sort of been dropped on the floor...

In the meantime, dparser has been made to work with python.  It is
a scannerless GLR parser.  I don't like the fact that they put
grammar productions in docstrings (like many python based parsers do)
but it shouldn't be too hard to write a spec-file front end for the
thing if someone was so inclined:

    http://staff.washington.edu/sabbey/py_dparser/

If pyggy is not quite up to speed for your project, you might want
to investigate that dparser.

> Humberto

Tim Newsham
http://www.thenewsh.com/~newsham/

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google