I want to use this tool in order to generate from an ABNF grammar a
parser for it and i m running to some problems. My election was due
that together with DParser are the only GLR parser generators.
So far i saw that the rules specified earlier have precedence. My
problem is that there is no context with the grammar. To be more
specific, this is the subset of rules that had drove me to the
problem.
the .pyl is something like :
***************************************
INITIAL :
"[bB]" : return "BINLETTER"
"[dD]" : return "DECLETTER"
"[xX]" : return "HEXLETTER"
"[0-1]+": return "BIN"
"[0-9]+": return "DEC"
"[a-fA-F0-9]+": return "HEX"
"\n+" : return
***************************************
and the .pyg :
***************************************
statement ->
BINLETTER BIN :
return "bin"
| DECLETTER DEC :
return "dec"
| HEXLETTER HEX :
return "hex"
;
***************************************
The ambiguity in the lexer comes from BIN , DEC and HEX that are
included in that order, also DECLETTER is included in HEX.
But the grammar shouldnt be ambiguous itself because there is a letter
that precede the numbers.
Now this are the outputs that i get:
BNF > xA1
hex
BNF > x12
ParseError in Token DEC with string : 12
BNF > x11
ParseError in Token BIN with string : 11
The error is obvious and it is because it follows the precedence of
each token in the lexer specification.
Taking a look in the code i found that when the glr.py call the token
function, the latter even if it founds more than one token it only
returns the first one. And in no place it cares about which are the
possible next tokens.
What i m planning to do, if you dont ve any better suggestions is:
Before calling the function token in the glr.py follow the approach
from YAPPS which sends the tokens that are allowed (i.e. the ones that
fit in the grammar). This step will solve this problem. I dont know if
it ll introduce more.
Still, that wont be enought. The second approach that i was thinking is
to return the list of tokens and if the grammar end up in a token error
will mean that i should pass to the next token.
I was already trying to do it, but it was more complicated than i
thought, it is why i ask for suggestions. does it sound okey, any
better ideas?
thx in advance.
Humberto
> Taking a look in the code i found that when the glr.py call the token
> function, the latter even if it founds more than one token it only
> returns the first one. And in no place it cares about which are the
> possible next tokens.
Yes. The parser will handle ambiguous grammars, but the lexer
does not handle ambiguous lexing rules. However, in your case
you can disambiguate entirely in the lexer:
> INITIAL :
> "[bB][01]+" : return "BIN"
> "[dD][0-9]+" : return "DEC"
> "[xX][a-fA-F0-9]+" : return "HEX"
> "\n+" : return
Alternately you could separate the tokens into non-overlapping
tokens and then use those in your grammar. For this problem its
not very ideal, but it may be useful in other situations.
For example:
[01] bin
[2-9] dec
[a-fA-F] hex
then your grammar would be
-> BINLETTER bin+
| DECLETTER (bin | dec)+
| HEXLETTER (bin | dec | hex)+
;
> Humberto
Tim Newsham
http://www.lava.net/~newsham/
> Taking a look in the code i found that when the glr.py call the token
> function, the latter even if it founds more than one token it only
> returns the first one. And in no place it cares about which are the
> possible next tokens.
Yes.  The parser will handle ambiguous grammars, but the lexer
does not handle ambiguous lexing rules.  However, in your case
you can disambiguate entirely in the lexer:
deflist |
 ::= | IDENT ID :
¨ regexp ¨ \n
|
deflist |
 ::= | deflist IDENT ID :
¨ regexp ¨ EOL
|
rulelist |
 ::= | ¨ rulepat
¨ SRCCODE |
rulelist |
 ::= | rulelist ¨ rulepat
¨ SRCCODE |
> During that i found and fix what i think was a bug.
> It is when we have an ambiguous grammar and it is asked to process the tree
> with the ambiguous flag. The problem is that it call the grammar function
> with list of symnode or rulenode (i dont remember) rather than with what
> should be the kids.
This is the intended behavior. If the parse can be ambiguous the
grammar rules will have to be written differently. The number of
possible trees can grow quite large and I didn't want to separate
them out into all possible trees and eliminate the node sharing.
> Also, as i was parsing a huge message with a huge grammar, the proctree
> reach the recursion limit, so i implement an iterative function for it.
Others have reported this as well. I am not sure what a good solution
would be. I don't think moving the recursion stack to a variable is
a very elegant solution. I haven't really investigated alternatives
much. My preference would be to get a larger python stack...
>>> Yes. The parser will handle ambiguous grammars, but the lexer
>>> does not handle ambiguous lexing rules. However, in your case
>>> you can disambiguate entirely in the lexer:
>
> What i did was to call the token function with the list of allowed tokens
> (as YAPPS approach) and it improves a lot in ambiguous lexers and grammars.
[...]
> Now the second problem is even if i pass to the lexer the allowed list of
> token, the lexer can still find more than one. For this problem i dont see
> any better solution than to start a new active parser for each token
> returned. But i didnt work on it yet.
I don't think this is a very elegant solution. If you want an ambiguous
lexer and an ambiguous parser, a better solution is to do everything
inside of the grammar and do away with the separation between parser and
lexer. This is called scannerless parsing. My original goal was to start
off with a GLR parser and move towards a scannerless parser, but I haven't
had time or the need to work on the parser in a long time and this goal
has sort of been dropped on the floor...
In the meantime, dparser has been made to work with python. It is
a scannerless GLR parser. I don't like the fact that they put
grammar productions in docstrings (like many python based parsers do)
but it shouldn't be too hard to write a spec-file front end for the
thing if someone was so inclined:
http://staff.washington.edu/sabbey/py_dparser/
If pyggy is not quite up to speed for your project, you might want
to investigate that dparser.
> Humberto
Tim Newsham
http://www.thenewsh.com/~newsham/