Ignoring tokens in the parser

761 views
Skip to first unread message

Dave Benjamin

unread,
Oct 11, 2010, 11:01:26 AM10/11/10
to ply-hack
Hi all,

I'm writing a PHP parser with PLY, which you can find here:
http://github.com/ramen/phply

The lexer is designed to be as close as possible to the one built into
PHP (http://php.net/token_get_all), which means that there are tokens
for WHITESPACE, OPEN_TAG, CLOSE_TAG, and a few other syntactical
elements that are ignored by the parser, but still available in case
someone wants to use the lexer for color syntax highlighting, etc.

I don't want these tokens to produce any values in the parser output,
not even None, so the technique I've been using is to call errok() for
these tokens in the error handler:

def p_error(t):
if t:
if t.type in ('WHITESPACE', 'OPEN_TAG', 'CLOSE_TAG',
'COMMENT', 'DOC_COMMENT'):
yacc.errok()
else:
raise SyntaxError('invalid syntax', (None, t.lineno, None,
t.value))
else:
raise SyntaxError('unexpected EOF while parsing', (None, None,
None, None))

http://github.com/ramen/phply/blob/master/phply/phpparse.py#L1297

I wonder if this is the right way to do it, or if there's a better
way. For one thing, when I start up my parser, I get the following
warnings:

WARNING: Token 'DOC_COMMENT' defined, but not used
WARNING: Token 'COMMENT' defined, but not used
WARNING: Token 'WHITESPACE' defined, but not used
WARNING: Token 'OPEN_TAG' defined, but not used
WARNING: There are 4 unused tokens

I can make these warnings go away by adding a rule that accepts these
tokens, but then I start producing values for them as well, which I
don't want. They can appear anywhere, so the error handler seems like
a convenient place to ignore them, but I wonder if this is an abuse of
this feature of PLY. I also wonder if it is thread-safe, since
yacc.errok() is module-level.

I'd appreciate any advice on the topic, or comments or suggestions
about the project in general. Thanks for your time!

Dave

Oldřich Jedlička

unread,
Oct 13, 2010, 2:25:20 AM10/13/10
to ply-...@googlegroups.com, Dave Benjamin
Hi Dave,

On Monday 11 October 2010 17:01:26 Dave Benjamin wrote:
> Hi all,
>
> I'm writing a PHP parser with PLY, which you can find here:
> http://github.com/ramen/phply
>
> The lexer is designed to be as close as possible to the one built into
> PHP (http://php.net/token_get_all), which means that there are tokens
> for WHITESPACE, OPEN_TAG, CLOSE_TAG, and a few other syntactical
> elements that are ignored by the parser, but still available in case
> someone wants to use the lexer for color syntax highlighting, etc.

I would write another lexer that calls the full lexer (the color syntax
highliting one), but ignores the mentioned tokens. The parser should not care
about something like whitespaces or comments.

Oldřich.

David Beazley

unread,
Oct 13, 2010, 8:51:09 AM10/13/10
to ply-...@googlegroups.com, David Beazley, Dave Benjamin
I agree. Add an extra lexing layer that strips the unwanted tokens from the stream before passing them to the parser. The parse() function has a lexer argument to specify an alternative lexer. There is also a tokenfunc argument that specifies the function to use for getting tokens. Using either of those, you could inject extra processing to discard tokens.

Cheers,
Dave

> --
> You received this message because you are subscribed to the Google Groups "ply-hack" group.
> To post to this group, send email to ply-...@googlegroups.com.
> To unsubscribe from this group, send email to ply-hack+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/ply-hack?hl=en.
>

Reply all
Reply to author
Forward
0 new messages