Possible bug in ply.lex ?

71 views
Skip to first unread message

Paul Miller

unread,
Nov 30, 2009, 9:07:14 AM11/30/09
to ply-hack
I've written a rather minimal s-expression parser with PLY, but I'm
experiencing a strange bug. Since the code is rather short, I'll post
it here:

--==BEGIN lexer.py==--
import ply.lex as lex

tokens = ('INTEGER', 'FLOAT', 'STRING', 'LPAREN', 'RPAREN',
'IDENTIFIER',
'NEWLINE', 'RATIONAL')

t_FLOAT = r'((\d*\.\d+)(E[\+-]?\d+)?|([1-9]\d*E[\+-]?\d+))'
t_STRING = r'\".*?\"'
t_LPAREN = r'\('
t_RPAREN = r'\)'
t_IDENTIFIER = r'[^0-9()][^()\ \t\n]*'
t_INTEGER = r'(-)?\d+'
t_RATIONAL = r'(-)?\d+/\d+'

t_ignore = ' \t'

def t_NEWLINE(t):
r'\n'
t.lexer.lineno += 1

def t_error(t):
'''
Houston, we have a problem.
'''
print("Illegal character %s" % t.value[0])
t.lexer.skip(1)

lexer = lex.lex (optimize = 0)

--==END lexer.py==--

Now, when I do this:

>>> from lexer import lexer
>>>
>>> lexer.input (' (+ 7abc 3 "xyz") ')
>>> for token in lexer:
... print token

I get:

LexToken(LPAREN,'(',1,1)
LexToken(IDENTIFIER,'+',1,2)
LexToken(INTEGER,'7',1,4)
LexToken(IDENTIFIER,'abc',1,5)
LexToken(INTEGER,'3',1,9)
LexToken(IDENTIFIER,'"xyz"',1,11)
LexToken(RPAREN,')',1,16)
>>>

What I'd expect is an error matching 7abc, since it's not a valid
identifier. The thing that makes me suspect this is a LY bug rather
than a bug in my code is that pyscheme (http://hkn.eecs.berkeley.edu/
~dyoo/python/pyscheme/) builds its lexer and parser using PLY and has
the same bug. Can anyone confirm this is a bug in PLY or am I doing
something subtly wrong?

Thanks!

David Beazley

unread,
Nov 30, 2009, 9:17:57 AM11/30/09
to ply-...@googlegroups.com, David Beazley
To me, this looks like more of a whitespace issue. If you ask PLY to parse something like "foo(bar)", there is no requirement that whitespace appear between "foo" and "(". Similarly, if you have tokens for integers and identifiers, then something like "45foo" is going to parse as "45" (int) and "foo" (identifier). I'm not exactly sure how you might want to fix it. Perhaps you can define a special illegal token to handle that case:

t_ILLEGALID = t_INTEGER + t_IDENTFIER

Cheers,
Dave
> --
>
> You received this message because you are subscribed to the Google Groups "ply-hack" group.
> To post to this group, send email to ply-...@googlegroups.com.
> To unsubscribe from this group, send email to ply-hack+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/ply-hack?hl=en.
>
>

Reply all
Reply to author
Forward
0 new messages