Lexing numeric literals

122 views
Skip to first unread message

Austin Hastings

unread,
Jul 7, 2015, 5:38:01 PM7/7/15
to ply-...@googlegroups.com
Consider the lowly number:

   0
   01
   0b010101
   0xdeadbeef
   0755
   0o123
   0d299
   0.
   0.0
   1
   1.0e+0
   0xcafe.babep-2
   .17
   1.7
   1..7
   17.
   17...
   0xfastf00d

There is an ANTLR snippet that shows a way to deal with various kinds of numeric literals in the presence of '.' and '..' as language tokens. (http://bit.ly/1HDwCX5)

My question is, can someone point me at a fairly performant PLY version of this? Ideally, it would be robust (as the ANTLR version above) in the face of malformed constructs or range errors. Ideally, it would be well documented. But I'll settle for it works and it's fast. I'm hoping for either the C or Perl6 number formats, but I've got to deal with double-dot and triple-dot tokens, so the usual parsing-101 examples won't do.

Right now, I'm using a fairly large regexp. I'm kind of hating it, because there's so much backfilling that I have to do - python's re engine insists on unique groupnames, so I can't have, for example, "(?P<exponent>...)" in more than one location. (Or "range_error".) That in turns leads to lots of separate code checks for different spellings of the same thing.

I have wondered if a lexer state would be the right way to deal with this, but I don't think it feels quite right. (The state would let me break the regexp into separate pieces, and I could then reassemble them in the parser. But whitespace and a missing end signal make me leery of this approach.)

I have also wondered if there is an efficient way to chew through the input text by hand. But I keep thinking this is PLY's job, so there should be a way for PLY to do it!

Any advice or links appreciated.

=Austin

Reply all
Reply to author
Forward
0 new messages