Re: 100 groups in python re limitation slows down PLY performance

258 views
Skip to first unread message

A.T.Hofkamp

unread,
Aug 20, 2012, 11:18:40 AM8/20/12
to ply-...@googlegroups.com
On 08/17/2012 02:52 PM, viy wrote:
> Hi all, jfyi
>
> I've added just one token to my lexer rules and stuck in 100 groups limit in
> python re
> http://stackoverflow.com/questions/478458/python-regular-expressions-with-more-than-100-groups
>
>
> PLY has workaround in its code - when your master re exceeds 100 groups, PLY
> catches AssertionError from python, splits master re into parts and retries.
>
> All works smoothly, but in my case my unit tests suite became 10x slower.
> Single parsing is about 1.5x slower.
>
> The solution is obvious - to get rid of the python limitation.
> Does anyone know the best way to do so?

Re-implement RE? :D
Much happiness would spread throughout the Python community, I am sure :)


Other solutions include

0. Live with it. Other solutions may cost more time than you are ever going to save.

1. DIY: You can easily define your own scanner, using arbitrary Python code. Just make sure you
match the interface. String scanning is relatively easy, it just takes a lot of code.

2. A long time ago (several years at least), someone wrote a Lex framework. I forgot about the
details, but the mailinglist archive or google can probably help you. Iirc, it was a true lex, and
had a different approach than using RE.

3. More exotic solutions like writing a scanner C extension (generated with lex/flex) are also possible.

4. Even more exotic stuff like generating a DFA somehow, and implementing that in Python can be done.

4. Other Python parser generators may have better solutions (I somewhat doubt it, but it should be
easy enough to scan through it, checking for how the scanner works)


Good luck

Albert

Eugene Voytitsky

unread,
Aug 20, 2012, 11:45:09 AM8/20/12
to ply-...@googlegroups.com, A.T.Hofkamp
Thanks, Albert.

It seems that I have really only 2 options:
0. Live with it, or
1. Recompile the Python without the checking of that limitation. (And
you are right – I should ask python community. But I decided to share
info here.)

Other options don't suite, because of a lot of work was been done and
code already in production.
--
Best regards,
Eugene Voytitsky

Joseph S. Tate

unread,
Aug 20, 2012, 12:00:53 PM8/20/12
to ply-...@googlegroups.com
Have you tried either of the non-stdlib re replacements? I note there
are at least two: re2 and regex. Seems like a monkey patching wrapper
on either of those would be easier to distribute than a patched
version of python.

Besides the 100 groups limit, re also has limitations on total regex
length depending on compiled character width.
> --
> You received this message because you are subscribed to the Google Groups
> "ply-hack" group.
> To post to this group, send email to ply-...@googlegroups.com.
> To unsubscribe from this group, send email to
> ply-hack+u...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>



--
Joseph Tate
Personal e-mail: jtate AT dragonstrider DOT com
Web: http://www.dragonstrider.com

David Beazley

unread,
Aug 20, 2012, 12:06:20 PM8/20/12
to ply-...@googlegroups.com, David Beazley, A.T.Hofkamp
Having just looked at the Python source, the 100 group limit is literally just hard coded into Modules/_sre.c and a few other places. Without looking at it more closely, it is hard to imagine a reason why this is needed. Plus, if you're going to hard-code a limit, surely it could be bumped up to something substantially larger like 1000 or even 10000 on a modern machine (assuming the limit was there for memory).

Wonder if this could be pushed for Python 3.3 (still in beta).

Cheers,
Dave
Reply all
Reply to author
Forward
0 new messages