Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Faster Regular Expressions

0 views
Skip to first unread message

nk...@vt.edu

unread,
Mar 9, 2000, 3:00:00 AM3/9/00
to pytho...@python.org

"""

Food for thought...

My work depends on fast regular expressions but I also enjoy Python's
ease and speed of development.

I ran the following regular expression speed test (P166 workstation
running Linux). The results are below. Function "fastMatch" can be
six times (6x) faster.

It seems to me that Tatu Ylonen (apparent author of regexp.c) did his
job well and that the re/mo wrapper in re.py slows everything down.

-- Neill

[Discuss if you like. Flames---as always---to /dev/null. I post here
because these results should go in the searchable archive.]

"""

import re

TEST = '"coconuts NI! coconuts NI! coconuts NI! coconuts NI! coconuts"'
SLOWQUOTE = re.compile( r"\"(?:(?:\\.)|[^\"\\])*\"")
FASTQUOTE = re.compile( r"\"(?:(?:\\.)|[^\"\\])*\"").code.match

def slowMatch( pattern, string) :
mo = pattern.match( string)
return mo.group()

def fastMatch( pattern, string) :
groups = pattern( string)
start, end = groups[ 0]
return string[ start:end]

def doit() :
for trial in range( 1000) :
x = slowMatch( SLOWQUOTE, TEST)
x = fastMatch( FASTQUOTE, TEST)

import profile
out = 'tmp.prof'
profile.run( 'doit()', out)
import pstats
profObj = pstats.Stats( out)
profObj.sort_stats('cumulative').print_stats()

"""
Thu Mar 9 14:01:35 2000 tmp.prof

5003 function calls in 2.510 CPU seconds

Ordered by: cumulative time

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.010 0.010 2.510 2.510 profile:0(doit())
1 0.000 0.000 2.500 2.500 <string>:1(?)
1 0.220 0.220 2.500 2.500 testre.py:34(doit)
1000 0.410 0.000 1.960 0.002 testre.py:25(slowMatch)
1000 0.640 0.001 0.890 0.001 /usr/local/lib/python1.5/re.py:112(match)
1000 0.660 0.001 0.660 0.001 /usr/local/lib/python1.5/re.py:335(group)
1000 0.320 0.000 0.320 0.000 testre.py:29(fastMatch)
1000 0.250 0.000 0.250 0.000 /usr/local/lib/python1.5/re.py:290(__init__)
0 0.000 0.000 profile:0(profiler)

"""


Tim Peters

unread,
Mar 9, 2000, 3:00:00 AM3/9/00
to pytho...@python.org
[nk...@vt.edu]

> Food for thought...
>
> My work depends on fast regular expressions but I also enjoy Python's
> ease and speed of development.
>
> I ran the following regular expression speed test (P166 workstation
> running Linux). The results are below. Function "fastMatch" can be
> six times (6x) faster.

In case that still isn't clear, he meant VI tick-tock tick-tock <wink>.

> It seems to me that Tatu Ylonen (apparent author of regexp.c) did his
> job well and that the re/mo wrapper in re.py slows everything down.

/F-redrik Lundh is working on a new (Unicode-aware) regexp pkg for Python
1.6. His speed tests confirmed the conventional wisdom here: regex has
much lower fixed overhead than re so generally wins on short strings; but re
is very often very much faster (even unboundedly so) than regex on long
strings (where the fixed overhead gets lost to the noise); and /F's pkg
appears to beat both on both counts -- at least according to his tests so
far.

btw-if-your-work-depends-on-fast-regexps-you-should-consider-
getting-a-real-job<wink>-ly y'rs - tim


Fredrik Lundh

unread,
Mar 10, 2000, 3:00:00 AM3/10/00
to
nk...@vt.edu wrote:

> I post here because these results should go in the searchable
> archive.

no, they shouldn't ;-)

despite your good intentions, your benchmark is flawed. read on.

> I ran the following regular expression speed test (P166 workstation
> running Linux). The results are below. Function "fastMatch" can be
> six times (6x) faster.

You cannot use the profiler to calculate relative performance
in this way. Since it's an instrumenting profiler, it slows down
Python code (like the glue code in 'pattern.match'). But it
doesn't slow down C functions (like 'pattern.code.match') at
all!

A more correct benchmark gives a difference of just over 2x.

re: 100
re-fast: 43

(times are normalized -- original 're' is 100)

> It seems to me that Tatu Ylonen (apparent author of regexp.c) did
> his job well and that the re/mo wrapper in re.py slows everything
> down.

But you're not testing Tatu Ylonen's regexp package (used by the
'regex' module) -- you're short-circuiting the 're' interface layer
built on top of Philip Hazel's PCRE library.

The 'regex' module provides less features, is much slower on complex
patterns, and isn't thread safe. But it has less calling overhead,
since it doesn't use an extra Python layer.

After tweaking your pattern to work with 'regex', I get the following
result:

regex: 40

Slightly faster than your 're' hack, and this one doesn't use any
undocumented features.

And note that they're not only undocumented; they will also be
gone in 1.6. The new engine will implement the full 're' syntax
and the same interface, it's fast also on relatively complex patterns
and large strings, and has a very tight Python interface:

sre: 26

That's just under 4x. Not too bad, imho.

For additional benchmarks, see

http://www.deja.com/=dnc/getdoc.xp?AN=588925502

</F>

0 new messages