New implementation of re module

MRAB

unread,

Jul 27, 2009, 12:34:03 PM7/27/09

to Python List

Hi all,

I've been working on a new implementation of the re module. The details
are at http://bugs.python.org/issue2636, specifically from
http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for
Python 2.6 on Windows if you want to try it out.

I'm interested in how fast it is generally, compared with the current re
module, but especially when faced with those 'pathological' regular
expressions which seem to take a long time to finish, for example:

re.search(r"^(.+|D)*A$", "x" * 25 + "B")

which on my PC (1.8GHz) takes 18.98secs with the re module but <0.01secs
with this new implementation.

TIA

William Dode

unread,

Jul 27, 2009, 3:04:59 PM7/27/09

to

On 27-07-2009, MRAB wrote:
> Hi all,
>
> I've been working on a new implementation of the re module. The details
> are at http://bugs.python.org/issue2636, specifically from
> http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for
> Python 2.6 on Windows if you want to try it out.
>

Someone can remember me how to compile it (on debian lenny), if possible
with python2.5. I've also python3.1 that i build alone...

I could test it with pytextile, i've a bunch of texts to bench and
compare.

Did you announce it on the unladen-swallow list ? They wanted to hack on
RE also...

--
William Dodé - http://flibuste.net
Informaticien Indépendant

Wolfgang Rohdewald

unread,

Jul 27, 2009, 3:27:55 PM7/27/09

to pytho...@python.org

On Monday 27 July 2009, MRAB wrote:
> I've been working on a new implementation of the re module. The
> details are at http://bugs.python.org/issue2636, specifically from
> http://bugs.python.org/issue2636#msg90954. I've included a .pyd
> file for Python 2.6 on Windows if you want to try it out.

how do I compile _regex.c on Linux?

--
Wolfgang

MRAB

unread,

Jul 27, 2009, 4:00:48 PM7/27/09

to pytho...@python.org

William Dode wrote:
> On 27-07-2009, MRAB wrote:
>> Hi all,
>>
>> I've been working on a new implementation of the re module. The details
>> are at http://bugs.python.org/issue2636, specifically from
>> http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for
>> Python 2.6 on Windows if you want to try it out.
>>
>
> Someone can remember me how to compile it (on debian lenny), if possible
> with python2.5. I've also python3.1 that i build alone...
>
> I could test it with pytextile, i've a bunch of texts to bench and
> compare.
>

All I can do is point you to
http://docs.python.org/extending/extending.html.

For Linux (which I don't have) you'll need to use _regex.h and _regex.c
to compile to _regex.so instead of _regex.pyd.

> Did you announce it on the unladen-swallow list ? They wanted to hack on
> RE also...
>

No. I haven't subscribed to it.

Aahz

unread,

Jul 27, 2009, 10:52:26 PM7/27/09

to

In article <mailman.3787.1248712...@python.org>,

MRAB <pyt...@mrabarnett.plus.com> wrote:
>
>I've been working on a new implementation of the re module. The details
>are at http://bugs.python.org/issue2636, specifically from
>http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for
>Python 2.6 on Windows if you want to try it out.

How does it handle the re module's unit tests?
--
Aahz (aa...@pythoncraft.com) <*> http://www.pythoncraft.com/

"Many customs in this life persist because they ease friction and promote
productivity as a result of universal agreement, and whether they are
precisely the optimal choices is much less important." --Henry Spencer

OKB (not okblacke)

unread,

Jul 28, 2009, 12:41:34 AM7/28/09

to

MRAB wrote:

> http://bugs.python.org/issue2636#msg90954

Variable-length lookbehind! My hero!

--
--OKB (not okblacke)
Brendan Barnwell
"Do not follow where the path may lead. Go, instead, where there is
no path, and leave a trail."
--author unknown

MRAB

unread,

Jul 28, 2009, 10:59:06 AM7/28/09

to pytho...@python.org

Aahz wrote:
> In article <mailman.3787.1248712...@python.org>,
> MRAB <pyt...@mrabarnett.plus.com> wrote:
>> I've been working on a new implementation of the re module. The details
>> are at http://bugs.python.org/issue2636, specifically from
>> http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for
>> Python 2.6 on Windows if you want to try it out.
>
> How does it handle the re module's unit tests?

Basically, it passes all those tests I expect it to pass. :-)

It fails those where the intended behaviour has changed, such as re.sub
treating unmatched groups as empty strings, as requested in
http://bugs.python.org/issue1519638.

Christopher Arndt

unread,

Jul 28, 2009, 1:07:26 PM7/28/09

to

On 27 Jul., 21:27, Wolfgang Rohdewald <wolfg...@rohdewald.de> wrote:
> how do I compile _regex.c on Linux?

This simple setup.py file should do the trick:

from distutils.core import setup, Extension

setup(name='regex',
version='1.0',
py_modules = ['regex'],
ext_modules=[Extension('_regex', ['_regex.c'])],
)

Also, you need to copy "unicodedata_db.h" from the "Modules" directory
of the Python source tree to your working directory, since this file
apparently is not installed into the include directory of a Python
installation.

I get an error for Python 2.5 on Mac OS X 10.4 and Linux though. Seems
that the module is not compatible with Python 2.5.

_regex.c: In function 'getstring':
_regex.c:2425: error: invalid type argument of '->'
_regex.c: In function 'getstring':
_regex.c:2425: error: invalid type argument of '->'
lipo: can't figure out the architecture type of: /var/tmp//
ccT3oDXD.out
error: command 'gcc' failed with exit status 1

resp. on Linux:

_regex.c: In function 'getstring':
_regex.c:2425: warning: implicit declaration of function 'Py_TYPE'
_regex.c:2425: error: invalid type argument of '->'
_regex.c: In function 'get_match_replacement':
_regex.c:2900: warning: implicit declaration of function
'PyLong_AsSsize_t'
error: command 'gcc' failed with exit status 1

With the official Python 2.6 distribution for Mac OS X it works.

Chris

MRAB

unread,

Jul 28, 2009, 1:25:43 PM7/28/09

to pytho...@python.org

The source code is intended to replace the current 're' module in Python
2.7 (and I'll be porting it to Python 3.2), so I'm not that worried
about Python versions earlier than 2.6 for testing, although if there's
sufficient need then I could tweak the sources for 2.5.

William Dode

unread,

Jul 28, 2009, 4:05:35 PM7/28/09

to

On 28-07-2009, MRAB wrote:

> With the official Python 2.6 distribution for Mac OS X it works.
>>
> The source code is intended to replace the current 're' module in Python
> 2.7 (and I'll be porting it to Python 3.2), so I'm not that worried
> about Python versions earlier than 2.6 for testing, although if there's
> sufficient need then I could tweak the sources for 2.5.

I understand now why i could'nt compile it !

So, i would like if it's not too much work for you.

Aahz

unread,

Jul 28, 2009, 4:57:37 PM7/28/09

to

In article <mailman.3843.1248793...@python.org>,

Then you should definitely publish to PyPI and post a message to
c.l.py.announce to get more users.

Mark Lawrence

unread,

Jul 28, 2009, 6:40:02 PM7/28/09

to pytho...@python.org

MRAB wrote:
> Hi all,

>
> I've been working on a new implementation of the re module. The details
> are at http://bugs.python.org/issue2636, specifically from
> http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for
> Python 2.6 on Windows if you want to try it out.
>

> I'm interested in how fast it is generally, compared with the current re
> module, but especially when faced with those 'pathological' regular
> expressions which seem to take a long time to finish, for example:
>
> re.search(r"^(.+|D)*A$", "x" * 25 + "B")
>
> which on my PC (1.8GHz) takes 18.98secs with the re module but <0.01secs
> with this new implementation.

I tried this on my 3GHz PC timings pretty much the same.

From here http://bugs.python.org/issue1721518 I knocked up this.

import time
import re
import regex

s = "Add.1, 2020 and Add.1, 2021-2023, 2025, 2028 and 2029 and Add.1) R"
r = "(?:\s|,|and|Add\S*?|Parts?|$[^$]*\)|[IV\-\d]+)*$"
t0 = time.clock()
print regex.search(r, s)
t1 = time.clock()
print "time", t1 - t0

print "It's going to crash"
t0 = time.clock()
print re.search(r, s)
t1 = time.clock()
print "It hasn't crashed time", t1 - t0

Output shows a slight change in timing:).

<_regex.RE_Match object at 0x0243A1A0>
time 0.00279001940191
It's going to crash
<_sre.SRE_Match object at 0x024396B0>
It hasn't crashed time 98.4238155967

>
> TIA

I also got the files bm_regex_effbot.py and bm_regex_v8.py from
http://code.google.com/p/unladen-swallow/source/browse/#svn/tests/performance
and ran them, then reran them having substituted regex for re. Output
timings were roughly effbot re 0.14secs, effbot regex 1.16secs, v8 re
0.17secs and v8 regex 0.67secs.

HTH.

--
Kindest regards.

Mark Lawrence.

MRAB

unread,

Jul 28, 2009, 8:58:50 PM7/28/09

to pytho...@python.org

William Dode wrote:
> On 28-07-2009, MRAB wrote:
>
>> With the official Python 2.6 distribution for Mac OS X it works.
>> The source code is intended to replace the current 're' module in Python
>> 2.7 (and I'll be porting it to Python 3.2), so I'm not that worried
>> about Python versions earlier than 2.6 for testing, although if there's
>> sufficient need then I could tweak the sources for 2.5.
>
> I understand now why i could'nt compile it !
>
> So, i would like if it's not too much work for you.
>

There's a new version which should compile with Python 2.5 now.

Mike

unread,

Jul 29, 2009, 11:24:52 AM7/29/09

to

On Jul 27, 11:34 am, MRAB <pyt...@mrabarnett.plus.com> wrote:
> I've been working on a new implementation of the re module.

Fabulous!

If you're extending/changing the interface, there are a couple of sore
points in the current implementation I'd love to see addressed:

- findall/finditer doesn't find overlapping matches. Sometimes you
really *do* want to know all possible matches, even if they overlap.
This comes up in bioinformatics, for example.

- split won't split on empty patterns, e.g. empty lookahead patterns.
This means that it can't be used for a whole class of interesting
cases. This has been discussed previously:

http://bugs.python.org/issue3262
http://bugs.python.org/issue852532
http://bugs.python.org/issue988761

- It'd be nice to have a version of split that generates the parts
(one by one) rather than returning the whole list.

- Repeated subgroup match information is not available. That is, for
a match like this

re.match('(.){3}', 'xyz')

there's no way to discover that the subgroup first matched 'x', then
matched 'y', and finally matched 'z'. Here is one past proposal
(mine), perhaps over-complex, to address this problem:

http://mail.python.org/pipermail/python-dev/2004-August/047238.html

Mike

MRAB

unread,

Jul 29, 2009, 11:45:55 AM7/29/09

to pytho...@python.org

Mike wrote:
> On Jul 27, 11:34 am, MRAB <pyt...@mrabarnett.plus.com> wrote:
>> I've been working on a new implementation of the re module.
>
> Fabulous!
>
> If you're extending/changing the interface, there are a couple of sore
> points in the current implementation I'd love to see addressed:
>
> - findall/finditer doesn't find overlapping matches. Sometimes you
> really *do* want to know all possible matches, even if they overlap.
> This comes up in bioinformatics, for example.
>

Perhaps by adding "overlapped=True"?

> - split won't split on empty patterns, e.g. empty lookahead patterns.
> This means that it can't be used for a whole class of interesting
> cases. This has been discussed previously:
>
> http://bugs.python.org/issue3262
> http://bugs.python.org/issue852532
> http://bugs.python.org/issue988761
>

Already addressed (see issue2636 for the full details).

> - It'd be nice to have a version of split that generates the parts
> (one by one) rather than returning the whole list.
>

Hmm, re.splititer() perhaps.

> - Repeated subgroup match information is not available. That is, for
> a match like this
>
> re.match('(.){3}', 'xyz')
>
> there's no way to discover that the subgroup first matched 'x', then
> matched 'y', and finally matched 'z'. Here is one past proposal
> (mine), perhaps over-complex, to address this problem:
>
> http://mail.python.org/pipermail/python-dev/2004-August/047238.html
>

Yikes! I think I'll let you code that... :-)

Mike

unread,

Jul 29, 2009, 1:21:39 PM7/29/09

to

On Jul 29, 10:45 am, MRAB <pyt...@mrabarnett.plus.com> wrote:

> Mike wrote:
> > - findall/finditer doesn't find overlapping matches. Sometimes you
> > really *do* want to know all possible matches, even if they overlap.
>

> Perhaps by adding "overlapped=True"?

Something like that would be great, yes.

> > - split won't split on empty patterns, e.g. empty lookahead patterns.

> Already addressed (see issue2636 for the full details).

Glad to hear it.

> > - Repeated subgroup match information is not available. That is, for
> > a match like this
>
> > re.match('(.){3}', 'xyz')
>
> > there's no way to discover that the subgroup first matched 'x', then
> > matched 'y', and finally matched 'z'. Here is one past proposal
> > (mine), perhaps over-complex, to address this problem:
>
> > http://mail.python.org/pipermail/python-dev/2004-August/047238.html
>
> Yikes! I think I'll let you code that... :-)

I agree that that document looks a little scary--maybe I was trying to
bite off too much at once.

My intuition, though, is that the basic idea should be fairly simple
to implement, at least for a depth-first matcher. The repeated match
subgroups are already being discovered, it's just that they're not
being saved, so there's no way to report them out once a complete
match is found. If some trail of breadcrumbs were pushed onto a stack
during the DFS, it could be traced at the end. And the whole thing
might not even been that expensive to do.

The hardest parts about this, in my mind, are figuring out how to
report the repeated matches out in a useful form (hence all that
detail in the proposal), and getting users to understand that using
this feature *could* suck up a lot of memory, if they're not careful.

As always, it's possible that my intuition is totally wrong. Plus I'm
not sure how this would work out in the breadth-first case.

Details aside, I would really, really, really like to have a way to
get at the repeated subgroup matches. I write a lot of code that
would be one-liners if this capability existed. Plus, it just plain
burns me that Python is discovering this information but impudently
refuses to tell me what it's found! ;-)

Wolfgang Rohdewald

unread,

Jul 30, 2009, 6:35:22 AM7/30/09

to pytho...@python.org

On Tuesday 28 July 2009, Christopher Arndt wrote:
> setup(name='regex',
> version='1.0',
> py_modules = ['regex'],
> ext_modules=[Extension('_regex', ['_regex.c'])],
> )
>
> Also, you need to copy "unicodedata_db.h" from the "Modules"
> directory of the Python source tree to your working directory,
> since this file apparently is not installed into the include
> directory of a Python installation.

using issue2636-20090729.zip

I have Python 2.6.2 on ubuntu 9.04

with ggc-4.3:
gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -
Wstrict-prototypes -fPIC -I/usr/include/python2.6 -c _regex.c -o
build/temp.linux-i686-2.6/_regex.o
_regex.c: In Funktion »bmatch_context«:
_regex.c:1462: Fehler: Als Erhöhungsoperand wird L-Wert erfordert
_regex.c:1470: Fehler: Als Erhöhungsoperand wird L-Wert erfordert
_regex.c:1478: Fehler: Als Verringerungsoperand wird L-Wert erfordert

with gcc-4.4:
gcc-4.4 -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-
prototypes -fPIC -I/usr/include/python2.6 -c _regex.c -o
build/temp.linux-i686-2.6/_regex.o
_regex.c: In function ‘bmatch_context’:
_regex.c:1462: error: lvalue required as increment operand
_regex.c:1470: error: lvalue required as increment operand
_regex.c:1478: error: lvalue required as decrement operand

--
Wolfgang

MRAB

unread,

Jul 30, 2009, 7:05:58 AM7/30/09

to pytho...@python.org

Wolfgang Rohdewald wrote:
> On Tuesday 28 July 2009, Christopher Arndt wrote:

>> setup(name='regex',
>> version='1.0',
>> py_modules = ['regex'],
>> ext_modules=[Extension('_regex', ['_regex.c'])],
>> )
>>
>> Also, you need to copy "unicodedata_db.h" from the "Modules"
>> directory of the Python source tree to your working directory,
>> since this file apparently is not installed into the include
>> directory of a Python installation.
>

> using issue2636-20090729.zip
>
> I have Python 2.6.2 on ubuntu 9.04
>
> with ggc-4.3:
> gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -
> Wstrict-prototypes -fPIC -I/usr/include/python2.6 -c _regex.c -o
> build/temp.linux-i686-2.6/_regex.o
> _regex.c: In Funktion »bmatch_context«:
> _regex.c:1462: Fehler: Als Erhöhungsoperand wird L-Wert erfordert
> _regex.c:1470: Fehler: Als Erhöhungsoperand wird L-Wert erfordert
> _regex.c:1478: Fehler: Als Verringerungsoperand wird L-Wert erfordert
>
> with gcc-4.4:
> gcc-4.4 -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-
> prototypes -fPIC -I/usr/include/python2.6 -c _regex.c -o
> build/temp.linux-i686-2.6/_regex.o
> _regex.c: In function ‘bmatch_context’:
> _regex.c:1462: error: lvalue required as increment operand
> _regex.c:1470: error: lvalue required as increment operand
> _regex.c:1478: error: lvalue required as decrement operand
>

There are other lines which are similar, eg line 1487. Do they all give
the same/similar error with your compiler?

Wolfgang Rohdewald

unread,

Jul 30, 2009, 8:29:07 AM7/30/09

to pytho...@python.org

On Thursday 30 July 2009, MRAB wrote:
> There are other lines which are similar, eg line 1487. Do they all
> give the same/similar error with your compiler?

yes. The full output with gcc-4.3:

notebook:~/kmj/src$ LANG=C python setup.py build
running build
running build_py
running build_ext
building '_regex' extension

gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -
Wstrict-prototypes -fPIC -I/usr/include/python2.6 -c _regex.c -o
build/temp.linux-i686-2.6/_regex.o

_regex.c: In function 'bmatch_context':
_regex.c:1462: error: lvalue required as increment operand
_regex.c:1470: error: lvalue required as increment operand
_regex.c:1478: error: lvalue required as decrement operand

_regex.c:1487: error: lvalue required as decrement operand
_regex.c:1593: error: lvalue required as increment operand
_regex.c:1606: error: lvalue required as decrement operand
_regex.c:1616: error: lvalue required as increment operand
_regex.c:1625: error: lvalue required as increment operand
_regex.c:1634: error: lvalue required as decrement operand
_regex.c:1643: error: lvalue required as decrement operand
_regex.c:2036: error: lvalue required as increment operand
_regex.c:2047: error: lvalue required as increment operand
_regex.c:2059: error: lvalue required as decrement operand
_regex.c:2070: error: lvalue required as decrement operand
_regex.c:2316: error: lvalue required as increment operand
In file included from _regex.c:2431:
_regex.c: In function 'umatch_context':

_regex.c:1462: error: lvalue required as increment operand
_regex.c:1470: error: lvalue required as increment operand
_regex.c:1478: error: lvalue required as decrement operand

_regex.c:1487: error: lvalue required as decrement operand
_regex.c:1593: error: lvalue required as increment operand
_regex.c:1606: error: lvalue required as decrement operand
_regex.c:1616: error: lvalue required as increment operand
_regex.c:1625: error: lvalue required as increment operand
_regex.c:1634: error: lvalue required as decrement operand
_regex.c:1643: error: lvalue required as decrement operand
_regex.c:2036: error: lvalue required as increment operand
_regex.c:2047: error: lvalue required as increment operand
_regex.c:2059: error: lvalue required as decrement operand
_regex.c:2070: error: lvalue required as decrement operand
_regex.c:2316: error: lvalue required as increment operand

error: command 'gcc' failed with exit status 1

--
Wolfgang

MRAB

unread,

Jul 30, 2009, 8:56:32 AM7/30/09

to pytho...@python.org

Wolfgang Rohdewald wrote:
> On Thursday 30 July 2009, MRAB wrote:
>> There are other lines which are similar, eg line 1487. Do they all
>> give the same/similar error with your compiler?
>
> yes. The full output with gcc-4.3:
>
>
> notebook:~/kmj/src$ LANG=C python setup.py build
> running build
> running build_py
> running build_ext
> building '_regex' extension
> gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -
> Wstrict-prototypes -fPIC -I/usr/include/python2.6 -c _regex.c -o
> build/temp.linux-i686-2.6/_regex.o
> _regex.c: In function 'bmatch_context':
> _regex.c:1462: error: lvalue required as increment operand

[snip]

> error: command 'gcc' failed with exit status 1
>

So it complains about:

++(RE_CHAR*)context->text_ptr

but not about:

++info->repeat.count

Does this mean that the gcc compiler thinks that the cast makes it an
rvalue? I'm using Visual C++ 2008 Express Edition, which doesn't
complain. What does the C standard say?

Wolfgang Rohdewald

unread,

Jul 30, 2009, 9:18:27 AM7/30/09

to pytho...@python.org

I am not really a C expert but I found some links. Most helpful:
http://developer.apple.com/DOCUMENTATION/DeveloperTools/gcc-4.0.1/gcc/C-Dialect-Options.html

(search -fnon-lvalue-assign)

so I did the conversion mentioned there. This works:

--- _regex.c 2009-07-29 11:34:00.000000000 +0200
+++ n 2009-07-30 15:15:22.000000000 +0200
@@ -1459,7 +1459,7 @@
if (text_ptr < (RE_CHAR*)context->slice_end && text_ptr[0] != '\n')
{
context->node = node->next_1;
- ++(RE_CHAR*)context->text_ptr;
+ ++*(RE_CHAR**)&context->text_ptr;
} else
context = reject_context(state, context);
break;

--
Wolfgang

Wolfgang Rohdewald

unread,

Jul 30, 2009, 9:24:28 AM7/30/09

to pytho...@python.org

On Thursday 30 July 2009, Wolfgang Rohdewald wrote:
> so I did the conversion mentioned there. This works:

I actually do not know if it works - but it compiles.

--
Wolfgang

Piet van Oostrum

unread,

Jul 30, 2009, 9:39:35 AM7/30/09

to

>>>>> MRAB <pyt...@mrabarnett.plus.com> (M) wrote:

>M> Hi all,
>M> I've been working on a new implementation of the re module. The details
>M> are at http://bugs.python.org/issue2636, specifically from
>M> http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for
>M> Python 2.6 on Windows if you want to try it out.

>M> I'm interested in how fast it is generally, compared with the current re
>M> module, but especially when faced with those 'pathological' regular
>M> expressions which seem to take a long time to finish, for example:

>M> re.search(r"^(.+|D)*A$", "x" * 25 + "B")

>M> which on my PC (1.8GHz) takes 18.98secs with the re module but <0.01secs
>M> with this new implementation.

Is this version also going to use the Thompson approach?
--
Piet van Oostrum <pi...@cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: pi...@vanoostrum.org

Hrvoje Niksic

unread,

Jul 30, 2009, 2:32:10 PM7/30/09

to

MRAB <pyt...@mrabarnett.plus.com> writes:

> So it complains about:
>
> ++(RE_CHAR*)context->text_ptr
>
> but not about:
>
> ++info->repeat.count
>
> Does this mean that the gcc compiler thinks that the cast makes it an
> rvalue?

The cast operator does return an rvalue, treating it otherwise used to
be an extension to popular compilers, including ironically gcc. The
standard-compliant way of writing the above would be:

++ *(RE_CHAR **) &context->text_ptr

MRAB

unread,

Jul 30, 2009, 5:44:31 PM7/30/09

to pytho...@python.org

Yes, it works. I've updated my code accordingly and it'll be in the next
release.

MRAB

unread,

Jul 30, 2009, 6:06:05 PM7/30/09

to pytho...@python.org

Piet van Oostrum wrote:
>>>>>> MRAB <pyt...@mrabarnett.plus.com> (M) wrote:
>
>> M> Hi all,
>> M> I've been working on a new implementation of the re module. The details
>> M> are at http://bugs.python.org/issue2636, specifically from
>> M> http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for
>> M> Python 2.6 on Windows if you want to try it out.
>
>> M> I'm interested in how fast it is generally, compared with the current re
>> M> module, but especially when faced with those 'pathological' regular
>> M> expressions which seem to take a long time to finish, for example:
>
>> M> re.search(r"^(.+|D)*A$", "x" * 25 + "B")
>
>> M> which on my PC (1.8GHz) takes 18.98secs with the re module but <0.01secs
>> M> with this new implementation.
>
> Is this version also going to use the Thompson approach?

That approach is what inspired the method in the new implementation, but
its speed comes from asking _whether_ there's a match, not _where_
there's a match, so it can ignore the route taken to get the match.

"grep", for example, selects those lines which match a pattern, whereas
the typical Python (and Perl) usage is to extract part of a string or
which part of the string matches.

Trying to support lazy and possessive quantifiers in Thompson's approach
is tricky (at least, I haven't been able to figure it out), and
back-references are right out! :-)

There's always the chance that I could come up with some sort of hybrid
scheme, applying different approaches depending on certain conditions.
It already switches from depth-first to breadth-first when it's taken
too many iterations without advancing and merges similar states, which
is why it's so much better with 'pathological' patterns; I just have to
get the depth-first phase to be as fast as the current 're' module.

John Machin

unread,

Aug 3, 2009, 9:00:33 AM8/3/09

to

On Jul 28, 2:34 am, MRAB <pyt...@mrabarnett.plus.com> wrote:
> Hi all,

>
> I've been working on a new implementation of the re module. The details

> are athttp://bugs.python.org/issue2636, specifically fromhttp://bugs.python.org/issue2636#msg90954. I've included a .pyd file for

> Python 2.6 on Windows if you want to try it out.
>

Where/how should we report perceived bugs: On that bugs.python.org
issue? Here? Private e-mail?

MRAB

unread,

Aug 3, 2009, 12:00:05 PM8/3/09

to pytho...@python.org

Probably on http://bugs.python.org/issue2636.

Alex Willmer

unread,

Aug 4, 2009, 8:30:05 PM8/4/09

to

On Jul 27, 5:34 pm, MRAB <pyt...@mrabarnett.plus.com> wrote:
> Hi all,
>
> I've been working on a new implementation of the re module. The details
> are athttp://bugs.python.org/issue2636, specifically fromhttp://bugs.python.org/issue2636#msg90954. I've included a .pyd file for
> Python 2.6 on Windows if you want to try it out.

Firstly Matthew, thank you for all your work on this. It brings some
very nice regex features to Python.

I've used Christopher Arndt's post as a basis and created a package
from you latest upload (issue2636-20090804.zip), which builds for
Python 2.5 and 2.6. I'd like to see this on PyPI, so it's easier to
install the module and your work gets wider exposure. Would this be
alright and would you prefer to have control of the upload, as this is
your work?

Below is the setup.py, the unicodedata_db.h files are taken from the
appropriate branches on svn.python.org

#!/usr/bin/env python

import shutil
import sys

from distutils.core import setup, Extension

MAJOR, MINOR = sys.version_info[:2]

# Copy appropriate unicodedata_db.h, not found in published includes
if (MAJOR, MINOR) == (2, 6):
shutil.copy('Python26/unicodedata_db.h', './')
elif (MAJOR, MINOR) == (2, 5):
shutil.copy('Python25/unicodedata_db.h', './')
else:
print "WARNING: No unicodedata_db.h prepared."

setup(
name='regex',
version='20080804',
description='Alternate regular expression module, to replace re.',
author='Matthew Barnett',
author_email='pyt...@mrabarnett.nospam.plus.com', # Obsfucated
url='http://bugs.python.org/issue2636',

py_modules = ['regex'],
ext_modules=[Extension('_regex', ['_regex.c'])],
)

Sincerely, Alex Willmer