I'm interested in how fast it is generally, compared with the current re module, but especially when faced with those 'pathological' regular expressions which seem to take a long time to finish, for example:
re.search(r"^(.+|D)*A$", "x" * 25 + "B")
which on my PC (1.8GHz) takes 18.98secs with the re module but <0.01secs with this new implementation.
"Many customs in this life persist because they ease friction and promote productivity as a result of universal agreement, and whether they are precisely the optimal choices is much less important." --Henry Spencer
-- --OKB (not okblacke) Brendan Barnwell "Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail." --author unknown
Aahz wrote: > In article <mailman.3787.1248712420.8015.python-l...@python.org>, > MRAB <pyt...@mrabarnett.plus.com> wrote: >> I've been working on a new implementation of the re module. The details >> are at http://bugs.python.org/issue2636, specifically from >> http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for >> Python 2.6 on Windows if you want to try it out.
> How does it handle the re module's unit tests?
Basically, it passes all those tests I expect it to pass. :-)
It fails those where the intended behaviour has changed, such as re.sub treating unmatched groups as empty strings, as requested in http://bugs.python.org/issue1519638.
Also, you need to copy "unicodedata_db.h" from the "Modules" directory of the Python source tree to your working directory, since this file apparently is not installed into the include directory of a Python installation.
I get an error for Python 2.5 on Mac OS X 10.4 and Linux though. Seems that the module is not compatible with Python 2.5.
_regex.c: In function 'getstring': _regex.c:2425: error: invalid type argument of '->' _regex.c: In function 'getstring': _regex.c:2425: error: invalid type argument of '->' lipo: can't figure out the architecture type of: /var/tmp// ccT3oDXD.out error: command 'gcc' failed with exit status 1
resp. on Linux:
_regex.c: In function 'getstring': _regex.c:2425: warning: implicit declaration of function 'Py_TYPE' _regex.c:2425: error: invalid type argument of '->' _regex.c: In function 'get_match_replacement': _regex.c:2900: warning: implicit declaration of function 'PyLong_AsSsize_t' error: command 'gcc' failed with exit status 1
With the official Python 2.6 distribution for Mac OS X it works.
> Also, you need to copy "unicodedata_db.h" from the "Modules" directory > of the Python source tree to your working directory, since this file > apparently is not installed into the include directory of a Python > installation.
> I get an error for Python 2.5 on Mac OS X 10.4 and Linux though. Seems > that the module is not compatible with Python 2.5.
> _regex.c: In function 'getstring': > _regex.c:2425: error: invalid type argument of '->' > _regex.c: In function 'getstring': > _regex.c:2425: error: invalid type argument of '->' > lipo: can't figure out the architecture type of: /var/tmp// > ccT3oDXD.out > error: command 'gcc' failed with exit status 1
> resp. on Linux:
> _regex.c: In function 'getstring': > _regex.c:2425: warning: implicit declaration of function 'Py_TYPE' > _regex.c:2425: error: invalid type argument of '->' > _regex.c: In function 'get_match_replacement': > _regex.c:2900: warning: implicit declaration of function > 'PyLong_AsSsize_t' > error: command 'gcc' failed with exit status 1
> With the official Python 2.6 distribution for Mac OS X it works.
The source code is intended to replace the current 're' module in Python 2.7 (and I'll be porting it to Python 3.2), so I'm not that worried about Python versions earlier than 2.6 for testing, although if there's sufficient need then I could tweak the sources for 2.5.
On 28-07-2009, MRAB wrote: > With the official Python 2.6 distribution for Mac OS X it works.
> The source code is intended to replace the current 're' module in Python > 2.7 (and I'll be porting it to Python 3.2), so I'm not that worried > about Python versions earlier than 2.6 for testing, although if there's > sufficient need then I could tweak the sources for 2.5.
I understand now why i could'nt compile it !
So, i would like if it's not too much work for you.
MRAB <pyt...@mrabarnett.plus.com> wrote: >Aahz wrote: >> In article <mailman.3787.1248712420.8015.python-l...@python.org>, >> MRAB <pyt...@mrabarnett.plus.com> wrote: >>> I've been working on a new implementation of the re module. The details >>> are at http://bugs.python.org/issue2636, specifically from >>> http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for >>> Python 2.6 on Windows if you want to try it out.
>> How does it handle the re module's unit tests?
>Basically, it passes all those tests I expect it to pass. :-)
>It fails those where the intended behaviour has changed, such as re.sub >treating unmatched groups as empty strings, as requested in >http://bugs.python.org/issue1519638.
Then you should definitely publish to PyPI and post a message to c.l.py.announce to get more users. -- Aahz (a...@pythoncraft.com) <*> http://www.pythoncraft.com/
"Many customs in this life persist because they ease friction and promote productivity as a result of universal agreement, and whether they are precisely the optimal choices is much less important." --Henry Spencer
> I'm interested in how fast it is generally, compared with the current re > module, but especially when faced with those 'pathological' regular > expressions which seem to take a long time to finish, for example:
> re.search(r"^(.+|D)*A$", "x" * 25 + "B")
> which on my PC (1.8GHz) takes 18.98secs with the re module but <0.01secs > with this new implementation.
I tried this on my 3GHz PC timings pretty much the same.
<_regex.RE_Match object at 0x0243A1A0> time 0.00279001940191 It's going to crash <_sre.SRE_Match object at 0x024396B0> It hasn't crashed time 98.4238155967
> TIA
I also got the files bm_regex_effbot.py and bm_regex_v8.py from http://code.google.com/p/unladen-swallow/source/browse/#svn/tests/per... and ran them, then reran them having substituted regex for re. Output timings were roughly effbot re 0.14secs, effbot regex 1.16secs, v8 re 0.17secs and v8 regex 0.67secs.
>> With the official Python 2.6 distribution for Mac OS X it works. >> The source code is intended to replace the current 're' module in Python >> 2.7 (and I'll be porting it to Python 3.2), so I'm not that worried >> about Python versions earlier than 2.6 for testing, although if there's >> sufficient need then I could tweak the sources for 2.5.
> I understand now why i could'nt compile it !
> So, i would like if it's not too much work for you.
There's a new version which should compile with Python 2.5 now.
On Jul 27, 11:34 am, MRAB <pyt...@mrabarnett.plus.com> wrote:
> I've been working on a new implementation of the re module.
Fabulous!
If you're extending/changing the interface, there are a couple of sore points in the current implementation I'd love to see addressed:
- findall/finditer doesn't find overlapping matches. Sometimes you really *do* want to know all possible matches, even if they overlap. This comes up in bioinformatics, for example.
- split won't split on empty patterns, e.g. empty lookahead patterns. This means that it can't be used for a whole class of interesting cases. This has been discussed previously:
- It'd be nice to have a version of split that generates the parts (one by one) rather than returning the whole list.
- Repeated subgroup match information is not available. That is, for a match like this
re.match('(.){3}', 'xyz')
there's no way to discover that the subgroup first matched 'x', then matched 'y', and finally matched 'z'. Here is one past proposal (mine), perhaps over-complex, to address this problem:
Mike wrote: > On Jul 27, 11:34 am, MRAB <pyt...@mrabarnett.plus.com> wrote: >> I've been working on a new implementation of the re module.
> Fabulous!
> If you're extending/changing the interface, there are a couple of sore > points in the current implementation I'd love to see addressed:
> - findall/finditer doesn't find overlapping matches. Sometimes you > really *do* want to know all possible matches, even if they overlap. > This comes up in bioinformatics, for example.
Perhaps by adding "overlapped=True"?
> - split won't split on empty patterns, e.g. empty lookahead patterns. > This means that it can't be used for a whole class of interesting > cases. This has been discussed previously:
Already addressed (see issue2636 for the full details).
> - It'd be nice to have a version of split that generates the parts > (one by one) rather than returning the whole list.
Hmm, re.splititer() perhaps.
> - Repeated subgroup match information is not available. That is, for > a match like this
> re.match('(.){3}', 'xyz')
> there's no way to discover that the subgroup first matched 'x', then > matched 'y', and finally matched 'z'. Here is one past proposal > (mine), perhaps over-complex, to address this problem:
On Jul 29, 10:45 am, MRAB <pyt...@mrabarnett.plus.com> wrote:
> Mike wrote: > > - findall/finditer doesn't find overlapping matches. Sometimes you > > really *do* want to know all possible matches, even if they overlap.
> Perhaps by adding "overlapped=True"?
Something like that would be great, yes.
> > - split won't split on empty patterns, e.g. empty lookahead patterns. > Already addressed (see issue2636 for the full details).
Glad to hear it.
> > - Repeated subgroup match information is not available. That is, for > > a match like this
> > re.match('(.){3}', 'xyz')
> > there's no way to discover that the subgroup first matched 'x', then > > matched 'y', and finally matched 'z'. Here is one past proposal > > (mine), perhaps over-complex, to address this problem:
I agree that that document looks a little scary--maybe I was trying to bite off too much at once.
My intuition, though, is that the basic idea should be fairly simple to implement, at least for a depth-first matcher. The repeated match subgroups are already being discovered, it's just that they're not being saved, so there's no way to report them out once a complete match is found. If some trail of breadcrumbs were pushed onto a stack during the DFS, it could be traced at the end. And the whole thing might not even been that expensive to do.
The hardest parts about this, in my mind, are figuring out how to report the repeated matches out in a useful form (hence all that detail in the proposal), and getting users to understand that using this feature *could* suck up a lot of memory, if they're not careful.
As always, it's possible that my intuition is totally wrong. Plus I'm not sure how this would work out in the breadth-first case.
Details aside, I would really, really, really like to have a way to get at the repeated subgroup matches. I write a lot of code that would be one-liners if this capability existed. Plus, it just plain burns me that Python is discovering this information but impudently refuses to tell me what it's found! ;-)
> Also, you need to copy "unicodedata_db.h" from the "Modules" > directory of the Python source tree to your working directory, > since this file apparently is not installed into the include > directory of a Python installation.
using issue2636-20090729.zip
I have Python 2.6.2 on ubuntu 9.04
with ggc-4.3: gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall - Wstrict-prototypes -fPIC -I/usr/include/python2.6 -c _regex.c -o build/temp.linux-i686-2.6/_regex.o _regex.c: In Funktion »bmatch_context«: _regex.c:1462: Fehler: Als Erhöhungsoperand wird L-Wert erfordert _regex.c:1470: Fehler: Als Erhöhungsoperand wird L-Wert erfordert _regex.c:1478: Fehler: Als Verringerungsoperand wird L-Wert erfordert
with gcc-4.4: gcc-4.4 -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict- prototypes -fPIC -I/usr/include/python2.6 -c _regex.c -o build/temp.linux-i686-2.6/_regex.o _regex.c: In function ‘bmatch_context’: _regex.c:1462: error: lvalue required as increment operand _regex.c:1470: error: lvalue required as increment operand _regex.c:1478: error: lvalue required as decrement operand
Wolfgang Rohdewald wrote: > On Tuesday 28 July 2009, Christopher Arndt wrote: >> setup(name='regex', >> version='1.0', >> py_modules = ['regex'], >> ext_modules=[Extension('_regex', ['_regex.c'])], >> )
>> Also, you need to copy "unicodedata_db.h" from the "Modules" >> directory of the Python source tree to your working directory, >> since this file apparently is not installed into the include >> directory of a Python installation.
> using issue2636-20090729.zip
> I have Python 2.6.2 on ubuntu 9.04
> with ggc-4.3: > gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall - > Wstrict-prototypes -fPIC -I/usr/include/python2.6 -c _regex.c -o > build/temp.linux-i686-2.6/_regex.o > _regex.c: In Funktion »bmatch_context«: > _regex.c:1462: Fehler: Als Erhöhungsoperand wird L-Wert erfordert > _regex.c:1470: Fehler: Als Erhöhungsoperand wird L-Wert erfordert > _regex.c:1478: Fehler: Als Verringerungsoperand wird L-Wert erfordert
> with gcc-4.4: > gcc-4.4 -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict- > prototypes -fPIC -I/usr/include/python2.6 -c _regex.c -o > build/temp.linux-i686-2.6/_regex.o > _regex.c: In function ‘bmatch_context’: > _regex.c:1462: error: lvalue required as increment operand > _regex.c:1470: error: lvalue required as increment operand > _regex.c:1478: error: lvalue required as decrement operand
There are other lines which are similar, eg line 1487. Do they all give the same/similar error with your compiler?
Wolfgang Rohdewald wrote: > On Thursday 30 July 2009, MRAB wrote: >> There are other lines which are similar, eg line 1487. Do they all >> give the same/similar error with your compiler?
Does this mean that the gcc compiler thinks that the cast makes it an rvalue? I'm using Visual C++ 2008 Express Edition, which doesn't complain. What does the C standard say?
> Does this mean that the gcc compiler thinks that the cast makes it > an rvalue? I'm using Visual C++ 2008 Express Edition, which doesn't > complain. What does the C standard say?
>>>>> MRAB <pyt...@mrabarnett.plus.com> (M) wrote: >M> Hi all, >M> I've been working on a new implementation of the re module. The details >M> are at http://bugs.python.org/issue2636, specifically from >M> http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for >M> Python 2.6 on Windows if you want to try it out. >M> I'm interested in how fast it is generally, compared with the current re >M> module, but especially when faced with those 'pathological' regular >M> expressions which seem to take a long time to finish, for example: >M> re.search(r"^(.+|D)*A$", "x" * 25 + "B") >M> which on my PC (1.8GHz) takes 18.98secs with the re module but <0.01secs >M> with this new implementation.
Is this version also going to use the Thompson approach? -- Piet van Oostrum <p...@cs.uu.nl> URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: p...@vanoostrum.org
MRAB <pyt...@mrabarnett.plus.com> writes: > So it complains about:
> ++(RE_CHAR*)context->text_ptr
> but not about:
> ++info->repeat.count
> Does this mean that the gcc compiler thinks that the cast makes it an > rvalue?
The cast operator does return an rvalue, treating it otherwise used to be an extension to popular compilers, including ironically gcc. The standard-compliant way of writing the above would be: