Utility to locate errors in regular expressions

Malte Forkel

unread,

May 24, 2013, 8:58:24 AM5/24/13

to pytho...@python.org

Finding out why a regular expression does not match a given string can
very tedious. I would like to write a utility that identifies the
sub-expression causing the non-match. My idea is to use a parser to
create a tree representing the complete regular expression. Then I could
simplify the expression by dropping sub-expressions one by one from
right to left and from bottom to top until the remaining regex matches.
The last sub-expression dropped should be (part of) the problem.

As a first step, I am looking for a parser for Python regular
expressions, or a Python regex grammar to create a parser from.

But may be my idea is flawed? Or a similar (or better) tools already
exists? Any advice will be highly appreciated!

Malte

Roy Smith

unread,

May 24, 2013, 9:12:16 AM5/24/13

to

In article <mailman.2062.1369400...@python.org>,

I think this would be a really cool tool. The debugging process I've
always used is essentially what you describe. I start try progressively
shorter sub-patterns until I get a match, then try to incrementally add
back little bits of the original pattern until it no longer matches.
With luck, the problem will become obvious at that point.

Having a tool which automated this would be really useful.

Of course, most of Python user community are wimps and shy away from big
hairy regexes [ducking and running].

Devin Jeanpierre

unread,

May 24, 2013, 9:13:41 AM5/24/13

to Malte Forkel, pytho...@python.org

On Fri, May 24, 2013 at 8:58 AM, Malte Forkel <malte....@berlin.de> wrote:
> As a first step, I am looking for a parser for Python regular
> expressions, or a Python regex grammar to create a parser from.

the sre_parse module is undocumented, but very usable.

> But may be my idea is flawed? Or a similar (or better) tools already
> exists? Any advice will be highly appreciated!

I think your task is made problematic by the possibility that no
single part of the regexp causes a match failure. What causes failure
depends on what branches are chosen with the |, *, +, ?, etc.
operators -- it might be a different character/subexpression for each
branch. And then there's exponentially many possible branches.

-- Devin

Roy Smith

unread,

May 24, 2013, 9:40:12 AM5/24/13

to

In article <mailman.2065.1369401...@python.org>,

That's certainly true. The full power of regex makes stuff like this
very hard to do in the general case. That being said, people tend to
write regexen which match hunks of text from left to right.

So, in theory, it's probably an intractable problem. But, in practice,
such a tool would actually be useful in a large set of real-life cases.

Neil Cerutti

unread,

May 24, 2013, 9:58:37 AM5/24/13

to

On 2013-05-24, Roy Smith <r...@panix.com> wrote:
> Of course, most of Python user community are wimps and shy away
> from big hairy regexes [ducking and running].

I prefer the simple, lumbering regular expressions like those in
the original Night of the Regular Expressions. The fast, powerful
ones from programs like the remake of Dawn of the GREP, just
aren't as scary.

--
Neil Cerutti

rusi

unread,

May 24, 2013, 10:09:41 AM5/24/13

to

python-specific: http://kodos.sourceforge.net/
Online: http://gskinner.com/RegExr/
emacs-specific: re-builder and regex-tool http://bc.tech.coop/blog/071103.html

Christian Gollwitzer

unread,

May 24, 2013, 2:21:39 PM5/24/13

to

Am 24.05.13 14:58, schrieb Malte Forkel:

> Finding out why a regular expression does not match a given string can
> very tedious. I would like to write a utility that identifies the
> sub-expression causing the non-match.

Try

http://laurent.riesterer.free.fr/regexp/

it shows the subexpressions which cause the match by coloring the parts.
Not exacty what you want, but very intuitive and powerful. Beware this
is Tcl and there might be subtle differences in RE syntax, but largely
it's the same.

Christian