Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Making regex suck less

54 views
Skip to first unread message

Gerson Kurz

unread,
Sep 1, 2002, 2:31:29 PM9/1/02
to
I wrote a small wrapper module to "struct" today that will allow me to
write C declarations and have

...
print p.declare("""

struct test1
{
int a,b;
float this,and,that;
};

struct test2
{
int count;
struct test1 data[80]
};

""")
...
and have it create instances I can easily assign data to
...
t2 = p.createInstance("test2")
t2.data[0].a = 42
...
and pack for C extensions by utilizing the struct module
...
data = p.pack(t2)
...

because of being fed up with that pack("qh34s>id",...) stuff. (I'll
post that to my website once its in a state I can let somebody else
see ;)

Anyway, that got me thinking on why do we have to deal with regular
expressions like r"((?:a|b)*)", when in most cases the code will look
something like this:

r = re.compile("<some cryptic re-string here>")
...
r.match(this) or r.find(that)

which means the real time is not spent in the compile() function, but
in the match or find function. So basically, couldn't one come up with
a *human readable* syntax for re, and compile that instead? Python
prides itself on its clean syntax, and human readability, an bang -
import re, get perl-ish code instantly!

Also, I think it would already be an improvement if the syntax
provided for clear and easy-to-understand special cases, like

re.compile("anything that starts with 'abc'")

and if you cannot find something in the special cases for you, you can
always go back to

re.compile("<some cryptinc re-string here>")

After all, *everyone* starting with re thinks the syntax is cryptic
and mind-boggling, and only if you get yourself into the "re mindset",
you understand things like r"\s*\w+\s*=\s*['\"].*?['\"]" instantly. If
we had an easier syntax, more people would be using re ;)

Is the idea utterly foolish?

Gerhard Häring

unread,
Sep 1, 2002, 3:13:56 PM9/1/02
to
* Gerson Kurz <gerso...@t-online.de> [2002-09-01 18:31 +0000]:
> [...] Anyway, that got me thinking on why do we have to deal with

> regular expressions like r"((?:a|b)*)", when in most cases the code
> will look something like this:
>
> r = re.compile("<some cryptic re-string here>")
> ...
> r.match(this) or r.find(that)

If you only use the RE once, you can use the module-level functions ;-)

> which means the real time is not spent in the compile() function, but
> in the match or find function. So basically, couldn't one come up with
> a *human readable* syntax for re, and compile that instead?

That's equally powerful? Most probably not.

> Also, I think it would already be an improvement if the syntax
> provided for clear and easy-to-understand special cases, like
>
> re.compile("anything that starts with 'abc'")

s.startswith("abc")
s.lower().startswith("abc")

> and if you cannot find something in the special cases for you, you can
> always go back to
>
> re.compile("<some cryptinc re-string here>")
>
> After all, *everyone* starting with re thinks the syntax is cryptic
> and mind-boggling, and only if you get yourself into the "re mindset",
> you understand things like r"\s*\w+\s*=\s*['\"].*?['\"]" instantly. If
> we had an easier syntax, more people would be using re ;)
>
> Is the idea utterly foolish?

I don't really know. IMO if you have very simple string-searching, then
you can probably get away with the string methods, and if you have very
complex stuff, then you'll probably be better of with a parser generator
(like SimpleParse, which is very readable, IMO).

I don't find regular expressions that unreadably, especially when I
consider that I'd have to write many lines of error-prone Python code
instead. Stuff like this is just too convenient:

# working around zxDateTime limitations:
if JYTHON:
import re

ISO_DATE_RE = re.compile(r"(\d\d\d\d)-(\d\d)-(\d\d)")
def DateFrom(s):
match = ISO_DATE_RE.match(s)
if match is None:
raise ValueError
return DateTime(*map(int, match.groups()))

Gerhard
--
mail: gerhard <at> bigfoot <dot> de registered Linux user #64239
web: http://www.cs.fhm.edu/~ifw00065/ OpenPGP public key id AD24C930
public key fingerprint: 3FCC 8700 3012 0A9E B0C9 3667 814B 9CAA AD24 C930
reduce(lambda x,y:x+y,map(lambda x:chr(ord(x)^42),tuple('zS^BED\nX_FOY\x0b')))

Henrik Motakef

unread,
Sep 1, 2002, 4:37:34 PM9/1/02
to
gerso...@t-online.de (Gerson Kurz) writes:

> So basically, couldn't one come up with a *human readable* syntax
> for re, and compile that instead? Python prides itself on its clean
> syntax, and human readability, an bang - import re, get perl-ish
> code instantly!

There is an Emacs Lisp package called "symbolic regexps" (or
sregexp.el) that lets you write regular expressions in standard lisp
syntax, like

(sregexq bol (or "abc" "def"))

instead of "^(abc|def)" (bol meaning "beginning of line"). I don't see
how that maps elegantly to Python syntax, however.

Regards
Henrik

A.M. Kuchling

unread,
Sep 1, 2002, 5:07:24 PM9/1/02
to
In article <3d72588...@news.t-online.de>, Gerson Kurz wrote:
> in the match or find function. So basically, couldn't one come up with
> a *human readable* syntax for re, and compile that instead? Python

Maybe Ka-Ping Yee's rxb? http://web.lfw.org/python/rxb15.py

The problem with a new syntax is that no one else would be using it, so
you'd still need to learn the existing syntax for use with grep, vi, Perl,
&c. (It wouldn't surprise me if Perl 6's revised regexes run into this very
difficulty and don't gain much adoption.)

--amk

François Pinard

unread,
Sep 1, 2002, 5:16:46 PM9/1/02
to
[Henrik Motakef]

> There is an Emacs Lisp package called "symbolic regexps" (or sregexp.el)
> that lets you write regular expressions in standard lisp syntax, like
> (sregexq bol (or "abc" "def")) instead of "^(abc|def)" (bol meaning
> "beginning of line"). I don't see how that maps elegantly to Python syntax,
> however.

PLEX has a Python way to describe regular expressions. It is likely available
stand-alone. I found the copy I use within the Pyrex distribution, see:

http://www.cosc.canterbury.ac.nz/~greg/python/Pyrex/

--
François Pinard http://www.iro.umontreal.ca/~pinard

John Roth

unread,
Sep 1, 2002, 7:40:36 PM9/1/02
to

"A.M. Kuchling" <a...@amk.ca> wrote in message
news:un50cco...@corp.supernews.com...

I see you've already gotten to my suggestion - look at Apocalypse 5
and Exegesis 5 on the O'Reilly Perl page and see what Larry Wall
has done.

I tend to agree that Larry is going out on a limb. However, since he's
done so, if we are thinking about a new regex syntax, we really
should consider doing it the same way. As you point out, there's going
to be enough difficulty getting one radical revision accepted. Getting
two different ones accepted is likely to sink everyone's effort.

Also, doing it the same way will simplify the Python effort in Parrot
(assuming anyone cares about that any more...)

And as far as grep, Vi and so forth are concerned (Perl isn't
an issue - that's the way it's going,) doing something to fix them
isn't all that hard - at least for the GNU versions of the programs.

>
> --amk


Carl Banks

unread,
Sep 1, 2002, 7:17:19 PM9/1/02
to
Gerson Kurz wrote:
> So basically, couldn't one come up with
> a *human readable* syntax for re, and compile that instead? Python
> prides itself on its clean syntax, and human readability, an bang -
> import re, get perl-ish code instantly!


You might want to look at the Plex package. It defines patterns by
constructing data structures. Something like this:

symbol = Range("A-Za-z") + Any(Range("0-9A-Za-z") | Char("_"))

However, three points:

First, this will certainly be slower than regular expressions, since
there are many Python calls needed to build the structure. (Of course,
after you've compiled it, it can be as fast as regexps.)

Second, even if you use re module, it is still nowhere near Perl-ish
ugliness. You still have Python's clean syntax outside of the
pattern.

Third, readability is not a unilateral good thing; conciseness is also
important, and sometimes opposed to readability. Sacrificing a little
readability to get a lot of conciseness is usually a good thing. I
think, as long as the regexp is not too obnoxious, it is probably
better to keep it concise. (Of course, this depends a lot on what
you're doing and how flexible you need to be.)

--
CARL BANKS
http://www.aerojockey.com

François Pinard

unread,
Sep 1, 2002, 8:45:58 PM9/1/02
to
[Carl Banks]

> You might want to look at the Plex package. [...]


>
> First, this will certainly be slower than regular expressions, since
> there are many Python calls needed to build the structure. (Of course,
> after you've compiled it, it can be as fast as regexps.)

The Plex matching engine does not backtrack, which might be an advantage over
Python regexps at matching time, at least theoretically for some regexps.
Building the matching tables, however, consumes a lot of time. So I guess
that for most usual regexps, Python regexps are quite OK.

> Second, even if you use re module, it is still nowhere near Perl-ish
> ugliness. You still have Python's clean syntax outside of the pattern.

A clear and definite advantage! :-) :-)

> Third, readability is not a unilateral good thing; conciseness is also

> important [...]

Python regexps, given some flag at compilation time, may allow for embedded
whitespace and comments. With proper care, difficult regexps could be made
less compact and more readable, without changing at all how they behave at
run-time. Conciseness is a quality for short regexps. Horrid regexps are
more advantageously written non-compactly.

Carl Banks

unread,
Sep 1, 2002, 7:45:55 PM9/1/02
to
Gerhard H?ring wrote:
>> which means the real time is not spent in the compile() function, but
>> in the match or find function. So basically, couldn't one come up with
>> a *human readable* syntax for re, and compile that instead?
>
> That's equally powerful? Most probably not.

Why not? It won't be as fast, but it should be able to do anything a
regexp can do, and would be much more versatile.

Gerson Kurz

unread,
Sep 2, 2002, 1:02:23 AM9/2/02
to
On Sun, 01 Sep 2002 21:07:24 -0000, "A.M. Kuchling" <a...@amk.ca>
wrote:

>The problem with a new syntax is that no one else would be using it, so
>you'd still need to learn the existing syntax for use with grep, vi, Perl,
>&c. (It wouldn't surprise me if Perl 6's revised regexes run into this very
>difficulty and don't gain much adoption.)

A tounge-in-cheek-answer: I don't use grep, I use python. I don't use
vi, I use python. I don't use perl, I use python :)

Seriously, my thinking was, the re.compile function is there to
compile an expression to a binary representation for optimized
searching. So maybe, a "clean syntax" -> "ugly re syntax" compiler
would be good?

Besides, you could use the output ("ugly re syntax") in grep, vi, perl
if you really so intended.


Gerhard Häring

unread,
Sep 2, 2002, 1:21:35 AM9/2/02
to
Gerson Kurz wrote in comp.lang.python:

> "A.M. Kuchling" <a...@amk.ca> wrote:
>>The problem with a new syntax is that no one else would be using it, so
>>you'd still need to learn the existing syntax for use with grep, vi, Perl,
>>&c. (It wouldn't surprise me if Perl 6's revised regexes run into this very
>>difficulty and don't gain much adoption.)
>
> A tounge-in-cheek-answer: I don't use grep, I use python. I don't use
> vi,

vi? Yuck. vim or die :-P

> I use python. I don't use perl, I use python :)
>
> Seriously, my thinking was, the re.compile function is there to
> compile an expression to a binary representation for optimized
> searching. So maybe, a "clean syntax" -> "ugly re syntax" compiler
> would be good?

I think it would. Though the other way round wouldn't be that bad,
either.

-- Gerhard

Greg Ewing

unread,
Sep 2, 2002, 1:55:05 AM9/2/02
to
François Pinard wrote:

> [Henrik Motakef]


>
> PLEX has a Python way to describe regular expressions. It is likely available
> stand-alone.

http://www.cosc.canterbury.ac.nz/~greg/python/Plex/

You get documentation as well if you get it from there. :-)

Note that Plex's RE implementation is very special
purpose -- you couldn't use it as a direct replacement
for the re module. But a wrapper for the RE module
which uses the same syntax could easily be made.

--
Greg Ewing, Computer Science Dept,
University of Canterbury,
Christchurch, New Zealand
http://www.cosc.canterbury.ac.nz/~greg

John La Rooy

unread,
Sep 1, 2002, 6:21:32 PM9/1/02
to
I think the main problem is that *human readable* doesn't map really
well onto regular expressions.

What would the equivalent of r"(.)(.)(.)\3\2\1"
This means a "palindrome of 6 characters"
But it is unlikely that the human readable processor would understand
that (isn't it??)

It would be more likely to look like this (I haven't put too much
thought into this)
"anything,anything,anything,same_as_3rd,same_as_2nd,same_as_1st"
or would you like to suggest something else?

palindrome_6 = re.compile(r"(.)(.)(.)\3\2\1")
palindrome_6 =
re.compile("anything,anything,anything,same_as_3rd,same_as_2nd,same_as_1st")

Sure there are some cases where the re is loaded with meta characters...

hmmm
OK is this about writing maintainable code or people not wanting to
learn all the ins and outs of re's?

John

Harvey Thomas

unread,
Sep 2, 2002, 7:06:08 AM9/2/02
to
John La Rooy wrote

I used to use OmniMark a lot when it was free. With OmniMark's
equivalent of REs, the palindrome would be

any => char1 any => char2 any => char3 char3 char2 char1

I selected a non-trivial OmniMark RE from old code at random and came up with

('<!DOCTYPE' white-space+ [any except white-space]+ white-space+ "PUBLIC" white-space+ '"'
upto-inc('"')) => a.whole (white-space+ "SYSTEM"? white-space* '"'
upto-inc('"'))?

The (untested) Python RE equivalent is something like

"""(?P<a.whole><!DOCTYPE\s+[^\n]+\s+"PUBLIC"\s+"[^"]+")(?:\s+(?:"SYSTEM")?\s+"[^"]*")?"""

or, more readably if compiled with the re.VERBOSE flag

"""
(?P<a.whole>
<!DOCTYPE
\s+[^\n]+
\s+"PUBLIC"
\s+"[^"]+")
(?:\s+(?:"SYSTEM")?
\s+"[^"]*")?
"""


Which is the easiest to understand?

I'm used to REs, so I don't find the verbose Python RE too difficult to read. When I was learning
OmniMark, however, it was nice to be able to use "digit" and "letter" rather than
"\d" and [a-zA-Z] as creating non-trivial effective search patterns is never easy.


_____________________________________________________________________
This message has been checked for all known viruses by the MessageLabs Virus Scanning Service.

Carl Banks

unread,
Sep 2, 2002, 3:08:47 PM9/2/02
to
John La Rooy wrote:
> Carl Banks wrote:
>> Gerhard H?ring wrote:
>>
>>>>which means the real time is not spent in the compile() function, but
>>>>in the match or find function. So basically, couldn't one come up with
>>>>a *human readable* syntax for re, and compile that instead?
>>>
>>>That's equally powerful? Most probably not.
>>
>> Why not? It won't be as fast, but it should be able to do anything a
>> regexp can do, and would be much more versatile.
>
> I think the main problem is that *human readable* doesn't map really
> well onto regular expressions.

Ridiculous. If you can map human readable code into machine language,
then you can map human readable code into regular expressions.

Cryptic as they are, regular expressions are still systematic; thus it
is possible to systematically convert the cryptic regexp syntax into
more readable and consistent syntax.


> What would the equivalent of r"(.)(.)(.)\3\2\1"
> This means a "palindrome of 6 characters"
> But it is unlikely that the human readable processor would understand
> that (isn't it??)

Nope. You don't appear to appreciate the power of computers to
translate human readable text into comlicated internal data, and are
evidently forgetting that interpretters such as Python that do a much
more difficult translation of readable text.


> It would be more likely to look like this (I haven't put too much
> thought into this)

No kidding.


> "anything,anything,anything,same_as_3rd,same_as_2nd,same_as_1st"
> or would you like to suggest something else?

How about:

pattern = Group(Any()) + Group(Any()) + Group(Any()) \
+ GroupRef(3) + GroupRef(2) + GroupRef(1)

There's no reason it has to be re.compile with a string.


[snip]


>
> Sure there are some cases where the re is loaded with meta characters...

That's the idea, chief. For a simple regexp like you gave above, it
would be overkill to use a human readable syntax. And it would still
be overkill for many regexps more complicated that that.

But eventually, the regexps will become complicated enough that a more
human readable syntax is preferrable. Not to mention that a human
readable syntax will be more versatile, when that is needed.


> hmmm
> OK is this about writing maintainable code or people not wanting to
> learn all the ins and outs of re's?

Nope. For me, this is about understanding that complicated regexps
could benefit from a more readable and consistent syntax, and that the
more consistent syntax could add a lot of power and versatility to
regexps.

Bengt Richter

unread,
Sep 2, 2002, 3:30:46 PM9/2/02
to
On Sun, 01 Sep 2002 21:07:24 -0000, "A.M. Kuchling" <a...@amk.ca> wrote:

I agree about new syntax, but I wouldn't mind having a re.help(regexp) function
for interactive use that would just explain in 'English' what the regexp expression
stands for. It would be a nice easy double check on whether I wrote what I meant,
and helpful for understanding someone else's magic. It shouldn't be that hard to do.

Regards,
Bengt Richter

John Hall

unread,
Sep 2, 2002, 3:51:44 PM9/2/02
to
On 2 Sep 2002 19:30:46 GMT, bo...@oz.net (Bengt Richter) wrote:

>I agree about new syntax, but I wouldn't mind having a re.help(regexp) function
>for interactive use that would just explain in 'English' what the regexp expression
>stands for. It would be a nice easy double check on whether I wrote what I meant,
>and helpful for understanding someone else's magic. It shouldn't be that hard to do.
>

Many years ago, in my PL/I & IBM Mainframe days, I wrote a gizmo to
check the result of some tricky JCL (Job Control Language) parameters.
(Required because if it did not do what the user intended, it could
fail in an overnight run and the user wouldn't know until morning)

Not only did it take an expression and show the result, but it gave a
decision-by-decision commentary on why the result was what it was, so
provided a live tutorial by way of a simulator. It was moderately
difficult to write, IIRC, but most difficult was checking that the
simulator was in accord with reality.

However I don't understand regexs well enough to attempt this yet.

--
John W Hall <wweexxss...@telusplanet.net>
Calgary, Alberta, Canada.
"Helping People Prosper in the Information Age"

John Roth

unread,
Sep 2, 2002, 5:49:23 PM9/2/02
to

"Gerson Kurz" <gerso...@t-online.de> wrote in message
news:3d72588...@news.t-online.de...

No, it's not utterly foolish. You might be surprised to learn that
Larry Wall agrees with you that the Perl regex syntax is much
too obtuse, and in need of a basic, ground up redesign. Even
current Perl syntax allows you a special form where you can
insert blanks for readability.

http://www.perl.com/pub/a/2002/06/04/apo5.html

http://www.perl.com/pub/a/2002/08/22/exegesis5.html

It's an interesting redesign of basic regex functionality.
Some of the things you can do with it are very, very
interesting indeed.

John Roth
>


John La Rooy

unread,
Sep 2, 2002, 7:10:39 AM9/2/02
to
Bengt Richter wrote:

>
> I agree about new syntax, but I wouldn't mind having a re.help(regexp) function
> for interactive use that would just explain in 'English' what the regexp expression
> stands for. It would be a nice easy double check on whether I wrote what I meant,
> and helpful for understanding someone else's magic. It shouldn't be that hard to do.
>
> Regards,
> Bengt Richter

I think that's an excellent idea (probably has already been done,
anybody know?)

It is unfortunate that often a re that starts off as a simple idea
turns out to be a big mess. I think some sort of graphical tool
would be even better so you could see what the re does. Especially
if it could convert both ways and allow you to edit the re.

John


John La Rooy

unread,
Sep 2, 2002, 7:23:18 AM9/2/02
to
Carl Banks wrote:

>>It would be more likely to look like this (I haven't put too much
>>thought into this)
>
>
> No kidding.
>
>
>
>>"anything,anything,anything,same_as_3rd,same_as_2nd,same_as_1st"
>>or would you like to suggest something else?
>
>
> How about:
>
> pattern = Group(Any()) + Group(Any()) + Group(Any()) \
> + GroupRef(3) + GroupRef(2) + GroupRef(1)
>

Err symantically that's exactly the same as the re and my suggestion
only the syntax is different. It's still nothing like saying

pattern = "6 character palindrome"

John

Carl Banks

unread,
Sep 2, 2002, 7:34:29 PM9/2/02
to
John La Rooy wrote:
> Carl Banks wrote:
>
>>>"anything,anything,anything,same_as_3rd,same_as_2nd,same_as_1st"
>>>or would you like to suggest something else?
>>
>>
>> How about:
>>
>> pattern = Group(Any()) + Group(Any()) + Group(Any()) \
>> + GroupRef(3) + GroupRef(2) + GroupRef(1)
>>
> Err symantically that's exactly the same as the re and my suggestion
> only the syntax is different. It's still nothing like saying
>
> pattern = "6 character palindrome"


Oh. Well, methinks you don't give humans enough credit. I'm a human,
and I can read the verbose regexp. Of course, humans can read regular
expressions, too. Maybe I don't give humans enough credit, either.
:-)

The point is, I was arguing for less-cryptic, more-verbose regular
expressions as a way to make complicated patterns more transparent. I
certainly wasn't arguing for "6 character palindrome".


Sorry-for-the-confusion-ly y'rs

jep...@unpythonic.net

unread,
Sep 2, 2002, 8:37:40 PM9/2/02
to

Do you mean something like this?

def palindrome_re(n):
pat = ["(.)" * ((n+1)/2)]
for i in range(n/2, 0, -1):
pat.append("\\%d" % i)
return "".join(pat)

With a little work, you can extend this to use named groups and named
backrefs as well, so that you can use it as a building block for larger
patterns:

def Any(): return "."
def Group(s, g): return "(?P<%s>%s)" % (g, s)
def Backref(g): return "(?P=%s)" % g
def Or(*args): return "|".join(args)

def palindrome_re(n, p):
pat = [Group(Any(), "%s%d") % (p, i+1) for i in range((n+1)/2)]
for i in range(n/2, 0, -1):
pat.append(Backref("%s%d" % (p, i)))
return "".join(pat)

I think that building REs in functions is a great approach for more
complex REs.

>>> q = re.compile(palindrome_re(7, "a") + palindrome_re(6, "b"))
>>> q.match("abcdcbaxyzzyx")
<_sre.SRE_Match object at 0x401c4f00>
>>> _.groupdict()
{'l4': 'd', 'l2': 'b', 'l3': 'c', 'l1': 'a', 'i1': 'x', 'i3': 'z', 'i2': 'y'}
>>> q = re.compile(Or(palindrome_re(7, "a"), palindrome_re(6, "b")))
>>> q.match('abccbb')
>>> q.match("abcdcba")
<_sre.SRE_Match object at 0x401c4f00>
>>> q.match("abccba")
<_sre.SRE_Match object at 0x402e5020>

Jeff

Greg Ewing

unread,
Sep 2, 2002, 9:18:52 PM9/2/02
to
John Roth wrote:

> http://www.perl.com/pub/a/2002/06/04/apo5.html
>
> http://www.perl.com/pub/a/2002/08/22/exegesis5.html

I just had a brief look at this, and the underlying
ideas seem to be a lot like the way Snobol patterns
work.

Maybe it's time for me to resurrect the Snobol-style
pattern matching module that I started on a while
back and never got around to releasing.

Would anyone be interested in this? Its
syntax is similar to that of Plex REs, except that
the primitives are Snobol-like, and it uses a
backtracking matching algorithm that's much more
powerful than a DFA (you can write entire parsers
in it, for example).

Terry Hancock

unread,
Sep 3, 2002, 1:06:00 AM9/3/02
to
From: Greg Ewing <see_repl...@something.invalid>

> Maybe it's time for me to resurrect the Snobol-style
> pattern matching module that I started on a while
> back and never got around to releasing.
>
> Would anyone be interested in this? Its
> syntax is similar to that of Plex REs, except that
> the primitives are Snobol-like, and it uses a
> backtracking matching algorithm that's much more
> powerful than a DFA (you can write entire parsers
> in it, for example).

Some interesting references, possibly? --

http://sourceforge.net/projects/pystemmer/
http://snowball.sourceforge.net

These may be more specialized -- Snowball is
specifically for algorithmically stemming words,
and PyStemmer is an interface to it. I haven't
really looked into how it works. The name is
related to SNOBOL, but I'm not sure how much
Snowball actually resembles it (if at all).

I'm using pystemmer as part of a function which
converts object titles to (hopefully) mnemonic
file names (ids) in Zope. I haven't really
looked into how it works.

But I though it might be relevant to you.

Cheers,
Terry

--
------------------------------------------------------
Terry Hancock
han...@anansispaceworks.com
Anansi Spaceworks
http://www.anansispaceworks.com
P.O. Box 60583
Pasadena, CA 91116-6583
------------------------------------------------------

Max M

unread,
Sep 3, 2002, 3:39:25 AM9/3/02
to
John La Rooy wrote:

> Bengt Richter wrote:
>>
> It is unfortunate that often a re that starts off as a simple idea
> turns out to be a big mess. I think some sort of graphical tool
> would be even better so you could see what the re does. Especially
> if it could convert both ways and allow you to edit the re.


There used to be a tkinter tool like this in Python under 1.52 written
by Guido. I don't know what happened to it though.

regards Max m

Max M

unread,
Sep 3, 2002, 3:58:54 AM9/3/02
to
Carl Banks wrote:

>>What would the equivalent of r"(.)(.)(.)\3\2\1"
>>This means a "palindrome of 6 characters"
>>But it is unlikely that the human readable processor would understand
>>that (isn't it??)

Why not just write the parts of the regex as named strings?

like::

name = '[a-zA-Z0-9_.]+'
at = '@'
dot = '\.'
topDomain = 'com|org|dk'
email = name + at + name + dot + topDomain

instead of::

email = '[a-zA-Z0-9_.]+@[a-zA-Z0-9_.]+\.com|org|dk'

Well my syntax is most likely wrong as I suck at regex, but the meaning
should be clear enough.

A.M. Kuchling

unread,
Sep 3, 2002, 8:23:58 AM9/3/02
to
In article <3D74672D...@mxm.dk>, Max M wrote:
> There used to be a tkinter tool like this in Python under 1.52 written
> by Guido. I don't know what happened to it though.

It's still around as Tools/scripts/redemo.py, I think.

--amk

John Roth

unread,
Sep 3, 2002, 8:47:40 AM9/3/02
to

"Greg Ewing" <see_repl...@something.invalid> wrote in message
news:3D740DFC...@something.invalid...

> John Roth wrote:
>
> > http://www.perl.com/pub/a/2002/06/04/apo5.html
> >
> > http://www.perl.com/pub/a/2002/08/22/exegesis5.html
>
>
> I just had a brief look at this, and the underlying
> ideas seem to be a lot like the way Snobol patterns
> work.
>
> Maybe it's time for me to resurrect the Snobol-style
> pattern matching module that I started on a while
> back and never got around to releasing.
>
> Would anyone be interested in this? Its
> syntax is similar to that of Plex REs, except that
> the primitives are Snobol-like, and it uses a
> backtracking matching algorithm that's much more
> powerful than a DFA (you can write entire parsers
> in it, for example).

I don't know much about Snobol, unfortunately.
I think my biggest issue here is that we shouldn't
reinvent the wheel unless there is a good reason.
In other words, Larry is taking Perl in a specific
direction. Assuming we want to make major
changes to regex, is there any _good_ reason
for doing something conceptually different and
consequently adding to the cacophony?

John Roth

Huaiyu Zhu

unread,
Sep 3, 2002, 1:42:27 PM9/3/02
to
jep...@unpythonic.net <jep...@unpythonic.net> wrote:
>On Mon, Sep 02, 2002 at 09:23:18PM +1000, John La Rooy wrote:
>> Carl Banks wrote:
>> >
>> >pattern = Group(Any()) + Group(Any()) + Group(Any()) \
>> > + GroupRef(3) + GroupRef(2) + GroupRef(1)
>> >
>> Err symantically that's exactly the same as the re and my suggestion
>> only the syntax is different. It's still nothing like saying
>>
>> pattern = "6 character palindrome"
>
>Do you mean something like this?
>
> def palindrome_re(n):
> pat = ["(.)" * ((n+1)/2)]
> for i in range(n/2, 0, -1):
> pat.append("\\%d" % i)
> return "".join(pat)
[snip]

>
>I think that building REs in functions is a great approach for more
>complex REs.

It would also be useful if patterns can be built up as structured objects:

def palindrome_re(n):
pat = Empty()
for i in range(n): pat += Group(Any())
for i in range(n, 0, -1): pat += GroupRef(i)
return pat

class Palindrom:
def __init__(self, n): self.n = n
def __call__(self): return palindrome_re(self.n)

pattern = Palindrom(6)

Once the tree structure is revealed it could be mapped to whatever syntax
that is convenient for the situation. The perl-like syntax would just be
one of them.

Huaiyu

Carl Banks

unread,
Sep 3, 2002, 2:13:32 PM9/3/02
to
John Roth wrote:
> I don't know much about Snobol, unfortunately.
> I think my biggest issue here is that we shouldn't
> reinvent the wheel unless there is a good reason.
> In other words, Larry is taking Perl in a specific
> direction. Assuming we want to make major
> changes to regex,

I don't think anyone wants to make changes to regular expressions, per
se. I think this discussion is about adding a higher level syntax to
regular expressions, that makes them more readable and versatile, for
cases where that's important.


> is there any _good_ reason
> for doing something conceptually different and
> consequently adding to the cacophony?

I think there is a good reason to have a higher level syntax. It
serves a purpose heretofore unserved: that complex regular expressions
no longer have to be unbearably unreadable.

I could live without it, though.

John Roth

unread,
Sep 3, 2002, 6:20:57 PM9/3/02
to

"Carl Banks" <imb...@vt.edu> wrote in message
news:al30tq$aud$1...@solaris.cc.vt.edu...

> John Roth wrote:
> > I don't know much about Snobol, unfortunately.
> > I think my biggest issue here is that we shouldn't
> > reinvent the wheel unless there is a good reason.
> > In other words, Larry is taking Perl in a specific
> > direction. Assuming we want to make major
> > changes to regex,
>
> I don't think anyone wants to make changes to regular expressions, per
> se. I think this discussion is about adding a higher level syntax to
> regular expressions, that makes them more readable and versatile, for
> cases where that's important.

One of the things Perl has right now is the ability to use spaces
within regular expressions via a suffix flag. In Perl 6, this will
become
standard, and will include the ability to include comments!

Apocalypse 5 is well worth reading. Larry sets out why he
decided to redesign regular expression syntax. Much of the
reason was exactly what you specify: it's hard to read, hard
to understand, and has picked up lots of cruft over the years.

Making regular expressions "more versatile" is a great
idea. There's lots of things I'd like to do. However,
everything you load onto the poor thing makes it that
much more complex, hard to understand, hard to read
and errorprone. At some point, the mess has to be
redesigned. Since Python's re syntax isn't currently
as complex as Perl's (I think) it might not be there yet.

Skip Montanaro

unread,
Sep 3, 2002, 6:36:41 PM9/3/02
to

John> One of the things Perl has right now is the ability to use spaces
John> within regular expressions via a suffix flag. In Perl 6, this will
John> become standard, and will include the ability to include comments!

Python also supports that with the re.VERBOSE flag.

Skip

Fernando Pereira

unread,
Sep 3, 2002, 8:09:33 PM9/3/02
to
On 9/3/02 2:13 PM, in article al30tq$aud$1...@solaris.cc.vt.edu, "Carl Banks"

<imb...@vt.edu> wrote:
> I think there is a good reason to have a higher level syntax. It
> serves a purpose heretofore unserved: that complex regular expressions
> no longer have to be unbearably unreadable.
The stronger version of this point is that at the moment there are no
methods for composing complex expressions from simpler ones that do not
involve manipulating strings, like building complex Python programs by
taking their pieces as strings, gluing them together and passing the result
to eval. Clearly a mess. For a good example of an alternative approach (in
Scheme) see

http://www.ai.mit.edu/people/shivers/sre.txt

The beauty of this kind of approach is that the expressive power of the host
language is fully available in constructing complex expressions.

-- F

Fredrik Lundh

unread,
Sep 4, 2002, 3:14:06 AM9/4/02
to
Gerson Kurz wrote:

> Seriously, my thinking was, the re.compile function is there to
> compile an expression to a binary representation for optimized
> searching. So maybe, a "clean syntax" -> "ugly re syntax" compiler
> would be good?

note that the SRE engine contains an "ugly syntax" to "internal
data structure" parser, and an "internal data structure" to "engine
code" compiler.

it's probably easier (and definitely more efficient) to turn a clean
syntax into an "internal data structure" than into an ugly syntax.

(the next step is to use python's own parse tree instead of SRE's
internal structure, and use an extension to python's compiler for
the final step. make it all pluggable, and you have perl6...)

</F>


Fredrik Lundh

unread,
Sep 4, 2002, 3:14:08 AM9/4/02
to
Bengt Richter wrote:

> I agree about new syntax, but I wouldn't mind having a re.help(regexp) function
> for interactive use that would just explain in 'English' what the regexp expression
> stands for.

you can ask SRE to dump the internal parse tree
to stdout:

>>> sre.compile("[a-z]\d*", sre.DEBUG)
in
range (97, 122)
max_repeat 0 65535
in
category category_digit

turning this into 'English' is left as an exercise etc.

</F>


Bengt Richter

unread,
Sep 4, 2002, 3:15:42 PM9/4/02
to

Interesting, thanks. Does the above mean that sre can't fully match
'a'+'9'*65537
?

Regards,
Bengt Richter

Fredrik Lundh

unread,
Sep 5, 2002, 2:41:14 AM9/5/02
to
Bengt Richter wrote:

> >you can ask SRE to dump the internal parse tree
> >to stdout:
> >
> >>>> sre.compile("[a-z]\d*", sre.DEBUG)
> >in
> > range (97, 122)
> >max_repeat 0 65535
> > in
> > category category_digit
> >
> >turning this into 'English' is left as an exercise etc.
>
> Interesting, thanks. Does the above mean that sre can't fully match
> 'a'+'9'*65537
> ?

in this context, 65535 represents any number:

>>> import re
>>> p = re.compile("[a-z]\d*")
>>> s = "a"+"9"*65537
>>> len(s)
65538
>>> m = p.match(s)
>>> len(m.group(0))
65538

</F>


Ben Wolfson

unread,
Sep 5, 2002, 2:56:24 AM9/5/02
to
On Thu, 05 Sep 2002 06:41:14 GMT, "Fredrik Lundh" <fre...@pythonware.com>
wrote:

>Bengt Richter wrote:


>
>> >you can ask SRE to dump the internal parse tree
>> >to stdout:
>> >
>> >>>> sre.compile("[a-z]\d*", sre.DEBUG)
>> >in
>> > range (97, 122)
>> >max_repeat 0 65535
>> > in
>> > category category_digit
>> >
>> >turning this into 'English' is left as an exercise etc.
>>
>> Interesting, thanks. Does the above mean that sre can't fully match
>> 'a'+'9'*65537
>> ?
>
>in this context, 65535 represents any number:

Doesn't that cause problems for something like this?

>>> m=re.compile(r'\d{0,65535}a').match(('9'*1000000)+'a')
>>> len(m.group(0))
1000001

--
BTR
You're going to set me up as a kind of slovenly attached pig that
Jack Kornfeld can slice down in his violent zen compassion?
-- Larry Block

nbe...@fred.net

unread,
Sep 9, 2002, 8:53:56 PM9/9/02
to
IMHO, the thing that I really miss in making regex's understandable,
is that you can't include one regexp into another. If you could, it
would be much easier to write clear expressions. Flex, for example,
allows this.

Bengt Richter

unread,
Sep 9, 2002, 9:40:08 PM9/9/02
to
On Thu, 05 Sep 2002 06:56:24 GMT, Ben Wolfson <wol...@midway.uchicago.edu> wrote:

>On Thu, 05 Sep 2002 06:41:14 GMT, "Fredrik Lundh" <fre...@pythonware.com>
>wrote:
>
>>Bengt Richter wrote:
>>
>>> >you can ask SRE to dump the internal parse tree
>>> >to stdout:
>>> >
>>> >>>> sre.compile("[a-z]\d*", sre.DEBUG)
>>> >in
>>> > range (97, 122)
>>> >max_repeat 0 65535
>>> > in
>>> > category category_digit
>>> >
>>> >turning this into 'English' is left as an exercise etc.
>>>
>>> Interesting, thanks. Does the above mean that sre can't fully match
>>> 'a'+'9'*65537
>>> ?
>>
>>in this context, 65535 represents any number:
>
>Doesn't that cause problems for something like this?
>
>>>> m=re.compile(r'\d{0,65535}a').match(('9'*1000000)+'a')
>>>> len(m.group(0))
>1000001
>

Looks like a bug to me if {0,65535} acts like {0,}

BTW, a search for \d{0,65534} seems to mean it, and compiles
so slowly that I lost patience waiting. Not very optimized, I guess.

>>> import re
>>> m=re.compile(r'\d{0,65535}a').search(('9'*1000000)+'a')
>>> len(m.group(0))
1000001

That went reasonably in time(though it's wrong), but this snoozed.
It must be brute forcing something.

>>> m=re.compile(r'\d{0,65534}a').search(('9'*1000000)+'a')
^C
[18:50] C:\pywk\junk>

Regards,
Bengt Richter

0 new messages