Re: how to avoid leading white spaces

Chris Rebert

unread,

Jun 1, 2011, 1:11:01 PM6/1/11

to rakesh kumar, pytho...@python.org

On Wed, Jun 1, 2011 at 12:31 AM, rakesh kumar
<rakeshkum...@gmail.com> wrote:
>
> Hi
>
> i have a file which contains data
>
> //ACCDJ         EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
> //         UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCDJ       '
> //ACCT          EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
> //         UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCT        '
> //ACCUM         EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
> //         UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCUM       '
> //ACCUM1        EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
> //         UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCUM1      '
>
> i want to cut the white spaces which are in between single quotes after TABLE=.
>
> for example :
>                                'ACCT[spaces] '
>                                'ACCUM           '
>                                'ACCUM1         '
> the above is the output of another python script but its having a leading spaces.

Er, you mean trailing spaces. Since this is easy enough to be
homework, I will only give an outline:

1. Use str.index() and str.rindex() to find the positions of the
starting and ending single-quotes in the line.
2. Use slicing to extract the inside of the quoted string.
3. Use str.rstrip() to remove the trailing spaces from the extracted string.
4. Use slicing and concatenation to join together the rest of the line
with the now-stripped inner string.

Relevant docs: http://docs.python.org/library/stdtypes.html#string-methods

Cheers,
Chris
--
http://rebertia.com

ru...@yahoo.com

unread,

Jun 1, 2011, 3:39:47 PM6/1/11

to

On Jun 1, 11:11 am, Chris Rebert <c...@rebertia.com> wrote:
> On Wed, Jun 1, 2011 at 12:31 AM, rakesh kumar

For some odd reason (perhaps because they are used a lot in Perl),
this groups seems to have a great aversion to regular expressions.
Too bad because this is a typical problem where their use is the
best solution.

import re
f = open ("your file")
for line in f:
fixed = re.sub (r"(TABLE='\S+)\s+'$", r"\1'", line)
print fixed,

(The above is for Python-2, adjust as needed for Python-3)

Karim

unread,

Jun 1, 2011, 4:34:19 PM6/1/11

to ru...@yahoo.com, pytho...@python.org

Rurpy,
Your solution is neat.
Simple is better than complicated... (at list for this simple issue)

Neil Cerutti

unread,

Jun 2, 2011, 9:21:06 AM6/2/11

to

On 2011-06-01, ru...@yahoo.com <ru...@yahoo.com> wrote:
> For some odd reason (perhaps because they are used a lot in
> Perl), this groups seems to have a great aversion to regular
> expressions. Too bad because this is a typical problem where
> their use is the best solution.

Python's str methods, when they're sufficent, are usually more
efficient.

Perl integrated regular expressions, while Python relegated them
to a library.

There are thus a large class of problems that are best solve with
regular expressions in Perl, but str methods in Python.

--
Neil Cerutti

Roy Smith

unread,

Jun 2, 2011, 9:57:16 PM6/2/11

to

In article <94ph22...@mid.individual.net>,
Neil Cerutti <ne...@norwich.edu> wrote:

> On 2011-06-01, ru...@yahoo.com <ru...@yahoo.com> wrote:
> > For some odd reason (perhaps because they are used a lot in
> > Perl), this groups seems to have a great aversion to regular
> > expressions. Too bad because this is a typical problem where
> > their use is the best solution.
>
> Python's str methods, when they're sufficent, are usually more
> efficient.

I was all set to say, "prove it!" when I decided to try an experiment.
Much to my surprise, for at least one common case, this is indeed
correct.

-------------------------------------------------
#!/usr/bin/env python

import timeit

text = '''Lorem ipsum dolor sit amet, consectetur adipiscing
elit. Mauris congue risus et purus lobortis facilisis. In
nec quam dolor, non blandit tellus. Suspendisse tempus,
sapien ac mattis volutpat, lectus elit auctor lacus, vitae
accumsan nunc elit in ligula. Curabitur quis mauris
neque. Etiam auctor eleifend arcu in egestas. Pellentesque
non mauris sit amet nulla aliquam hendrerit pretium id
arcu. Ut fringilla tempor lorem eget tincidunt. Duis nibh
nisi, iaculis sed scelerisque in, facilisis quis
dui. Aliquam varius diam in turpis auctor dapibus. Fusce
aliquet erat vestibulum mauris volutpat id laoreet enim
fermentum. Nam at justo nibh, ut vulputate dui.
libero. Nunc ac risus justo, in sodales erat.
'''
text = ' '.join(text.split())

t1 = timeit.Timer("'laoreet' in text",
"text = '%s'" % text)
t2 = timeit.Timer("pattern.search(text)",
"import re; pattern = re.compile('laoreet'); text =
'%s'" % text)
print t1.timeit()
print t2.timeit()
-------------------------------------------------
./contains.py
0.990975856781
1.91417002678
-------------------------------------------------

> Perl integrated regular expressions, while Python relegated them
> to a library.

The same way Python relegates most everything to a library :-)

MRAB

unread,

Jun 2, 2011, 10:41:47 PM6/2/11

to pytho...@python.org

On 03/06/2011 02:57, Roy Smith wrote:
> In article<94ph22...@mid.individual.net>,
> Neil Cerutti<ne...@norwich.edu> wrote:
>
>> On 2011-06-01, ru...@yahoo.com<ru...@yahoo.com> wrote:
>>> For some odd reason (perhaps because they are used a lot in
>>> Perl), this groups seems to have a great aversion to regular
>>> expressions. Too bad because this is a typical problem where
>>> their use is the best solution.
>>
>> Python's str methods, when they're sufficent, are usually more
>> efficient.
>
> I was all set to say, "prove it!" when I decided to try an experiment.
> Much to my surprise, for at least one common case, this is indeed
> correct.
>

[snip]

I've tested it on my PC for Python 2.7 (bytestring) and Python 3.1
(Unicode) and included the "regex" module on PyPI:

Python 2.7:
0.949936333562
4.31320052965
1.14035334748

Python 3.1:
1.27268308633
4.2509511537
1.16866839819

Chris Torek

unread,

Jun 2, 2011, 10:58:24 PM6/2/11

to

>In article <94ph22...@mid.individual.net>

> Neil Cerutti <ne...@norwich.edu> wrote:
>> Python's str methods, when they're sufficent, are usually more
>> efficient.

In article <roy-E2FA6F.2...@news.panix.com>
Roy Smith <r...@panix.com> replied:

>I was all set to say, "prove it!" when I decided to try an experiment.
>Much to my surprise, for at least one common case, this is indeed
>correct.

[big snip]

>t1 = timeit.Timer("'laoreet' in text",
> "text = '%s'" % text)
>t2 = timeit.Timer("pattern.search(text)",
> "import re; pattern = re.compile('laoreet'); text =
>'%s'" % text)
>print t1.timeit()
>print t2.timeit()
>-------------------------------------------------
>./contains.py
>0.990975856781
>1.91417002678
>-------------------------------------------------

This is a bit surprising, since both "s1 in s2" and re.search()
could use a Boyer-Moore-based algorithm for a sufficiently-long
fixed string, and the time required should be proportional to that
needed to set up the skip table. The re.compile() gets to re-use
the table every time. (I suppose "in" could as well, with some sort
of cache of recently-built tables.)

Boyer-Moore search is roughly O(M/N) where M is the length of the
text being searched and N is the length of the string being sought.
(However, it depends on the form of the string, e.g., searching
for "ababa" is not as good as searching for "abcde".)

Python might be penalized by its use of Unicode here, since a
Boyer-Moore table for a full 16-bit Unicode string would need
65536 entries (one per possible ord() value). However, if the
string being sought is all single-byte values, a 256-element
table suffices; re.compile(), at least, could scan the pattern
and choose an appropriate underlying search algorithm.

There is an interesting article here as well:
http://effbot.org/zone/stringlib.htm
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: gmail (figure it out) http://web.torek.net/torek/index.html

Roy Smith

unread,

Jun 2, 2011, 11:44:40 PM6/2/11

to

In article <is9ik...@news1.newsguy.com>,
Chris Torek <nos...@torek.net> wrote:

> Python might be penalized by its use of Unicode here, since a
> Boyer-Moore table for a full 16-bit Unicode string would need
> 65536 entries (one per possible ord() value).

I'm not sure what you mean by "full 16-bit Unicode string"? Isn't
unicode inherently 32 bit? Or at least 20-something bit? Things like
UTF-16 are just one way to encode it.

In any case, while I could imagine building a 2^16 entry jump table,
clearly it's infeasible (with today's hardware) to build a 2^32 entry
table. But, there's nothing that really requires you to build a table at
all. If I understand the algorithm right, all that's really required is
that you can map a character to a shift value.

For an 8 bit character set, an indexed jump table makes sense. For a
larger character set, I would imagine you would do some heuristic
pre-processing to see if your search string consisted only of characters
in one unicode plane and use that fact to build a table which only
indexes that plane. Or, maybe use a hash table instead of a regular
indexed table. Not as fast, but only slower by a small constant factor,
which is not a horrendous price to pay in a fully i18n world :-)

Chris Angelico

unread,

Jun 2, 2011, 11:52:03 PM6/2/11

to pytho...@python.org

On Fri, Jun 3, 2011 at 1:44 PM, Roy Smith <r...@panix.com> wrote:
> In article <is9ik...@news1.newsguy.com>,
> Chris Torek <nos...@torek.net> wrote:
>
>> Python might be penalized by its use of Unicode here, since a
>> Boyer-Moore table for a full 16-bit Unicode string would need
>> 65536 entries (one per possible ord() value).
>
> I'm not sure what you mean by "full 16-bit Unicode string"? Isn't
> unicode inherently 32 bit? Or at least 20-something bit? Things like
> UTF-16 are just one way to encode it.

The size of a Unicode character is like the size of a number. It's not
defined in terms of a maximum. However, Unicode planes 0-2 have all
the defined printable characters, and there are only 16 planes in
total, so (since each plane is 2^16 characters) that kinda makes
Unicode 18-bit or 20-bit. UTF-16 / UCS-2, therefore, uses two 16-bit
numbers to store a 20-bit number. Why do I get the feeling I've met
that before...

Chris Angelico
136E:0100 CD 20 INT 20

Chris Angelico

unread,

Jun 2, 2011, 11:54:23 PM6/2/11

to pytho...@python.org

On Fri, Jun 3, 2011 at 1:52 PM, Chris Angelico <ros...@gmail.com> wrote:
> However, Unicode planes 0-2 have all
> the defined printable characters

PS. I'm fully aware that there are ranges defined in page 14 / E.
They're non-printing characters, and unlikely to be part of a text
string, although it is possible. So you can't shortcut things and
treat Unicode as 18-bit numbers; has to be 20-bit. Doesn't have to be
32-bit unless that's really convenient.

Chris Angelico

Chris Torek

unread,

Jun 3, 2011, 12:30:46 AM6/3/11

to

>In article <is9ik...@news1.newsguy.com>,
> Chris Torek <nos...@torek.net> wrote:
>> Python might be penalized by its use of Unicode here, since a
>> Boyer-Moore table for a full 16-bit Unicode string would need
>> 65536 entries (one per possible ord() value).

In article <roy-751FAC.2...@news.panix.com>

Roy Smith <r...@panix.com> wrote:
>I'm not sure what you mean by "full 16-bit Unicode string"? Isn't
>unicode inherently 32 bit?

Well, not exactly. As I understand it, Python is normally built
with a 16-bit "unicode character" type though (using either UCS-2
or UTF-16 internally; but I admit I have been far too lazy to look
up stuff like surrogates here :-) ).

>In any case, while I could imagine building a 2^16 entry jump table,
>clearly it's infeasible (with today's hardware) to build a 2^32 entry
>table. But, there's nothing that really requires you to build a table at
>all. If I understand the algorithm right, all that's really required is
>that you can map a character to a shift value.

Right. See the URL I included for an example. The point here,
though, is ... well:

>For an 8 bit character set, an indexed jump table makes sense. For a
>larger character set, I would imagine you would do some heuristic
>pre-processing to see if your search string consisted only of characters
>in one unicode plane and use that fact to build a table which only
>indexes that plane. Or, maybe use a hash table instead of a regular
>indexed table.

Just so. You have to pay for one scan through the string to build
a hash-table of offsets -- an expense similar to that for building
the 256-entry 8-bit table, perhaps, depending on string length --
but then you pay again for each character looked-at, since:

skip = hashed_lookup(table, this_char);

is a more complex operation than:

skip = table[this_char];

(where table is a simple array, hence the C-style semicolons: this
is not Python pseudo-code :-) ). Hence, a "penalty".

>Not as fast, but only slower by a small constant factor,
>which is not a horrendous price to pay in a fully i18n world :-)

Indeed.

Thorsten Kampe

unread,

Jun 3, 2011, 4:32:12 AM6/3/11

to

* Roy Smith (Thu, 02 Jun 2011 21:57:16 -0400)

> In article <94ph22...@mid.individual.net>,
> Neil Cerutti <ne...@norwich.edu> wrote:
> > On 2011-06-01, ru...@yahoo.com <ru...@yahoo.com> wrote:
> > > For some odd reason (perhaps because they are used a lot in
> > > Perl), this groups seems to have a great aversion to regular
> > > expressions. Too bad because this is a typical problem where
> > > their use is the best solution.
> >
> > Python's str methods, when they're sufficent, are usually more
> > efficient.
>
> I was all set to say, "prove it!" when I decided to try an experiment.
> Much to my surprise, for at least one common case, this is indeed
> correct.

> [...]

> t1 = timeit.Timer("'laoreet' in text",
> "text = '%s'" % text)
> t2 = timeit.Timer("pattern.search(text)",
> "import re; pattern = re.compile('laoreet'); text =
> '%s'" % text)
> print t1.timeit()
> print t2.timeit()
> -------------------------------------------------
> ./contains.py
> 0.990975856781
> 1.91417002678
> -------------------------------------------------

Strange that a lot of people (still) automatically associate
"efficiency" with "takes two seconds to run instead of one" (which I
guess no one really cares about).

Efficiency is much better measured in which time it saves you to write
and maintain the code in a readable way.

Thorsten

ru...@yahoo.com

unread,

Jun 3, 2011, 8:51:18 AM6/3/11

to

On 06/02/2011 07:21 AM, Neil Cerutti wrote:
> > On 2011-06-01, ru...@yahoo.com <ru...@yahoo.com> wrote:
>> >> For some odd reason (perhaps because they are used a lot in
>> >> Perl), this groups seems to have a great aversion to regular
>> >> expressions. Too bad because this is a typical problem where
>> >> their use is the best solution.
> >
> > Python's str methods, when they're sufficent, are usually more
> > efficient.

Unfortunately, except for the very simplest cases, they are often
not sufficient. I often find myself changing, for example, a
startwith() to a RE when I realize that the input can contain mixed
case or that I have to treat commas as well as spaces as delimiters.
After doing this a number of times, one starts to use an RE right
from the get go unless one is VERY sure that there will be no
requirements creep.

And to regurgitate the mantra frequently used to defend Python when
it is criticized for being slow, the real question should be, are
REs fast enough? The answer almost always is yes.

> > Perl integrated regular expressions, while Python relegated them
> > to a library.

Which means that one needs an one extra "import re" line that is
not required in Perl.

Since RE strings are complied and cached, one often need not compile
them explicitly. Using match results is often requires more lines
than in Perl:
m = re.match (...)
if m: do something with m
rather than Perl's
if m/.../ {do something with capture group globals}
Any true Python fan should not find this a problem, the stock
response being, "what's the matter, your Enter key broken?"

> > There are thus a large class of problems that are best solve with
> > regular expressions in Perl, but str methods in Python.

Guess that depends on what one's definition of "large" is.

There are a few simple things, admittedly common, that Python
provides functions for that Perl uses REs for: replace(), for
example. But so what? I don't know if Perl does it or not but
there is no reason why functions called with string arguments or
REs with no "magic" characters can't be optimized to something
about as efficient as a corresponding Python function. Such uses
are likely to be naively counted as "using an RE in Perl".

I would agree though that the selection of string manipulation
functions in Perl are not as nice or orthogonal as in Python, and
that this contributes to a tendency to use REs in Perl when one
doesn't need to. But that is a programmer tradeoff (as in Python)
between fast-coding/slow-execution and slow-coding/fast-execution.
I for one would use Perl's index() and substr() to identify and
manipulate fixed patterns when performance was an issue.
One runs into the same tradeoff in Python pretty quickly too
so I'm not sure I'd call that space between the two languages
"large".

The other tradeoff, applying both to Perl and Python is with
maintenance. As mentioned above, even when today's requirements
can be solved with some code involving several string functions,
indexes, and conditionals, when those requirements change, it is
usually a lot harder to modify that code than a RE.

In short, although your observations are true to some extent, they
are not sufficient to justify the anti-RE attitude often seen here.

Nobody

unread,

Jun 3, 2011, 9:11:57 AM6/3/11

to

On Fri, 03 Jun 2011 04:30:46 +0000, Chris Torek wrote:

>>I'm not sure what you mean by "full 16-bit Unicode string"? Isn't
>>unicode inherently 32 bit?
>
> Well, not exactly. As I understand it, Python is normally built
> with a 16-bit "unicode character" type though

It's normally 32-bit on platforms where wchar_t is 32-bit (e.g. Linux).

Neil Cerutti

unread,

Jun 3, 2011, 9:17:59 AM6/3/11

to

On 2011-06-03, ru...@yahoo.com <ru...@yahoo.com> wrote:
> The other tradeoff, applying both to Perl and Python is with
> maintenance. As mentioned above, even when today's
> requirements can be solved with some code involving several
> string functions, indexes, and conditionals, when those
> requirements change, it is usually a lot harder to modify that
> code than a RE.
>
> In short, although your observations are true to some extent,
> they are not sufficient to justify the anti-RE attitude often
> seen here.

Very good article. Thanks. I mostly wanted to combat the notion
that that the alleged anti-RE attitude here might be caused by an
opposition to Perl culture.

I contend that the anti-RE attitude sometimes seen here is caused
by dissatisfaction with regexes in general combined with an
aversion to the re module. I agree that it's not that bad, but
it's clunky enough that it does contribute to making it my last
resort.

--
Neil Cerutti

Nobody

unread,

Jun 3, 2011, 9:18:40 AM6/3/11

to

On Fri, 03 Jun 2011 02:58:24 +0000, Chris Torek wrote:

> Python might be penalized by its use of Unicode here, since a
> Boyer-Moore table for a full 16-bit Unicode string would need
> 65536 entries (one per possible ord() value). However, if the
> string being sought is all single-byte values, a 256-element
> table suffices; re.compile(), at least, could scan the pattern
> and choose an appropriate underlying search algorithm.

The table can be truncated or compressed at the cost of having to map
codepoints to table indices. Or use a hash table instead of an array.

Steven D'Aprano

unread,

Jun 3, 2011, 10:25:53 AM6/3/11

to

On Fri, 03 Jun 2011 05:51:18 -0700, ru...@yahoo.com wrote:

> On 06/02/2011 07:21 AM, Neil Cerutti wrote:

>> > Python's str methods, when they're sufficent, are usually more
>> > efficient.
>
> Unfortunately, except for the very simplest cases, they are often not
> sufficient.

Maybe so, but the very simplest cases occur very frequently.

> I often find myself changing, for example, a startwith() to
> a RE when I realize that the input can contain mixed case

Why wouldn't you just normalise the case?

source.lower().startswith(prefix.lower())

Particularly if the two strings are short, this is likely to be much
faster than a regex.

Admittedly, normalising the case in this fashion is not strictly correct.
It works well enough for ASCII text, and probably Latin-1, but for
general Unicode, not so much. But neither will a regex solution. If you
need to support true case normalisation for arbitrary character sets,
Python isn't going to be much help for you. But for the rest of us, a
simple str.lower() or str.upper() might be technically broken but it will
do the job.

> or that I have
> to treat commas as well as spaces as delimiters.

source.replace(",", " ").split(" ")

[steve@sylar ~]$ python -m timeit -s "source = 'a b c,d,e,f,g h i j k'"
"source.replace(',', ' ').split(' ')"
100000 loops, best of 3: 2.69 usec per loop

[steve@sylar ~]$ python -m timeit -s "source = 'a b c,d,e,f,g h i j k'" -
s "import re" "re.split(',| ', source)"
100000 loops, best of 3: 11.8 usec per loop

re.split is about four times slower than the simple solution.

> After doing this a
> number of times, one starts to use an RE right from the get go unless
> one is VERY sure that there will be no requirements creep.

YAGNI.

There's no need to use a regex just because you think that you *might*,
someday, possibly need a regex. That's just silly. If and when
requirements change, then use a regex. Until then, write the simplest
code that will solve the problem you have to solve now, not the problem
you think you might have to solve later.

> And to regurgitate the mantra frequently used to defend Python when it
> is criticized for being slow, the real question should be, are REs fast
> enough? The answer almost always is yes.

Well, perhaps so.

[...]

> In short, although your observations are true to some extent, they
> are not sufficient to justify the anti-RE attitude often seen here.

I don't think that there's really an *anti* RE attitude here. It's more a
skeptical, cautious attitude to them, as a reaction to the Perl "when all
you have is a hammer, everything looks like a nail" love affair with
regexes.

There are a few problems with regexes:

- they are another language to learn, a very cryptic a terse language;
- hence code using many regexes tends to be obfuscated and brittle;
- they're over-kill for many simple tasks;
- and underpowered for complex jobs, and even some simple ones;
- debugging regexes is a nightmare;
- they're relatively slow;
- and thanks in part to Perl's over-reliance on them, there's a tendency
among many coders (especially those coming from Perl) to abuse and/or
misuse regexes; people react to that misuse by treating any use of
regexes with suspicion.

But they have their role to play as a tool in the programmers toolbox.

Regarding their syntax, I'd like to point out that even Larry Wall is
dissatisfied with regex culture in the Perl community:

http://www.perl.com/pub/2002/06/04/apo5.html

--
Steven

D'Arcy J.M. Cain

unread,

Jun 3, 2011, 10:58:57 AM6/3/11

to Steven D'Aprano, pytho...@python.org

On 03 Jun 2011 14:25:53 GMT

Steven D'Aprano <steve+comp....@pearwood.info> wrote:
> source.replace(",", " ").split(" ")

I would do;

source.replace(",", " ").split()

> [steve@sylar ~]$ python -m timeit -s "source = 'a b c,d,e,f,g h i j k'"

What if the string is 'a b c, d, e,f,g h i j k'?

>>> source.replace(",", " ").split(" ")

['a', 'b', 'c', '', 'd', '', 'e', 'f', 'g', 'h', 'i', 'j', 'k']
>>> source.replace(",", " ").split()
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k']

Of course, it may be that the former is what you want but I think that
the latter would be more common.

> There's no need to use a regex just because you think that you *might*,
> someday, possibly need a regex. That's just silly. If and when
> requirements change, then use a regex. Until then, write the simplest
> code that will solve the problem you have to solve now, not the problem
> you think you might have to solve later.

I'm not sure if this should be rule #1 for programmers but it
definitely needs to be one of the very low numbers. Trying to guess
the client's future requests is always a losing game.

--
D'Arcy J.M. Cain <da...@druid.net> | Democracy is three wolves
http://www.druid.net/darcy/ | and a sheep voting on
+1 416 425 1212 (DoD#0082) (eNTP) | what's for dinner.

ru...@yahoo.com

unread,

Jun 3, 2011, 11:14:30 AM6/3/11

to

But I questioned the reasons given (not as efficient, not built
in, not often needed) for dissatisfaction with REs.[*] If those
reasons are not strong, then is not their Perl-smell still a leading
candidate for explaining the anti-RE attitude here?

Of course the whole question, lacking some serious group-psychological
investigation, is pure speculation anyway.

----
[*] A reason for not using REs not mentioned yet is that REs take
some time to learn. Thus, although most people will know how to use
Python string methods, only a subset of those will be familiar with
REs. But that doesn't seem like a reason for RE bashing either
since REs are easier to learn than SQL and one frequently sees
recommendations here to use sqlite.

ru...@yahoo.com

unread,

Jun 3, 2011, 3:29:52 PM6/3/11

to

On 06/03/2011 08:25 AM, Steven D'Aprano wrote:
> On Fri, 03 Jun 2011 05:51:18 -0700, ru...@yahoo.com wrote:
>
>> On 06/02/2011 07:21 AM, Neil Cerutti wrote:
>
>>> > Python's str methods, when they're sufficent, are usually more
>>> > efficient.
>>
>> Unfortunately, except for the very simplest cases, they are often not
>> sufficient.
>
> Maybe so, but the very simplest cases occur very frequently.

Right, and I stated that.

>> I often find myself changing, for example, a startwith() to
>> a RE when I realize that the input can contain mixed case
>
> Why wouldn't you just normalise the case?

Because some of the text may be case-sensitive.

>[...]

>> or that I have
>> to treat commas as well as spaces as delimiters.
>
> source.replace(",", " ").split(" ")

Uhgg. create a whole new string just so you can split it on
one rather than two characters? Sorry, but I find

re.split ('[ ,]', source)

states much more clearly exactly what is being done with no
obfuscation. Obviously this is a simple enough case that the
difference is minor but when the pattern gets only a little
more complex, the clarity difference becomes greater.

>[...]

> re.split is about four times slower than the simple solution.

If this processing is a bottleneck, by all means use a more
complex hard-coded replacement for a regex. In most cases
that won't be necessary.

>> After doing this a
>> number of times, one starts to use an RE right from the get go unless
>> one is VERY sure that there will be no requirements creep.
>
> YAGNI.

IAHNI. (I actually have needed it.)

> There's no need to use a regex just because you think that you *might*,
> someday, possibly need a regex. That's just silly. If and when
> requirements change, then use a regex. Until then, write the simplest
> code that will solve the problem you have to solve now, not the problem
> you think you might have to solve later.

I would not recommend you use a regex instead of a string method
solely because you might need a regex later. But when you have
to spend 10 minutes writing a half-dozen lines of python versus
1 minute writing a regex, your evaluation of the possibility of
requirements changing should factor into your decision.

> [...]
>> In short, although your observations are true to some extent, they
>> are not sufficient to justify the anti-RE attitude often seen here.
>
> I don't think that there's really an *anti* RE attitude here. It's more a
> skeptical, cautious attitude to them, as a reaction to the Perl "when all
> you have is a hammer, everything looks like a nail" love affair with
> regexes.

Yes, as I said, the regex attitude here seems in large part to
be a reaction to their frequent use in Perl. It seems anti- to
me in that I often see cautions about their use but seldom see
anyone pointing out that they are often a better solution than
a mass of twisty little string methods and associated plumbing.

> There are a few problems with regexes:
>
> - they are another language to learn, a very cryptic a terse language;

Chinese is cryptic too but there are a few billion people who
don't seem to be bothered by that.

> - hence code using many regexes tends to be obfuscated and brittle;

No. With regexes the code is likely to be less brittle than
a dozen or more lines of mixed string functions, indexes, and
conditionals.

> - they're over-kill for many simple tasks;
> - and underpowered for complex jobs, and even some simple ones;

Right, like all tools (including Python itself) they are suited
best for a specific range of problems. That range is quite wide.

> - debugging regexes is a nightmare;

Very complex ones, perhaps. "Nightmare" seems an overstatement.

> - they're relatively slow;

So is Python. In both cases, if it is a bottleneck then
choosing another tool is appropriate.

> - and thanks in part to Perl's over-reliance on them, there's a tendency
> among many coders (especially those coming from Perl) to abuse and/or
> misuse regexes; people react to that misuse by treating any use of
> regexes with suspicion.

So you claim. I have seen more postings in here where
REs were not used when they would have simplified the code,
then I have seen regexes used when a string method or two
would have done the same thing.

> But they have their role to play as a tool in the programmers toolbox.

We agree.

> Regarding their syntax, I'd like to point out that even Larry Wall is
> dissatisfied with regex culture in the Perl community:
>
> http://www.perl.com/pub/2002/06/04/apo5.html

You did see the very first sentence in this, right?

"Editor's Note: this Apocalypse is out of date and remains here
for historic reasons. See Synopsis 05 for the latest information."

(Note that "Apocalypse" is referring to a series of Perl design
documents and has nothing to do with regexes in particular.)

Synopsis 05 is (AFAICT with a quick scan) a proposal for revising
regex syntax. I didn't see anything about de-emphasizing them in
Perl. (But I have no idea what is going on for Perl 6 so I could
be wrong about that.)

As for the original reference, Wall points out a number of
problems with regexes, mostly details of their syntax. For
example that more frequently used non-capturing groups require
more characters than less-frequently used capturing groups.
Most of these criticisms seem irrelevant to the question of
whether hard-wired string manipulation code or regexes should
be preferred in a Python program.

And for the few criticisms that are relevant, nobody ever said
regexes were perfect. The problems are well known, especially on
this list where we've all been told about them a million times.

The fact that REs are not perfect does not make them not useful.
We also know about Python's problems (slow, the GIL, excessively
terse and poorly organized documentation, etc) but that hardly
makes Python not useful.

Finally he is talking about *revising* regex syntax (in part by
replacing some magic character sequences with other "better" ones)
beyond the core CS textbook forms. He was *not* AFAICT advocating
using hard-wired string manipulation code in place of regexes.
So it is hardly a condemnation of the concept of regexs, rather
just the opposite.

Perhaps you stopped reading after seeing his "regular expression
culture is a mess" comment without trying to see what he meant
by "culture" or "mess"?

Neil Cerutti

unread,

Jun 3, 2011, 4:49:08 PM6/3/11

to

On 2011-06-03, ru...@yahoo.com <ru...@yahoo.com> wrote:

>>> or that I have to treat commas as well as spaces as
>>> delimiters.
>>
>> source.replace(",", " ").split(" ")
>
> Uhgg. create a whole new string just so you can split it on one
> rather than two characters? Sorry, but I find
>
> re.split ('[ ,]', source)

It's quibbling to complain about creating one more string in an
operation that already creates N strings.

Here's another alternative:

list(itertools.chain.from_iterable(elem.split(" ")
for elem in source.split(",")))

It's weird looking, but delimiting text with two different
delimiters is weird.

> states much more clearly exactly what is being done with no
> obfuscation. Obviously this is a simple enough case that the
> difference is minor but when the pattern gets only a little
> more complex, the clarity difference becomes greater.

re.split is a nice improvement over str.split. I use it often.
It's a happy place in the re module. Using a single capture group
it can perhaps also be used for applications of str.partition.

> I would not recommend you use a regex instead of a string
> method solely because you might need a regex later. But when
> you have to spend 10 minutes writing a half-dozen lines of
> python versus 1 minute writing a regex, your evaluation of the
> possibility of requirements changing should factor into your
> decision.

Most of the simplest and clearest applications of the re module
are simply not necessary at all. If I'm inspecting a string with
what amounts to more than a couple of lines of basic Python then
break out the re module.

Of course often times that means I've got a context sensitive
parsing job on my hands, and I have to put it away again. ;)

> Yes, as I said, the regex attitude here seems in large part to
> be a reaction to their frequent use in Perl. It seems anti- to
> me in that I often see cautions about their use but seldom see
> anyone pointing out that they are often a better solution than
> a mass of twisty little string methods and associated plumbing.

That doesn't seem to apply to the problem that prompted your
complaint, at least.

>> There are a few problems with regexes:
>>
>> - they are another language to learn, a very cryptic a terse
>> language;
>
> Chinese is cryptic too but there are a few billion people who
> don't seem to be bothered by that.

Chinese *would* be a problem if you proposed it as the solution
to a problem that could be solved by using a persons native
tongue instead.

>> - hence code using many regexes tends to be obfuscated and
>> brittle;
>
> No. With regexes the code is likely to be less brittle than a
> dozen or more lines of mixed string functions, indexes, and
> conditionals.

That is the opposite of my experience, but YMMV.

>> - they're over-kill for many simple tasks;
>> - and underpowered for complex jobs, and even some simple ones;
>
> Right, like all tools (including Python itself) they are suited
> best for a specific range of problems. That range is quite
> wide.
>
>> - debugging regexes is a nightmare;
>
> Very complex ones, perhaps. "Nightmare" seems an
> overstatement.

I will abandon a re based solution long before the nightmare.

>> - they're relatively slow;
>
> So is Python. In both cases, if it is a bottleneck then
> choosing another tool is appropriate.

It's not a problem at all until it is.

>> - and thanks in part to Perl's over-reliance on them, there's
>> a tendency among many coders (especially those coming from
>> Perl) to abuse and/or misuse regexes; people react to that
>> misuse by treating any use of regexes with suspicion.
>
> So you claim. I have seen more postings in here where
> REs were not used when they would have simplified the code,
> then I have seen regexes used when a string method or two
> would have done the same thing.

Can you find an example or invent one? I simply don't remember
such problems coming up, but I admit it's possible.

--
Neil Cerutti

Chris Torek

unread,

Jun 3, 2011, 5:45:07 PM6/3/11

to

>On 2011-06-03, ru...@yahoo.com <ru...@yahoo.com> wrote:

[prefers]

>> re.split ('[ ,]', source)

This is probably not what you want in dealing with
human-created text:

>>> re.split('[ ,]', 'foo bar, spam,maps')
['foo', '', 'bar', '', 'spam', 'maps']

Instead, you probably want "a comma followed by zero or
more spaces; or, one or more spaces":

>>> re.split(r',\s*|\s+', 'foo bar, spam,maps')
['foo', 'bar', 'spam', 'maps']

or perhaps (depending on how you want to treat multiple
adjacent commas) even this:

>>> re.split(r',+\s*|\s+', 'foo bar, spam,maps,, eggs')
['foo', 'bar', 'spam', 'maps', 'eggs']

although eventually you might want to just give in and use the
csv module. :-) (Especially if you want to be able to quote
commas, for instance.)

>> ... With regexes the code is likely to be less brittle than a

>> dozen or more lines of mixed string functions, indexes, and
>> conditionals.

In article <94svm4...@mid.individual.net>
Neil Cerutti <ne...@norwich.edu> wrote:
[lots of snippage]

>That is the opposite of my experience, but YMMV.

I suspect it depends on how familiar the user is with regular
expressions, their abilities, and their limitations.

People relatively new to REs always seem to want to use them
to count (to balance parentheses, for instance). People who
have gone through the compiler course know better. :-)

Ethan Furman

unread,

Jun 3, 2011, 6:11:24 PM6/3/11

to pytho...@python.org

Chris Torek wrote:
>> On 2011-06-03, ru...@yahoo.com <ru...@yahoo.com> wrote:
> [prefers]
>>> re.split ('[ ,]', source)
>
> This is probably not what you want in dealing with
> human-created text:
>
> >>> re.split('[ ,]', 'foo bar, spam,maps')
> ['foo', '', 'bar', '', 'spam', 'maps']

I think you've got a typo in there... this is what I get:

--> re.split('[ ,]', 'foo bar, spam,maps')
['foo', 'bar', '', 'spam', 'maps']

I would add a * to get rid of that empty element, myself:
--> re.split('[ ,]*', 'foo bar, spam,maps')

['foo', 'bar', 'spam', 'maps']

~Ethan~

MRAB

unread,

Jun 3, 2011, 6:38:50 PM6/3/11

to pytho...@python.org

On 03/06/2011 23:11, Ethan Furman wrote:

> Chris Torek wrote:
>>> On 2011-06-03, ru...@yahoo.com <ru...@yahoo.com> wrote:
>> [prefers]
>>>> re.split ('[ ,]', source)
>>
>> This is probably not what you want in dealing with
>> human-created text:
>>
>> >>> re.split('[ ,]', 'foo bar, spam,maps')
>> ['foo', '', 'bar', '', 'spam', 'maps']
>

> I think you've got a typo in there... this is what I get:
>
> --> re.split('[ ,]', 'foo bar, spam,maps')
> ['foo', 'bar', '', 'spam', 'maps']
>
> I would add a * to get rid of that empty element, myself:

> --> re.split('[ ,]*', 'foo bar, spam,maps')

> ['foo', 'bar', 'spam', 'maps']
>

It's better to use + instead of * because you don't want it to be a
zero-width separator. The fact that it works should be treated as an
idiosyncrasy of the current re module, which can't split on a
zero-width match.

Gregory Ewing

unread,

Jun 3, 2011, 9:41:33 PM6/3/11

to

Chris Torek wrote:
> Python might be penalized by its use of Unicode here, since a
> Boyer-Moore table for a full 16-bit Unicode string would need
> 65536 entries

But is there any need for the Boyer-Moore algorithm to
operate on characters?

Seems to me you could just as well chop the UTF-16 up
into bytes and apply Boyer-Moore to them, and it would
work about as well.

--
Greg

Steven D'Aprano

unread,

Jun 3, 2011, 10:05:12 PM6/3/11

to

On Fri, 03 Jun 2011 12:29:52 -0700, ru...@yahoo.com wrote:

>>> I often find myself changing, for example, a startwith() to a RE when
>>> I realize that the input can contain mixed case
>>
>> Why wouldn't you just normalise the case?
>
> Because some of the text may be case-sensitive.

Perhaps you misunderstood me. You don't have to throw away the
unnormalised text, merely use the normalized text in the expression you
need.

Of course, if you include both case-sensitive and insensitive tests in
the same calculation, that's a good candidate for a regex... or at least
it would be if regexes supported that :)

>>[...]
>>> or that I have
>>> to treat commas as well as spaces as delimiters.
>>
>> source.replace(",", " ").split(" ")
>
> Uhgg. create a whole new string just so you can split it on one rather
> than two characters?

You say that like it's expensive.

And how do you what the regex engine is doing under the hood? For all you
know, it could be making hundreds of temporary copies and throwing them
away. Or something. It's a black box.

The fact that creating a whole new string to split on is faster than
*running* the regex (never mind compiling it, loading the regex engine,
and anything else that needs to be done) should tell you which does more
work. Copying is cheap. Parsing is expensive.

> Sorry, but I find
>
> re.split ('[ ,]', source)
>
> states much more clearly exactly what is being done with no obfuscation.

That's because you know regex syntax. And I'd hardly call the version
with replace obfuscated.

Certainly the regex is shorter, and I suppose it's reasonable to expect
any reader to know at least enough regex to read that, so I'll grant you
that this is a small win for clarity. A micro-optimization for
readability, at the expense of performance.

> Obviously this is a simple enough case that the difference is minor but
> when the pattern gets only a little more complex, the clarity difference
> becomes greater.

Perhaps. But complicated tasks require complicated regexes, which are
anything but clear.

[...]

>>> After doing this a
>>> number of times, one starts to use an RE right from the get go unless
>>> one is VERY sure that there will be no requirements creep.
>>
>> YAGNI.
>
> IAHNI. (I actually have needed it.)

I'm sure you have, and when you need it, it's entirely appropriate to use
a regex solution. But you stated that you used regexes as insurance *just
in case* the requirements changed. Why, is your text editor broken? You
can't change a call to str.startswith(prefix) to re.match(prefix, str) if
and when you need to? That's what I mean by YAGNI -- don't solve the
problem you think you might have tomorrow.

>> There's no need to use a regex just because you think that you *might*,
>> someday, possibly need a regex. That's just silly. If and when
>> requirements change, then use a regex. Until then, write the simplest
>> code that will solve the problem you have to solve now, not the problem
>> you think you might have to solve later.
>
> I would not recommend you use a regex instead of a string method solely
> because you might need a regex later. But when you have to spend 10
> minutes writing a half-dozen lines of python versus 1 minute writing a
> regex, your evaluation of the possibility of requirements changing
> should factor into your decision.

Ah, but if your requirements are complicated enough that it takes you ten
minutes and six lines of string method calls, that sounds to me like a
situation that probably calls for a regex!

Of course it depends on what the code actually does... if it counts the
number of nested ( ) pairs, and you're trying to do that with a regex,
you're sacked! *wink*

[...]

>> There are a few problems with regexes:
>>
>> - they are another language to learn, a very cryptic a terse language;
>
> Chinese is cryptic too but there are a few billion people who don't seem
> to be bothered by that.

Chinese isn't cryptic to the Chinese, because they've learned it from
childhood.

But has anyone done any studies comparing reading comprehension speed
between native Chinese readers and native European readers? For all I
know, Europeans learn to read twice as quickly as Chinese, and once
learned, read text twice as fast. Or possibly the other way around. Who
knows? Not me.

But I do know that English typists typing 26 letters of the alphabet
leave Asian typists and their thousands of ideograms in the dust. There's
no comparison -- it's like quicksort vs bubblesort *wink*.

[...]

>> - debugging regexes is a nightmare;
>
> Very complex ones, perhaps. "Nightmare" seems an overstatement.

You *can't* debug regexes in Python, since there are no tools for (e.g.)
single-stepping through the regex, displaying intermediate calculations,
or anything other than making changes to the regex and running it again,
hoping that it will do the right thing this time.

I suppose you can use external tools, like Regex Buddy, if you're on a
supported platform and if they support your language's regex engine.

[...]

>> Regarding their syntax, I'd like to point out that even Larry Wall is
>> dissatisfied with regex culture in the Perl community:
>>
>> http://www.perl.com/pub/2002/06/04/apo5.html
>
> You did see the very first sentence in this, right?
>
> "Editor's Note: this Apocalypse is out of date and remains here for
> historic reasons. See Synopsis 05 for the latest information."

Yes. And did you click through to see the Synopsis? It is a bare
technical document with all the motivation removed. Since I was pointing
to Larry Wall's motivation, it was appropriate to link to the Apocalypse
document, not the Synopsis.

> (Note that "Apocalypse" is referring to a series of Perl design
> documents and has nothing to do with regexes in particular.)

But Apocalypse 5 specifically has everything to do with regexes. That's
why I linked to that, and not (say) Apocalypse 2.

> Synopsis 05 is (AFAICT with a quick scan) a proposal for revising regex
> syntax. I didn't see anything about de-emphasizing them in Perl. (But
> I have no idea what is going on for Perl 6 so I could be wrong about
> that.)

I never said anything about de-emphasizing them. I said that Larry Wall
was dissatisfied with Perl's culture of regexes -- his own words were:

"regular expression culture is a mess"

and he is also extremely critical of current (i.e. Perl 5) regex syntax.
Since Python's regex syntax borrows heavily from Perl 5, that's extremely
pertinent to the issue. When even the champion of regex culture says
there is much broken about regex culture, we should all listen.

> As for the original reference, Wall points out a number of problems with
> regexes, mostly details of their syntax. For example that more
> frequently used non-capturing groups require more characters than
> less-frequently used capturing groups. Most of these criticisms seem
> irrelevant to the question of whether hard-wired string manipulation
> code or regexes should be preferred in a Python program.

It is only relevant in so far as the readability and relative obfuscation
of regex syntax is relevant. No further.

You keep throwing out the term "hard-wired string manipulation", but I
don't understand what point you're making. I don't understand what you
see as "hard-wired", or why you think

source.startswith(prefix)

is more hard-wired than

re.match(prefix, source)

[...]

> Perhaps you stopped reading after seeing his "regular expression culture
> is a mess" comment without trying to see what he meant by "culture" or
> "mess"?

Perhaps you are being over-sensitive and reading *far* too much into what
I said. If regexes were more readable, as proposed by Wall, that would go
a long way to reducing my suspicion of them.

--
Steven

MRAB

unread,

Jun 3, 2011, 10:24:50 PM6/3/11

to pytho...@python.org

On 04/06/2011 03:05, Steven D'Aprano wrote:
> On Fri, 03 Jun 2011 12:29:52 -0700, ru...@yahoo.com wrote:
>
>>>> I often find myself changing, for example, a startwith() to a RE when
>>>> I realize that the input can contain mixed case
>>>
>>> Why wouldn't you just normalise the case?
>>
>> Because some of the text may be case-sensitive.
>
> Perhaps you misunderstood me. You don't have to throw away the
> unnormalised text, merely use the normalized text in the expression you
> need.
>
> Of course, if you include both case-sensitive and insensitive tests in
> the same calculation, that's a good candidate for a regex... or at least
> it would be if regexes supported that :)
>

[snip]
Some regex implementations support scoped case sensitivity. :-)

I have at times thought that it would be useful if .startswith offered
the option of case insensitivity and there were also str.equal which
offered it.

Roy Smith

unread,

Jun 3, 2011, 10:30:59 PM6/3/11

to

In article <4de992d7$0$29996$c3e8da3$5496...@news.astraweb.com>,

Steven D'Aprano <steve+comp....@pearwood.info> wrote:

> Of course, if you include both case-sensitive and insensitive tests in
> the same calculation, that's a good candidate for a regex... or at least
> it would be if regexes supported that :)

Of course they support that.

r'([A-Z]+) ([a-zA-Z]+) ([a-z]+)'

matches a word in upper case followed by a word in either (or mixed)
case, followed by a word in lower case (for some narrow definition of
"word").

Another nice thing about regexes (as compared to string methods) is that
they're both portable and serializable. You can use the same regex in
Perl, Python, Ruby, PHP, etc. You can transmit them over a network
connection to a cooperating process. You can store them in a database
or a config file, or allow users to enter them on the fly.

Steven D'Aprano

unread,

Jun 4, 2011, 12:59:44 AM6/4/11

to

On Sat, 04 Jun 2011 03:24:50 +0100, MRAB wrote:

> [snip]
> Some regex implementations support scoped case sensitivity. :-)

Yes, you should link to your regex library :)

Have you considered the suggested Perl 6 syntax? Much of it looks good to
me.

> I have at times thought that it would be useful if .startswith offered
> the option of case insensitivity and there were also str.equal which
> offered it.

I agree. I wish string methods in general would support a case_sensitive
flag. I think that's a common enough task to count as a exception to the
rule "don't include function boolean arguments that merely swap between
two different actions".

--
Steven

Steven D'Aprano

unread,

Jun 4, 2011, 1:14:56 AM6/4/11

to

On Fri, 03 Jun 2011 22:30:59 -0400, Roy Smith wrote:

> In article <4de992d7$0$29996$c3e8da3$5496...@news.astraweb.com>,
> Steven D'Aprano <steve+comp....@pearwood.info> wrote:
>
>> Of course, if you include both case-sensitive and insensitive tests in
>> the same calculation, that's a good candidate for a regex... or at
>> least it would be if regexes supported that :)
>
> Of course they support that.
>
> r'([A-Z]+) ([a-zA-Z]+) ([a-z]+)'
>
> matches a word in upper case followed by a word in either (or mixed)
> case, followed by a word in lower case (for some narrow definition of
> "word").

This fails to support non-ASCII letters, and you know quite well that
having to spell out by hand regexes in both upper and lower (or mixed)
case is not support for case-insensitive matching. That's why Python's re
has a case insensitive flag.

> Another nice thing about regexes (as compared to string methods) is that
> they're both portable and serializable. You can use the same regex in
> Perl, Python, Ruby, PHP, etc.

Say what?

Regexes are anything but portable. Sure, if you limit yourself to some
subset of regex syntax, you might find that many different languages and
engines support your regex, but general regexes are not guaranteed to run
in multiple engines.

The POSIX standard defines two different regexes; Tcl supports three;
Grep supports the two POSIX syntaxes, plus Perl syntax; Python has two
(regex and re modules); Perl 5 and Perl 6 have completely different
syntax. Subtle differences, such as when hyphens in character classes
count as a literal, abound. See, for example:

http://www.regular-expressions.info/refflavors.html

> You can transmit them over a network
> connection to a cooperating process. You can store them in a database
> or a config file, or allow users to enter them on the fly.

Sure, but if those sorts of things are important to you, there's no
reason why you can't create your own string-processing language. Apart
from the time and effort required :)

--
Steven

Chris Angelico

unread,

Jun 4, 2011, 2:04:18 AM6/4/11

to pytho...@python.org

On Sat, Jun 4, 2011 at 12:30 PM, Roy Smith <r...@panix.com> wrote:
> Another nice thing about regexes (as compared to string methods) is that
> they're both portable and serializable. You can use the same regex in
> Perl, Python, Ruby, PHP, etc. You can transmit them over a network
> connection to a cooperating process. You can store them in a database
> or a config file, or allow users to enter them on the fly.
>

I wouldn't ever be transmitting them around the place, unless also
allowing users to enter them. But I have done exactly that - a
validator system that lets you put a header with a regex, and then
string content below that. That IS one advantage of the regex.

However, that's a very VERY specific situation. If I'm not asking a
third party to provide the match condition, then that's not a reason
to go regex.

Chris Angelico

Roy Smith

unread,

Jun 4, 2011, 9:39:24 AM6/4/11

to

I wrote:
>> Another nice thing about regexes (as compared to string methods) is
>> that they're both portable and serializable. You can use the same
>> regex in Perl, Python, Ruby, PHP, etc.

In article <4de9bf50$0$29996$c3e8da3$5496...@news.astraweb.com>,

Steven D'Aprano <steve+comp....@pearwood.info> wrote:
> Regexes are anything but portable. Sure, if you limit yourself to some
> subset of regex syntax, you might find that many different languages and
> engines support your regex, but general regexes are not guaranteed to run
> in multiple engines.

To be sure, if you explore the edges of the regex syntax space, you can
write non-portable expressions. You don't even have to get very far out
to the edge. But, as you say, if you limit yourself to a subset, you
can write portable ones. I have a high level of confidence that I can
execute:

^foo/bar

on any regex engine in the world and have it match the same thing that

my_string.startswith('foo/bar')

does. The fact that not all regexes are portable doesn't negate the
fact that many are portable and that this is useful in real life.

> > You can transmit them over a network
> > connection to a cooperating process. You can store them in a database
> > or a config file, or allow users to enter them on the fly.
>
> Sure, but if those sorts of things are important to you, there's no
> reason why you can't create your own string-processing language. Apart
> from the time and effort required :)

The time and effort required to write (and debug, and document) the
language is just part of it. The bigger part is that you've now got to
teach this new language to all your users (i.e. another barrier to
adoption of your system).

For example, I'm working with MongoDB on my current project. It
supports regex matching. Pretty much everything I need to know is
documented by the Mongo folks saying, "MongoDB uses PCRE for regular
expressions" (with a link to the PCRE man page). This lets me leverage
my existing knowledge of regexes to perform sophisticated queries
immediately. Had they invented their own string processing language, I
would have to invest time to learn that.

As another example, a project I used to work on was very much into NIH
(Not Invented Here). They wrote their own pattern matching language,
loosely based on snobol. Customers signed up for three-day classes to
come learn this language so they could use the product. Ugh.

rusi

unread,

Jun 4, 2011, 12:36:49 PM6/4/11

to

The efficiently argument is specious. [This is a python list not a C
or assembly list]

The real issue is that complex regexes are hard to get right -- even
if one is experienced.
This is analogous to the fact that knotty programs can be hard to get
right even for experienced programmers.

The analogy stems from the fact that both programs in general and
regexes in particular are a code.
Regex in particular is a code for an interesting class of languages --
the so-called regular languages. And like all involved cod(ing), can
be helped by a debugger.

And just as it is a clincher for effective C programming to have a C
debugger whereas it is less so for python, the effective use of
regexes needs good debugger(s).

I sometimes use regex-tool but there are better I guess (see
http://bc.tech.coop/blog/071103.html )
Most recently there was mention of a python specific tool: kodos
http://kodos.sourceforge.net/about.html

In short I would reword rurpy's complaint to: Regexes should be
recommended along with (the idea of) regex tools.

Nobody

unread,

Jun 4, 2011, 3:44:56 PM6/4/11

to

No, because that won't care about alignment. E.g. on a big-endian
architecture, if you search for '\u2345' in the string '\u0123\u4567', it
will find a match (at an offset of 1 byte).

Nobody

unread,

Jun 4, 2011, 4:02:32 PM6/4/11

to

On Sat, 04 Jun 2011 05:14:56 +0000, Steven D'Aprano wrote:

> This fails to support non-ASCII letters, and you know quite well that
> having to spell out by hand regexes in both upper and lower (or mixed)
> case is not support for case-insensitive matching. That's why Python's re
> has a case insensitive flag.

I find it slightly ironic that you pointed out the ASCII limitation while
overlooking the arbitrariness of upper/lower-case equivalence. Case isn't
the only type of equivalence; it's just the only one which affects ASCII.
Should we also have flags to treat half-width and full-width characters as
equivalent? What about traditional and simplified Chinese, hiragana and
katakana, or the various stylistic variants of the Latin and Greek
alphabets in the mathematical symbols block (U+1D400..U+1D7FF)?

Steven D'Aprano

unread,

Jun 4, 2011, 8:44:16 PM6/4/11

to

On Sat, 04 Jun 2011 09:39:24 -0400, Roy Smith wrote:

> To be sure, if you explore the edges of the regex syntax space, you can
> write non-portable expressions. You don't even have to get very far out
> to the edge. But, as you say, if you limit yourself to a subset, you
> can write portable ones. I have a high level of confidence that I can
> execute:
>
> ^foo/bar
>
> on any regex engine in the world and have it match the same thing that
>
> my_string.startswith('foo/bar')
>
> does.

Not the best choice you could have made, for two reasons:

(1) ^ can match at the start of each line, not just the start of the
string. Although this doesn't occur by default in Python, do you know
whether all other engines default the same way?

(2) There is at least one major regex engine that doesn't support ^ for
start of string matching at all, namely the W3C XML Schema pattern
matcher.

As you say... not very far out to the edges at all.

[...]

> As another example, a project I used to work on was very much into NIH
> (Not Invented Here). They wrote their own pattern matching language,
> loosely based on snobol. Customers signed up for three-day classes to
> come learn this language so they could use the product. Ugh.

And you think that having customers sign up for a two-week class to learn
regexes would be an improvement? *wink*

I don't know a lot about SNOBOL pattern matching, but I know that they're
more powerful than regexes, and it seems to me that they're also easier
to read and learn. I suspect that the programming world would have been
much better off if SNOBOL pattern matching had won the popularity battle
against regexes.

--
Steven

Steven D'Aprano

unread,

Jun 4, 2011, 9:01:18 PM6/4/11

to

On Sat, 04 Jun 2011 21:02:32 +0100, Nobody wrote:

> On Sat, 04 Jun 2011 05:14:56 +0000, Steven D'Aprano wrote:
>
>> This fails to support non-ASCII letters, and you know quite well that
>> having to spell out by hand regexes in both upper and lower (or mixed)
>> case is not support for case-insensitive matching. That's why Python's
>> re has a case insensitive flag.
>
> I find it slightly ironic that you pointed out the ASCII limitation
> while overlooking the arbitrariness of upper/lower-case equivalence.

Case is hardly arbitrary. It's *extremely* common, at least in Western
languages, which you may have noticed we're writing in :-P

> Case isn't the only type of equivalence; it's just the only one which
> affects ASCII. Should we also have flags to treat half-width and
> full-width characters as equivalent? What about traditional and
> simplified Chinese, hiragana and katakana, or the various stylistic
> variants of the Latin and Greek alphabets in the mathematical symbols
> block (U+1D400..U+1D7FF)?

Perhaps we should. But since Python regexes don't support such flags
either, I fail to see your point.

--
Steven

rusi

unread,

Jun 5, 2011, 7:17:20 AM6/5/11

to

On Jun 3, 7:25 pm, Steven D'Aprano <steve
+comp.lang.pyt...@pearwood.info> wrote:

> Regarding their syntax, I'd like to point out that even Larry Wall is
> dissatisfied with regex culture in the Perl community:
>
> http://www.perl.com/pub/2002/06/04/apo5.html

This is a very good link.
And it can be a starting point for python to leapfrog over perl.
After all for perl changing the regex syntax/semantics means deep
surgery to the language. For python its just another module.

In particular, there is something that is possible and easy today that
was not conceivable 20 years ago -- using unicode.
Much of the regex problem stems from what LW calls 'poor huffman
coding'
And much of that is thanks to the fact that regexes need different
kinds of grouping but the hegemony of ASCII has forced a
multicharacter rendering for most of those.

A snip from the article:

----------------------------------
Consider these constructs:

(??{...})
(?{...})
(?#...)
(?:...)
(?i:...)
(?=...)
(?!...)
(?<=...)
(?<!...)
(?>...)
(?(...)...|...)

These all look quite similar, but some of them do radically different
things. In particular, the (?<...) does not mean the opposite of the (?
>...). The underlying visual problem is the overuse of parentheses, as
in Lisp. Programs are more readable if different things look
different.

----------------------------------
Some parenthesis usage shown here
http://xahlee.blogspot.com/2011/05/use-of-unicode-matching-brackets-as.html

ru...@yahoo.com

unread,

Jun 6, 2011, 1:44:01 AM6/6/11

to

On 06/03/2011 02:49 PM, Neil Cerutti wrote:
> > On 2011-06-03, ru...@yahoo.com <ru...@yahoo.com> wrote:
>>>> >>>> or that I have to treat commas as well as spaces as
>>>> >>>> delimiters.
>>> >>>
>>> >>> source.replace(",", " ").split(" ")
>> >>
>> >> Uhgg. create a whole new string just so you can split it on one
>> >> rather than two characters? Sorry, but I find
>> >>
>> >> re.split ('[ ,]', source)
> >
> > It's quibbling to complain about creating one more string in an
> > operation that already creates N strings.

It's not the time it take to create the string, its the doing
of things that aren't really needed to accomplish the task:
The re.split says directly and with no extraneous actions,
"split 'source' on either spaces or commas". This of course
is a trivial example but used thoughtfully, REs allow you to
be very precise about what you are doing, versus using "tricks"
like substituting individual characters first so you can split
on a single character afterwards.

> > Here's another alternative:
> >
> > list(itertools.chain.from_iterable(elem.split(" ")
> > for elem in source.split(",")))

You seriously find that clearer than re.split('[ ,]') above?
I have no further comment. :-)

> > It's weird looking, but delimiting text with two different
> > delimiters is weird.

Perhaps, but real-world input data is often very weird.
Try parsing a text "database" of a circa 1980 telephone
company phone directory sometime. :-)

> >[...]

>>> >>> - they are another language to learn, a very cryptic a terse
>>> >>> language;
>> >>
>> >> Chinese is cryptic too but there are a few billion people who
>> >> don't seem to be bothered by that.
> >
> > Chinese *would* be a problem if you proposed it as the solution
> > to a problem that could be solved by using a persons native
> > tongue instead.

My point was that "cryptic" is in large part an inverse function
of knowledge. If I always go out of my way to avoid regexes, than
likely I will never become comfortable with them and they will
always seem cryptic. To someone who uses them more often, they
will seem less cryptic. They may never have the clarity of Python
but neither is Python code a very clear way to describe text
patterns.

As for needing to learn them (S D'A comment), shrug. Programmers
are expected to learn new things all the time, many even do so
for fun. REs (practical use that is) in the grand scheme of things
are not that hard.

They are I think a lot easier to learn than SQL, yet it is common
here to see recommendations to use sqlite rather than an ad-hoc
concoction of Python dicts.

> >[...]

>>> >>> - and thanks in part to Perl's over-reliance on them, there's
>>> >>> a tendency among many coders (especially those coming from
>>> >>> Perl) to abuse and/or misuse regexes; people react to that
>>> >>> misuse by treating any use of regexes with suspicion.
>> >>
>> >> So you claim. I have seen more postings in here where
>> >> REs were not used when they would have simplified the code,
>> >> then I have seen regexes used when a string method or two
>> >> would have done the same thing.
> >
> > Can you find an example or invent one? I simply don't remember
> > such problems coming up, but I admit it's possible.

Sure, the response to the OP of this thread.

ru...@yahoo.com

unread,

Jun 6, 2011, 1:47:13 AM6/6/11

to

On 06/03/2011 03:45 PM, Chris Torek wrote:
>>On 2011-06-03, ru...@yahoo.com <ru...@yahoo.com> wrote:
> [prefers]
>>> re.split ('[ ,]', source)
>
> This is probably not what you want in dealing with
> human-created text:
>
> >>> re.split('[ ,]', 'foo bar, spam,maps')
> ['foo', '', 'bar', '', 'spam', 'maps']
>
> Instead, you probably want "a comma followed by zero or
> more spaces; or, one or more spaces":
>
> >>> re.split(r',\s*|\s+', 'foo bar, spam,maps')
> ['foo', 'bar', 'spam', 'maps']
>
> or perhaps (depending on how you want to treat multiple
> adjacent commas) even this:
>
> >>> re.split(r',+\s*|\s+', 'foo bar, spam,maps,, eggs')
> ['foo', 'bar', 'spam', 'maps', 'eggs']

Which to me, illustrates nicely the power of a regex to concisely
localize the specification of an input format and adapt easily
to changes in that specification.

> although eventually you might want to just give in and use the
> csv module. :-) (Especially if you want to be able to quote
> commas, for instance.)

Which internally uses regexes, at least for the sniffer function.
(The main parser is in C presumably for speed, this being a
library module and all.)

>>> ... With regexes the code is likely to be less brittle than a
>>> dozen or more lines of mixed string functions, indexes, and
>>> conditionals.
>
> In article <94svm4...@mid.individual.net>
> Neil Cerutti <ne...@norwich.edu> wrote:
> [lots of snippage]
>>That is the opposite of my experience, but YMMV.
>
> I suspect it depends on how familiar the user is with regular
> expressions, their abilities, and their limitations.

I suspect so too at least in part.

> People relatively new to REs always seem to want to use them
> to count (to balance parentheses, for instance). People who
> have gone through the compiler course know better. :-)

But also, a thing I think sometimes gets forgotten, is if the
max nesting depth is finite, parens can be balanced with a
regex. This is nice for the particularly common case of a
nest depth of 1 (balanced but non-nested parens.)

ru...@yahoo.com

unread,

Jun 6, 2011, 2:03:39 AM6/6/11

to

On 06/03/2011 08:05 PM, Steven D'Aprano wrote:
> On Fri, 03 Jun 2011 12:29:52 -0700, ru...@yahoo.com wrote:
>
>>>> I often find myself changing, for example, a startwith() to a RE when
>>>> I realize that the input can contain mixed case
>>>
>>> Why wouldn't you just normalise the case?
>>
>> Because some of the text may be case-sensitive.
>
> Perhaps you misunderstood me. You don't have to throw away the
> unnormalised text, merely use the normalized text in the expression you
> need.
>
> Of course, if you include both case-sensitive and insensitive tests in
> the same calculation, that's a good candidate for a regex... or at least
> it would be if regexes supported that :)

I did not choose a good example to illustrate what I find often
motivates my use of regexes.

You are right that for a simple .startwith() using a regex "just
in case" is not a good choice, and in fact I would not do that.

The process that I find often occurs is that I write (or am about
to write string method solution and when I think more about the
input data (which is seldom well-specified), I realize that using
a regex I can get better error checking, do more of the "parsing"
in one place, and adapt to changes in input format better than I
could with a .startswith and a couple other such methods.

Thus what starts as
if line.startswith ('CUSTOMER '):
try: kw, first_initial, last_name, code, rest = line.split(None,
4)
...
often turns into (sometimes before it is written) something like
m = re.match (r'CUSTOMER (\w+) (\w+) ([A-Z]\d{3})')
if m: first_initial, last_name, code = m.group(...)

>>>[...]
>>>> or that I have
>>>> to treat commas as well as spaces as delimiters.
>>>
>>> source.replace(",", " ").split(" ")
>>
>> Uhgg. create a whole new string just so you can split it on one rather
>> than two characters?
>
> You say that like it's expensive.

No, I said it like it was ugly. Doing things unrelated to the
task at hand is ugly. And not very adaptable -- see my reply
to Chris Torek's post. I understand it is a common idiom and
I use it myself, but in this case there is a cleaner alternative
with re.split that expresses exactly what one is doing.

> And how do you what the regex engine is doing under the hood? For all you
> know, it could be making hundreds of temporary copies and throwing them
> away. Or something. It's a black box.

That's a silly argument.
And how do you know what replace is doing under the hood?
I would expect any regex processor to compile the regex into
an FSM. As usual, I would expect to pay a small performance
price for the generality, but that is reasonable tradeoff in
many cases. If it were a potential problem, I would test it.
What I wouldn't do is throw away a useful tool because, "golly,
I don't know, maybe it'll be slow" -- that's just a form of
cargo cult programming.

> The fact that creating a whole new string to split on is faster than
> *running* the regex (never mind compiling it, loading the regex engine,
> and anything else that needs to be done) should tell you which does more
> work. Copying is cheap. Parsing is expensive.

In addition to being wrong (loading is done once, compilation is
typically done once or a few times, while the regex is used many
times inside a loop so the overhead cost is usually trivial compared
with the cost of starting Python or reading a file), this is another
micro-optimization argument.

I'm not sure why you've suddenly developed this obsession with
wringing every last nanosecond out of your code. Usually it
is not necessary. Have you thought of buying a faster computer?
Or using C? *wink*

>> Sorry, but I find
>>
>> re.split ('[ ,]', source)
>>
>> states much more clearly exactly what is being done with no obfuscation.
>
> That's because you know regex syntax. And I'd hardly call the version
> with replace obfuscated.
>
> Certainly the regex is shorter, and I suppose it's reasonable to expect
> any reader to know at least enough regex to read that, so I'll grant you
> that this is a small win for clarity. A micro-optimization for
> readability, at the expense of performance.
>
>
>> Obviously this is a simple enough case that the difference is minor but
>> when the pattern gets only a little more complex, the clarity difference
>> becomes greater.
>
> Perhaps. But complicated tasks require complicated regexes, which are
> anything but clear.

Complicated tasks require complicated code as well.

As another post pointed out, there are ways to improve the
clarity of a regex such as the re.VERBOSE flag.
There is no doubt that a regex encapsulates information much more
densely than python string manipulation code. One should not
be surprised that is might take as much time and effort to understand
a one-line regex as a dozen (or whatever) lines Python code that
do the same thing. In most cases I'll bet, given equal fluency
in regexes and Python, the regex will take less.

> [...]
>>>> After doing this a
>>>> number of times, one starts to use an RE right from the get go unless
>>>> one is VERY sure that there will be no requirements creep.
>>>
>>> YAGNI.
>>
>> IAHNI. (I actually have needed it.)
>
> I'm sure you have, and when you need it, it's entirely appropriate to use
> a regex solution. But you stated that you used regexes as insurance *just
> in case* the requirements changed. Why, is your text editor broken? You
> can't change a call to str.startswith(prefix) to re.match(prefix, str) if
> and when you need to? That's what I mean by YAGNI -- don't solve the
> problem you think you might have tomorrow.

Retracted above.

>>> There's no need to use a regex just because you think that you *might*,
>>> someday, possibly need a regex. That's just silly. If and when
>>> requirements change, then use a regex. Until then, write the simplest
>>> code that will solve the problem you have to solve now, not the problem
>>> you think you might have to solve later.
>>
>> I would not recommend you use a regex instead of a string method solely
>> because you might need a regex later. But when you have to spend 10
>> minutes writing a half-dozen lines of python versus 1 minute writing a
>> regex, your evaluation of the possibility of requirements changing
>> should factor into your decision.
>
> Ah, but if your requirements are complicated enough that it takes you ten
> minutes and six lines of string method calls, that sounds to me like a
> situation that probably calls for a regex!

Recall that the post that started this discussion presented
a problem that took me six lines of code (actually spread out
over a few more for readability) to do without regexes versus
one line with.

So you do agree that that a regex was a better solution in
that case? I ask beause we agree both seem to agree that
regexes are useful tools and preferable when the corresponding
Python code is "too" complex. We also agree that when the
need can be handled by very simple python code, python may be
preferable. So I'm trying to calibrate your switch-over point
a little better.

> Of course it depends on what the code actually does... if it counts the
> number of nested ( ) pairs, and you're trying to do that with a regex,
> you're sacked! *wink*

Right. And again repeating what I said before, regexes
aren't a universal solution to every problem. *wink*

> [...]
>>> There are a few problems with regexes:
>>>
>>> - they are another language to learn, a very cryptic a terse language;
>>
>> Chinese is cryptic too but there are a few billion people who don't seem
>> to be bothered by that.
>
> Chinese isn't cryptic to the Chinese, because they've learned it from
> childhood.
>
> But has anyone done any studies comparing reading comprehension speed
> between native Chinese readers and native European readers? For all I
> know, Europeans learn to read twice as quickly as Chinese, and once
> learned, read text twice as fast. Or possibly the other way around. Who
> knows? Not me.
>
> But I do know that English typists typing 26 letters of the alphabet
> leave Asian typists and their thousands of ideograms in the dust. There's
> no comparison -- it's like quicksort vs bubblesort *wink*.

70 years ago there was all sorts of scientific evidence
that showed white, Western-European culture did lots of
things better than everyone else, especially non-whites,
in the world. Let's not go there. *wink*

> [...]
>>> - debugging regexes is a nightmare;
>>
>> Very complex ones, perhaps. "Nightmare" seems an overstatement.
>
> You *can't* debug regexes in Python, since there are no tools for (e.g.)
> single-stepping through the regex, displaying intermediate calculations,
> or anything other than making changes to the regex and running it again,
> hoping that it will do the right thing this time.

Thinking in addition to hoping will help quite a bit.

There are two factors that migigate the lack of debuggers.

1) REs are not a Turing complete language so in some sense
are simpler than Python.

2) The vast majority of REs that I have had to fix or write
are not complex enough to require a debugger. Often they simply
look complex due to all the parens and backslashes -- once you
reformat them (permanently with the re.VERBOSE flag, or
temporarily in a text editor, they don't look so bad.

> I suppose you can use external tools, like Regex Buddy, if you're on a
> supported platform and if they support your language's regex engine.
>
> [...]
>>> Regarding their syntax, I'd like to point out that even Larry Wall is
>>> dissatisfied with regex culture in the Perl community:
>>>
>>> http://www.perl.com/pub/2002/06/04/apo5.html
>>
>> You did see the very first sentence in this, right?
>>
>> "Editor's Note: this Apocalypse is out of date and remains here for
>> historic reasons. See Synopsis 05 for the latest information."
>
> Yes. And did you click through to see the Synopsis? It is a bare
> technical document with all the motivation removed. Since I was pointing
> to Larry Wall's motivation, it was appropriate to link to the Apocalypse
> document, not the Synopsis.

OK, fair enough.

>> (Note that "Apocalypse" is referring to a series of Perl design
>> documents and has nothing to do with regexes in particular.)
>
> But Apocalypse 5 specifically has everything to do with regexes. That's
> why I linked to that, and not (say) Apocalypse 2.

Where did I suggest that you should have linked to Apocalypse 2?
I wrote what I wrote to point out that the "Apocalypse" title was
not a pejorative comment on regexes. I don't see how I could have
been clearer.

>> Synopsis 05 is (AFAICT with a quick scan) a proposal for revising regex
>> syntax. I didn't see anything about de-emphasizing them in Perl. (But
>> I have no idea what is going on for Perl 6 so I could be wrong about
>> that.)
>
> I never said anything about de-emphasizing them. I said that Larry Wall
> was dissatisfied with Perl's culture of regexes -- his own words were:
>
> "regular expression culture is a mess"

Right, and I quoted that. But I don't know what he meant
by "culture of regexes". Their syntax? Their extensive use
in Perl? Something else? If you don't care about their
de-emphasis in Perl, then presumably their extensive use
there is not part of what you consider "culture of regexes",
yes? So to you, "culture of regexes" refers only to the
syntax of Perl regexes?

I pointed out that the use of regexs in Perl 6 (AFAICT from
the Synopsis 05 document) are still as widely used as in
Perl 5. However the document also describes changes in *how*
they are used within Perl (e.g, the production of Match objects)
So I conclude the *use* of regexes is part of Larry Wall concept
of "regex culture".

Further, my guess is that the term means something else again
to many Python programmers -- something more akin to the
LW concept but with a much greater negative valuation.

> and he is also extremely critical of current (i.e. Perl 5) regex syntax.
> Since Python's regex syntax borrows heavily from Perl 5, that's extremely
> pertinent to the issue. When even the champion of regex culture says
> there is much broken about regex culture, we should all listen.

I'll just note that "extremely" is a description you have chosen
to apply. He identified problems (some of which have developed
since regexes started being widely used) and changes to improve
them. One could say GvR was "extremely" critical of the str/-
unicode situation in Python-2. It would be a bit much to use
that to say that one should avoid the use of text in Python 2 '
programs.

The Larry Wall who you claim is "extremely critical of current
regex syntax" proposed the following in the new "fixed" regex
syntax (from the Synopsis 05 doc):

Unchanged syntactic features
The following regex features use the same syntax as in Perl 5:
Capturing: (...)
Repetition quantifiers: *, +, and ?
Alternatives: |
Backslash escape: \
Minimal matching suffix: ??, *?, +?

Those, with character classes (including "\"-named ones) and non-
capturing ()'s, constitute about 99+% of my regex uses and the
overwhelming majority of regexes I have had to work with.

Nobody here has claimed that regexes are perfect. No doubt the
Perl 6 changes are an improvement but I doubt that they change
the nature of regexes anywhere near enough to overcome the complaints
against them voiced in this group. Further, those changes will
likely take years or decades to make their way into the Python
standard library if at all. (Perl is no longer the thought-leader
it once was, and the new syntax is competing against innumerable
established uses of the old syntax outside of Perl.) Thus, although
I look forward to the new syntax, I don't see it as any kind of
justification not to use the existing syntax in the meantime.

>> As for the original reference, Wall points out a number of problems with
>> regexes, mostly details of their syntax. For example that more
>> frequently used non-capturing groups require more characters than
>> less-frequently used capturing groups. Most of these criticisms seem
>> irrelevant to the question of whether hard-wired string manipulation
>> code or regexes should be preferred in a Python program.
>
> It is only relevant in so far as the readability and relative obfuscation
> of regex syntax is relevant. No further.

OK, again you are confirming it is only the syntax of regexes
that bothers you?

> You keep throwing out the term "hard-wired string manipulation", but I
> don't understand what point you're making. I don't understand what you
> see as "hard-wired", or why you think
>
> source.startswith(prefix)
>
> is more hard-wired than
>
> re.match(prefix, source)

What I mean is that I see regexes as being and extremely small,
highly restricted, domain specific language targeted specifically
at describing text patterns. Thus they do that job better than
than trying to describe patterns implicitly with Python code.

> [...]
>> Perhaps you stopped reading after seeing his "regular expression culture
>> is a mess" comment without trying to see what he meant by "culture" or
>> "mess"?
>
> Perhaps you are being over-sensitive and reading *far* too much into what
> I said.

Not sensitive at all. I expressed an opinion that I thought
is under-represented here and could help some get over their
regex-phobia. Since it doesn't have a provably right or wrong
answer, I expected it would be contested and have no problem
with that.

As for reading too much into what you said, possibly. I look
forward to your clarifications.

> If regexes were more readable, as proposed by Wall, that would go
> a long way to reducing my suspicion of them.

I am delighted to read that you find the new syntax more
acceptable. I guess that means that although you would
object to the Perl 5 regex

/(?mi)^(?:[a-z]|\d){1,2}(?=\s)/

you find its Perl 6 form

/ :i ^^ [ <[a..z]> || \d ] ** 1..2 <?before \s> /

a big improvement?

And I presume, based on your lack of comment, the size of the
document required to describe the new syntax does not raise
any concerns for you? Or the many additional new "line-noise"
meta-characters ("too few metacharacters" was one of the
problems LW described in the Apocalypse document you referred
us to)? Again, I wonder if you and Larry Wall are really on
the same page with the faults you find in the Perl 5 syntax..

And again with the qualifier that I have not spent much time
reading about the changes, and further my regex-fu is at
a low enough level that I am probably unable to fully
appreciate many of the improvements, the syntax doesn't
really look different enough that I see it overcoming the
objections that I often read here. Consequently I don't
find the argument, avoid using what is currently available,
very convincing.

Chris Torek

unread,

Jun 6, 2011, 3:11:21 AM6/6/11

to

In article <ef48ad50-da06-47a8...@d28g2000yqf.googlegroups.com>
ru...@yahoo.com <ru...@yahoo.com> wrote (in part):
[mass snippage]
>What I mean is that I see regexes as being an extremely small,

>highly restricted, domain specific language targeted specifically
>at describing text patterns. Thus they do that job better than
>than trying to describe patterns implicitly with Python code.

Indeed.

Kernighan has often used / supported the idea of "little languages";
see:

http://www.princeton.edu/~hos/frs122/precis/kernighan.htm

In this case, regular expressions form a "little language" that is
quite well suited to some lexical analysis problems. Since the
language is (modulo various concerns) targeted at the "right level",
as it were, it becomes easy (modulo various concerns :-) ) to
express the desired algorithm precisely yet concisely.

On the whole, this is a good thing.

The trick lies in knowing when it *is* the right level, and how to
use the language of REs.

>On 06/03/2011 08:05 PM, Steven D'Aprano wrote:
>> If regexes were more readable, as proposed by Wall, that would go
>> a long way to reducing my suspicion of them.

"Suspicion" seems like an odd term here.

Still, it is true that something (whether it be use of re.VERBOSE,
and whitespace-and-comments, or some New and Improved Syntax) could
help. Dense and complex REs are quite powerful, but may also contain
and hide programming mistakes. The ability to describe what is
intended -- which may differ from what is written -- is useful.

As an interesting aside, even without the re.VERBOSE flag, one can
build complex, yet reasonably-understandable, REs in Python, by
breaking them into individual parts and giving them appropriate
names. (This is also possible in perl, although the perl syntax
makes it less obvious, I think.)

Octavian Rasnita

unread,

Jun 6, 2011, 4:51:46 AM6/6/11

to pytho...@python.org

It is not so hard to decide whether using RE is a good thing or not.

When the speed is important and every millisecond counts, RE should be used
only when there is no other faster way, because usually RE is less faster
than using other core Perl/Python functions that can do matching and
replacing.

When the speed is not such a big issue, RE should be used only if it is
easier to understand and maintain than using the core functions. And of
course, RE should be used when the core functions cannot do what RE can do.

In Python, the RE syntax is not so short and simple as in Perl, so using RE
even for very very simple things requires a longer code, so using other core
functions may appear as a better solution, because the RE version of the
code is almost never as easy to read as the code that uses other core
functions (or... for very simple RE, they are probably same as readable).

In Perl, RE syntax is very short and simple, and in some cases it is more
easier to understand and maintain a code that uses RE than other core
functions.

For example, if somebody wants to check if the $var variable contains the
letter "x", a solution without RE in Perl is:

if ( index( $var, 'x' ) >= 0 ) {
print "ok";
}

while the solution with RE is:

if ( $var =~ /x/ ) {
print "ok";
}

And it is obviously that the solution that uses RE is shorter and easier to
read and maintain, beeing also much more flexible.

Of course, sometimes an even better alternative is to use a module from CPAN
like Regexp::Common that can use RE in a more simple and readable way for
matching numbers, profanity words, balanced params, programming languages
comments, IP and MAC addresses, zip codes... or a module like Email::Valid
for verifying if an email address is correct, because it may be very hard to
create a RE for matching an email address.

So... just like with Python, there are more ways to do it, but depending on
the situation, some of them are better than others. :-)

--Octavian

--------------------------------------------------------------------------------

> --
> http://mail.python.org/mailman/listinfo/python-list
>

Chris Angelico

unread,

Jun 6, 2011, 5:01:05 AM6/6/11

to pytho...@python.org

On Mon, Jun 6, 2011 at 6:51 PM, Octavian Rasnita <oras...@gmail.com> wrote:
> It is not so hard to decide whether using RE is a good thing or not.
>
> When the speed is important and every millisecond counts, RE should be used
> only when there is no other faster way, because usually RE is less faster
> than using other core Perl/Python functions that can do matching and
> replacing.
>
> When the speed is not such a big issue, RE should be used only if it is
> easier to understand and maintain than using the core functions. And of
> course, RE should be used when the core functions cannot do what RE can do.

for X in features:
"When speed is important and every millisecond counts, X should be
used only when there is no other faster way."
"When speed is not such a big issue, X should be used only if it is
easier to understand and maintain than other ways."

I think that's fairly obvious. :)

Chris Angelico

rusi

unread,

Jun 6, 2011, 10:33:33 AM6/6/11

to

For any significant language feature (take recursion for example)
there are these issues:

1. Ease of reading/skimming (other's) code
2. Ease of writing/designing one's own
3. Learning curve
4. Costs/payoffs (eg efficiency, succinctness) of use
5. Debug-ability

I'll start with 3.
When someone of Kernighan's calibre (thanks for the link BTW) says
that he found recursion difficult it could mean either that Kernighan
is a stupid guy -- unlikely considering his other achievements. Or
that C is not optimal (as compared to lisp say) for learning
recursion.

Evidently for syntactic, implementation and cultural reasons, Perl
programmers are likely to get (and then overuse) regexes faster than
python programmers.

1 is related but not the same as 3. Someone with courses in automata,
compilers etc -- standard CS stuff -- is unlikely to find regexes a
problem. Conversely an intelligent programmer without a CS background
may find them more forbidding.

On Jun 6, 12:11 pm, Chris Torek <nos...@torek.net> wrote:
>
> >On 06/03/2011 08:05 PM, Steven D'Aprano wrote:
> >> If regexes were more readable, as proposed by Wall, that would go
> >> a long way to reducing my suspicion of them.
>
> "Suspicion" seems like an odd term here.

When I was in school my mother warned me that in college I would have
to undergo a most terrifying course called 'calculus'.

Steven's 'suspicions' make me recall my mother's warning :-)

Steven D'Aprano

unread,

Jun 6, 2011, 11:29:23 AM6/6/11

to

On Sun, 05 Jun 2011 23:03:39 -0700, ru...@yahoo.com wrote:

> Thus what starts as
> if line.startswith ('CUSTOMER '):
> try:
> kw, first_initial, last_name, code, rest = line.split(None, 4)
> ...
> often turns into (sometimes before it is written) something like
> m = re.match (r'CUSTOMER (\w+) (\w+) ([A-Z]\d{3})')
> if m:
> first_initial, last_name, code = m.group(...)

I would argue that the first, non-regex solution is superior, as it
clearly distinguishes the multiple steps of the solution:

* filter lines that start with "CUSTOMER"
* extract fields in that line
* validate fields (not shown in your code snippet)

while the regex tries to do all of these in a single command. This makes
the regex an "all or nothing" solution: it matches *everything* or
*nothing*. This means that your opportunity for giving meaningful error
messages is much reduced. E.g. I'd like to give an error message like:

found digit in customer name (field 2)

but with your regex, if it fails to match, I have no idea why it failed,
so can't give any more meaningful error than:

invalid customer line

and leave it to the caller to determine what makes it invalid. (Did I
misspell "CUSTOMER"? Put a dot after the initial? Forget the code? Use
two spaces between fields instead of one?)

[...]

> I would expect
> any regex processor to compile the regex into an FSM.

Flying Spaghetti Monster?

I have been Touched by His Noodly Appendage!!!

[...]

>> The fact that creating a whole new string to split on is faster than
>> *running* the regex (never mind compiling it, loading the regex engine,
>> and anything else that needs to be done) should tell you which does
>> more work. Copying is cheap. Parsing is expensive.
>
> In addition to being wrong (loading is done once, compilation is
> typically done once or a few times, while the regex is used many times
> inside a loop so the overhead cost is usually trivial compared with the
> cost of starting Python or reading a file), this is another
> micro-optimization argument.

Yes, but you have to pay the cost of loading the re engine, even if it is
a one off cost, it's still a cost, and sometimes (not always!) it can be
significant. It's quite hard to write fast, tiny Python scripts, because
the initialization costs of the Python environment are so high. (Not as
high as for, say, VB or Java, but much higher than, say, shell scripts.)
In a tiny script, you may be better off avoiding regexes because it takes
longer to load the engine than to run the rest of your script!

But yes, you are right that this is a micro-optimization argument. In a
big application, it's less likely to be important.

> I'm not sure why you've suddenly developed this obsession with wringing
> every last nanosecond out of your code. Usually it is not necessary.
> Have you thought of buying a faster computer? Or using C? *wink*

It's hardly an obsession. I'm just stating it as a relevant factor: for
simple text parsing tasks, string methods are often *much* faster than
regexes.

[...]

>> Ah, but if your requirements are complicated enough that it takes you
>> ten minutes and six lines of string method calls, that sounds to me
>> like a situation that probably calls for a regex!
>
> Recall that the post that started this discussion presented a problem
> that took me six lines of code (actually spread out over a few more for
> readability) to do without regexes versus one line with.
>
> So you do agree that that a regex was a better solution in that case?

I don't know... I'm afraid I can't find your six lines of code, and so
can't judge it in comparison to your regex solution:

for line in f:
fixed = re.sub (r"(TABLE='\S+)\s+'$", r"\1'", line)

My solution would probably be something like this:

for line in lines:
if line.endswith("'"):
line = line[:-1].rstrip() + "'"

although perhaps I've misunderstood the requirements.

[...]

>>> (Note that "Apocalypse" is referring to a series of Perl design
>>> documents and has nothing to do with regexes in particular.)
>>
>> But Apocalypse 5 specifically has everything to do with regexes. That's
>> why I linked to that, and not (say) Apocalypse 2.
>
> Where did I suggest that you should have linked to Apocalypse 2? I wrote
> what I wrote to point out that the "Apocalypse" title was not a
> pejorative comment on regexes. I don't see how I could have been
> clearer.

Possibly by saying what you just said here?

I never suggested, or implied, or thought, that "Apocalypse" was a
pejorative comment on *regexes*. The fact that I referenced Apocalypse
FIVE suggests strongly that there are at least four others, presumably
not about regexes.

[...]

>> It is only relevant in so far as the readability and relative
>> obfuscation of regex syntax is relevant. No further.
>
> OK, again you are confirming it is only the syntax of regexes that
> bothers you?

The syntax of regexes is a big part of it. I won't say the only part.

[...]

>> If regexes were more readable, as proposed by Wall, that would go a
>> long way to reducing my suspicion of them.
>
> I am delighted to read that you find the new syntax more acceptable.

Perhaps I wasn't as clear as I could have been. I don't know what the new
syntax is. I was referring to the design principle of improving the
readability of regexes. Whether Wall's new syntax actually does improve
readability and ease of maintenance is a separate issue, one on which I
don't have an opinion on. I applaud his *intention* to reform regex
syntax, without necessarily agreeing that he has done so.

--
Steven

Ian Kelly

unread,

Jun 6, 2011, 12:06:20 PM6/6/11

to Python

On Mon, Jun 6, 2011 at 9:29 AM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> [...]
>> I would expect
>> any regex processor to compile the regex into an FSM.
>
> Flying Spaghetti Monster?
>
> I have been Touched by His Noodly Appendage!!!

Finite State Machine.

Neil Cerutti

unread,

Jun 6, 2011, 12:08:05 PM6/6/11

to

Here's a recap, along with two candidate solutions, one based on
your recommendation, and one using str functions and slicing.

(I fixed a specification problem in your original regex, as one
of the lines of data contained a space after the closing ',
making the $ inappropriate)

data.txt:
//ACCDJ EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
// UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCDJ '
//ACCT EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
// UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCT '
//ACCUM EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
// UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCUM '
//ACCUM1 EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
// UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCUM1 '
^Z

import re

print("re solution")
with open("data.txt") as f:
for line in f:
fixed = re.sub(r"(TABLE='\S+)\s+'", r"\1'", line)
print(fixed, end='')

print("non-re solution")
with open("data.txt") as f:
for line in f:
i = line.find("TABLE='")
if i != -1:
begin = line.index("'", i) + 1
end = line.index("'", begin)
field = line[begin: end].rstrip()
print(line[:i] + line[i:begin] + field + line[end:], end='')
else:
print(line, end='')

These two solutions print identical output processing the sample
data. Slight changes in the data would reveal divergence in the
assumptions each solution made.

I agree with you that this is a very tempting candidate for
re.sub, and if it probably would have been my first try as well.

--
Neil Cerutti

Ian Kelly

unread,

Jun 6, 2011, 12:29:15 PM6/6/11

to Python

On Mon, Jun 6, 2011 at 10:08 AM, Neil Cerutti <ne...@norwich.edu> wrote:
> import re
>
> print("re solution")
> with open("data.txt") as f:
> for line in f:
> fixed = re.sub(r"(TABLE='\S+)\s+'", r"\1'", line)
> print(fixed, end='')
>
> print("non-re solution")
> with open("data.txt") as f:
> for line in f:
> i = line.find("TABLE='")
> if i != -1:
> begin = line.index("'", i) + 1
> end = line.index("'", begin)
> field = line[begin: end].rstrip()
> print(line[:i] + line[i:begin] + field + line[end:], end='')
> else:
> print(line, end='')

print("non-re solution")
with open("data.txt") as f:
for line in f:

try:
start = line.index("TABLE='") + 7
end = line.index("'", start)
except ValueError:
pass
else:
line = line[:start] + line[start:end].rstrip() + line[end:]
print(line, end='')

Neil Cerutti

unread,

Jun 6, 2011, 1:17:28 PM6/6/11

to

On 2011-06-06, Ian Kelly <ian.g...@gmail.com> wrote:
> On Mon, Jun 6, 2011 at 10:08 AM, Neil Cerutti <ne...@norwich.edu> wrote:
>> import re
>>
>> print("re solution")
>> with open("data.txt") as f:

>> ? ?for line in f:
>> ? ? ? ?fixed = re.sub(r"(TABLE='\S+)\s+'", r"\1'", line)
>> ? ? ? ?print(fixed, end='')

>>
>> print("non-re solution")
>> with open("data.txt") as f:

>> ? ?for line in f:
>> ? ? ? ?i = line.find("TABLE='")
>> ? ? ? ?if i != -1:
>> ? ? ? ? ? ?begin = line.index("'", i) + 1
>> ? ? ? ? ? ?end = line.index("'", begin)
>> ? ? ? ? ? ?field = line[begin: end].rstrip()
>> ? ? ? ? ? ?print(line[:i] + line[i:begin] + field + line[end:], end='')
>> ? ? ? ?else:
>> ? ? ? ? ? ?print(line, end='')

>
> print("non-re solution")
> with open("data.txt") as f:
> for line in f:
> try:
> start = line.index("TABLE='") + 7

I wrestled with using addition like that, and decided against it.
The 7 is a magic number and repeats/hides information. I wanted
something like:

prefix = "TABLE='"
start = line.index(prefix) + len(prefix)

But decided I searching for the opening ' was a bit better.

--
Neil Cerutti

Ethan Furman

unread,

Jun 6, 2011, 1:48:15 PM6/6/11

to Python

Ian Kelly wrote:
> On Mon, Jun 6, 2011 at 10:08 AM, Neil Cerutti <ne...@norwich.edu> wrote:

>> import re
>>
>> print("re solution")
>> with open("data.txt") as f:
>> for line in f:
>> fixed = re.sub(r"(TABLE='\S+)\s+'", r"\1'", line)
>> print(fixed, end='')
>>
>> print("non-re solution")
>> with open("data.txt") as f:
>> for line in f:
>> i = line.find("TABLE='")
>> if i != -1:
>> begin = line.index("'", i) + 1
>> end = line.index("'", begin)
>> field = line[begin: end].rstrip()
>> print(line[:i] + line[i:begin] + field + line[end:], end='')
>> else:
>> print(line, end='')
>
> print("non-re solution")
> with open("data.txt") as f:
> for line in f:

> try:
> start = line.index("TABLE='") + 7

> end = line.index("'", start)
> except ValueError:
> pass
> else:
> line = line[:start] + line[start:end].rstrip() + line[end:]
> print(line, end='')

I like the readability of this version, but isn't generating an
exception on every other line going to kill performance?

~Ethan~

Ian Kelly

unread,

Jun 6, 2011, 1:40:51 PM6/6/11

to Neil Cerutti, pytho...@python.org

On Mon, Jun 6, 2011 at 11:17 AM, Neil Cerutti <ne...@norwich.edu> wrote:
> I wrestled with using addition like that, and decided against it.
> The 7 is a magic number and repeats/hides information. I wanted
> something like:
>
> prefix = "TABLE='"
> start = line.index(prefix) + len(prefix)
>
> But decided I searching for the opening ' was a bit better.

Fair enough, although if you ask me the + 1 is just as magical as the
+ 7 (it's still the length of the string that you're searching for).
Also, re-finding the opening ' still repeats information.

The main thing I wanted to fix was that the second .index() call had
the possibility of raising an unhandled ValueError. There are really
two things we have to search for in the line, either of which could be
missing, and catching them both with the same except: clause feels
better to me than checking both of them for -1.

Cheers,
Ian

Ian Kelly

unread,

Jun 6, 2011, 1:42:27 PM6/6/11

to Ethan Furman, Python

On Mon, Jun 6, 2011 at 11:48 AM, Ethan Furman <et...@stoneleaf.us> wrote:
> I like the readability of this version, but isn't generating an exception on
> every other line going to kill performance?

I timed it on the example data before I posted and found that it was
still 10 times as fast as the regex version. I didn't time the
version without the exceptions.

Neil Cerutti

unread,

Jun 6, 2011, 1:56:45 PM6/6/11

to

On 2011-06-06, Ian Kelly <ian.g...@gmail.com> wrote:

> Fair enough, although if you ask me the + 1 is just as magical
> as the + 7 (it's still the length of the string that you're
> searching for). Also, re-finding the opening ' still repeats
> information.

Heh, true. I doesn't really repeat information, though, as in my
version there could be intervening garbage after the TABLE=,
which probably isn't desirable.

> The main thing I wanted to fix was that the second .index()
> call had the possibility of raising an unhandled ValueError.
> There are really two things we have to search for in the line,
> either of which could be missing, and catching them both with
> the same except: clause feels better to me than checking both
> of them for -1.

I thought an unhandled ValueError was a good idea in that case. I
knew that TABLE= may not exist, but I assumed if it did, that the
quotes are supposed to be there.

--
Neil Cerutti

Ian

unread,

Jun 6, 2011, 5:04:06 PM6/6/11

to pytho...@python.org

On 03/06/2011 03:58, Chris Torek wrote:
>
>> -------------------------------------------------
> This is a bit surprising, since both "s1 in s2" and re.search()
> could use a Boyer-Moore-based algorithm for a sufficiently-long
> fixed string, and the time required should be proportional to that
> needed to set up the skip table. The re.compile() gets to re-use
> the table every time.
Is that true? My immediate thought is that Boyer-Moore would quickly give
the number of characters to skip, but skipping them would be slow because
UTF8 encoded characters are variable sized, and the string would have to be
walked anyway.

Or am I misunderstanding something.

Ian

ru...@yahoo.com

unread,

Jun 7, 2011, 12:00:34 PM6/7/11

to

On 06/06/2011 09:29 AM, Steven D'Aprano wrote:
> On Sun, 05 Jun 2011 23:03:39 -0700, ru...@yahoo.com wrote:

[...]

> I would argue that the first, non-regex solution is superior, as it
> clearly distinguishes the multiple steps of the solution:
>
> * filter lines that start with "CUSTOMER"
> * extract fields in that line
> * validate fields (not shown in your code snippet)
>
> while the regex tries to do all of these in a single command. This makes
> the regex an "all or nothing" solution: it matches *everything* or
> *nothing*. This means that your opportunity for giving meaningful error
> messages is much reduced. E.g. I'd like to give an error message like:
>
> found digit in customer name (field 2)
>
> but with your regex, if it fails to match, I have no idea why it failed,
> so can't give any more meaningful error than:
>
> invalid customer line
>
> and leave it to the caller to determine what makes it invalid. (Did I
> misspell "CUSTOMER"? Put a dot after the initial? Forget the code? Use
> two spaces between fields instead of one?)

I agree that is a legitimate criticism. Its importance depends
greatly on the purpose and consumers of the code. While such
detailed error messages might be appropriate in a fully polished
product, in my case, I often have to process files personally
to extract information, or to provide code to others (who typically
have at least some degree of technical sophistication) to do the
same.

In this case, being able to code something quickly, and adapt it
quickly to changes is more important than providing highly detailed
error messages. The format is simple enough that "invalid customer
line" and the line number is perfectly adaquate. YMMV.

As I said, regexes are a tool, like any tool, to be used
appropriately.

[...]

>> In addition to being wrong (loading is done once, compilation is
>> typically done once or a few times, while the regex is used many times
>> inside a loop so the overhead cost is usually trivial compared with the
>> cost of starting Python or reading a file), this is another
>> micro-optimization argument.
>
> Yes, but you have to pay the cost of loading the re engine, even if it is
> a one off cost, it's still a cost,

~$ time python -c 'pass'
real 0m0.015s
user 0m0.011s
sys 0m0.003s

~$ time python -c 'import re'
real 0m0.015s
user 0m0.011s
sys 0m0.003s

Or do you mean something else by "loading the re engine"?

> and sometimes (not always!) it can be
> significant. It's quite hard to write fast, tiny Python scripts, because
> the initialization costs of the Python environment are so high. (Not as
> high as for, say, VB or Java, but much higher than, say, shell scripts.)
> In a tiny script, you may be better off avoiding regexes because it takes
> longer to load the engine than to run the rest of your script!

Do you have an example? I am having a hard time imagining that.
Perhaps you are thinking on the time require to compile a RE?

~$ time python -c 'import re; re.compile(r"^[^()]*($[^()]*$[^()]*)*
$")'
real 0m0.017s
user 0m0.014s
sys 0m0.003s

Hard to imagine a case where where 15mS is fast enough but
17mS is too slow. And that's without the diluting effect
of actually doing some real work in the script. Of course
a more complex regex would likely take longer.

(The times vary greatly on my machine, I am quoting the most
common lowest but not absolutely lowest results.)

>>>> (Note that "Apocalypse" is referring to a series of Perl design
>>>> documents and has nothing to do with regexes in particular.)
>>>
>>> But Apocalypse 5 specifically has everything to do with regexes. That's
>>> why I linked to that, and not (say) Apocalypse 2.
>>
>> Where did I suggest that you should have linked to Apocalypse 2? I wrote
>> what I wrote to point out that the "Apocalypse" title was not a
>> pejorative comment on regexes. I don't see how I could have been
>> clearer.
>
> Possibly by saying what you just said here?
>
> I never suggested, or implied, or thought, that "Apocalypse" was a
> pejorative comment on *regexes*. The fact that I referenced Apocalypse
> FIVE suggests strongly that there are at least four others, presumably
> not about regexes.

Nor did I ever suggest you did. Don't forget that you are
not the only person reading this list. The comment was for
the benefit of others. Perhaps you are being overly sensitive?

> [...]
>>> If regexes were more readable, as proposed by Wall, that would go a
>>> long way to reducing my suspicion of them.
>>
>> I am delighted to read that you find the new syntax more acceptable.
>
> Perhaps I wasn't as clear as I could have been. I don't know what the new
> syntax is. I was referring to the design principle of improving the
> readability of regexes. Whether Wall's new syntax actually does improve
> readability and ease of maintenance is a separate issue, one on which I
> don't have an opinion on. I applaud his *intention* to reform regex
> syntax, without necessarily agreeing that he has done so.

Thanks for clarifying. But since you earlier wrote in response
to MRAB,
http://groups.google.com/group/comp.lang.python/msg/43f3a81d9cc75217?

"Have you considered the suggested Perl 6 syntax? Much of
it looks good to me."

I'm sure you can understand my confusion.

ru...@yahoo.com

unread,

Jun 7, 2011, 2:37:15 PM6/7/11

to

On 06/06/2011 08:33 AM, rusi wrote:
> For any significant language feature (take recursion for example)
> there are these issues:
>
> 1. Ease of reading/skimming (other's) code
> 2. Ease of writing/designing one's own
> 3. Learning curve
> 4. Costs/payoffs (eg efficiency, succinctness) of use
> 5. Debug-ability
>
> I'll start with 3.
> When someone of Kernighan's calibre (thanks for the link BTW) says
> that he found recursion difficult it could mean either that Kernighan
> is a stupid guy -- unlikely considering his other achievements. Or
> that C is not optimal (as compared to lisp say) for learning
> recursion.

Just as a side comment, I didn't see anything in the link
Chris Torek posted (repeated here since it got snipped:
http://www.princeton.edu/~hos/frs122/precis/kernighan.htm)
that said Kernighan found recursion difficult, just that it
was perceived as expensive. Nor that the expense had anything
to do with programming language but rather was due to hardware
constraints of the time.
But maybe you are referring to some other source?

> Evidently for syntactic, implementation and cultural reasons, Perl
> programmers are likely to get (and then overuse) regexes faster than
> python programmers.

If by "get", you mean "understand", then I'm not sure why
the reasons you give should make a big difference. Regex
syntax is pretty similar in both Python and Perl, and
virtually identical in terms of learning their basics.
There are some differences in the how regexes are used
between Perl and Python that I mentioned in
http://groups.google.com/group/comp.lang.python/msg/39fca0d4589f4720?,
but as I said there, that wouldn't, particularly in light
of Python culture where one-liners and terseness are not
highly valued, seem very important. And I don't see how
the different Perl and Python cultures themselves would
make learning regexes harder for Python programmers. At
most I can see the Perl culture encouraging their use and
the Python culture discouraging it, but that doesn't change
the ease or difficulty of learning.

And why do you say "overuse" regexs? Why isn't it the case
that Perl programmers use regexes appropriately in Perl? Are
you not arbitrarily applying a Python-centric standard to a
different culture? What if a Perl programmer says that Python
programmers under-use regexes?

> 1 is related but not the same as 3. Someone with courses in automata,
> compilers etc -- standard CS stuff -- is unlikely to find regexes a
> problem. Conversely an intelligent programmer without a CS background
> may find them more forbidding.

I'm not sure of that. (Not sure it should be that way,
perhaps it may be that way in practice.) I suspect that
a good theoretical understanding of automata theory would
be essential in writing a regex compiler but I'm not sure
it is necessary to use regexes.

It does I'm sure give one a solid understanding of the
limitations of regexes but a practical understanding of
those can be achieved without the full course I think.

Roy Smith

unread,

Jun 7, 2011, 8:30:07 PM6/7/11

to

On 06/06/2011 08:33 AM, rusi wrote:
>> Evidently for syntactic, implementation and cultural reasons, Perl
>> programmers are likely to get (and then overuse) regexes faster than
>> python programmers.

"ru...@yahoo.com" <ru...@yahoo.com> wrote:
> I don't see how the different Perl and Python cultures themselves
> would make learning regexes harder for Python programmers.

Oh, that part's obvious. People don't learn things in a vacuum. They
read about something, try it, fail, and ask for help. If, in one
community, the response they get is, "I see what's wrong with your
regex, you need to ...", and in another they get, "You shouldn't be
using a regex there, you should use this string method instead...", it
should not be a surprise that it's easier to learn about regexes in the
first community.

rusi

unread,

Jun 8, 2011, 4:27:54 AM6/8/11

to

On Jun 7, 11:37 pm, "ru...@yahoo.com" <ru...@yahoo.com> wrote:
> On 06/06/2011 08:33 AM, rusi wrote:
>
> > For any significant language feature (take recursion for example)
> > there are these issues:
>
> > 1. Ease of reading/skimming (other's) code
> > 2. Ease of writing/designing one's own
> > 3. Learning curve
> > 4. Costs/payoffs (eg efficiency, succinctness) of use
> > 5. Debug-ability
>
> > I'll start with 3.
> > When someone of Kernighan's calibre (thanks for the link BTW) says
> > that he found recursion difficult it could mean either that Kernighan
> > is a stupid guy -- unlikely considering his other achievements. Or
> > that C is not optimal (as compared to lisp say) for learning
> > recursion.
>
> Just as a side comment, I didn't see anything in the link
> Chris Torek posted (repeated here since it got snipped:http://www.princeton.edu/~hos/frs122/precis/kernighan.htm)
> that said Kernighan found recursion difficult, just that it
> was perceived as expensive. Nor that the expense had anything
> to do with programming language but rather was due to hardware
> constraints of the time.
> But maybe you are referring to some other source?

No the same source, see:

> In his work Kernighan also experimented with writing structured and unstructured programs.
> He found writing structured programs (programs that did not use goto's) difficult at first,
> but now he cannot imagine writing programs in any other manner. The idea of recursion
> in programs also seemed to develop slowly; the advantage to the programmer was clear,
> but recursion statements were generally perceived as expensive, and thus were discouraged.

Note the also -- it suggests that recursion and structured programming
went together for Kernighan.

>
> > Evidently for syntactic, implementation and cultural reasons, Perl
> > programmers are likely to get (and then overuse) regexes faster than
> > python programmers.
>
> If by "get", you mean "understand", then I'm not sure why
> the reasons you give should make a big difference. Regex
> syntax is pretty similar in both Python and Perl, and
> virtually identical in terms of learning their basics.

Having it part of the language (rather than an import-ed module makes
for a certain 'smoothness' (I imagine)

> There are some differences in the how regexes are used
> between Perl and Python that I mentioned in http://groups.google.com/group/comp.lang.python/msg/39fca0d4589f4720?,
> but as I said there, that wouldn't, particularly in light
> of Python culture where one-liners and terseness are not
> highly valued, seem very important. And I don't see how
> the different Perl and Python cultures themselves would
> make learning regexes harder for Python programmers. At
> most I can see the Perl culture encouraging their use and
> the Python culture discouraging it, but that doesn't change
> the ease or difficulty of learning.

See Roy's answer.

> What if a Perl programmer says that Python programmers under-use regexes?

That's what I gather they would say (and I guess you and I agree its
true?)

>
> And why do you say "overuse" regexs? Why isn't it the case
> that Perl programmers use regexes appropriately in Perl? Are
> you not arbitrarily applying a Python-centric standard to a
> different culture?

>

> > 1 is related but not the same as 3. Someone with courses in automata,
> > compilers etc -- standard CS stuff -- is unlikely to find regexes a
> > problem. Conversely an intelligent programmer without a CS background
> > may find them more forbidding.
>
> I'm not sure of that. (Not sure it should be that way,
> perhaps it may be that way in practice.) I suspect that
> a good theoretical understanding of automata theory would
> be essential in writing a regex compiler but I'm not sure
> it is necessary to use regexes.
>
> It does I'm sure give one a solid understanding of the
> limitations of regexes but a practical understanding of
> those can be achieved without the full course I think.

How do you answer when a regex-happy but CS-illiterate programmer asks
for a regex to match parenthesis?

Anyway you may be right and this is quite far from my main points --
which I would like to iterate:

1. regexes were invented by automata-theorists and used mostly-
unchanged by early unix hackers.
This works upto a point and fails badly when they get too large.
2. Larry Walls suggestions are on the whole good and python can
leapfrog over Perl by implementing them, especially given that it is
much easier for python to add a newRe module than for perl5 to become
perl6.
3. A big problem with regexes is not having re-debuggers. Things like
emacs' re-builder, regex-tool and python's native kodos need more
visibility

Duncan Booth

unread,

Jun 8, 2011, 5:01:48 AM6/8/11

to

"ru...@yahoo.com" <ru...@yahoo.com> wrote:
> On 06/06/2011 09:29 AM, Steven D'Aprano wrote:
>> Yes, but you have to pay the cost of loading the re engine, even if
>> it is a one off cost, it's still a cost,
>
> ~$ time python -c 'pass'
> real 0m0.015s
> user 0m0.011s
> sys 0m0.003s
>
> ~$ time python -c 'import re'
> real 0m0.015s
> user 0m0.011s
> sys 0m0.003s
>
> Or do you mean something else by "loading the re engine"?

At least part of the reason that there's no difference there is that the
're' module was imported in both cases:

C:\Python27>python -c "import sys; print('re' in sys.modules)"
True

C:\Python32>python -c "import sys; print('re' in sys.modules)"
True

Steven is right to assert that there's a cost to loading it, but unless you
jump through hoops it's not a cost you can avoid paying and still use
Python.

--
Duncan Booth http://kupuguy.blogspot.com

ru...@yahoo.com

unread,

Jun 8, 2011, 10:39:24 AM6/8/11

to

On 06/08/2011 03:01 AM, Duncan Booth wrote:
> "ru...@yahoo.com" <ru...@yahoo.com> wrote:
>> On 06/06/2011 09:29 AM, Steven D'Aprano wrote:
>>> Yes, but you have to pay the cost of loading the re engine, even if
>>> it is a one off cost, it's still a cost,

[...]

> At least part of the reason that there's no difference there is that the
> 're' module was imported in both cases:

Quite right. I should have thought of that.

[...]

> Steven is right to assert that there's a cost to loading it, but unless you
> jump through hoops it's not a cost you can avoid paying and still use
> Python.

I would say that it is effectively zero cost then.

ru...@yahoo.com

unread,

Jun 8, 2011, 10:38:05 AM6/8/11

to

I think we are just using different definitions of "harder".

I said, immediately after the sentence you quoted,

>> At
>> most I can see the Perl culture encouraging their use and
>> the Python culture discouraging it, but that doesn't change
>> the ease or difficulty of learning.

Constantly being told not to use regexes certainly discourages
one from learning them, but I don't think that's the same as
being *harder* to learn in Python. The syntax of regexes is,
at least at the basic level, pretty universal, and it is in
learning to understand that syntax that most of any difficulty
lies. Whether to express a regex as "/code (blue)|(red)/i" in
Perl or "(r'code (blue)|(red)', re.I)" in Python is a superficial
difference, as is, say, using match results: "$alert = $1' vs
"alert = m.group(1)".

A Google for "python regular expression tutorial" produces
lots of results including the Python docs HOWTO. And because
the syntax is pretty universal, leaving the "python" off that
search string will yield many, many more that are applicable.
Although one does get some "don't do that" responses to regex
questions on this list (and some are good advice), there are
also usually answers too.

So I think of it as more of a Python culture thing, rather
then being actually harder to learn to use regexes in Python
although I see how one can view it your way too.

rusi

unread,

Jun 8, 2011, 12:14:08 PM6/8/11

to

... this is the old nature vs nurture debate: http://en.wikipedia.org/wiki/Nature_versus_nurture

Chris Torek

unread,

Jun 8, 2011, 10:32:08 PM6/8/11

to

>On 03/06/2011 03:58, Chris Torek wrote:
>>> -------------------------------------------------
>> This is a bit surprising, since both "s1 in s2" and re.search()
>> could use a Boyer-Moore-based algorithm for a sufficiently-long
>> fixed string, and the time required should be proportional to that
>> needed to set up the skip table. The re.compile() gets to re-use
>> the table every time.

In article <mailman.2508.1307394...@python.org>

Ian <hobs...@gmail.com> wrote:
>Is that true? My immediate thought is that Boyer-Moore would quickly give
>the number of characters to skip, but skipping them would be slow because
>UTF8 encoded characters are variable sized, and the string would have to be
>walked anyway.

As I understand it, strings in python 3 are Unicode internally and
(apparently) use wchar_t. Byte strings in python 3 are of course
byte strings, not UTF-8 encoded.

>Or am I misunderstanding something.

Here's python 2.7 on a Linux box:

>>> print sys.getsizeof('a'), sys.getsizeof('ab'), sys.getsizeof('abc')
38 39 40
>>> print sys.getsizeof(u'a'), sys.getsizeof(u'ab'), sys.getsizeof(u'abc')
56 60 64

This implies that strings in Python 2.x are just byte strings (same
as b"..." in Python 3.x) and never actually contain unicode; and
unicode strings (same as "..." in Python 3.x) use 4-byte "characters"
per that box's wchar_t.