Regex driving me crazy...

J

unread,

Apr 7, 2010, 5:40:32 PM4/7/10

to Python List

Can someone make me un-crazy?

I have a bit of code that right now, looks like this:

status = getoutput('smartctl -l selftest /dev/sda').splitlines()[6]
status = re.sub(' (?= )(?=([^"]*"[^"]*")*[^"]*$)', ":",status)
print status

Basically, it pulls the first actual line of data from the return you
get when you use smartctl to look at a hard disk's selftest log.

The raw data looks like this:

# 1 Short offline Completed without error 00% 679 -

Unfortunately, all that whitespace is arbitrary single space
characters. And I am interested in the string that appears in the
third column, which changes as the test runs and then completes. So
in the example, "Completed without error"

The regex I have up there doesn't quite work, as it seems to be
subbing EVERY space (or at least in instances of more than one space)
to a ':' like this:

# 1: Short offline:::::: Completed without error:::::: 00%:::::: 679:::::::: -

Ultimately, what I'm trying to do is either replace any space that is
> one space wiht a delimiter, then split the result into a list and
get the third item.

OR, if there's a smarter, shorter, or better way of doing it, I'd love to know.

The end result should pull the whole string in the middle of that
output line, and then I can use that to compare to a list of possible
output strings to determine if the test is still running, has
completed successfully, or failed.

Unfortunately, my google-fu fails right now, and my Regex powers were
always rather weak anyway...

So any ideas on what the best way to proceed with this would be?

Grant Edwards

unread,

Apr 7, 2010, 5:47:23 PM4/7/10

to

On 2010-04-07, J <dreadpi...@gmail.com> wrote:

> Can someone make me un-crazy?

Definitely. Regex is driving you crazy, so don't use a regex.

inputString = "# 1 Short offline Completed without error 00% 679 -"

print ' '.join(inputString.split()[4:-3])

> So any ideas on what the best way to proceed with this would be?

Anytime you have a problem with a regex, the first thing you should
ask yourself: "do I really, _really_ need a regex?

Hint: the answer is usually "no".

--
Grant Edwards grant.b.edwards Yow! I'm continually AMAZED
at at th'breathtaking effects
gmail.com of WIND EROSION!!

Patrick Maupin

unread,

Apr 7, 2010, 8:49:27 PM4/7/10

to

You mean like this?

>>> import re
>>> re.split(' {2,}', '# 1 Short offline Completed without error 00%')
['# 1', 'Short offline', 'Completed without error', '00%']
>>>

Regards,
Pat

Patrick Maupin

unread,

Apr 7, 2010, 8:50:12 PM4/7/10

to

On Apr 7, 4:47 pm, Grant Edwards <inva...@invalid.invalid> wrote:

OK, fine. Post a better solution to this problem than:

Patrick Maupin

unread,

Apr 7, 2010, 9:03:47 PM4/7/10

to

BTW, although I find it annoying when people say "don't do that" when
"that" is a perfectly good thing to do, and although I also find it
annoying when people tell you what not to do without telling you what
*to* do, and although I find the regex solution to this problem to be
quite clean, the equivalent non-regex solution is not terrible, so I
will present it as well, for your viewing pleasure:

>>> [x for x in '# 1 Short offline Completed without error 00%'.split(' ') if x.strip()]

James Stroud

unread,

Apr 7, 2010, 10:02:39 PM4/7/10

to

Patrick Maupin wrote:
> BTW, although I find it annoying when people say "don't do that" when
> "that" is a perfectly good thing to do, and although I also find it
> annoying when people tell you what not to do without telling you what
> *to* do, and although I find the regex solution to this problem to be
> quite clean, the equivalent non-regex solution is not terrible

I propose a new way to answer questions on c.l.python that will (1) give respondents the pleasure of vague admonishment and (2) actually answer the question. The way I propose utilizes the double negative. For example:

"You are doing it wrong! Don't not do <code>re.split('\s{2,}', s[2])</code>."

Please answer this way in the future.

Thank you,
James

Patrick Maupin

unread,

Apr 7, 2010, 10:10:36 PM4/7/10

to

On Apr 7, 9:02 pm, James Stroud <nospamjstroudmap...@mbi.ucla.edu>
wrote:

I most certainly will not consider when that isn't warranted!

OTOH, in general I am more interested in admonishing the authors of
the pseudo-answers than I am the authors of the questions, despite the
fact that I find this hilarious:

http://despair.com/cluelessness.html

Regards,
Pat

Grant Edwards

unread,

Apr 7, 2010, 10:36:49 PM4/7/10

to

On 2010-04-08, Patrick Maupin <pma...@gmail.com> wrote:

> On Apr 7, 4:47?pm, Grant Edwards <inva...@invalid.invalid> wrote:
>> On 2010-04-07, J <dreadpiratej...@gmail.com> wrote:
>>
>> > Can someone make me un-crazy?
>>

>> Definitely. ?Regex is driving you crazy, so don't use a regex.
>>
>> ? inputString = "# 1 ?Short offline ? ? ? Completed without error ? ? 00% ? ? ? 679 ? ? ? ? -"
>>
>> ? print ' '.join(inputString.split()[4:-3])
[...]

> OK, fine. Post a better solution to this problem than:
>
>>>> import re
>>>> re.split(' {2,}', '# 1 Short offline Completed without error 00%')
> ['# 1', 'Short offline', 'Completed without error', '00%']

OK, I'll bite: what's wrong with the solution I already posted?

--
Grant

Grant Edwards

unread,

Apr 7, 2010, 10:38:30 PM4/7/10

to

I will certain try to avoid not answering in a manner not unlike that.

--
Grant

Patrick Maupin

unread,

Apr 7, 2010, 10:45:21 PM4/7/10

to

On Apr 7, 9:36 pm, Grant Edwards <inva...@invalid.invalid> wrote:

Sorry, my eyes completely missed your one-liner, so my criticism about
not posting a solution was unwarranted. I don't think you and I read
the problem the same way (which is probably why I didn't notice your
solution -- because it wasn't solving the problem I thought I saw).

When I saw "And I am interested in the string that appears in the
third column, which changes as the test runs and then completes" I
assumed that, not only could that string change, but so could the one
before it.

I guess my base assumption that anything with words in it could
change. I was looking at the OP's attempt at a solution, and he
obviously felt he needed to see two or more spaces as an item
delimiter.

(And I got testy because of seeing other IMO unwarranted denigration
of re on the list lately.)

Regards,
Pat

Steven D'Aprano

unread,

Apr 7, 2010, 10:51:53 PM4/7/10

to

On Wed, 07 Apr 2010 18:03:47 -0700, Patrick Maupin wrote:

> BTW, although I find it annoying when people say "don't do that" when
> "that" is a perfectly good thing to do, and although I also find it
> annoying when people tell you what not to do without telling you what
> *to* do,

Grant did give a perfectly good solution.

> and although I find the regex solution to this problem to be
> quite clean, the equivalent non-regex solution is not terrible, so I
> will present it as well, for your viewing pleasure:
>
> >>> [x for x in '# 1 Short offline Completed without error
> 00%'.split(' ') if x.strip()]
> ['# 1', 'Short offline', ' Completed without error', ' 00%']

This is one of the reasons we're so often suspicious of re solutions:

>>> s = '# 1 Short offline Completed without error 00%'
>>> tre = Timer("re.split(' {2,}', s)",
... "import re; from __main__ import s")
>>> tsplit = Timer("[x for x in s.split(' ') if x.strip()]",
... "from __main__ import s")
>>>
>>> re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()]
True
>>>
>>>
>>> min(tre.repeat(repeat=5))
6.1224789619445801
>>> min(tsplit.repeat(repeat=5))
1.8338048458099365

Even when they are correct and not unreadable line-noise, regexes tend to
be slow. And they get worse as the size of the input increases:

>>> s *= 1000
>>> min(tre.repeat(repeat=5, number=1000))
2.3496899604797363
>>> min(tsplit.repeat(repeat=5, number=1000))
0.41538596153259277
>>>
>>> s *= 10
>>> min(tre.repeat(repeat=5, number=1000))
23.739185094833374
>>> min(tsplit.repeat(repeat=5, number=1000))
4.6444299221038818

And this isn't even one of the pathological O(N**2) or O(2**N) regexes.

Don't get me wrong -- regexes are a useful tool. But if your first
instinct is to write a regex, you're doing it wrong.

[quote]
A related problem is Perl's over-reliance on regular expressions
that is exaggerated by advocating regex-based solution in almost
all O'Reilly books. The latter until recently were the most
authoritative source of published information about Perl.

While simple regular expression is a beautiful thing and can
simplify operations with string considerably, overcomplexity in
regular expressions is extremly dangerous: it cannot serve a basis
for serious, professional programming, it is fraught with pitfalls,
a big semantic mess as a result of outgrowing its primary purpose.
Diagnostic for errors in regular expressions is even weaker then
for the language itself and here many things are just go unnoticed.
[end quote]

http://www.softpanorama.org/Scripting/Perlbook/Ch01/
place_of_perl_among_other_lang.shtml

Even Larry Wall has criticised Perl's regex culture:

http://dev.perl.org/perl6/doc/design/apo/A05.html

--
Steven

J

unread,

Apr 7, 2010, 11:01:08 PM4/7/10

to Patrick Maupin, pytho...@python.org

On Wed, Apr 7, 2010 at 22:45, Patrick Maupin <pma...@gmail.com> wrote:

> When I saw "And I am interested in the string that appears in the
> third column, which changes as the test runs and then completes" I
> assumed that, not only could that string change, but so could the one
> before it.
>
> I guess my base assumption that anything with words in it could
> change. I was looking at the OP's attempt at a solution, and he
> obviously felt he needed to see two or more spaces as an item
> delimiter.

I apologize for the confusion, Pat...

I could have worded that better, but at that point I was A:
Frustrated, B: starving, and C: had my wife nagging me to stop working
to come get something to eat ;-)

What I meant was, in that output string, the phrase in the middle
could change in length...
After looking at the source code for smartctl (part of the
smartmontools package for you linux people) I found the switch that
creates those status messages.... they vary in character length, some
with non-text characters like ( and ) and /, and have either 3 or 4
words...

The spaces between each column, instead of being a fixed number of
spaces each, were seemingly arbitrarily created... there may be 4
spaces between two columns or there may be 9, or 7 or who knows what,
and since they were all being treated as individual spaces instead of
tabs or something, I was having trouble splitting the output into
something that was easy to parse (at least in my mind it seemed that
way).

Anyway, that's that... and I do apologize if my original post was
confusing at all...

Cheers
Jeff

Patrick Maupin

unread,

Apr 7, 2010, 11:04:58 PM4/7/10

to

On Apr 7, 9:51 pm, Steven D'Aprano

<ste...@REMOVE.THIS.cybersource.com.au> wrote:
> On Wed, 07 Apr 2010 18:03:47 -0700, Patrick Maupin wrote:
> > BTW, although I find it annoying when people say "don't do that" when
> > "that" is a perfectly good thing to do, and although I also find it
> > annoying when people tell you what not to do without telling you what
> > *to* do,
>
> Grant did give a perfectly good solution.

Yeah, I noticed later and apologized for that. What he gave will work
perfectly if the only data that changes the number of words is the
data the OP is looking for. This may or may not be true. I don't
know anything about the program generating the data, but I did notice
that the OP's attempt at an answer indicated that the OP felt (rightly
or wrongly) he needed to split on two or more spaces.

Bravo!!! Good data, quotes, references, all good stuff!

I absolutely agree that regex shouldn't always be the first thing you
reach for, but I was reading way too much unsubstantiated "this is
bad. Don't do it." on the subject recently. In particular, when
people say "Don't use regex. Use PyParsing!" It may be good advice
in the right context, but it's a bit disingenuous not to mention that
PyParsing will use regex under the covers...

Regards,
Pat

Grant Edwards

unread,

Apr 7, 2010, 11:10:03 PM4/7/10

to

On 2010-04-08, Patrick Maupin <pma...@gmail.com> wrote:

> Sorry, my eyes completely missed your one-liner, so my criticism about
> not posting a solution was unwarranted. I don't think you and I read
> the problem the same way (which is probably why I didn't notice your
> solution -- because it wasn't solving the problem I thought I saw).

No worries.

> When I saw "And I am interested in the string that appears in the
> third column, which changes as the test runs and then completes" I
> assumed that, not only could that string change, but so could the one
> before it.

If that's the case, my solution won't work right.

> I guess my base assumption that anything with words in it could
> change. I was looking at the OP's attempt at a solution, and he
> obviously felt he needed to see two or more spaces as an item
> delimiter.

If the requirement is indeed two or more spaces as a delimiter with
spaces allowed in any field, then a regular expression split is
probably the best solution.

--
Grant

Patrick Maupin

unread,

Apr 7, 2010, 11:26:02 PM4/7/10

to

On Apr 7, 9:51 pm, Steven D'Aprano
<ste...@REMOVE.THIS.cybersource.com.au> wrote:

> This is one of the reasons we're so often suspicious of re solutions:
>
> >>> s = '# 1 Short offline Completed without error 00%'
> >>> tre = Timer("re.split(' {2,}', s)",
>
> ... "import re; from __main__ import s")>>> tsplit = Timer("[x for x in s.split(' ') if x.strip()]",
>
> ... "from __main__ import s")
>
> >>> re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()]
> True
>
> >>> min(tre.repeat(repeat=5))
> 6.1224789619445801
> >>> min(tsplit.repeat(repeat=5))
>
> 1.8338048458099365

I will confess that, in my zeal to defend re, I gave a simple one-
liner, rather than the more optimized version:

>>> from timeit import Timer

>>> s = '# 1 Short offline Completed without error 00%'

>>> tre = Timer("splitter(s)",
... "import re; from __main__ import s; splitter =
re.compile(' {2,}').split")

>>> tsplit = Timer("[x for x in s.split(' ') if x.strip()]",
... "from __main__ import s")

>>> min(tre.repeat(repeat=5))
1.893190860748291
>>> min(tsplit.repeat(repeat=5))
2.0661051273345947

You're right that if you have an 800K byte string, re doesn't perform
as well as split, but the delta is only a few percent.

>>> s *= 10000
>>> min(tre.repeat(repeat=5, number=1000))
15.331652164459229
>>> min(tsplit.repeat(repeat=5, number=1000))
14.596404075622559

Regards,
Pat

Lie Ryan

unread,

Apr 8, 2010, 1:05:20 AM4/8/10

to

On 04/08/10 12:45, Patrick Maupin wrote:
> (And I got testy because of seeing other IMO unwarranted denigration
> of re on the list lately.)

Why am I seeing a lot of this pattern lately:

OP: Got problem with string
+- A: Suggested a regex-based solution
+- B: Quoted "Some people ... regex ... two problems."

or

OP: Writes some regex, found problem
+- A: Quoted "Some people ... regex ... two problems."
+- B: Supplied regex-based solution, clean one
+- A: Suggested PyParsing (or similar)

Patrick Maupin

unread,

Apr 8, 2010, 12:57:31 AM4/8/10

to

On Apr 7, 9:51 pm, Steven D'Aprano
<ste...@REMOVE.THIS.cybersource.com.au> wrote:

BTW, I don't know how you got 'True' here.

> >>> re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()]
> True

You must not have s set up to be the string given by the OP. I just
realized there was an error in my non-regexp example, that actually
manifests itself with the test data:

>>> import re

>>> s = '# 1 Short offline Completed without error 00%'

>>> re.split(' {2,}', s)

['# 1', 'Short offline', 'Completed without error', '00%']

>>> [x for x in s.split(' ') if x.strip()]

['# 1', 'Short offline', ' Completed without error', ' 00%']

>>> re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()]

False

To fix it requires something like:

[x.strip() for x in s.split(' ') if x.strip()]

or:

[x for x in [x.strip() for x in s.split(' ')] if x]

I haven't timed either one of these, but given that the broken
original one was slower than the simpler:

splitter = re.compile(' {2,}').split

splitter(s)

on strings of "normal" length, and given that nobody noticed this bug
right away (even though it was in the printout on my first message,
heh), I think that this shows that (here, let me qualify this
carefully), at least in some cases, the first regexp that comes to my
mind can be prettier, shorter, faster, less bug-prone, etc. than the
first non-regexp that comes to my mind...

Regards,
Pat

Kushal Kumaran

unread,

Apr 8, 2010, 1:16:04 AM4/8/10

to J, Python List

On Thu, Apr 8, 2010 at 3:10 AM, J <dreadpi...@gmail.com> wrote:
> Can someone make me un-crazy?
>

> I have a bit of code that right now, looks like this:
>
> status = getoutput('smartctl -l selftest /dev/sda').splitlines()[6]
> status = re.sub(' (?= )(?=([^"]*"[^"]*")*[^"]*$)', ":",status)
> print status
>
> Basically, it pulls the first actual line of data from the return you
> get when you use smartctl to look at a hard disk's selftest log.
>
> The raw data looks like this:
>
> # 1 Short offline Completed without error 00% 679 -
>
> Unfortunately, all that whitespace is arbitrary single space
> characters. And I am interested in the string that appears in the
> third column, which changes as the test runs and then completes. So
> in the example, "Completed without error"
>
> The regex I have up there doesn't quite work, as it seems to be
> subbing EVERY space (or at least in instances of more than one space)
> to a ':' like this:
>
> # 1: Short offline:::::: Completed without error:::::: 00%:::::: 679:::::::: -
>
> Ultimately, what I'm trying to do is either replace any space that is
>> one space wiht a delimiter, then split the result into a list and
> get the third item.
>
> OR, if there's a smarter, shorter, or better way of doing it, I'd love to know.
>
> The end result should pull the whole string in the middle of that
> output line, and then I can use that to compare to a list of possible
> output strings to determine if the test is still running, has
> completed successfully, or failed.
>

Is there any particular reason you absolutely must extract the status
message? If you already have a list of possible status messages, you
could just test which one of those is present in the line...

> Unfortunately, my google-fu fails right now, and my Regex powers were
> always rather weak anyway...
>
> So any ideas on what the best way to proceed with this would be?

--
regards,
kushal

Steven D'Aprano

unread,

Apr 8, 2010, 3:07:26 AM4/8/10

to

On Wed, 07 Apr 2010 21:57:31 -0700, Patrick Maupin wrote:

> On Apr 7, 9:51 pm, Steven D'Aprano
> <ste...@REMOVE.THIS.cybersource.com.au> wrote:
>
> BTW, I don't know how you got 'True' here.
>
>> >>> re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()]
>> True

It was a copy and paste from the interactive interpreter. Here it is, in
a fresh session:

[steve@wow-wow ~]$ python
Python 2.5 (r25:51908, Nov 6 2007, 16:54:01)
[GCC 4.1.2 20070925 (Red Hat 4.1.2-27)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

>>> import re
>>> s = '# 1 Short offline Completed without error 00%'

>>> re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()]
True
>>>

Now I copy-and-paste from your latest post to do it again:

>>> s = '# 1 Short offline Completed without error 00%'

>>> re.split(' {2,}', s) == [x for x in s.split(' ') if x.strip()]
False

Weird, huh?

And here's the answer: somewhere along the line, something changed the
whitespace in the string into non-spaces:

>>> s
'# 1 \xc2\xa0Short offline \xc2\xa0 \xc2\xa0 \xc2\xa0 Completed without
error \xc2\xa0 \xc2\xa0 \xc2\xa0 00%'

I blame Google. I don't know how they did it, but I'm sure it was them!
*wink*

By the way, let's not forget that the string could be fixed-width fields
padded with spaces, in which case the right solution almost certainly
will be:

s = '# 1 Short offline Completed without error 00%'

result = s[25:55].rstrip()

Even in 2010, there are plenty of programs that export data using fixed
width fields.

--
Steven

J

unread,

Apr 8, 2010, 9:49:46 AM4/8/10

to Kushal Kumaran, Python List

On Thu, Apr 8, 2010 at 01:16, Kushal Kumaran
<kushal.kum...@gmail.com> wrote:
>
> Is there any particular reason you absolutely must extract the status
> message? If you already have a list of possible status messages, you
> could just test which one of those is present in the line...

Yes and no...

Mostly, it's for the future. Right now, this particular test script
(and I mean test script in the sense it's part of a testing framework,
not in the sense that I'm being tested on it ;-) ) is fully
automated.

Once the self-test on the HDD is complete, the script will return
either a 0 or 1 for PASS or FAIL respectively.

However, in the future, it may need to be changed to or also handled
manually instead of automatically. And if we end up wanting it to be
automatic, then having that phrase would be important for logging or
problem determination. We don't so much care about the rest of the
string I want to parse as the data it gives is mostly meaningless, but
having it pull things like:

Completed: Electrical error

or

Completed: Bad Sectors Found

could as useful as

Completed without error

or

Aborted by user

So that's why I was focusing on just extracting that phrase from the
output. I could just pull the entire string and do a search for the
phrases in question, and that's probably the simplest thing to do:

re.search("Search Phrase",outputString)

but I do have a tendency to overthink things some times and besides
which, having just that phrase for the logs, or for use in a future
change would be cool, and this way, I've already got that much of it
done for later on.

Grant Edwards

unread,

Apr 8, 2010, 10:13:32 AM4/8/10

to

On 2010-04-08, Steven D'Aprano <ste...@REMOVE.THIS.cybersource.com.au> wrote:

> Even in 2010, there are plenty of programs that export data using fixed
> width fields.

If you want the columns to line up as the data changes, that's pretty
much the only way to go.

--
Grant Edwards grant.b.edwards Yow! But was he mature
at enough last night at the
gmail.com lesbian masquerade?

Tim Chase

unread,

Apr 8, 2010, 10:23:39 AM4/8/10

to Lie Ryan, pytho...@python.org

Lie Ryan wrote:
> Why am I seeing a lot of this pattern lately:
>
> OP: Got problem with string
> +- A: Suggested a regex-based solution
> +- B: Quoted "Some people ... regex ... two problems."
>
> or
>
> OP: Writes some regex, found problem
> +- A: Quoted "Some people ... regex ... two problems."
> +- B: Supplied regex-based solution, clean one
> +- A: Suggested PyParsing (or similar)

There's a spectrum of parsing solutions:

- string.split() or string[slice] notations handle simple cases
and are built-in

- regexps handle more complex parsing tasks and are also built in

- pyparsing handles far more complex parsing tasks (nesting, etc)
but isn't built-in

The above dialog tends to appear when the task isn't in the
sweet-spot of regexps. Either it's sufficiently simple that
simple split/slice notation will do, or (at the other end of the
spectrum) the effort to get it working with a regexp is hairy and
convoluted, worthy of a more readable solution implemented with
pyparsing. The problem comes from people thinking that regexps
are the right solution to *every* problem...often demonstrated by
the OP writing "how do I write a regexp to solve this
<non-regexp-optimal> problem" assuming regexps are the right tool
for everything.

There are some problem-classes for which regexps are the *right*
solution, and I don't see as much of your example dialog in those
cases.

-tkc

Lie Ryan

unread,

Apr 8, 2010, 5:24:34 PM4/8/10

to Tim Chase, pytho...@python.org

On 4/9/10, Tim Chase <pytho...@tim.thechases.com> wrote:

> Lie Ryan wrote:
>> Why am I seeing a lot of this pattern lately:
>>
>> OP: Got problem with string
>> +- A: Suggested a regex-based solution
>> +- B: Quoted "Some people ... regex ... two problems."
>>
>> or
>>
>> OP: Writes some regex, found problem
>> +- A: Quoted "Some people ... regex ... two problems."
>> +- B: Supplied regex-based solution, clean one
>> +- A: Suggested PyParsing (or similar)
>

> There's a spectrum of parsing solutions:
>

<snip>

>
> The above dialog tends to appear when the task isn't in the
> sweet-spot of regexps. Either it's sufficiently simple that
> simple split/slice notation will do, or (at the other end of the
> spectrum) the effort to get it working with a regexp is hairy and
> convoluted, worthy of a more readable solution implemented with
> pyparsing. The problem comes from people thinking that regexps
> are the right solution to *every* problem...often demonstrated by
> the OP writing "how do I write a regexp to solve this
> <non-regexp-optimal> problem" assuming regexps are the right tool
> for everything.
>
> There are some problem-classes for which regexps are the *right*
> solution, and I don't see as much of your example dialog in those
> cases.

I would have agreed with you if someone were to make the statement
until a few weeks ago; somehow in the last week or so, the mood about
regex seems to has shifted to "regex is not suitable for anything"
type of mood. As soon as someone (OP or not) proposed a regex
solution, someone else would retort with don't use regex use
string-builtins or pyparsing. It appears that the group has developed
some sense of regexphobia; some people pushes using string builtins
for moderately complex requirement and suggested pyparsing for not-so
complex need and that keeps shrinking regex sweet spot. But that's
just my inherently subjective observation.

Dotan Cohen

unread,

Apr 8, 2010, 9:29:01 PM4/8/10

to Lie Ryan, pytho...@python.org

> I would have agreed with you if someone were to make the statement
> until a few weeks ago; somehow in the last week or so, the mood about
> regex seems to has shifted to "regex is not suitable for anything"
> type of mood. As soon as someone (OP or not) proposed a regex
> solution, someone else would retort with don't use regex use
> string-builtins or pyparsing. It appears that the group has developed
> some sense of regexphobia; some people pushes using string builtins
> for moderately complex requirement and suggested pyparsing for not-so
> complex need and that keeps shrinking regex sweet spot. But that's
> just my inherently subjective observation.
>

Isn't that a core feature of a high-level language such as Python?
Providing the tools to perform common or difficult tasks easily
thought built in functions?

I am hard pressed to think of a situation in which a regex is
preferable to a built-in function.

--
Dotan Cohen

http://bido.com
http://what-is-what.com

Please CC me if you want to be sure that I read your message. I do not
read all list mail.

MRAB

unread,

Apr 8, 2010, 10:27:31 PM4/8/10

to pytho...@python.org

Dotan Cohen wrote:
>> I would have agreed with you if someone were to make the statement
>> until a few weeks ago; somehow in the last week or so, the mood about
>> regex seems to has shifted to "regex is not suitable for anything"
>> type of mood. As soon as someone (OP or not) proposed a regex
>> solution, someone else would retort with don't use regex use
>> string-builtins or pyparsing. It appears that the group has developed
>> some sense of regexphobia; some people pushes using string builtins
>> for moderately complex requirement and suggested pyparsing for not-so
>> complex need and that keeps shrinking regex sweet spot. But that's
>> just my inherently subjective observation.
>>
>
> Isn't that a core feature of a high-level language such as Python?
> Providing the tools to perform common or difficult tasks easily
> thought built in functions?
>
> I am hard pressed to think of a situation in which a regex is
> preferable to a built-in function.
>

Regexes do have their uses. It's a case of knowing when they are the
best approach and when they aren't.

Dotan Cohen

unread,

Apr 8, 2010, 10:32:22 PM4/8/10

to MRAB, pytho...@python.org

> Regexes do have their uses. It's a case of knowing when they are the
> best approach and when they aren't.
>

Agreed. The problems begin when the "when they aren't" is not recognised.

Lie Ryan

unread,

Apr 9, 2010, 12:48:22 AM4/9/10

to

On 04/09/10 12:32, Dotan Cohen wrote:
>> Regexes do have their uses. It's a case of knowing when they are the
>> best approach and when they aren't.
>
> Agreed. The problems begin when the "when they aren't" is not recognised.

But problems also arises when people are suggesting overly complex
series of built-in functions for what is better handled by regex.

Using built-in functions (to me at least) is not a natural way to match
strings, and makes things less understandable for anything but very
simple manipulations. Regex is like Query-by-Example (QBE), in database,
you give an example and you get a result; you give the general pattern
and you get a match. Regex is declarative similar to full-blown parser,
instead of procedural like built-in functions. Regex's unsuitability for
complex parsing stems from terseness and inability to handle arbitrary
nests.

People need to recognize when built-in function isn't suitable and when
bringing forth pyparsing for parsing one or two is just an overkill.

Unreasonable phobia to regex is just as much harmful as overuse of it.

Patrick Maupin

unread,

Apr 9, 2010, 12:16:19 AM4/9/10

to

On Apr 8, 9:32 pm, Dotan Cohen <dotanco...@gmail.com> wrote:
> > Regexes do have their uses. It's a case of knowing when they are the
> > best approach and when they aren't.
>
> Agreed. The problems begin when the "when they aren't" is not recognised.

Arguing against this is like arguing against motherhood and apple
pie. The same argument can validly be made for any Python construct,
any C construct, etc. This argument is so broad and vague that it's
completely meaningless. Platitudes don't help people learn how to
code. Even constant measuring of speed doesn't really help people
start learning how to code -- it just shows them that there are a lot
of OCD people in this profession.

The great thing about Python is that a lot of people, with differing
ambitions, capabilities, amounts of time to invest, and backgrounds
can pick it up and just start using it.

If somebody asks "how do I use re for this" then IMO the *best*
possible response is to tell them how to use re for this (unless
"this" is *difficult* or *impossible* to do with re, in which case you
shouldn't answer the question unless you've had your coffee and you're
in a good mood). You might also gently explain that there other
techniques that might, in some cases be easier to code or read. But
performance? It's all fine and dandy for the experienced coders to
discuss the finer points of different techniques (which, BTW, are
usually all predicated on using the current CPython implementation,
and might in some cases be completely wrong with one of the new JITs
under development), but you have to trust people to know their own
needs! If somebody says "this is too slow -- how do I speed it up?"
then that's really the time to strut your stuff and show that you know
how to milk the language for all it's worth. Until then, just tell
them what they want to know, perhaps with a small disclaimer that it's
probably not the most efficient or elegant or whatever way to solve
their problem. The process of learning a computer language is one of
breaking through a series of brick walls, and in many cases people
will learn things faster if you help give them the tools to get past
their mental roadblocks.

The thing that Lie and I were reacting to was the visceral "don't do
that" that seems to crop up whenever somebody asks how to do something
with re. There are a lot of good use cases for re. Arguably,
something like mxtexttools or some other low-level text processor
would be better for a few of the cases, but they're not part of the
standard library and re is.

One of the great things about Python is that a lot of useful programs
can be written just using Python and the standard library. No C, no
third-party binary libraries, etc. It's not just batteries included
-- it's everything included!

I've written C extensions, both bare, and wrapped with Pyrex, and I've
used third-party extension modules, and while that's OK, it's much
better to have some Python source code in a repository that you can
pull down to any kind of system and just RUN. And look at. And learn
from.

Many useful programs need to do text processing. Often, the built-in
string functions are sufficient. But sometimes they are not.
Discouraging somebody from learning re is doing them a disservice,
because, for the things it is really good at, it is the *only* thing
in the standard library that IS really good.

Yes, you can construct regular expressions and example texts that will
exhibit horrible worst-case performance. But there are a lot of ways
to shoot yourself in the foot performance-wise in Python (as in any
language), and most of them don't require you to use *any* library
functions, much less the dreaded re module.

Often, when I see people give advice that is (I don't want to say
"knee-jerk" because the advice usually has a good foundation) so let's
say "terse" and "unexplained" or maybe even that it is an
"admonishment", it makes me feel that perhaps the person giving the
advice doesn't really trust Python.

I don't remember where I first read it, or heard it, but one of the
core strengths of Python is how easy it is to throw away code and
replace it with something better. So, trust Python to help people get
something going, and then (if they need or want to!) to make it
better.

Just my 2 cents worth.

Pat

Steven D'Aprano

unread,

Apr 9, 2010, 4:59:43 AM4/9/10

to

On Fri, 09 Apr 2010 14:48:22 +1000, Lie Ryan wrote:

> On 04/09/10 12:32, Dotan Cohen wrote:
>>> Regexes do have their uses. It's a case of knowing when they are the
>>> best approach and when they aren't.
>>
>> Agreed. The problems begin when the "when they aren't" is not
>> recognised.
>
> But problems also arises when people are suggesting overly complex
> series of built-in functions for what is better handled by regex.

What defines "overly complex"?

For some reason, people seem to have the idea that pattern matching of
strings must be a single expression, no matter how complicated the
pattern they're trying to match. If we have a complicated task to do in
almost any other field, we don't hesitate to write a function to do it,
or even multiple functions: we break our code up into small,
understandable, testable pieces. We recognise that a five-line function
may very well be less complex than a one-line expression that does the
same thing. But if it's a string pattern matching task, we somehow become
resistant to the idea of writing a function and treat one-line
expressions as "simpler", no matter how convoluted they become.

It's as if we decided that every maths problem had to be solved by a
single expression, no matter how complex, and invented a painfully terse
language unrelated to normal maths syntax for doing so:

# Calculate the roots of sin**2(3*x-y):
result = me.compile("{^g.?+*y:h}|\Y^r&(?P:2+)|\w+(x&y)|[?#\s]").solve()

That's not to say that regexes aren't useful, or that they don't have
advantages. They are well-studied from a theoretical basis. You don't
have to re-invent the wheel: the re module provides useful pattern
matching functionality with quite good performance.

One disadvantage is that you have to learn an entire new language, a
language which is painfully terse and obfuscated, with virtually no
support for debugging. Larry Wall has criticised the Perl regex syntax on
a number of grounds:

* things which look similar often are very different;
* things which are commonly needed are long and verbose, while things
which are rarely needed are short;
* too much reliance on too few metacharacters;
* the default is to treat whitespace around tokens as significant,
instead of defaulting to verbose-mode for readability;
* overuse of parentheses;
* difficulty working with non-ASCII data;
* insufficient abstraction;
* even though regexes are source code in a regular expression language,
they're treated as mere strings, even in Perl;

and many others.

http://dev.perl.org/perl6/doc/design/apo/A05.html

As programming languages go, regular expressions -- even Perl's regular
expressions on steroids -- are particularly low-level. It's the assembly
language of pattern matching, compared to languages like Prolog, SNOBOL
and Icon. These languages use patterns equivalent in power to Backus-Naur
Form grammars, or context-free grammars, much more powerful and readable
than regular expressions.

But in any case, not all text processing problems are pattern-matching
problems, and even those that are don't necessarily require the 30lb
sledgehammer of regular expressions.

I find it interesting to note that there is such a thing as "regex
culture", as Larry Wall describes it. There seems to be a sort of
programmers' machismo about solving problems via regexes, even when
they're not the right tool for the job, and in the fewest number of
characters possible. I think regexes have a bad reputation because of
regex culture, and not just within Python circles either:

http://echochamber.me/viewtopic.php?f=11&t=57405

For the record, I'm not talking about "Because It's There" regexes like
this this 6343-character monster:

http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html

or these:

http://mail.pm.org/pipermail/athens-pm/2003-January/000033.html
http://blog.sigfpe.com/2007/02/modular-arithmetic-with-regular.html

The fact that these exist at all is amazing and wonderful. And yes, I
admire the Obfuscated C and Underhanded C contests too :)

--
Steven

Alf P. Steinbach

unread,

Apr 9, 2010, 5:12:37 AM4/9/10

to

* Steven D'Aprano:

>
> For some reason, people seem to have the idea that pattern matching of
> strings must be a single expression, no matter how complicated the
> pattern they're trying to match. If we have a complicated task to do in
> almost any other field, we don't hesitate to write a function to do it,
> or even multiple functions: we break our code up into small,
> understandable, testable pieces. We recognise that a five-line function
> may very well be less complex than a one-line expression that does the
> same thing. But if it's a string pattern matching task, we somehow become
> resistant to the idea of writing a function and treat one-line
> expressions as "simpler", no matter how convoluted they become.
>
> It's as if we decided that every maths problem had to be solved by a
> single expression, no matter how complex, and invented a painfully terse
> language unrelated to normal maths syntax for doing so:
>
> # Calculate the roots of sin**2(3*x-y):
> result = me.compile("{^g.?+*y:h}|\Y^r&(?P:2+)|\w+(x&y)|[?#\s]").solve()

http://www.youtube.com/watch?v=a9xAKttWgP4

Cheers,

- Alf

Paul Rubin

unread,

Apr 9, 2010, 6:00:36 AM4/9/10

to

Steven D'Aprano <st...@REMOVE-THIS-cybersource.com.au> writes:
> One disadvantage is that you have to learn an entire new language, a
> language which is painfully terse and obfuscated, with virtually no
> support for debugging. Larry Wall has criticised the Perl regex syntax on

> a number of grounds: ...

There is a parser combinator library for Python called Pyparsing but it
is apparently dog slow. Maybe someone can do a faster one sometime.
See: http://pyparsing.wikispaces.com/ for info. I haven't used it,
but it is apparently similar in style to Parsec (a Haskell library):

http://research.microsoft.com/users/daan/download/papers/parsec-paper.pdf

I use Parsec sometimes, and it's much nicer than complicated regexps.
There is a version called Attoparsec now that is slightly less powerful
but very fast.

Lie Ryan

unread,

Apr 9, 2010, 7:24:42 AM4/9/10

to

On 04/09/10 18:59, Steven D'Aprano wrote:
> On Fri, 09 Apr 2010 14:48:22 +1000, Lie Ryan wrote:
>
>> On 04/09/10 12:32, Dotan Cohen wrote:
>>>> Regexes do have their uses. It's a case of knowing when they are the
>>>> best approach and when they aren't.
>>>
>>> Agreed. The problems begin when the "when they aren't" is not
>>> recognised.
>>
>> But problems also arises when people are suggesting overly complex
>> series of built-in functions for what is better handled by regex.
>
> What defines "overly complex"?

These discussions about readability and suitability of regex are
orthogonal issue with the sub-topic I started. We are all fully aware of
the limitations of each approaches. What I am complaining is the recent
development of people just saying no to regex when the problem is in
fact in regex's very sweetspot. We have all seen people abusing regex;
but nowadays I'm starting to see people abusing built-ins as well.

We don't like when regex gets convoluted, but that doesn't mean built-in
fare much better either.

Stefan Behnel

unread,

Apr 9, 2010, 7:18:56 AM4/9/10

to pytho...@python.org

Tim Chase, 08.04.2010 16:23:

> Lie Ryan wrote:
>> Why am I seeing a lot of this pattern lately:
>>
>> OP: Got problem with string
>> +- A: Suggested a regex-based solution
>> +- B: Quoted "Some people ... regex ... two problems."
>>
>> or
>>
>> OP: Writes some regex, found problem
>> +- A: Quoted "Some people ... regex ... two problems."
>> +- B: Supplied regex-based solution, clean one
>> +- A: Suggested PyParsing (or similar)
>

> There are some problem-classes for which regexps are the *right*
> solution, and I don't see as much of your example dialog in those cases.

Obviously. People rarely complain about problems that are easy to solve
with the solution at hand.

Stefan

Stefan Behnel

unread,

Apr 9, 2010, 7:27:14 AM4/9/10

to pytho...@python.org

Steven D'Aprano, 09.04.2010 10:59:

> It's as if we decided that every maths problem had to be solved by a
> single expression, no matter how complex, and invented a painfully terse
> language unrelated to normal maths syntax for doing so:
>
> # Calculate the roots of sin**2(3*x-y):
> result = me.compile("{^g.?+*y:h}|\Y^r&(?P:2+)|\w+(x&y)|[?#\s]").solve()

Actually, I would expect that the result of any mathematical calculation
can be found by applying a suitable regular expression to pi.

Stefan

Tim Chase

unread,

Apr 9, 2010, 10:54:03 AM4/9/10

to Stefan Behnel, pytho...@python.org

On 04/09/2010 06:18 AM, Stefan Behnel wrote:
> Tim Chase, 08.04.2010 16:23:
>> Lie Ryan wrote:

>>> OP: Got problem with string
>>> +- A: Suggested a regex-based solution
>>> +- B: Quoted "Some people ... regex ... two problems."
>>>
>>> or
>>>
>>> OP: Writes some regex, found problem
>>> +- A: Quoted "Some people ... regex ... two problems."
>>> +- B: Supplied regex-based solution, clean one
>>> +- A: Suggested PyParsing (or similar)
>>

>> There are some problem-classes for which regexps are the *right*
>> solution, and I don't see as much of your example dialog in those cases.
>
> Obviously. People rarely complain about problems that are easy to solve
> with the solution at hand.

Well, you still see the "Got a problem with a string" and the
"having a problem with this regex" questions, but you don't see
the remainder of the "now you have two problems" dialog.
Granted, some folks give that as a knee-jerk reaction so we just
learn to ignore their input because sometimes a regexp is exactly
the right solution ;-)

-tkc

Dotan Cohen

unread,

Apr 10, 2010, 4:52:28 PM4/10/10

to Lie Ryan, pytho...@python.org

> Unreasonable phobia to regex is just as much harmful as overuse of it.
>

Agreed. I did not mean to sound as if I am against the use of regular
expressions.