Mastering Regular Expressions 2nd Ed.

David LeBlanc

unread,

Jul 19, 2002, 7:34:37 PM7/19/02

to

Book. It's out. From O'Reilly. See
http://www.oreillynet.com/pub/a/network/2002/07/15/regexp.html

Claims extensive Python re coverage. In addition to the other usual
suspects, also covers Java, Ruby, php and .net.

Hope it's as good as the first one was/is!

David LeBlanc
Seattle, WA USA

Jeff Sandys

unread,

Jul 22, 2002, 11:11:51 AM7/22/02

to

When will O'Reilly give us _Regular Expression Pocket Reference_?

Thanks,
Jeff Sandys

Tim Roberts

unread,

Jul 23, 2002, 1:23:02 AM7/23/02

to

"David LeBlanc" <whi...@oz.net> wrote:

>Book. It's out. From O'Reilly. See
>http://www.oreillynet.com/pub/a/network/2002/07/15/regexp.html
>
>Claims extensive Python re coverage. In addition to the other usual
>suspects, also covers Java, Ruby, php and .net.
>
>Hope it's as good as the first one was/is!

Hear, hear. My colleagues scoffed when I bought it, thinking that the
subject surely couldn't be worthy of an entire book. They were wrong. It
is a valuable reference that I keep close by.

The multi-thousand byte full expression to match a valid RFC822 e-mail
address was practically worth the price all by itself!
--
- Tim Roberts, ti...@probo.com
Providenza & Boekelheide, Inc.

Skip Montanaro

unread,

Jul 23, 2002, 7:16:52 AM7/23/02

to

Tim> The multi-thousand byte full expression to match a valid RFC822
Tim> e-mail address was practically worth the price all by itself!

And then they released RFC 2822 and made the book obsolete. How
inconsiderate. <wink>

S

Paul Rubin

unread,

Jul 23, 2002, 4:38:55 PM7/23/02

to

Tim Roberts <ti...@probo.com> writes:
> The multi-thousand byte full expression to match a valid RFC822 e-mail
> address was practically worth the price all by itself!

Argggh! That sounds like regular expressions aren't really the best
way to match RFC822 addresses.

Neil Schemenauer

unread,

Jul 23, 2002, 8:05:45 PM7/23/02

to

Paul Rubin wrote:
> Argggh! That sounds like regular expressions aren't really the best
> way to match RFC822 addresses.

Regular expressions work much better if you use them for lexical
analysis rather than for parsing.

Neil

Peter Hansen

unread,

Jul 23, 2002, 10:11:58 PM7/23/02

to

Would you please expand on that for those of us who are not computer
scientists and/or who do not understand the implications of your
statement?

Thanks,
-Peter

Tom Harris

unread,

Jul 23, 2002, 9:24:36 PM7/23/02

to

>Regular expressions work much better if you use them for lexical
>analysis rather than for parsing.

> Neil

Massive regular expressions are certainly difficult to maintain, and I
sometimes wonder if they are the best solution to some problems. Your
comments above seem to bear on the correct usage of them. Could you expand a
bit? Lexical analysis is tokenisation, parsing is making sense of the
tokens, is that correct? Is the moral to leave logic to the programming
language, not try to use regexes to di it?

Tom Harris, Software Engineer
Optiscan Imaging, 15-17 Normanby Rd, Notting Hill, Melbourne, Vic 3168,
Australia
email to...@optiscan.com ph +61 3 9538 3333 fax +61 3 9562 7742

This email may contain confidential information. If you have received this
email in error, please delete it immediately,and inform us of the mistake by
return email. Any form of reproduction, or further dissemination of this
email is strictly prohibited. Also, please note that opinions expressed in
this email are those of the author, and are not necessarily those of
Optiscan Pty Ltd.

Andrae Muys

unread,

Jul 24, 2002, 1:10:20 AM7/24/02

to

Paul Rubin <phr-n...@NOSPAMnightsong.com> wrote in message news:<7xn0si9...@ruckus.brouhaha.com>...

More like matching RFC822 addresses isn't worth the price of
admission. Still you missed the point, 822 addresses are complex
beasts, not all valid emails are of the form \S+@\w(\.\w)*. The point
of the RFC822 regex is that, assuming the address points to a valid MX
record, the only practical way to test if an email address is valid is
to try sending email to it.

Andrae Muys

DIG

unread,

Jul 24, 2002, 12:35:18 AM7/24/02

to

Hi, Peter Hansen !

On Tue, Jul 23, 2002 at 10:11:58PM -0400, Peter Hansen wrote:

> Neil Schemenauer wrote:
> >
> > Regular expressions work much better if you use them for lexical
> > analysis rather than for parsing.
>
> Would you please expand on that for those of us who are not computer
> scientists and/or who do not understand the implications of your
> statement?

I am not Neil Schemenauer <n...@python.ca>. And I am not a computer scientist (sorry ! :-)), but in my opinion, Tom Harris (to...@optiscan.com, Software Engineer) already (Wed, Jul 24, 2002 at 11:24:36AM +1000) answered you (by asking right questions):

* On Wed, Jul 24, 2002 at 11:24:36AM +1000, Tom Harris wrote:
*
* [...] Lexical analysis is tokenisation, parsing is making sense of the
* tokens, is that correct? Is the moral to leave logic to the programming
* language, not try to use regexes to di it? [...]

Yes, it is.

Regards,

--
DIG (Dmitri I GOULIAEV)

Roy Smith

unread,

Jul 24, 2002, 8:21:42 AM7/24/02

to

Peter Hansen <pe...@engcorp.com> wrote:
>> Regular expressions work much better if you use them for lexical
>> analysis rather than for parsing.
>
> Would you please expand on that for those of us who are not computer
> scientists and/or who do not understand the implications of your
> statement?

In a nutshell, lexical analysis is figuring out how to break a file up
into words and symbols (genericaly called "tokens"), and parsing is
figuring out what those words mean. So, if I were to write:

"Quick,defenistrate him!" :-)

lexical analysis would figure out that I've got the following tokens:

1) a quotation mark
2) the word "Quick"
3) a comma
4) the word "defenistrate"
5) the word "him"
6) an exclamation mark
7) a quotation mark
9) a smiley

At this point, I still have no idea what that line means, but at least
I've broken it up into token that I can start to try an organize into
higher level constructs like sentences and understand what those
sentences mean. That's parsing.

Fredrik Lundh

unread,

Jul 24, 2002, 11:07:20 AM7/24/02

to

Tom Harris wrote:

> Massive regular expressions are certainly difficult to maintain, and I
> sometimes wonder if they are the best solution to some problems.

when in doubt, the answer is no.

for further discussion, see:

http://www.google.com/search?q=zawinski+two+problems

> Lexical analysis is tokenisation, parsing is making sense of the
> tokens, is that correct?

exactly.

> Is the moral to leave logic to the programming language, not try to

> use regexes to do it?

exactly. (why do you ask when you know the answer ;-)

for some discussion on using REs for lexical analysis, and python
to do (simple) parsing, see:

http://effbot.org/guides/xml-scanner.htm

</F>

Tim Roberts

unread,

Jul 25, 2002, 1:27:16 AM7/25/02

to

am...@shortech.com.au (Andrae Muys) wrote:

>Paul Rubin <phr-n...@NOSPAMnightsong.com> wrote in message news:<7xn0si9...@ruckus.brouhaha.com>...
>> Tim Roberts <ti...@probo.com> writes:
>> > The multi-thousand byte full expression to match a valid RFC822 e-mail
>> > address was practically worth the price all by itself!
>>
>> Argggh! That sounds like regular expressions aren't really the best
>> way to match RFC822 addresses.
>
>More like matching RFC822 addresses isn't worth the price of
>admission.

Exactly.

"This multi-line ""behemoth""
is a perfectly valid"
<"RFC822 e-mail"@address.com(really and truly)>

I'll bet Outlook doesn't handle it...

Kristian Ovaska

unread,

Jul 25, 2002, 4:46:26 AM7/25/02

to

Tom Harris <To...@optiscan.com>:

>Massive regular expressions are certainly difficult to maintain, and I
>sometimes wonder if they are the best solution to some problems.

It's strange that while regular languages are a small subset of real,
Turing-complete languages, it's very hard to read or write any
non-trivial regexp. The syntax is straight out of computer science
mathematical notation (with some extensions) and is not suitable for
anything complex. It's bit like programming for the Turing machine. On
the other hand, the syntax is compact and IS suitable for simple
tasks.

I'm sure there are alternative regular languages that are more
readable, altough I've never come across one. The problem of such
languages is, I guess, that since they are more verbose than this very
compact CS notation, you can't just wrap them in a string, but would
need to expand the underlying language (Python in this case).

--
Kristian Ovaska <kristia...@helsinki.fi>

Fredrik Lundh

unread,

Jul 25, 2002, 1:13:27 PM7/25/02

to

Kristian Ovaska wrote:

> I'm sure there are alternative regular languages that are more
> readable, altough I've never come across one. The problem of such
> languages is, I guess, that since they are more verbose than this very
> compact CS notation, you can't just wrap them in a string, but would
> need to expand the underlying language

or you could just use the language as is, instead of forcing people
to write stuff in a really ugly sublanguage.

for some examples, see ping's rxb:

http://web.lfw.org/python/

and greg ewing's plex:

http://www.cosc.canterbury.ac.nz/~greg/python/Plex/

</F>

David LeBlanc

unread,

Jul 25, 2002, 1:44:05 PM7/25/02

to

> It's strange that while regular languages are a small subset of real,
> Turing-complete languages, it's very hard to read or write any
> non-trivial regexp. The syntax is straight out of computer science
> mathematical notation (with some extensions) and is not suitable for
> anything complex. It's bit like programming for the Turing machine. On
> the other hand, the syntax is compact and IS suitable for simple
> tasks.

<snip>

> Kristian Ovaska <kristia...@helsinki.fi>

I may be wrong about this, but I don't think regular expressions qualify as
turing complete. No branching for one thing...

Personally, I think RE's are great and ought to be fully integrated into
Python, along with a few improvements.

Dave LeBlanc
Seattle, WA USA

Michael Hudson

unread,

Jul 26, 2002, 6:06:38 AM7/26/02

to

"David LeBlanc" <whi...@oz.net> writes:

> I may be wrong about this, but I don't think regular expressions
> qualify as turing complete. No branching for one thing...

Somewhere you can find a (perl) regexp that matches prime but not
composite numbers, which suggests Turing completeness -- and rather
takes the mickey out of the word "regular".

> Personally, I think RE's are great and ought to be fully integrated
> into Python, along with a few improvements.

I don't.

Cheers,
M.

--
ARTHUR: Yes. It was on display in the bottom of a locked filing
cabinet stuck in a disused lavatory with a sign on the door
saying "Beware of the Leopard".
-- The Hitch-Hikers Guide to the Galaxy, Episode 1

Michael Hudson

unread,

Jul 26, 2002, 6:27:50 AM7/26/02

to

Michael Hudson <m...@python.net> writes:

> "David LeBlanc" <whi...@oz.net> writes:
>
> > I may be wrong about this, but I don't think regular expressions
> > qualify as turing complete. No branching for one thing...
>
> Somewhere you can find a (perl) regexp that matches prime but not
> composite numbers,

I found it here:

http://montreal.pm.org/tech/neil_kandalgaonkar.shtml

and think it's sufficiently clever to post a link to.

Here's it at work in Python:

>>> def isprime(num, prog=re.compile(r"^1?$|^(11+?)\1+$")):
... return prog.match('1'*num) is None
...
>>> isprime(10)
0
>>> isprime(13)
1

(you can see I got the above description slightly wrong)

Cheers,
M.

--
Well, you pretty much need Microsoft stuff to get misbehaviours
bad enough to actually tear the time-space continuum. Luckily
for you, MS Internet Explorer is available for Solaris.
-- Calle Dybedahl, alt.sysadmin.recovery

Paul Rubin

unread,

Jul 26, 2002, 6:55:36 AM7/26/02

to

Michael Hudson <m...@python.net> writes:
> > I may be wrong about this, but I don't think regular expressions
> > qualify as turing complete. No branching for one thing...
>
> Somewhere you can find a (perl) regexp that matches prime but not
> composite numbers, which suggests Turing completeness -- and rather
> takes the mickey out of the word "regular".

Formally, regular expressions can be recognized by finite state
machines, thus no hope of recognizing primes.

Perl regexps aren't really regexps because of the tricks you can play
with back-substitution etc. They're a superset of traditional regexps.

Jeff Epler

unread,

Jul 26, 2002, 2:25:59 PM7/26/02

to

Michael Hudson <m...@python.net> writes:
> > Somewhere you can find a (perl) regexp that matches prime but not
> > composite numbers,
>

Michael Hudson <m...@python.net> continues:

> I found it here:
>
> http://montreal.pm.org/tech/neil_kandalgaonkar.shtml

[...]

> (you can see I got the above description slightly wrong)
>

> and think it's sufficiently clever to post a link to.
>
> Here's it at work in Python:
>
> >>> def isprime(num, prog=re.compile(r"^1?$|^(11+?)\1+$")):
> ... return prog.match('1'*num) is None

You can use the zero-length negative lookahead assertion (?!...) to make
the RE match on primes (and not match on composites), of course.

The expression becomes
r"^(?!1?$|^(11+?)\1+$)"

This is clearly cooler than the RE that matches multiples of 3 written
in binary... I'm not sure this is it, but I think it may be.
(0|1(01*0)*1)+
of course, this RE is actually a good old-fashioned RE, so it still has
some allure.

Jeff

Kristian Ovaska

unread,

Jul 27, 2002, 5:10:00 AM7/27/02

to

"David LeBlanc" <whi...@oz.net>:

>> It's strange that while regular languages are a small subset of real,
>> Turing-complete languages,

>I may be wrong about this, but I don't think regular expressions qualify as
>turing complete. No branching for one thing...

You're right and that's what I ment, too. "Language A is a subset of
B" means that everything you can express in A you can also express in
B, but the reverse is not necessarily true.

Regular languages are rather limited: they can't even recognize
balanced parenthesis: (), (()), ((())), etc.

--
Kristian Ovaska <kristia...@helsinki.fi>