Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

regex of the month (decade?)

16 views
Skip to first unread message

Uri Guttman

unread,
Jan 7, 2008, 5:06:02 PM1/7/08
to Fun with Perl

^([Pp]([Oo][Ss][Tt])?[.\s]*[Oo]([Ff][Ff][Ii][Cc][Ee])?[.\s]*[Bb][Oo]
[Xx])|[Pp][Oo]([Bb]|[Xx]|[Dd][Rr][Aa][Ww][Ee][Rr]|[Ss][Tt][Oo][Ff][Ff]
[Ii][Cc][Ee]|[ ][Bb][Xx]|[Bb][Oo][Xx])|[Pp][/][Oo]|[Bb]([Xx]|[Oo][Xx]|
[Uu][Zz][Oo][Nn])|[Aa]([Pp][Aa][Rr][Tt][Aa][Dd][Oo]|[Pp][Tt][Dd][Oo])


the challenge: itemize the stupidities. the case issue is only 1! i
don't want to even post the 'spec' unless asked for it. i saw this on
usenet today.

enjoy!!

uri

--
Uri Guttman ------ u...@stemsystems.com -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org

Thomas L. Shinnick

unread,
Jan 7, 2008, 5:51:56 PM1/7/08
to Fun with Perl
At 04:06 PM 1/7/2008, Uri Guttman wrote:

>^([Pp]([Oo][Ss][Tt])?[.\s]*[Oo]([Ff][Ff][Ii][Cc][Ee])?[.\s]*[Bb][Oo]
>[Xx])|[Pp][Oo]([Bb]|[Xx]|[Dd][Rr][Aa][Ww][Ee][Rr]|[Ss][Tt][Oo][Ff][Ff]
>[Ii][Cc][Ee]|[ ][Bb][Xx]|[Bb][Oo][Xx])|[Pp][/][Oo]|[Bb]([Xx]|[Oo][Xx]|
>[Uu][Zz][Oo][Nn])|[Aa]([Pp][Aa][Rr][Tt][Aa][Dd][Oo]|[Pp][Tt][Dd][Oo])
>
>
>the challenge: itemize the stupidities. the case issue is only 1! i
>don't want to even post the 'spec' unless asked for it. i saw this on
>usenet today.

Why use [ ] in one place when \s is known and used previously?

And ^ (i.e. \A) doesn't distribute across the alternatives, so
only the first alternative must match at beginning of string.

Assuming use of (?ix) then is this the de-obfuscated equivalent?
^( P ( OST)? [.\s]* O (FFICE)? [.\s]* BOX )
|
PO ( B | X | DRAWER | STOFFICE | [ ]BX | BOX )
|
P[/]O
|
B ( X | OX | UZON )
|
A ( PARTADO | PTDO )

--
I'm a pessimist about probabilities; I'm an optimist about possibilities.
Lewis Mumford (1895-1990)

Dmitry Karasik

unread,
Jan 7, 2008, 5:43:16 PM1/7/08
to Uri Guttman, Fun with Perl
Hi Uri!

On 07 Jan 08 at 23:06, "Uri" (Uri Guttman) wrote:

Uri> ^([Pp]([Oo][Ss][Tt])?[.\s]*[Oo]([Ff][Ff][Ii][Cc][Ee])?[.\s]*[Bb][Oo]
Uri> [Xx])|[Pp][Oo]([Bb]|[Xx]|[Dd][Rr][Aa][Ww][Ee][Rr]|[Ss][Tt][Oo][Ff][Ff]
Uri> [Ii][Cc][Ee]|[
Uri> ][Bb][Xx]|[Bb][Oo][Xx])|[Pp][/][Oo]|[Bb]([Xx]|[Oo][Xx]|
Uri> [Uu][Zz][Oo][Nn])|[Aa]([Pp][Aa][Rr][Tt][Aa][Dd][Oo]|[Pp][Tt][Dd][Oo])


I've observed this pattern in FreeBSD startup (shell) scripts.
I do realize that shell is not perl, but nevertheless the style
still baffles me, I cannot decide whether this is extraordinary
stupidity or extraordinary wisdom :)
F.ex., /etc/rc.firewall on 6.2-stable has this:

case ${firewall_type} in
[Oo][Pp][Ee][Nn]|[Cc][Ll][Ii][Ee][Nn][Tt])
case ${natd_enable} in
[Yy][Ee][Ss])


--
Sincerely,
Dmitry Karasik

Dan Collins

unread,
Jan 7, 2008, 5:52:04 PM1/7/08
to f...@perl.org

I don't even want to know what that's supposed to do.
First, and most obviously, that should use /i. That gives us:
^(p(ost)?[.\s]*o(ffice)?[.\s]*box)|po(b|x|drawer|stoffice|\sbx|box)|p/o|b(x|ox|uzon)|a(partado|ptdo)


Last I checked, '.' matches \s. Oh, and '/' uis a pretty important
character, and should really be escaped.
^(p(ost)?.*o(ffice)?.*box)|po(b|x|drawer|stoffice|\sbx|box)|p\/o|b(x|ox|uzon)|a(partado|ptdo)


Unless I'm horribly mistaken, that can be simplified incredibly to
^(p(ost)?.*o(ffice)?.*box)|po(b|x|drawer|stoffice|\sbx)|p\/o|b(o?x|uzon)|a(partado|ptdo)
or
^(p(ost)?.*o(ffice)?.*)(b|x|drawer)|p\/o|b(o?x|uzon)|a(partado|ptdo)

In total, I count 9 errors and 213 characters removed. Though, I can't
count, so that may be wrong. Did I miss anything?

Keith Ivey

unread,
Jan 7, 2008, 6:26:22 PM1/7/08
to f...@perl.org
Dan Collins wrote:
> Last I checked, '.' matches \s. Oh, and '/' uis a pretty important
> character, and should really be escaped.

No, '.' inside [] matches '.'. The [.\s] is looking for a period or
whitespace.

And there's no reason to escape '/' if you're not using slashes for the
regex delimiters. If you want to escape it because you make a habit to
do so in regexes, that's fine, but it's not an error not to.

--
Keith C. Ivey <ke...@iveys.org>
Washington, DC

Uri Guttman

unread,
Jan 7, 2008, 6:33:43 PM1/7/08
to Dan Collins, f...@perl.org
>>>>> "DC" == Dan Collins <en.wp...@gmail.com> writes:


DC> Last I checked, '.' matches \s. Oh, and '/' uis a pretty important
DC> character, and should really be escaped.

nope. . doesn't match \n which is part of \s. you need the /s modifier
to make . match \n (and then of course \s).

DC> In total, I count 9 errors and 213 characters removed. Though, I can't
DC> count, so that may be wrong. Did I miss anything?

i dunno. i can't figure it out either. that is why i posted it here! :)

Meryll Larkin

unread,
Jan 7, 2008, 8:40:24 PM1/7/08
to Uri Guttman, Fun with Perl
I think I would rewrite
[Oo]([Ff][Ff][Ii][Cc][Ee])?
as
([Oo][Ff][Ff][Ii][Cc][Ee])?

but, pardon my ignorance, won't an "i" at the end of the regex make it catch
either case?

Meryll

Steven R. Stoll

unread,
Jan 8, 2008, 9:59:04 AM1/8/08
to Uri Guttman, Fun with Perl
After solving the case sensitivity issue, separating the alternations, and
solving the un-escaped /, here is what we are left with.

(p(ost)?[.\s]*o(ffice)?[.\s]*box)

po(b|x|drawer|stoffice|[ ]bx|box)
p[\/]o


b(x|ox|uzon)
a(partado|ptdo)

Which matches:
(p(ost)?.*o(ffice)?.*box)

post(anynumberofanythingexceptnewline)office(anynumberofanythingexceptnewlin
e)box
p(anynumberofanythingexceptnewline)office(anynumberofanythingexceptnewline)b
ox
post(anynumberofanythingexceptnewline)o(anynumberofanythingexceptnewline)box
p(anynumberofanythingexceptnewline)o(anynumberofanythingexceptnewline)box


po(b|x|drawer|stoffice|[ ]bx|box)

pob
pox
podrawer
postoffice
po bx
pobox


p[\/]o

p/o


b(x|ox|uzon)

bx
box
buzon


a(partado|ptdo)
apartado
aptdo


I can't imagine what the original specs were, but it looks like a patch job
gone awry.

Steve

Keith Ivey

unread,
Jan 8, 2008, 10:26:40 AM1/8/08
to Fun with Perl
Stoll, Steven R. wrote:
> (p(ost)?[.\s]*o(ffice)?[.\s]*box)
> po(b|x|drawer|stoffice|[ ]bx|box)
> p[\/]o
> b(x|ox|uzon)
> a(partado|ptdo)
>
> Which matches:
> (p(ost)?.*o(ffice)?.*box)
>
> post(anynumberofanythingexceptnewline)office(anynumberofanythingexceptnewlin
> e)box

'[.\s]*' matches any number of periods or whitespace characters, since
'.' is not special inside a character class. It's not the same as '.*'.
Also, even if '.' were special, '\s' matches newline along with other
whitespace characters.

Steven R. Stoll

unread,
Jan 8, 2008, 11:25:05 AM1/8/08
to Keith Ivey, Fun with Perl
You're right. Mistook it for (.\s) for some reason. My description of .*
still stands however.

But the following should be:


(p(ost)?[.\s]*o(ffice)?[.\s]*box)

post(anynumberofperiodsorspacecharacterclassitems)office(anynumberofperiodso
rspacecharacterclassitems)box

p(anynumberofperiodsorspacecharacterclassitems)o(anynumberofperiodsorspacech
aracterclassitems)box

p(anynumberofperiodsorspacecharacterclassitems)office(anynumberofperiodsorsp
acecharacterclassitems)box

post(anynumberofperiodsorspacecharacterclassitems)o(anynumberofperiodsorspac
echaracterclassitems)box

Steve


-----Original Message-----
From: Keith Ivey [mailto:ke...@iveys.org]
Sent: Tuesday, January 08, 2008 10:27 AM
To: Fun with Perl

David Landgren

unread,
Jan 8, 2008, 5:08:50 PM1/8/08
to Stoll, Steven R., Uri Guttman, Fun with Perl
Stoll, Steven R. wrote:
> After solving the case sensitivity issue, separating the alternations, and
> solving the un-escaped /, here is what we are left with.
>
> (p(ost)?[.\s]*o(ffice)?[.\s]*box)
> po(b|x|drawer|stoffice|[ ]bx|box)
> p[\/]o
> b(x|ox|uzon)
> a(partado|ptdo)

If we unroll that to

post[.\s]*o(ffice)?[.\s]*box
p[.\s]*o(ffice)?[.\s]*box


pob
pox
podrawer
postoffice
po[ ]bx
pobox
p[\/]o

bx
box
buzon
apartado
aptdo

reassembling it, we obtain

(?:p(?:o(?:st(?:[.\s]*o(ffice)?[.\s]*box|office)|(?:
b)?x|b(?:ox)?|drawer)|[.\s]*o(ffice)?[.\s]*box|\/o)|ap(?:arta|t)do|b(?:uzon|o?x))

I'm happy that they thought to check for 'pox' as a shorthand to
postoffice box. I'll remember to use that next time I need to address
such a letter.

David

madd...@free.fr

unread,
Jan 10, 2008, 9:32:19 PM1/10/08
to f...@perl.org
Dan Collins wrote:

> On Jan 7, 2008 5:06 PM, Uri Guttman <u...@stemsystems.com> wrote:
>
>> ^([Pp]([Oo][Ss][Tt])?[.\s]*[Oo]([Ff][Ff][Ii][Cc][Ee])?[.\s]*[Bb][Oo]
>> [Xx])|[Pp][Oo]([Bb]|[Xx]|[Dd][Rr][Aa][Ww][Ee][Rr]|[Ss][Tt][Oo][Ff]
>> [Ff]
>> [Ii][Cc][Ee]|[ ][Bb][Xx]|[Bb][Oo][Xx])|[Pp][/][Oo]|[Bb]([Xx]|[Oo]
>> [Xx]|
>> [Uu][Zz][Oo][Nn])|[Aa]([Pp][Aa][Rr][Tt][Aa][Dd][Oo]|[Pp][Tt][Dd][Oo])
>>
>> the challenge: itemize the stupidities. the case issue is only 1! i
>> don't want to even post the 'spec' unless asked for it. i saw this on
>> usenet today.
>

> I don't even want to know what that's supposed to do.
> First, and most obviously, that should use /i.

You can also find code like this in HTML::Template. Sam Tregar
explained the reason in the FAQ:

Q: Why do you use /[Tt]/ instead of /t/i? It's so ugly!

A: Simple - the case-insensitive match switch is very
inefficient. According to "Mastering Regular Expressions"
from O'Reilly Press, /[Tt]/ is faster and more space
efficient than /t/i - by as much as double against long
strings. //i essentially does a lc() on the string and
keeps a temporary copy in memory.

When this changes, and it is in the 5.6 development series,
I will gladly use //i. Believe me, I realize [Tt] is hideously
ugly.

» http://search.cpan.org/dist/HTML-Template/
Template.pm#FREQUENTLY_ASKED_QUESTIONS


--
Sébastien Aperghis-Tramoni

Close the world, txEn eht nepO.

Michael G Schwern

unread,
Jan 10, 2008, 10:50:26 PM1/10/08
to Sébastien Aperghis-Tramoni, f...@perl.org
Sébastien Aperghis-Tramoni wrote:
> Q: Why do you use /[Tt]/ instead of /t/i? It's so ugly!
>
> A: Simple - the case-insensitive match switch is very
> inefficient. According to "Mastering Regular Expressions"
> from O'Reilly Press, /[Tt]/ is faster and more space
> efficient than /t/i - by as much as double against long
> strings. //i essentially does a lc() on the string and
> keeps a temporary copy in memory.
>
> When this changes, and it is in the 5.6 development series,
> I will gladly use //i. Believe me, I realize [Tt] is hideously
> ugly.

Looks like it was painfully true for 5.5...

$ time perl5.5.5 -wle '$foo = "x" x 10000; $foo .= "T"; $foo =~ /[Tt]/ for
1..100000'

real 0m4.882s
user 0m4.761s
sys 0m0.026s

$ time perl5.5.5 -wle '$foo = "x" x 10000; $foo .= "T"; $foo =~ /t/i for
1..100000'

real 0m40.656s
user 0m39.587s
sys 0m0.149s


And the reverse is now true in this highly inaccurate test...

$ time perl5.8.8 -wle '$foo = "x" x 10000; $foo .= "T"; $foo =~ /[Tt]/ for
1..100000'

real 0m5.732s
user 0m5.565s
sys 0m0.027s

$ time perl5.8.8 -wle '$foo = "x" x 10000; $foo .= "T"; $foo =~ /t/i for
1..100000'

real 0m2.589s
user 0m2.544s
sys 0m0.015s

--
<Schwern> What we learned was if you get confused, grab someone and swing
them around a few times
-- Life's lessons from square dancing

David Landgren

unread,
Jan 11, 2008, 9:01:58 AM1/11/08
to Michael G Schwern, Sébastien Aperghis-Tramoni, f...@perl.org
Michael G Schwern wrote:

> And the reverse is now true in this highly inaccurate test...
>
> $ time perl5.8.8 -wle '$foo = "x" x 10000; $foo .= "T"; $foo =~ /[Tt]/ for
> 1..100000'
>
> real 0m5.732s
> user 0m5.565s
> sys 0m0.027s
>
> $ time perl5.8.8 -wle '$foo = "x" x 10000; $foo .= "T"; $foo =~ /t/i for
> 1..100000'
>
> real 0m2.589s
> user 0m2.544s
> sys 0m0.015s

And if I recall my perl510delta correctly, /i should be even faster on
5.10.0. No, hang on, it's when UTF-8 strings are involved.

% time perl5.8.8 -Mutf8 -Mcharnames=:full -wle '$foo = "e\N{GREEK SMALL
LETTER BETA}" x 5000; $foo .= "T"; $foo =~ /t/i for 1..1000'

real 0m22.855s
user 0m22.827s
sys 0m0.016s

% ./perl -v
This is perl, v5.10.0 DEVEL32604 built for i386-freebsd-thread-multi

% time ./perl -Ilib -Mutf8 -Mcharnames=:full -wle '$foo = "e\N{GREEK
SMALL LETTER BETA}" x 5000; $foo .= "T"; $foo =~ /t/i for 1..1000'

real 0m22.957s
user 0m22.948s
sys 0m0.001s

Well, look on the bright side. It's no worse.

The benchmark may be flawed, since my appreciation of Unicode is little
more than "things went downhill after 7-bit ASCII".

David

Chris Dolan

unread,
Jan 12, 2008, 5:51:05 PM1/12/08
to David Landgren, Fun With Perl
On Jan 11, 2008, at 8:01 AM, David Landgren wrote:

> The benchmark may be flawed, since my appreciation of Unicode is
> little more than "things went downhill after 7-bit ASCII".

Haven't I read that you live in Paris? I figured that anyone who
lives in a country whose dominant language was not fully expressible
in ASCII would love Unicode.

On a major tangent, have others noticed the resurgence of the umlaut
in printed English? I keep seeing things like coöperation or
coördinates -- particularly in Technology Review, but in other
publications on occasion too. Is that because it's *supposed* to be
spelled that way, but ASCII and the typewriter have suppressed that
spelling for my lifetime?

Chris

Yanick Champoux

unread,
Jan 12, 2008, 6:50:52 PM1/12/08
to Chris Dolan, Fun With Perl
Chris Dolan wrote:
> On a major tangent, have others noticed the resurgence of the umlaut
> in printed English? I keep seeing things like coöperation or
> coördinates -- particularly in Technology Review, but in other
> publications on occasion too. Is that because it's *supposed* to be
> spelled that way, but ASCII and the typewriter have suppressed that
> spelling for my lifetime?
>

A quick use of Google-fu unearthed a blog entry
http://www.dwelle.org/archives/2007/01/05/whats-with-all-the-umlauts/,
which in turn pointed to the page
http://ourworld.compuserve.com/homepages/profirst/d.htm that says:

*dieresis* or *diæresis *A diacritical mark (* ¨ *) optionally used in
English, oftentimes replaced by a hyphen. In English, the dieresis is
used on a second identical vowel to indicate a change in pronunciation
of that vowel or indicate it is pronounced in a separate syllable. It is
sometimes referred to as an « umlaut » when used with a single character
or in a « diphthong. » Examples: reëlecting, reëncoding, coöperation,
coördination.

Well I, for one, never knew that such a thing existed. Neato! Too
bad the name of the mark, though, which is definitively unfortunate.


Joÿ,
`/anick

Georg Moritz

unread,
Jan 12, 2008, 6:23:35 PM1/12/08
to Chris Dolan, Fun With Perl
From the keyboard of Chris Dolan [12.01.08,16:51]:

Well, that's sort of quotemeta for the double o - differentiating e.g.
double-o usage in cool vs. cooperation. I haven't seen that usage in
english yet, but it's used in spanish to mark a vowel as literal, e.g. in
"Parque Güell".

0--gg-

--
_($_=" "x(1<<5)."?\n".q·/)Oo. G°\ /
/\_¯/(q /
---------------------------- \__(m.====·.(_("always off the crowd"))."·
");sub _{s,/,($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e,e && print}

Dave Mitchell

unread,
Jan 12, 2008, 6:44:48 PM1/12/08
to Georg Moritz, Chris Dolan, Fun With Perl
On Sun, Jan 13, 2008 at 12:23:35AM +0100, Georg Moritz wrote:
> Well, that's sort of quotemeta for the double o - differentiating e.g.
> double-o usage in cool vs. cooperation. I haven't seen that usage in
> english yet, but it's used in spanish to mark a vowel as literal, e.g. in
> "Parque Güell".

The only English word I think its commonly seen with is naïve,
to indicate that the ai isn't a digraph.


--
"But Sidley Park is already a picture, and a most amiable picture too.
The slopes are green and gentle. The trees are companionably grouped at
intervals that show them to advantage. The rill is a serpentine ribbon
unwound from the lake peaceably contained by meadows on which the right
amount of sheep are tastefully arranged." -- Lady Croom, "Arcadia"

Georg Moritz

unread,
Jan 12, 2008, 6:50:11 PM1/12/08
to Yanick Champoux, Fun With Perl
From the keyboard of Yanick Champoux [12.01.08,18:50]:

> Chris Dolan wrote:
> > On a major tangent, have others noticed the resurgence of the umlaut in
> > printed English? I keep seeing things like coöperation or coördinates --
> > particularly in Technology Review, but in other publications on occasion
> > too. Is that because it's *supposed* to be spelled that way, but ASCII and
> > the typewriter have suppressed that spelling for my lifetime?
> >
>
> A quick use of Google-fu unearthed a blog entry
> http://www.dwelle.org/archives/2007/01/05/whats-with-all-the-umlauts/, which
> in turn pointed to the page
> http://ourworld.compuserve.com/homepages/profirst/d.htm that says:
>
> *dieresis* or *diæresis *A diacritical mark (* ¨ *) optionally used in
> English, oftentimes replaced by a hyphen. In English, the dieresis is used on
> a second identical vowel to indicate a change in pronunciation of that vowel
> or indicate it is pronounced in a separate syllable. It is sometimes referred
> to as an « umlaut » when used with a single character or in a « diphthong. »
> Examples: reëlecting, reëncoding, coöperation, coördination.

Actually the term "umlaut" in german denotes a "shifted" vowel. If you do
a transition from "u" -> "e" biased towards "i" and stopping in the middle,
you have the "ü", which can be written as diphtong also: "ue". The "e" in
"ue" was often placed above the "u" in old german writing (where the "e"
was written like "n", but with a sharp bend instead of a curve before the
last falling stroke). The four strokes necessary for that "e" were reduced
to two, and those to dots, hence the two points above the "ü".

So, the umlaut is a shortened form of a "diphtongy" denoting a shifted vowel,
and *not* a diaeresis ("ue" is not a diphtong, but an umlaut ;-)

Ö--gg-

Keith Ivey

unread,
Jan 12, 2008, 8:05:06 PM1/12/08
to Fun With Perl
Chris Dolan wrote:
> Haven't I read that you live in Paris? I figured that anyone who lives
> in a country whose dominant language was not fully expressible in ASCII
> would love Unicode.

"Not fully expressible" seems mild to apply to writing French in ASCII
(which after all has no diacritics). The phrase seems more appropriate
for writing French in ISO-8859-1 (because of the lack of "oe" ligature).

Aristotle Pagaltzis

unread,
Jan 13, 2008, 2:21:36 AM1/13/08
to f...@perl.org
* Chris Dolan <ch...@chrisdolan.net> [2008-01-12 23:55]:

> I figured that anyone who lives in a country whose dominant
> language was not fully expressible in ASCII would love Unicode.

For bonus points, try writing, say, German (fully expressible
with an ISO-8859 charset) and Greek (fully expressible[^1] with
an ISO-8859 charset) in the same document.

[1]: Well, Modern Greek anyway.

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>

Eugene van der Pijll

unread,
Jan 13, 2008, 5:12:22 AM1/13/08
to Yanick Champoux, Chris Dolan, Fun With Perl
Yanick Champoux schreef:
> *dieresis* or *diæresis

>
> Well I, for one, never knew that such a thing existed. Neato! Too
> bad the name of the mark, though, which is definitively unfortunate.

According to the infallible Wikipedia, this diacritic is also called a
trema. Only if used as a seperation mark, not as an umlaut.

HTH

Eugene

Michael G Schwern

unread,
Jan 13, 2008, 12:16:21 PM1/13/08
to Yanick Champoux, Chris Dolan, Fun With Perl
Yanick Champoux wrote:
> Chris Dolan wrote:
>> On a major tangent, have others noticed the resurgence of the umlaut
>> in printed English? I keep seeing things like coöperation or
>> coördinates -- particularly in Technology Review, but in other
>> publications on occasion too. Is that because it's *supposed* to be
>> spelled that way, but ASCII and the typewriter have suppressed that
>> spelling for my lifetime?
>>
>
> A quick use of Google-fu unearthed a blog entry
> http://www.dwelle.org/archives/2007/01/05/whats-with-all-the-umlauts/,
> which in turn pointed to the page
> http://ourworld.compuserve.com/homepages/profirst/d.htm that says:
>
> *dieresis* or *diæresis *A diacritical mark (* ¨ *) optionally used in
> English, oftentimes replaced by a hyphen. In English, the dieresis is
> used on a second identical vowel to indicate a change in pronunciation
> of that vowel or indicate it is pronounced in a separate syllable. It is
> sometimes referred to as an « umlaut » when used with a single character
> or in a « diphthong. » Examples: reëlecting, reëncoding, coöperation,
> coördination.

Because, ya know, I always get confused about how to pronounce "reelecting".
Thank god they cleared that right up! And with an easy to understand symbol
that everyone knows about! :P

Really they just want to be more metal. Soon it will be ¡KömpUsërV.DøøM!


--
There will be snacks.

Yanick Champoux

unread,
Jan 13, 2008, 7:53:56 PM1/13/08
to Georg Moritz, Fun With Perl
Georg Moritz wrote:
> From the keyboard of Yanick Champoux [12.01.08,18:50]:
>
>> *dieresis* or *diæresis *A diacritical mark (* ¨ *) optionally used in
>> English, oftentimes replaced by a hyphen. In English, the dieresis is used on
>> a second identical vowel to indicate a change in pronunciation of that vowel
>> or indicate it is pronounced in a separate syllable. It is sometimes referred
>> to as an « umlaut » when used with a single character or in a « diphthong. »
>> Examples: reëlecting, reëncoding, coöperation, coördination.
>>
>
> Actually the term "umlaut" in german denotes a "shifted" vowel. If you do
> a transition from "u" -> "e" biased towards "i" and stopping in the middle,
> you have the "ü", which can be written as diphtong also: "ue". The "e" in
> "ue" was often placed above the "u" in old german writing (where the "e"
> was written like "n", but with a sharp bend instead of a curve before the
> last falling stroke). The four strokes necessary for that "e" were reduced
> to two, and those to dots, hence the two points above the "ü".
>
> So, the umlaut is a shortened form of a "diphtongy" denoting a shifted vowel,
> and *not* a diaeresis ("ue" is not a diphtong, but an umlaut ;-)
>

Yup, I became aware of that yesterday when I shared my new golden
nugget of trivia with my wife. Being both German *and* a linguist, she
promptly corrected my inaccuracies. But she left out the part about the
'e' mutating into the two little dots that we know and love nowadays,
which I think is the coolest part of the whole story, so thanks for
that! :-)


Joy,
`/anick

Yanick Champoux

unread,
Jan 13, 2008, 8:15:52 PM1/13/08
to Michael G Schwern, Chris Dolan, Fun With Perl
Michael G Schwern wrote:
> > Yanick Champoux wrote:
>
> > *dieresis* or *diæresis [..]

>
> Really they just want to be more metal. Soon it will be ¡KömpUsërV.DøøM!
>

And to thing that, all those years, I laughed at Spinal Tap's claim
to be avant-garde and visionary geniuses. They were right. All that
time, they were right...

Joÿ,
`/anick

John Douglas Porter

unread,
Jan 13, 2008, 10:12:06 PM1/13/08
to Fun With Perl

From the keyboard of Yanick Champoux [12.01.08,18:50]:
> *dieresis* or *diæresis *A diacritical mark (* ¨ *) optionally
used in
> English, oftentimes replaced by a hyphen. In English, the dieresis
is used on
> a second identical vowel to indicate a change in pronunciation of
that vowel
> or indicate it is pronounced in a separate syllable. It is sometimes
referred
> to as an « umlaut » when used with a single character or in a «
diphthong. »
> Examples: reëlecting, reëncoding, coöperation, coördination.

I want to clarify (only because I myself was confused at first)
that an umlaut can be used IN a diphthong, but does not have
any function in MAKING a diphthong. For example, the
German diphthong "au" becomes "äu" due to umlaut, (or "vowel
shifting").

Unless I am mistaken.

--
John Douglas Porter


____________________________________________________________________________________
Never miss a thing. Make Yahoo your home page.
http://www.yahoo.com/r/hs

Sebb

unread,
Jan 14, 2008, 7:21:56 AM1/14/08
to Fun With Perl
On 14/01/2008, John Douglas Porter <johnd...@yahoo.com> wrote:
>
> From the keyboard of Yanick Champoux [12.01.08,18:50]:
> > *dieresis* or *diæresis *A diacritical mark (* ¨ *) optionally
> used in
> > English, oftentimes replaced by a hyphen. In English, the dieresis
> is used on
> > a second identical vowel to indicate a change in pronunciation of
> that vowel
> > or indicate it is pronounced in a separate syllable. It is sometimes
> referred
> > to as an « umlaut » when used with a single character or in a «
> diphthong. »
> > Examples: reëlecting, reëncoding, coöperation, coördination.
>

Also naïf and naïve - non-identical vowels.

> I want to clarify (only because I myself was confused at first)
> that an umlaut can be used IN a diphthong, but does not have
> any function in MAKING a diphthong. For example, the
> German diphthong "au" becomes "äu" due to umlaut, (or "vowel
> shifting").
>

In summary, umlaut and dieresis/diaeresis (also trema) are both
diacritical marks.

Same symbol, but different meaning, where the meaning of the symbol
depends on the context.

The rule seems to be: second vowel of a pair=dieresis, otherwise umlaut.

Any counter-examples?

> Unless I am mistaken.

Same here...

Keith Ivey

unread,
Jan 14, 2008, 11:16:23 AM1/14/08
to sebb, Fun With Perl
sebb wrote:
> The rule seems to be: second vowel of a pair=dieresis, otherwise umlaut.

I'd call the symbol in "Brontė" a dieresis, not an umlaut. Maybe: when
the symbol indicates the vowel is to be pronounced further forward in
the mouth, it's an umlaut; when it indicates the vowel is to be
pronounced on its own in a normal way, it's a dieresis.

Georg Moritz

unread,
Jan 14, 2008, 11:16:42 AM1/14/08
to sebb, Fun With Perl
From the keyboard of sebb [14.01.08,12:21]:

> On 14/01/2008, John Douglas Porter <johnd...@yahoo.com> wrote:
> >
> > From the keyboard of Yanick Champoux [12.01.08,18:50]:
> > > *dieresis* or *diæresis *A diacritical mark (* ¨ *) optionally
> > used in
> > > English, oftentimes replaced by a hyphen. In English, the dieresis
> > is used on
> > > a second identical vowel to indicate a change in pronunciation of
> > that vowel
> > > or indicate it is pronounced in a separate syllable. It is sometimes
> > referred
> > > to as an « umlaut » when used with a single character or in a «
> > diphthong. »
> > > Examples: reëlecting, reëncoding, coöperation, coördination.
> >
>
> Also naïf and naïve - non-identical vowels.
>
> > I want to clarify (only because I myself was confused at first)
> > that an umlaut can be used IN a diphthong, but does not have
> > any function in MAKING a diphthong. For example, the
> > German diphthong "au" becomes "äu" due to umlaut, (or "vowel
> > shifting").

Correct (unless I'm mistaken too ;-)

> In summary, umlaut and dieresis/diaeresis (also trema) are both
> diacritical marks.
>
> Same symbol, but different meaning, where the meaning of the symbol
> depends on the context.
>
> The rule seems to be: second vowel of a pair=dieresis, otherwise umlaut.
>
> Any counter-examples?

yup, two examples:

German:
"geärgert" (been angry)
- here the second vowel is an umlaut

Quenya - the elven-tongue:
"ámen anta síra ilaurëa massamma" (give us today the daily our-bread)
- here the diaeresis is noted on the first vowel at ëa

;-)

0--gg-

> > Unless I am mistaken.
>
> Same here...
>
> > --
> > John Douglas Porter
> >
> >
> >
> >
> > ____________________________________________________________________________________
> > Never miss a thing. Make Yahoo your home page.
> > http://www.yahoo.com/r/hs
> >
>

--

Paul Johnson

unread,
Jan 14, 2008, 2:15:13 PM1/14/08
to Georg Moritz, sebb, Fun With Perl
On Mon, Jan 14, 2008 at 05:16:42PM +0100, Georg Moritz wrote:

> >From the keyboard of sebb [14.01.08,12:21]:

> > The rule seems to be: second vowel of a pair=dieresis, otherwise umlaut.
> >
> > Any counter-examples?
>
> yup, two examples:
>
> German:
> "geärgert" (been angry)
> - here the second vowel is an umlaut
>
> Quenya - the elven-tongue:
> "ámen anta síra ilaurëa massamma" (give us today the daily our-bread)
> - here the diaeresis is noted on the first vowel at ëa

Don't forget the burning of the Böögg in Zürich at Sächsilüüte.
http://de.wikipedia.org/wiki/Sechsel%C3%A4uten

--
Paul Johnson - pa...@pjcj.net
http://www.pjcj.net

David Landgren

unread,
Jan 14, 2008, 5:42:08 PM1/14/08
to Chris Dolan, Fun With Perl
Chris Dolan wrote:
> On Jan 11, 2008, at 8:01 AM, David Landgren wrote:
>
>> The benchmark may be flawed, since my appreciation of Unicode is
>> little more than "things went downhill after 7-bit ASCII".
>
> Haven't I read that you live in Paris? I figured that anyone who lives
> in a country whose dominant language was not fully expressible in ASCII
> would love Unicode.

I do, but then again French is fully expressible in Latin-1... except
for the oe ligature (œ).

I worked with a French programmer a few years back who spent much of his
career working in government computing circles. Apparently when the
national European computer organisations were thrashing out what
characters should go where in the 128..255 high ASCII slots, his
colleague at the time, who was representing France in one of the
discussions, was out of the meeting having a coffee.

At that point, a vote was taken, and the result was that some other
accented character like ý or something made it in at the expense of œ.
What did get in were the decidedly less useful Æ and æ ligatures.

> On a major tangent, have others noticed the resurgence of the umlaut in
> printed English? I keep seeing things like coöperation or coördinates
> -- particularly in Technology Review, but in other publications on
> occasion too. Is that because it's *supposed* to be spelled that way,
> but ASCII and the typewriter have suppressed that spelling for my lifetime?

Funny you should mention that. I read about this first maybe twenty,
twenty-five years ago proposed as "the way things should really be" but
never saw it in use. Then last week I read two articles on two different
web sites that used this convention. I found it quite jarring. What's
next, "welcome to the reäl world?"

David

David Landgren

unread,
Jan 14, 2008, 5:47:30 PM1/14/08
to sebb, Fun With Perl
sebb wrote:
> On 14/01/2008, John Douglas Porter <johnd...@yahoo.com> wrote:
>> From the keyboard of Yanick Champoux [12.01.08,18:50]:
>>> *dieresis* or *diæresis *A diacritical mark (* ¨ *) optionally
>> used in
>>> English, oftentimes replaced by a hyphen. In English, the dieresis
>> is used on
>>> a second identical vowel to indicate a change in pronunciation of
>> that vowel
>>> or indicate it is pronounced in a separate syllable. It is sometimes
>> referred
>>> to as an « umlaut » when used with a single character or in a «
>> diphthong. »
>>> Examples: reëlecting, reëncoding, coöperation, coördination.
>
> Also naïf and naïve - non-identical vowels.

But isn't that because these two words are straight lifts from French,
where they have this exact spelling? It's not an English thing in this case.

David

Peter Makholm

unread,
Jan 15, 2008, 12:29:27 AM1/15/08
to Fun With Perl
David Landgren <da...@landgren.net> writes:

> At that point, a vote was taken, and the result was that some other
> accented character like ý or something made it in at the expense of
> œ. What did get in were the decidedly less useful Æ and æ ligatures.

Quite interesting discussion over all, but also quite off topic, so I
joined until now. Which is most usefull of the french oe-ligature or
the scandinavian letter æ I don't know.

But part of the explanation should be that the scandinavian delegates
pushed for 'æ' to be accepted as a full letter and not just an
ligature. This succedded and therefore 'æ' got included in iso-8859-1,
and I'm quite sure that iso-8859-15 makes the same distinction.

In english 'æ' is still considered a ligature.


But as a non-ascii using european I prefere iso-8859-1(5) to
unicode. Much easier to work with, but I havn't really had the need to
mix different alphabets.


> but never saw it in use. Then last week I read two articles on two
> different web sites that used this convention. I found it quite
> jarring. What's next, "welcome to the reäl world?"

But only if the pronouncation 're-al' would become widespread.

//Makholm

Craig S. Cottingham

unread,
Jan 14, 2008, 7:02:00 PM1/14/08
to Fun With Perl
On Jan 14, 2008, at 16:42, David Landgren wrote:

> What's next, "welcome to the reäl world?"

Well, no, because the "a" isn't pronounced as a separate vowel. On
the other hand, we may start seeing references to El Camino Reäl in
Silicon Valley.

--
Craig S. Cottingham
craig.co...@gmail.com

0 new messages