the challenge: itemize the stupidities. the case issue is only 1! i
don't want to even post the 'spec' unless asked for it. i saw this on
usenet today.
enjoy!!
uri
--
Uri Guttman ------ u...@stemsystems.com -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
>^([Pp]([Oo][Ss][Tt])?[.\s]*[Oo]([Ff][Ff][Ii][Cc][Ee])?[.\s]*[Bb][Oo]
>[Xx])|[Pp][Oo]([Bb]|[Xx]|[Dd][Rr][Aa][Ww][Ee][Rr]|[Ss][Tt][Oo][Ff][Ff]
>[Ii][Cc][Ee]|[ ][Bb][Xx]|[Bb][Oo][Xx])|[Pp][/][Oo]|[Bb]([Xx]|[Oo][Xx]|
>[Uu][Zz][Oo][Nn])|[Aa]([Pp][Aa][Rr][Tt][Aa][Dd][Oo]|[Pp][Tt][Dd][Oo])
>
>
>the challenge: itemize the stupidities. the case issue is only 1! i
>don't want to even post the 'spec' unless asked for it. i saw this on
>usenet today.
Why use [ ] in one place when \s is known and used previously?
And ^ (i.e. \A) doesn't distribute across the alternatives, so
only the first alternative must match at beginning of string.
Assuming use of (?ix) then is this the de-obfuscated equivalent?
^( P ( OST)? [.\s]* O (FFICE)? [.\s]* BOX )
|
PO ( B | X | DRAWER | STOFFICE | [ ]BX | BOX )
|
P[/]O
|
B ( X | OX | UZON )
|
A ( PARTADO | PTDO )
--
I'm a pessimist about probabilities; I'm an optimist about possibilities.
Lewis Mumford (1895-1990)
On 07 Jan 08 at 23:06, "Uri" (Uri Guttman) wrote:
Uri> ^([Pp]([Oo][Ss][Tt])?[.\s]*[Oo]([Ff][Ff][Ii][Cc][Ee])?[.\s]*[Bb][Oo]
Uri> [Xx])|[Pp][Oo]([Bb]|[Xx]|[Dd][Rr][Aa][Ww][Ee][Rr]|[Ss][Tt][Oo][Ff][Ff]
Uri> [Ii][Cc][Ee]|[
Uri> ][Bb][Xx]|[Bb][Oo][Xx])|[Pp][/][Oo]|[Bb]([Xx]|[Oo][Xx]|
Uri> [Uu][Zz][Oo][Nn])|[Aa]([Pp][Aa][Rr][Tt][Aa][Dd][Oo]|[Pp][Tt][Dd][Oo])
I've observed this pattern in FreeBSD startup (shell) scripts.
I do realize that shell is not perl, but nevertheless the style
still baffles me, I cannot decide whether this is extraordinary
stupidity or extraordinary wisdom :)
F.ex., /etc/rc.firewall on 6.2-stable has this:
case ${firewall_type} in
[Oo][Pp][Ee][Nn]|[Cc][Ll][Ii][Ee][Nn][Tt])
case ${natd_enable} in
[Yy][Ee][Ss])
--
Sincerely,
Dmitry Karasik
I don't even want to know what that's supposed to do.
First, and most obviously, that should use /i. That gives us:
^(p(ost)?[.\s]*o(ffice)?[.\s]*box)|po(b|x|drawer|stoffice|\sbx|box)|p/o|b(x|ox|uzon)|a(partado|ptdo)
Last I checked, '.' matches \s. Oh, and '/' uis a pretty important
character, and should really be escaped.
^(p(ost)?.*o(ffice)?.*box)|po(b|x|drawer|stoffice|\sbx|box)|p\/o|b(x|ox|uzon)|a(partado|ptdo)
Unless I'm horribly mistaken, that can be simplified incredibly to
^(p(ost)?.*o(ffice)?.*box)|po(b|x|drawer|stoffice|\sbx)|p\/o|b(o?x|uzon)|a(partado|ptdo)
or
^(p(ost)?.*o(ffice)?.*)(b|x|drawer)|p\/o|b(o?x|uzon)|a(partado|ptdo)
In total, I count 9 errors and 213 characters removed. Though, I can't
count, so that may be wrong. Did I miss anything?
No, '.' inside [] matches '.'. The [.\s] is looking for a period or
whitespace.
And there's no reason to escape '/' if you're not using slashes for the
regex delimiters. If you want to escape it because you make a habit to
do so in regexes, that's fine, but it's not an error not to.
--
Keith C. Ivey <ke...@iveys.org>
Washington, DC
DC> Last I checked, '.' matches \s. Oh, and '/' uis a pretty important
DC> character, and should really be escaped.
nope. . doesn't match \n which is part of \s. you need the /s modifier
to make . match \n (and then of course \s).
DC> In total, I count 9 errors and 213 characters removed. Though, I can't
DC> count, so that may be wrong. Did I miss anything?
i dunno. i can't figure it out either. that is why i posted it here! :)
but, pardon my ignorance, won't an "i" at the end of the regex make it catch
either case?
Meryll
(p(ost)?[.\s]*o(ffice)?[.\s]*box)
po(b|x|drawer|stoffice|[ ]bx|box)
p[\/]o
b(x|ox|uzon)
a(partado|ptdo)
Which matches:
(p(ost)?.*o(ffice)?.*box)
post(anynumberofanythingexceptnewline)office(anynumberofanythingexceptnewlin
e)box
p(anynumberofanythingexceptnewline)office(anynumberofanythingexceptnewline)b
ox
post(anynumberofanythingexceptnewline)o(anynumberofanythingexceptnewline)box
p(anynumberofanythingexceptnewline)o(anynumberofanythingexceptnewline)box
po(b|x|drawer|stoffice|[ ]bx|box)
pob
pox
podrawer
postoffice
po bx
pobox
p[\/]o
p/o
b(x|ox|uzon)
bx
box
buzon
a(partado|ptdo)
apartado
aptdo
I can't imagine what the original specs were, but it looks like a patch job
gone awry.
Steve
'[.\s]*' matches any number of periods or whitespace characters, since
'.' is not special inside a character class. It's not the same as '.*'.
Also, even if '.' were special, '\s' matches newline along with other
whitespace characters.
But the following should be:
(p(ost)?[.\s]*o(ffice)?[.\s]*box)
post(anynumberofperiodsorspacecharacterclassitems)office(anynumberofperiodso
rspacecharacterclassitems)box
p(anynumberofperiodsorspacecharacterclassitems)o(anynumberofperiodsorspacech
aracterclassitems)box
p(anynumberofperiodsorspacecharacterclassitems)office(anynumberofperiodsorsp
acecharacterclassitems)box
post(anynumberofperiodsorspacecharacterclassitems)o(anynumberofperiodsorspac
echaracterclassitems)box
Steve
-----Original Message-----
From: Keith Ivey [mailto:ke...@iveys.org]
Sent: Tuesday, January 08, 2008 10:27 AM
To: Fun with Perl
If we unroll that to
post[.\s]*o(ffice)?[.\s]*box
p[.\s]*o(ffice)?[.\s]*box
pob
pox
podrawer
postoffice
po[ ]bx
pobox
p[\/]o
bx
box
buzon
apartado
aptdo
reassembling it, we obtain
(?:p(?:o(?:st(?:[.\s]*o(ffice)?[.\s]*box|office)|(?:
b)?x|b(?:ox)?|drawer)|[.\s]*o(ffice)?[.\s]*box|\/o)|ap(?:arta|t)do|b(?:uzon|o?x))
I'm happy that they thought to check for 'pox' as a shorthand to
postoffice box. I'll remember to use that next time I need to address
such a letter.
David
> On Jan 7, 2008 5:06 PM, Uri Guttman <u...@stemsystems.com> wrote:
>
>> ^([Pp]([Oo][Ss][Tt])?[.\s]*[Oo]([Ff][Ff][Ii][Cc][Ee])?[.\s]*[Bb][Oo]
>> [Xx])|[Pp][Oo]([Bb]|[Xx]|[Dd][Rr][Aa][Ww][Ee][Rr]|[Ss][Tt][Oo][Ff]
>> [Ff]
>> [Ii][Cc][Ee]|[ ][Bb][Xx]|[Bb][Oo][Xx])|[Pp][/][Oo]|[Bb]([Xx]|[Oo]
>> [Xx]|
>> [Uu][Zz][Oo][Nn])|[Aa]([Pp][Aa][Rr][Tt][Aa][Dd][Oo]|[Pp][Tt][Dd][Oo])
>>
>> the challenge: itemize the stupidities. the case issue is only 1! i
>> don't want to even post the 'spec' unless asked for it. i saw this on
>> usenet today.
>
> I don't even want to know what that's supposed to do.
> First, and most obviously, that should use /i.
You can also find code like this in HTML::Template. Sam Tregar
explained the reason in the FAQ:
Q: Why do you use /[Tt]/ instead of /t/i? It's so ugly!
A: Simple - the case-insensitive match switch is very
inefficient. According to "Mastering Regular Expressions"
from O'Reilly Press, /[Tt]/ is faster and more space
efficient than /t/i - by as much as double against long
strings. //i essentially does a lc() on the string and
keeps a temporary copy in memory.
When this changes, and it is in the 5.6 development series,
I will gladly use //i. Believe me, I realize [Tt] is hideously
ugly.
» http://search.cpan.org/dist/HTML-Template/
Template.pm#FREQUENTLY_ASKED_QUESTIONS
--
Sébastien Aperghis-Tramoni
Close the world, txEn eht nepO.
Looks like it was painfully true for 5.5...
$ time perl5.5.5 -wle '$foo = "x" x 10000; $foo .= "T"; $foo =~ /[Tt]/ for
1..100000'
real 0m4.882s
user 0m4.761s
sys 0m0.026s
$ time perl5.5.5 -wle '$foo = "x" x 10000; $foo .= "T"; $foo =~ /t/i for
1..100000'
real 0m40.656s
user 0m39.587s
sys 0m0.149s
And the reverse is now true in this highly inaccurate test...
$ time perl5.8.8 -wle '$foo = "x" x 10000; $foo .= "T"; $foo =~ /[Tt]/ for
1..100000'
real 0m5.732s
user 0m5.565s
sys 0m0.027s
$ time perl5.8.8 -wle '$foo = "x" x 10000; $foo .= "T"; $foo =~ /t/i for
1..100000'
real 0m2.589s
user 0m2.544s
sys 0m0.015s
--
<Schwern> What we learned was if you get confused, grab someone and swing
them around a few times
-- Life's lessons from square dancing
> And the reverse is now true in this highly inaccurate test...
>
> $ time perl5.8.8 -wle '$foo = "x" x 10000; $foo .= "T"; $foo =~ /[Tt]/ for
> 1..100000'
>
> real 0m5.732s
> user 0m5.565s
> sys 0m0.027s
>
> $ time perl5.8.8 -wle '$foo = "x" x 10000; $foo .= "T"; $foo =~ /t/i for
> 1..100000'
>
> real 0m2.589s
> user 0m2.544s
> sys 0m0.015s
And if I recall my perl510delta correctly, /i should be even faster on
5.10.0. No, hang on, it's when UTF-8 strings are involved.
% time perl5.8.8 -Mutf8 -Mcharnames=:full -wle '$foo = "e\N{GREEK SMALL
LETTER BETA}" x 5000; $foo .= "T"; $foo =~ /t/i for 1..1000'
real 0m22.855s
user 0m22.827s
sys 0m0.016s
% ./perl -v
This is perl, v5.10.0 DEVEL32604 built for i386-freebsd-thread-multi
% time ./perl -Ilib -Mutf8 -Mcharnames=:full -wle '$foo = "e\N{GREEK
SMALL LETTER BETA}" x 5000; $foo .= "T"; $foo =~ /t/i for 1..1000'
real 0m22.957s
user 0m22.948s
sys 0m0.001s
Well, look on the bright side. It's no worse.
The benchmark may be flawed, since my appreciation of Unicode is little
more than "things went downhill after 7-bit ASCII".
David
> The benchmark may be flawed, since my appreciation of Unicode is
> little more than "things went downhill after 7-bit ASCII".
Haven't I read that you live in Paris? I figured that anyone who
lives in a country whose dominant language was not fully expressible
in ASCII would love Unicode.
On a major tangent, have others noticed the resurgence of the umlaut
in printed English? I keep seeing things like coöperation or
coördinates -- particularly in Technology Review, but in other
publications on occasion too. Is that because it's *supposed* to be
spelled that way, but ASCII and the typewriter have suppressed that
spelling for my lifetime?
Chris
A quick use of Google-fu unearthed a blog entry
http://www.dwelle.org/archives/2007/01/05/whats-with-all-the-umlauts/,
which in turn pointed to the page
http://ourworld.compuserve.com/homepages/profirst/d.htm that says:
*dieresis* or *diæresis *A diacritical mark (* ¨ *) optionally used in
English, oftentimes replaced by a hyphen. In English, the dieresis is
used on a second identical vowel to indicate a change in pronunciation
of that vowel or indicate it is pronounced in a separate syllable. It is
sometimes referred to as an « umlaut » when used with a single character
or in a « diphthong. » Examples: reëlecting, reëncoding, coöperation,
coördination.
Well I, for one, never knew that such a thing existed. Neato! Too
bad the name of the mark, though, which is definitively unfortunate.
Joÿ,
`/anick
Well, that's sort of quotemeta for the double o - differentiating e.g.
double-o usage in cool vs. cooperation. I haven't seen that usage in
english yet, but it's used in spanish to mark a vowel as literal, e.g. in
"Parque Güell".
0--gg-
--
_($_=" "x(1<<5)."?\n".q·/)Oo. G°\ /
/\_¯/(q /
---------------------------- \__(m.====·.(_("always off the crowd"))."·
");sub _{s,/,($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e,e && print}
The only English word I think its commonly seen with is naïve,
to indicate that the ai isn't a digraph.
--
"But Sidley Park is already a picture, and a most amiable picture too.
The slopes are green and gentle. The trees are companionably grouped at
intervals that show them to advantage. The rill is a serpentine ribbon
unwound from the lake peaceably contained by meadows on which the right
amount of sheep are tastefully arranged." -- Lady Croom, "Arcadia"
> Chris Dolan wrote:
> > On a major tangent, have others noticed the resurgence of the umlaut in
> > printed English? I keep seeing things like coöperation or coördinates --
> > particularly in Technology Review, but in other publications on occasion
> > too. Is that because it's *supposed* to be spelled that way, but ASCII and
> > the typewriter have suppressed that spelling for my lifetime?
> >
>
> A quick use of Google-fu unearthed a blog entry
> http://www.dwelle.org/archives/2007/01/05/whats-with-all-the-umlauts/, which
> in turn pointed to the page
> http://ourworld.compuserve.com/homepages/profirst/d.htm that says:
>
> *dieresis* or *diæresis *A diacritical mark (* ¨ *) optionally used in
> English, oftentimes replaced by a hyphen. In English, the dieresis is used on
> a second identical vowel to indicate a change in pronunciation of that vowel
> or indicate it is pronounced in a separate syllable. It is sometimes referred
> to as an « umlaut » when used with a single character or in a « diphthong. »
> Examples: reëlecting, reëncoding, coöperation, coördination.
Actually the term "umlaut" in german denotes a "shifted" vowel. If you do
a transition from "u" -> "e" biased towards "i" and stopping in the middle,
you have the "ü", which can be written as diphtong also: "ue". The "e" in
"ue" was often placed above the "u" in old german writing (where the "e"
was written like "n", but with a sharp bend instead of a curve before the
last falling stroke). The four strokes necessary for that "e" were reduced
to two, and those to dots, hence the two points above the "ü".
So, the umlaut is a shortened form of a "diphtongy" denoting a shifted vowel,
and *not* a diaeresis ("ue" is not a diphtong, but an umlaut ;-)
Ö--gg-
"Not fully expressible" seems mild to apply to writing French in ASCII
(which after all has no diacritics). The phrase seems more appropriate
for writing French in ISO-8859-1 (because of the lack of "oe" ligature).
For bonus points, try writing, say, German (fully expressible
with an ISO-8859 charset) and Greek (fully expressible[^1] with
an ISO-8859 charset) in the same document.
[1]: Well, Modern Greek anyway.
Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>
According to the infallible Wikipedia, this diacritic is also called a
trema. Only if used as a seperation mark, not as an umlaut.
HTH
Eugene
Because, ya know, I always get confused about how to pronounce "reelecting".
Thank god they cleared that right up! And with an easy to understand symbol
that everyone knows about! :P
Really they just want to be more metal. Soon it will be ¡KömpUsërV.DøøM!
--
There will be snacks.
Yup, I became aware of that yesterday when I shared my new golden
nugget of trivia with my wife. Being both German *and* a linguist, she
promptly corrected my inaccuracies. But she left out the part about the
'e' mutating into the two little dots that we know and love nowadays,
which I think is the coolest part of the whole story, so thanks for
that! :-)
Joy,
`/anick
And to thing that, all those years, I laughed at Spinal Tap's claim
to be avant-garde and visionary geniuses. They were right. All that
time, they were right...
Joÿ,
`/anick
I want to clarify (only because I myself was confused at first)
that an umlaut can be used IN a diphthong, but does not have
any function in MAKING a diphthong. For example, the
German diphthong "au" becomes "äu" due to umlaut, (or "vowel
shifting").
Unless I am mistaken.
--
John Douglas Porter
____________________________________________________________________________________
Never miss a thing. Make Yahoo your home page.
http://www.yahoo.com/r/hs
Also naïf and naïve - non-identical vowels.
> I want to clarify (only because I myself was confused at first)
> that an umlaut can be used IN a diphthong, but does not have
> any function in MAKING a diphthong. For example, the
> German diphthong "au" becomes "äu" due to umlaut, (or "vowel
> shifting").
>
In summary, umlaut and dieresis/diaeresis (also trema) are both
diacritical marks.
Same symbol, but different meaning, where the meaning of the symbol
depends on the context.
The rule seems to be: second vowel of a pair=dieresis, otherwise umlaut.
Any counter-examples?
> Unless I am mistaken.
Same here...
I'd call the symbol in "Brontė" a dieresis, not an umlaut. Maybe: when
the symbol indicates the vowel is to be pronounced further forward in
the mouth, it's an umlaut; when it indicates the vowel is to be
pronounced on its own in a normal way, it's a dieresis.
> On 14/01/2008, John Douglas Porter <johnd...@yahoo.com> wrote:
> >
> > From the keyboard of Yanick Champoux [12.01.08,18:50]:
> > > *dieresis* or *diæresis *A diacritical mark (* ¨ *) optionally
> > used in
> > > English, oftentimes replaced by a hyphen. In English, the dieresis
> > is used on
> > > a second identical vowel to indicate a change in pronunciation of
> > that vowel
> > > or indicate it is pronounced in a separate syllable. It is sometimes
> > referred
> > > to as an « umlaut » when used with a single character or in a «
> > diphthong. »
> > > Examples: reëlecting, reëncoding, coöperation, coördination.
> >
>
> Also naïf and naïve - non-identical vowels.
>
> > I want to clarify (only because I myself was confused at first)
> > that an umlaut can be used IN a diphthong, but does not have
> > any function in MAKING a diphthong. For example, the
> > German diphthong "au" becomes "äu" due to umlaut, (or "vowel
> > shifting").
Correct (unless I'm mistaken too ;-)
> In summary, umlaut and dieresis/diaeresis (also trema) are both
> diacritical marks.
>
> Same symbol, but different meaning, where the meaning of the symbol
> depends on the context.
>
> The rule seems to be: second vowel of a pair=dieresis, otherwise umlaut.
>
> Any counter-examples?
yup, two examples:
German:
"geärgert" (been angry)
- here the second vowel is an umlaut
Quenya - the elven-tongue:
"ámen anta síra ilaurëa massamma" (give us today the daily our-bread)
- here the diaeresis is noted on the first vowel at ëa
;-)
0--gg-
> > Unless I am mistaken.
>
> Same here...
>
> > --
> > John Douglas Porter
> >
> >
> >
> >
> > ____________________________________________________________________________________
> > Never miss a thing. Make Yahoo your home page.
> > http://www.yahoo.com/r/hs
> >
>
--
> >From the keyboard of sebb [14.01.08,12:21]:
> > The rule seems to be: second vowel of a pair=dieresis, otherwise umlaut.
> >
> > Any counter-examples?
>
> yup, two examples:
>
> German:
> "geärgert" (been angry)
> - here the second vowel is an umlaut
>
> Quenya - the elven-tongue:
> "ámen anta síra ilaurëa massamma" (give us today the daily our-bread)
> - here the diaeresis is noted on the first vowel at ëa
Don't forget the burning of the Böögg in Zürich at Sächsilüüte.
http://de.wikipedia.org/wiki/Sechsel%C3%A4uten
--
Paul Johnson - pa...@pjcj.net
http://www.pjcj.net
I do, but then again French is fully expressible in Latin-1... except
for the oe ligature (œ).
I worked with a French programmer a few years back who spent much of his
career working in government computing circles. Apparently when the
national European computer organisations were thrashing out what
characters should go where in the 128..255 high ASCII slots, his
colleague at the time, who was representing France in one of the
discussions, was out of the meeting having a coffee.
At that point, a vote was taken, and the result was that some other
accented character like ý or something made it in at the expense of œ.
What did get in were the decidedly less useful Æ and æ ligatures.
> On a major tangent, have others noticed the resurgence of the umlaut in
> printed English? I keep seeing things like coöperation or coördinates
> -- particularly in Technology Review, but in other publications on
> occasion too. Is that because it's *supposed* to be spelled that way,
> but ASCII and the typewriter have suppressed that spelling for my lifetime?
Funny you should mention that. I read about this first maybe twenty,
twenty-five years ago proposed as "the way things should really be" but
never saw it in use. Then last week I read two articles on two different
web sites that used this convention. I found it quite jarring. What's
next, "welcome to the reäl world?"
David
But isn't that because these two words are straight lifts from French,
where they have this exact spelling? It's not an English thing in this case.
David
> At that point, a vote was taken, and the result was that some other
> accented character like ý or something made it in at the expense of
> œ. What did get in were the decidedly less useful Æ and æ ligatures.
Quite interesting discussion over all, but also quite off topic, so I
joined until now. Which is most usefull of the french oe-ligature or
the scandinavian letter æ I don't know.
But part of the explanation should be that the scandinavian delegates
pushed for 'æ' to be accepted as a full letter and not just an
ligature. This succedded and therefore 'æ' got included in iso-8859-1,
and I'm quite sure that iso-8859-15 makes the same distinction.
In english 'æ' is still considered a ligature.
But as a non-ascii using european I prefere iso-8859-1(5) to
unicode. Much easier to work with, but I havn't really had the need to
mix different alphabets.
> but never saw it in use. Then last week I read two articles on two
> different web sites that used this convention. I found it quite
> jarring. What's next, "welcome to the reäl world?"
But only if the pronouncation 're-al' would become widespread.
//Makholm
> What's next, "welcome to the reäl world?"
Well, no, because the "a" isn't pronounced as a separate vowel. On
the other hand, we may start seeing references to El Camino Reäl in
Silicon Valley.
--
Craig S. Cottingham
craig.co...@gmail.com