Why I no longer use Perl

strenhol...@gmail.com

unread,

Oct 23, 2006, 9:40:46 AM10/23/06

to

I have been a Perl programmer for over ten years. Recently, I found a
bug in Perl which made me stop using Perl altogether.

In these examples, the accented character is a 2-byte UTF-8 sequence.

$ /usr/bin/perl --version

This is perl, v5.8.0 built for i386-linux-thread-multi
(with 1 registered patch, see perl -V for more detail)

[Full Perl license text removed for brevity]

$ /usr/local/bin/perl --version

This is perl, v5.8.8 built for i686-linux

[Full Perl license text removed for brevity]

$ echo á | /usr/bin/perl -pe 's/á/aye/'
á
$ echo á | /usr/local/bin/perl -pe 's/á/aye/'
aye

So, is there any way to work around this problem? Nope. You might think
"use utf8" will fix this issue. It doesn't.

$ cat unicode.char
á
$ cat unicode.script
use utf8;

open(A,"< unicode.char");

while(<A>) {

$_ =~ s/á/aye/;
print;

}
$ /usr/bin/perl unicode.script
aye
$ /usr/local/bin/perl unicode.script
á

As you can see, "use utf8" just causes Perl 5.8.0 to do the right
thing, yet breaks Perl 5.8.8. So maybe we can fix this with a
conditional statement.

$ cat unicode.script.2
$vers=sprintf("%vd",$^V);

if($vers =~ /5.8.0/) {
use utf8;
}

open(A,"< unicode.char");

while(<A>) {

$_ =~ s/á/aye/;
print;

}
$ /usr/bin/perl unicode.script.2
á
$ /usr/local/bin/perl unicode.script.2
aye

At this point, I gave up. These days, I write in either awk (for simple
stuff) or Python (for complicated stuff). For example, none of the four
freely downloadable awk interpreters have this problem:

$ echo á | busybox awk '{gsub(/á/,"aye");print}'
aye
$ echo á | gawk '{gsub(/á/,"aye");print}'
aye
$ echo á | mawk '{gsub(/á/,"aye");print}'
aye
$ echo á | bwk-awk '{gsub(/á/,"aye");print}'
aye

The nice thing about awk is that there is a Posix standard out there;
this guarantees that I can write my awk scripts in a manner that will
work on any modern system with an awk interpreter.

The nice thing about Python is that there is a strong committment from
the Python community to not arbitrarily break things or make changes
which break scripts between bugfix releases.

Perl 6 looks promising; it seems that there is a lot of work being done
to document anything and everything Perl 6 does. It looks like Perl 6
will have a standard so a given script which follows the standard is
guaranteed to work with any version of Perl 6. I might return to Perl
once Perl 6 has a stable release and a well-defined standard.

- Sam

Aukjan van Belkum

unread,

Oct 23, 2006, 10:21:58 AM10/23/06

to

strenhol...@gmail.com wrote:

> $ cat unicode.script.2
> $vers=sprintf("%vd",$^V);
>
> if($vers =~ /5.8.0/) {
> use utf8;
> }
>

Actually this will not work... since use statements are done as the
first thing, and the 'if' statement has no effect on it. (AFAIK)
You could accomplish this with a BEGIN block and a 'require' statement
instead... That should solve your problem!!

So please don't quit on Perl :-)

Aukjan

Matt Garrish

unread,

Oct 23, 2006, 10:26:14 AM10/23/06

to

strenhol...@gmail.com wrote:
> I have been a Perl programmer for over ten years. Recently, I found a
> bug in Perl which made me stop using Perl altogether.
>

In ten years you've never learned how to set the locale?

perldoc perllocale

Matt

Aukjan van Belkum

unread,

Oct 23, 2006, 10:30:18 AM10/23/06

to

Aukjan van Belkum wrote:
> strenhol...@gmail.com wrote:
>
>> $ cat unicode.script.2
>> $vers=sprintf("%vd",$^V);
>>
>> if($vers =~ /5.8.0/) {
>> use utf8;
>> }
>>
>
> Actually this will not work... since use statements are done as the
> first thing,

hmm .. just read 'perldoc -q require', which explains the difference
between 'use' and 'require' better... :-)

Aukjan

Dr.Ruud

unread,

Oct 23, 2006, 10:27:26 AM10/23/06

to

strenholme schreef:

> $ echo á | /usr/bin/perl -pe 's/á/aye/'
> á

perl 5.8.8 on Windows 2000:

C:\>echo á | perl -MO=Deparse -pe "s/á/aye/"
LINE: while (defined($_ = <ARGV>)) {
s/\341/aye/;
}
continue {
print $_;
}
-e syntax OK

"\341" is "\xE1" is chr(225).

C:\> echo á | perl -MO=Deparse -Mencoding=latin1 -pe "s/á/aye/"
use encoding (split(/,/, 'latin1', 0));
LINE: while (defined($_ = <ARGV>)) {
s/\303\241/aye/;
}
continue {
print $_;
}
-e syntax OK

"\303\241" (or "\xC3\xA1") must be utf8 for chr(225).

In Python you are in a special environment. Let's do the same with perl:

$ perl -de1
[...]

DB<1> x $_="á"; s/á/aye/; print
aye0 1
DB<2>

So I think your problem is with locales.

--
Affijn, Ruud

"Gewoon is een tijger."

Peter J. Holzer

unread,

Oct 23, 2006, 3:57:48 PM10/23/06

to

On 2006-10-23 13:40, strenhol...@gmail.com <strenhol...@gmail.com> wrote:
> I have been a Perl programmer for over ten years. Recently, I found a
> bug in Perl which made me stop using Perl altogether.
>
> In these examples, the accented character is a 2-byte UTF-8 sequence.
>
> $ /usr/bin/perl --version
>
> This is perl, v5.8.0 built for i386-linux-thread-multi
> (with 1 registered patch, see perl -V for more detail)
>
> [Full Perl license text removed for brevity]
>
> $ /usr/local/bin/perl --version
>
> This is perl, v5.8.8 built for i686-linux
>
> [Full Perl license text removed for brevity]
>
> $ echo á | /usr/bin/perl -pe 's/á/aye/'
> á
> $ echo á | /usr/local/bin/perl -pe 's/á/aye/'
> aye
>
> So, is there any way to work around this problem? Nope.

You don't include enough information about your environment to be sure
where the bug is, but it is probably in perl 5.8.0 and was later fixed.
There were quite a few unicode-related bugs in 5.8.0. However, perl
5.8.0 is now more than 4 years old and was superceded by perl 5.8.1 more
than three years ago.

> You might think "use utf8" will fix this issue. It doesn't.

I'm not sure what "the issue" is, but I'm quite sure that you haven't
understood what "use utf8" does: It tells the perl compiler that the
source code is in UTF-8. It doesn't say anything about the encoding of
the files your script reads.

> $ cat unicode.char
> á
> $ cat unicode.script
> use utf8;
>
> open(A,"< unicode.char");

open(A, '<:utf8', 'unicode.char');

> while(<A>) {
>
> $_ =~ s/á/aye/;
> print;
>
> }
> $ /usr/bin/perl unicode.script
> aye
> $ /usr/local/bin/perl unicode.script
> á
>
> As you can see, "use utf8" just causes Perl 5.8.0 to do the right
> thing,

Er, no. Perl 5.8.0 still does the wrong thing, and Perl 5.8.8 does the
right thing. You just made another error which makes it look like it is
the other way round (sometimes two wrongs do make a right).

> The nice thing about Python is that there is a strong committment from
> the Python community to not arbitrarily break things or make changes
> which break scripts between bugfix releases.

If the scripts are buggy and depend on the bug being fixed, I'm sure
they will break in Python, too. If a bug fix doesn't change the
behaviour the bug isn't fixed.

hp

brian d foy

unread,

Oct 23, 2006, 6:36:35 PM10/23/06

to

In article <1161610846.3...@m73g2000cwd.googlegroups.com>,
<strenhol...@gmail.com> wrote:

> I have been a Perl programmer for over ten years. Recently, I found a
> bug in Perl which made me stop using Perl altogether.

> $ /usr/bin/perl --version

> This is perl, v5.8.0 built for i386-linux-thread-multi

There's your problem. As an experienced programmer you probably already
know that you shouldn't use versions that have a 0 in them. That's just
the usual wisdom of software, not just Perl.

> Perl 6 looks promising; it seems that there is a lot of work being done
> to document anything and everything Perl 6 does.

Wait for 6.1, for the same reason :)

--
Posted via a free Usenet account from http://www.teranews.com

Michele Dondi

unread,

Oct 24, 2006, 5:14:39 AM10/24/06

to

On Mon, 23 Oct 2006 21:57:48 +0200, "Peter J. Holzer"
<hjp-u...@hjp.at> wrote:

>You don't include enough information about your environment to be sure
>where the bug is, but it is probably in perl 5.8.0 and was later fixed.
>There were quite a few unicode-related bugs in 5.8.0. However, perl
>5.8.0 is now more than 4 years old and was superceded by perl 5.8.1 more
>than three years ago.

However AIUI the point is that for some reason the OP seems to find
5.8.0's buggy behaviour to be the "right" one...

Michele
--
{$_=pack'B8'x25,unpack'A8'x32,$a^=sub{pop^pop}->(map substr
(($a||=join'',map--$|x$_,(unpack'w',unpack'u','G^<R<Y]*YB='
.'KYU;*EVH[.FHF2W+#"\Z*5TI/ER<Z`S(G.DZZ9OX0Z')=~/./g)x2,$_,
256),7,249);s/[^\w,]/ /g;$ \=/^J/?$/:"\r";print,redo}#JAPH,

Peter J. Holzer

unread,

Oct 24, 2006, 10:34:28 AM10/24/06

to

On 2006-10-24 09:14, Michele Dondi <bik....@tiscalinet.it> wrote:
> On Mon, 23 Oct 2006 21:57:48 +0200, "Peter J. Holzer"
><hjp-u...@hjp.at> wrote:
>
>>You don't include enough information about your environment to be sure
>>where the bug is, but it is probably in perl 5.8.0 and was later
>>fixed. There were quite a few unicode-related bugs in 5.8.0. However,
>>perl 5.8.0 is now more than 4 years old and was superceded by perl
>>5.8.1 more than three years ago.
>
> However AIUI the point is that for some reason the OP seems to find
> 5.8.0's buggy behaviour to be the "right" one...

Not quite. AIUI he found perl 5.8.8 behaviour in the first example
correct but the 5.8.0 behaviour in the second example. But that was
because he had an error in the second example.

Unfortunately the Perl behaviour with UTF-8 is not very intuitive. While
it is backward compatible and allows you to work with both byte-oriented
and character oriented data with relatively little trouble *after*
you've grokked the concepts, the learning-curve is rather steep. You
really need to understand the difference between byte strings and
character strings as well as the influence of I/O layers, environment
variables and conversion functions to make sense of the behaviour.

I doubt that it is much simpler in Python, though.

It might be simpler in a strongly typed language such as Java
(somebody's going to point out now that Java isn't really strongly
typed, either), where you have different types for character strings and
byte strings, but in reality it isn't: The difference isn't always
crystal-clear, and you will have some standard library which got it
wrong and which can't be fixed because that would break every program
which uses it.

Joe Smith

unread,

Oct 25, 2006, 2:32:39 AM10/25/06

to

strenhol...@gmail.com wrote:
> I have been a Perl programmer for over ten years. Recently, I found a
> bug in Perl which made me stop using Perl altogether.

It's not a bug. It's a misunderstanding on the programmer's part.

> So, is there any way to work around this problem? Nope. You might think
> "use utf8" will fix this issue. It doesn't.

The 'use utf8' pragma is used to tell the perl interpreter that
the script itself is written in UTF8. It affects things like
$_ = "á";
where string literals inside the Perl program are encoded in UTF8.
It doe _not_ affect data being read or written.

If you are concerned about UTF8 versus non-UTF data in the input
stream, then you need to 'use locale' and/or use I/O layers.
See 'perldoc -q binmode' on the newer version of perl.

-Joe

strenhol...@gmail.com

unread,

Oct 25, 2006, 12:57:58 PM10/25/06

to

First of all, I would like to thank Peter for taking time to help with
this issue. The open(A,'<:utf8','unicode.char'); line fixed the issue
for Perl 5.8.0 and Perl 5.8.8.

In reply for Peter's request for more information:

$ env | grep LANG
LANG=en_US.UTF-8

I wish I had more time to study the dark mysteries of how Perl
determines the encoding for the script, the input Perl receives, and
the output Perl prints. I assumed "use utf8" told Perl "the script and
all input and output will be in UTF8"; I was wrong. Unfortunatly, I am
no longer a computer professional (the dot-com bubble exploding hit me
hard) and am a full-time ESL English teacher these days who programs
open-source projects as a hobby.

I have the highest of respect for the Perl language and the Perl
community. I have had the pleasure of meeting Larry Wall at a SVLUG
meeting during my days as a dot-com worker and am still very impressed
with how nice and charming Larry is.

The reason why I have to put up with !@#$ Perl 5.8.0 bugs is because
some actively maintained Linux distributions (Such as RedHat Enterprise
Linux 3 and derivatives) still distribute this fossil.

Again, I thank everyone for their responses and their help.

- Sam

Peter J. Holzer

unread,

Oct 25, 2006, 3:47:55 PM10/25/06

to

On 2006-10-25 16:57, strenhol...@gmail.com <strenhol...@gmail.com> wrote:
> I wish I had more time to study the dark mysteries of how Perl
> determines the encoding for the script, the input Perl receives, and
> the output Perl prints. I assumed "use utf8" told Perl "the script and
> all input and output will be in UTF8"; I was wrong.

You can use the PERL_UNICODE environment variable or the -C option for
that. But there are still some areas where explicit conversion is
necessary.

> The reason why I have to put up with !@#$ Perl 5.8.0 bugs is because
> some actively maintained Linux distributions (Such as RedHat Enterprise
> Linux 3 and derivatives) still distribute this fossil.

Yep. Most of our RHEL servers are running RHEL 3, too. Fortunately I can
simply install a newer perl if I run into problems. (There's an
advantage if the sysadmin and the perl programmer are the same person
:-).