Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

utf8 pragma - strange behavior

14 views
Skip to first unread message

ryang

unread,
Mar 16, 2005, 9:55:45 PM3/16/05
to
I am trying to understand how to work with Unicode in Perl. I have
read the relevant man pages (perluniintro, perlunicode, etc.) and have
written severl scripts to test/verifiy my understanding. However, I
created a script that has unexpected output. The script is below and
it contains some UTF-8 encoded characters which represent all five
Spanish accented vowels plus the enye (n with a tilde over it) in upper
and lower case. I hope that this post comes through as UTF-8 encoded
as the source code is. I am posting from Google groups which does use
UTF-8 encoding.

BEGIN CODE >>
#!/usr/bin/perl

use warnings;
use strict;
#use utf8;
use Encode;

# using utf8 causes the characters to be printed in latin-1 encoding

my %table = (
# spanish
# hexidecimal UTF-8 => actual UTF-8
'0xc381' => chr(hex('c3')) . chr(hex('81')), # 'Á',
'0xc389' => encode("utf8", "\x{00c9}"), # 'É',
'0xc38d' => 'Í',
'0xc393' => 'Ó',
'0xc391' => 'Ñ',
'0xc39a' => 'Ú',
'0xc3a1' => 'á',
'0xc3a9' => 'é',
'0xc3ad' => 'í',
'0xc3b3' => 'ó',
'0xc3b1' => 'ñ',
'0xc3ba' => 'ú',
);

foreach (sort keys %table) {
print "$_ = $table{$_}\n";
}
<< END CODE

When the 'use utf8' line is commented out, the script outputs the UTF-8
characters correctly. However, when the utf8 pragma is used, the
characters that are actually hard coded into the hash as UTF-8 (not the
Á or É) are printed in Latin-1. To my understanding, in Perl 5.8.x,
the only effect of the utf8 pragma is to tell the parser that literals
and variables may contain UTF-8 encoded characters. However in
practice, the utf8 pragma is effecting the script's output.

I have tested the script on Mac OSX 10.3.8 with Perl 5.8.1 and on
Fedora Core (not sure which version) running perl 5.8.3.

Can anyone explain why the utf8 pragma effects the output of the script?

0 new messages