Now I find that they seem to be byte references, not character
references. Consider the following test script:
use strict;
use warnings;
use utf8; # source code in UTF-8 ("Zurück")
use open OUT => ':encoding(UTF-8)', ':std';
my $str1 = "<<\xa0Zurück\n"; # byte -> bad
my $str2 = "<<\x{a0}Zurück\n"; # should be character, but isn't
my $str3 = "<<\x{00a0}Zurück\n"; # ditto
my $str4 = "<<\xa0" . "Zurück\n"; # upgrading hack, works
print $str1, $str2, $str3, $str4;
$str1 ne $str2 and die "won't die";
$str1 ne $str3 and die "won't die";
$str1 ne $str4 and die 'die now, somewhat counter-intuitively';
The correct version of the string uses implicit upgrading of
the byte escape "\xa0" to a Unicode character. I've read upgrading
should rather be avoided, but here it does the job.
Am I mistaken in my expectation that while "\xa0" should be a byte,
"\x{a0}" and "\x{00a0}" should be characters? Note that perlretut(1)
seems to support this assumption:
Unicode characters in the range of 128-255 use two hexadecimal
digits with braces: \x{ab}. Note that this is different than \xab,
which is just a hexadecimal byte with no Unicode significance.
http://perl.active-venture.com/pod/perlretut-morecharacter.html
But maybe this only refers to these escapes inside regular expressions.
Or maybe the utf8 pragma breaks things here? Don't think so, though.
If I comment it out, I have to recode my script to Latin1 in order for
the strings to be valid.
Note that the reason I use the utf8 pragma is so I can write "Zurück"
in my source code and automatically have Perl informed that these are
characters, not bytes - which is a great convenience.
Yeah, it would also work in Latin1, and our editors handle various
encodings just fine - but we have a good UTF-8 development environment
and there might be characters not representable in Latin1 that I'd like
to add to the script source.
What's your advice for handling this situation more elegantly?
--
Michael.Ludwig (#) XING.com
[ perlbug readers, you will find the nut of the issue in the
section marked BUG ]
* Michael Ludwig <michael...@xing.com> [2010-03-03 14:05]:
> For convenience, I have test script source code in UTF-8. The
> test also deals with non-breaking spaces, which I prefer to
> keep as character references since they are not visible and
> might be mistaken by the casual onlooker for ordinary spaces.
> So I write them as "\xa0". Or "\x{a0}", or "\x{00a0}".
>
> Now I find that they seem to be byte references, not character
> references.
Perl does not distinguish between bytes and characters. It does
distinguish between scalars that use a packed byte buffer for
storage vs strings that use variable-width integer sequence for
storage, but this is an implementation detail and does not mean
anything in terms of semantics. Strings are simply strings in
Perl. You cannot tell what kind of data they contain just by
looking at them and the UTF8 flag doesn’t tell you either.
> Consider the following test script:
>
> use strict;
> use warnings;
> use utf8; # source code in UTF-8 ("Zurück")
> use open OUT => ':encoding(UTF-8)', ':std';
>
> my $str1 = "<<\xa0Zurück\n"; # byte -> bad
> my $str2 = "<<\x{a0}Zurück\n"; # should be character, but isn't
> my $str3 = "<<\x{00a0}Zurück\n"; # ditto
> my $str4 = "<<\xa0" . "Zurück\n"; # upgrading hack, works
>
> print $str1, $str2, $str3, $str4;
>
> $str1 ne $str2 and die "won't die";
> $str1 ne $str3 and die "won't die";
> $str1 ne $str4 and die 'die now, somewhat counter-intuitively';
"\x{00a0}" does not map to utf8 at t.pl line 11.
<<\xA0Zurück
"\x{00a0}" does not map to utf8 at t.pl line 11.
<<\xA0Zurück
"\x{00a0}" does not map to utf8 at t.pl line 11.
<<\xA0Zurück
<< Zurück
die now, somewhat counter-intuitively at t.pl line 15.
This is definitely a bug.
> The correct version of the string uses implicit upgrading of
> the byte escape "\xa0" to a Unicode character. I've read
> upgrading should rather be avoided, but here it does the job.
No, upgrading is perfectly fine. Mixing byte and character data
is what should be avoided, because then Perl will assume it’s all
characters, which will result in mangling of one of the two kinds
of data. Usually the byte data is encoded text, in which case the
problem becomes apparent as double-encoded text. But it’s really
a problem both ways.
> Am I mistaken in my expectation that while "\xa0" should be
> a byte, "\x{a0}" and "\x{00a0}" should be characters? Note that
> perlretut(1) seems to support this assumption:
>
> Unicode characters in the range of 128-255 use two hexadecimal
> digits with braces: \x{ab}. Note that this is different than
> \xab, which is just a hexadecimal byte with no Unicode
> significance.
>
> http://perl.active-venture.com/pod/perlretut-morecharacter.html
>
> But maybe this only refers to these escapes inside regular expressions.
The documentation appears to be wrong. Unfortunately a lot of the
documentation of Perl itself is wrong or confused about Perl’s
string model.
> Or maybe the utf8 pragma breaks things here? Don't think so,
> though. If I comment it out, I have to recode my script to
> Latin1 in order for the strings to be valid.
Yes. This appears to be a utf8 pragma bug or a bug in the parser
that shows up in interaction with the utf8 pragma.
====================== BUG ======================
What happens is that the presence of the ü under the utf8 pragma
triggers using the variable-width integer sequence format for the
string, but the 0xA0 byte from the \x escape gets written into
that buffer verbatim, as if it were a packed byted array string.
This is wrong and completely broken.
====================== BUG ======================
> Note that the reason I use the utf8 pragma is so I can write
> "Zurück" in my source code and automatically have Perl informed
> that these are characters, not bytes - which is a great
> convenience.
>
> Yeah, it would also work in Latin1, and our editors handle
> various encodings just fine - but we have a good UTF-8
> development environment and there might be characters not
> representable in Latin1 that I'd like to add to the script
> source.
Writing source in UTF-8 is a perfectly sane practice. No need to
justify it.
> What's your advice for handling this situation more elegantly?
Use the \U escape to indicate that you always mean a Unicode code
point. Due to other quirks in how \U is implemented, it ends up
not triggering the bug that \x would.
Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>
thanks for your answer - much appreciated! Please see my comments
inline.
Am 07.03.2010 um 07:39 schrieb Aristotle Pagaltzis:
> Perl does not distinguish between bytes and characters. It does
> distinguish between scalars that use a packed byte buffer for
> storage vs strings that use variable-width integer sequence for
> storage, but this is an implementation detail and does not mean
> anything in terms of semantics. Strings are simply strings in
> Perl. You cannot tell what kind of data they contain just by
> looking at them and the UTF8 flag doesn’t tell you either.
Okay. But unless I'm completely misled, you can tell whether a
string is supposed to contain characters (<- Encode::decode) or
bytes (<- Encode::encode). With the utf8 pragma in scope, it seems
to me that my literal strings are supposed to contain characters,
not bytes.
> "\x{00a0}" does not map to utf8 at t.pl line 11.
> <<\xA0Zurück
> "\x{00a0}" does not map to utf8 at t.pl line 11.
> <<\xA0Zurück
> "\x{00a0}" does not map to utf8 at t.pl line 11.
> <<\xA0Zurück
> << Zurück
> die now, somewhat counter-intuitively at t.pl line 15.
>
> This is definitely a bug.
Good. It looked like one to me. Thanks for logging it with the
Perl maintainers.
However, it might already have been fixed for Perl 5.10.1 - at
least, ActiveState v5.10.1 produces what I think is a correct
result:
michael.ludwig@nb-mludwig: ~/MiLu/dev/perl/Unicode > aperl nbsp.pl
<< Zurück
<< Zurück
<< Zurück
<< Zurück
michael.ludwig@nb-mludwig: ~/MiLu/dev/perl/Unicode > aperl -v
This is perl, v5.10.1 built for darwin-thread-multi-2level
(with 2 registered patches, see perl -V for more detail)
>> Am I mistaken in my expectation that while "\xa0" should be
>> a byte, "\x{a0}" and "\x{00a0}" should be characters? Note that
>> perlretut(1) seems to support this assumption:
>>
>> Unicode characters in the range of 128-255 use two hexadecimal
>> digits with braces: \x{ab}. Note that this is different than
>> \xab, which is just a hexadecimal byte with no Unicode
>> significance.
>>
>> http://perl.active-venture.com/pod/perlretut-morecharacter.html
>>
>> But maybe this only refers to these escapes inside regular expressions.
>
> The documentation appears to be wrong. Unfortunately a lot of the
> documentation of Perl itself is wrong or confused about Perl’s
> string model.
The documentation I referred to is outdated. Sorry for that.
>> What's your advice for handling this situation more elegantly?
>
> Use the \U escape to indicate that you always mean a Unicode code
> point. Due to other quirks in how \U is implemented, it ends up
> not triggering the bug that \x would.
How would I use that? I only know about the U specifier for pack:
my $smiley = pack 'U', 0x263a;
--
Michael.Ludwig (#) XING.com
The result of decode is a character string.
The result of encode is a byte string.
However, apart from looking at the source code and deducing the
intentions of the programmer, there is no way to tell whether a given
string is meant as a character or byte string, simply because there is
no technical representation of this intent in the string or its
metadata.
Note that "characters" are the general case: a string is made of
characters. When every character value fits in a single byte, the string
can be used as a byte string.
> > This is definitely a bug.
> Good. It looked like one to me. Thanks for logging it with the
> Perl maintainers.
This bug forces us to look at the internal encoding and flags to come to
the conclusion that it is indeed a bug. Don't mistake this as a sign
that looking at the internal encoding or flags should ever happen in
actual code. Even if you work around the bug, make sure that you don't
make anything conditional on the current formatting of the string.
Instead, coerce it to whatever you need by using utf8::downgrade or
utf8::upgrade. In your specific case, concatenation of two separate
parts is probably the most sane thing to do.
> >> Am I mistaken in my expectation that while "\xa0" should be
> >> a byte, "\x{a0}" and "\x{00a0}" should be characters?
Yes. These three escapes are supposed to be exactly the same. They
create a U+00A0 character, which happens to be perfectly usable as the
A0 byte when used as such, in a string that doesn't contain any
character greater than U+00FF.
> >> [perlre:]
> >> Unicode characters in the range of 128-255 use two hexadecimal
> >> digits with braces: \x{ab}. Note that this is different than
> >> \xab, which is just a hexadecimal byte with no Unicode
> >> significance.
> The documentation I referred to is outdated. Sorry for that.
Indeed this documentation is wrong. Current documentation, as of Perl
version 5.8.9 (december 2008) no longer has this paragraph.
--
Met vriendelijke groet, Kind regards, Korajn salutojn,
Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig>
Convolution: ICT solutions and consultancy <sa...@convolution.nl>
Am 08.03.2010 um 16:15 schrieb Juerd Waalboer:
> Michael Ludwig skribis 2010-03-08 15:55 (+0100):
>> Okay. But unless I'm completely misled, you can tell whether a
>> string is supposed to contain characters (<- Encode::decode) or
>> bytes (<- Encode::encode)
>
> The result of decode is a character string.
>
> The result of encode is a byte string.
Thanks for confirming.
> However, apart from looking at the source code and deducing the
> intentions of the programmer, there is no way to tell whether a given
> string is meant as a character or byte string, simply because there is
> no technical representation of this intent in the string or its
> metadata.
>
> Note that "characters" are the general case: a string is made of
> characters. When every character value fits in a single byte, the string
> can be used as a byte string.
And clarifying further.
> This bug forces us to look at the internal encoding and flags to come to
> the conclusion that it is indeed a bug. Don't mistake this as a sign
> that looking at the internal encoding or flags should ever happen in
> actual code. Even if you work around the bug, make sure that you don't
> make anything conditional on the current formatting of the string.
>
> Instead, coerce it to whatever you need by using utf8::downgrade or
> utf8::upgrade. In your specific case, concatenation of two separate
> parts is probably the most sane thing to do.
Good.
>>>> Am I mistaken in my expectation that while "\xa0" should be
>>>> a byte, "\x{a0}" and "\x{00a0}" should be characters?
>
> Yes. These three escapes are supposed to be exactly the same. They
> create a U+00A0 character, which happens to be perfectly usable as the
> A0 byte when used as such, in a string that doesn't contain any
> character greater than U+00FF.
Okay. Let me try to see if I have understood correctly. Without the utf8
pragma in scope, "so\xa0ein\xa0Käse" with a-Umlaut stored as a sequence
of two bytes in my source code will be stored internally as a sequence
of 12 integers. With the utf8 pragma in scope, only 11 integers.
I know I shouldn't care about the internals, but sometimes grokking the
internals is helpful as an aide-mémoire, because it puts things into
perspective that otherwise seem more arbitrary.
--
Michael.Ludwig (#) XING.com
> Michael Ludwig skribis 2010-03-10 10:34 (+0100):
>> Okay. Let me try to see if I have understood correctly. Without the utf8
>> pragma in scope, "so\xa0ein\xa0Käse" with a-Umlaut stored as a sequence
>> of two bytes in my source code will be stored internally as a sequence
>> of 12 integers. With the utf8 pragma in scope, only 11 integers.
I think I got confused about bytes and integers now, because I misread
an earlier post by Aristoteles. What I meant is:
With the utf8 pragma in scope, "so\xa0ein\xa0Käse" with a-Umlaut stored
as a sequence of two bytes in my source code will be stored internally as
a sequence of 11 integers. (But I shouldn't care about the integers, that's
an implementation detail.) Without the utf8 pragma in scope, the string will
be stored as a sequence of 12 bytes; and 11 bytes if I convert the source to
Latin-1.
In the broken perl versions, like 5.8.9 and 5.10.0, with the utf8 pragma
in scope I get the wrong sequence of 11 integers, as per your illustration
quoted below: I get a0 where I should get c2-a0, because those perl versions
don't handle character escapes correctly.
> "so\xa0ein\xa0Käse" must be stored as either:
>
> l1: 73 6f a0 65 69 6e a0 4b e4 73 65 (UTF8 flag off)
>
> or:
>
> u8: 73 67 c2-a0 65 69 6e c2-a0 4b c3-a4 73 65 (UTF8 flag on)
Yes (modulo typo):
so ein Käse: 73 6f c2-a0 65 69 6e c2-a0 4b c3-a4 73 65
so?ein?Käse: 73 6f c2-a0 65 69 6e c2-a0 4b c3-83 c2-a4 73 65
----
use common::sense; # includes utf8 pragma
use open OUT => qw/:encoding(UTF-8) :std/;
use Encode;
sub show_bytes {
my $str = shift;
my $out = '';
for ( split '', $str ) {
my $octets = Encode::encode( 'UTF-8', $_ );
$out .= join '-', map sprintf( '%x', ord), split '', $octets;
$out .= ' ';
}
return $out;
}
print STDERR "Kaputt in Perl 5.8.9 und 5.10.0!\n"; # heile in 5.10.1
my $sok = "so\xa0ein\xa0Käse";
print $_, ":\t", show_bytes( $_ ), "\n" for $sok;
----
> Both strings should be semantically equal, and have 11 characters, each
> of which has an integer ordinal value.
>
> What happens is the following:
>
> 73 6f a0 65 69 6e a0 4b c3-a4 73 65 (UTF8 flag on)
> l1 l1 u8
>
> This is wrong. It is a bug.
Very graphical and palpable exposition, thanks!
--
Michael.Ludwig (#) XING.com
"so\xa0ein\xa0Käse" must be stored as either:
l1: 73 6f a0 65 69 6e a0 4b e4 73 65 (UTF8 flag off)
or:
u8: 73 67 c2-a0 65 69 6e c2-a0 4b c3-a4 73 65 (UTF8 flag on)
Both strings should be semantically equal, and have 11 characters, each
of which has an integer ordinal value.
What happens is the following:
73 6f a0 65 69 6e a0 4b c3-a4 73 65 (UTF8 flag on)
l1 l1 u8
This is wrong. It is a bug.
--
Met vriendelijke groet, // Kind regards, // Korajn salutojn,
Juerd Waalboer <ju...@tnx.nl>
TNX
I just noticed I never replied to this…
* Michael Ludwig <michael...@xing.com> [2010-03-08 15:50]:
Sorry – I meant \N. Eg in that case,
my $smiley = "\N{U+263A}";