Quoth Manfred Lotz <
manfre...@arcor.de>:
> On Tue, 14 May 2013 21:27:49 +0100
> Ben Morrow <
b...@morrow.me.uk> wrote:
> >
> > That is exactly what Peter was trying to explain. Because of the 'use
> > utf8', perl has already decoded the UTF-8 in the source code file into
> > Unicode characters, so $string does *not* contain "\x48\xc3\xa4":
> > instead it contains "\x48\xe4". The e4 is because 'ä', as a Unicode
> > character, has ordinal 0x34. This string, which happens to contain
> > only bytes though it could easily not have done, is not valid UTF-8,
> > so decode croaks.
> >
>
> Ok, I agree that perl decodes 'ä' (which is utf8 x'c3a4' in the file) to
> unicode \x{e4}.
>
> Nevertheless the ä is a valid utf8 char.
No, you're confused about the difference between 'UTF-8' and 'Unicode'.
Unicode is a big list of characters, with names and associated semantics
(like 'the lowercase of character 'A' is character 'a''). Each of these
characters has been given a number; some of these numbers are >255, so
it isn't possible to represent a string of Unicode characters directly
with a string of bytes, the way you can with ASCII or Latin-1.
This is a problem, given that files (on most systems) and TCP
connections and so on are defined as strings of bytes, To solve it,
various 'Unicode Transformation Formats' have been invented. The one
usually used on Unix systems and in Internet protocols is called
'UTF-8'; if you feed a string of Unicode characters into a UTF-8 encoder
you get a string of bytes out, and if you feed a string of bytes into a
UTF-8 decoder you either get a string of Unicode characters or you get
an error, if the string of bytes wasn't valid UTF-8.
Perl strings are always strings of Unicode characters[0]. If you want to
represent a string of bytes in Perl, you do so by using a string of
characters all of which happen to have an ordinal value less than 256.
Perl does not make any attempt to keep track of whether a given string
was supposed to be 'a string of bytes' or not: you have to do this
yourself[1].
If you read a string from a file (without doing anything special to the
filehandle first), you will always get a string of bytes, because the
Unix file-reading APIs only support files that consist of strings of
bytes. If that string of bytes was supposed to be UTF-8, and you want to
manipulate it as a string of Unicode characters, you have to pass it
through Encode::decode. Since not all strings of bytes are valid UTF-8
this can function can fail; this is what Peter posted.
If you write a string to a file (without...), the characters in the
string are written out directly as bytes. If they all have ordinals
below 256 this will effectively leave the file encoded in ISO8859-1,
since the first 256 Unicode characters have the same numbers as the 256
ISO8859-1 characters. If you try to write a character with ordinal 256
or greater, you will get a warning and stupid behaviour, because there
simply isn't any way to write a byte to a file with a value greater than
255[2]. If you want to write UTF-8 to a file, you have to encode your
string of characters (which may have ordinals >255) using
Encode::encode, which will return a string with all ordinals <256 which
you can write to the file.
So "\x48\xc3\xa4" is valid UTF-8. If you decode it into Unicode
characters, you get the string "\x48\xe4", which is *not* valid UTF-8.
What are you actually trying to do here? That is, why do you think you
need to check if a string is valid UTF-8?
Ben
[0, 1] Historical footnotes: Perl's Unicode support was started in
perl 5.6, and first became usable in 5.8. In the beginning the intention
was that Perl should keep track of whether a given string was a string
of bytes or a string of Unicode characters, and treat the string
differently (for some operations) in each case. This turned out to be a
nightmare, because Perl's dynamic typing system meant that strings kept
being unexpectedly converted from one type to the other, making it very
difficult to predict which behaviour a given operator would actually
use.
After a great deal of argument, the design was eventually changed to the
one I described above, and any remnants of the old design were
designated 'The Unicode Bug'. I believe the first version of perl which
properly fixed the Unicode Bug is 5.14, though there are still functions
in the API which shouldn't really be there. As a rule of thumb, any
function which mentions 'the UTF8 flag' is not a function you should be
using, unless you're trying to work around bugs in an XS module.
[2] The behaviour is stupider than in ought to be: what in fact happens
is that Perl encodes the character as UTF-8 and writes that out. This
will almost certainly make the file unreadable, since some parts will be
in UTF-8 and some parts will not. Properly perl ought to either give a
fatal error or write nothing at all.