utf8

George Mpouras

unread,

May 13, 2013, 7:05:00 AM5/13/13

to

Is there any easy way to decice if a string is valid UTF-8 ?

Manfred Lotz

unread,

May 13, 2013, 8:51:46 AM5/13/13

to

On Mon, 13 May 2013 14:05:00 +0300
George Mpouras <nospam.gravital...@hotmail.noads.com> wrote:

> Is there any easy way to decice if a string is valid UTF-8 ?

Minimal example:

#! /usr/bin/perl

use strict;
use warnings;

use utf8;
use Encode;

my $string = 'Hä';

Encode::is_utf8($string) or die "bad string";

my $bad_string = 0x123456;
Encode::is_utf8($bad_string) or die "bad string";

--
Manfred

George Mpouras

unread,

May 13, 2013, 9:22:36 AM5/13/13

to

thanks, it is working.
I have tried the same thing, but my mistake was, I have not used the
line "use utf8;" !

Manfred Lotz

unread,

May 13, 2013, 9:43:52 AM5/13/13

to

On Mon, 13 May 2013 16:22:36 +0300

George Mpouras <nospam.gravital...@hotmail.noads.com> wrote:

> Στις 13/5/2013 15:51, ο/η Manfred Lotz έγραψε:
> > On Mon, 13 May 2013 14:05:00 +0300
> > George Mpouras <nospam.gravital...@hotmail.noads.com>
> > wrote:
> >
> >> Is there any easy way to decice if a string is valid UTF-8 ?
> >
> > Minimal example:
> >
> > #! /usr/bin/perl
> >
> > use strict;
> > use warnings;
> >
> > use utf8;
> > use Encode;
> >
> > my $string = 'Hä';
> >
> > Encode::is_utf8($string) or die "bad string";
> >
> > my $bad_string = 0x123456;
> > Encode::is_utf8($bad_string) or die "bad string";
> >
> >
>
>
>
> thanks, it is working.
> I have tried the same thing, but my mistake was, I have not used the
> line "use utf8;" !
>
>

Yes, that is important.

--
Manfred

Peter J. Holzer

unread,

May 13, 2013, 7:10:59 PM5/13/13

to

On 2013-05-13 12:51, Manfred Lotz <manfre...@arcor.de> wrote:
> On Mon, 13 May 2013 14:05:00 +0300
> George Mpouras <nospam.gravital...@hotmail.noads.com> wrote:
>> Is there any easy way to decice if a string is valid UTF-8 ?
>
> Minimal example:
>
> #! /usr/bin/perl
>
> use strict;
> use warnings;
>
> use utf8;
> use Encode;
>

> my $string = 'H�';

This string is not UTF-8 in any useful sense. It consists of two
characters, U+0048 LATIN CAPITAL LETTER H and U+00e4 LATIN SMALL
LETTER A WITH DIAERESIS. The same string encoded in UTF-8 would consist
of three bytes, "\x{48}\x{C3}\x{A4}". Note that the former string has
length 2, the latter has length 3.

> Encode::is_utf8($string) or die "bad string";

This tests whether the internal representation of the string is
utf-8-like, which you almost never want to know in a Perl program. It
also tells you whether the string has character semantics (unless you
use a rather new version of perl with the unicode_strings feature),
which is sometimes useful.

If you want to know whether a string is a correctly encoded UTF-8
sequence, try to decode it:

$decoded = eval { decode('UTF-8', $string, FB_CROAK) };

(decode(..., FB_CROAK) will die if $string is not UTF-8, so you need to
catch that. All other check parameters are even less convenient).

hp

--
_ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
| | | h...@hjp.at | die Satzbestandteile des Satzes nicht mehr
__/ | http://www.hjp.at/ | zusammenpa�t. -- Ralph Babel

George Mpouras

unread,

May 14, 2013, 2:58:51 AM5/14/13

to

>
> If you want to know whether a string is a correctly encoded UTF-8
> sequence, try to decode it:
>
> $decoded = eval { decode('UTF-8', $string, FB_CROAK) };
>
> (decode(..., FB_CROAK) will die if $string is not UTF-8, so you need to
> catch that. All other check parameters are even less convenient).
>

nice !

Manfred Lotz

unread,

May 14, 2013, 3:31:26 PM5/14/13

to

On Tue, 14 May 2013 01:10:59 +0200
"Peter J. Holzer" <hjp-u...@hjp.at> wrote:

> On 2013-05-13 12:51, Manfred Lotz <manfre...@arcor.de> wrote:
> > On Mon, 13 May 2013 14:05:00 +0300
> > George Mpouras <nospam.gravital...@hotmail.noads.com>
> > wrote:
> >> Is there any easy way to decice if a string is valid UTF-8 ?
> >
> > Minimal example:
> >
> > #! /usr/bin/perl
> >
> > use strict;
> > use warnings;
> >
> > use utf8;
> > use Encode;
> >

> > my $string = 'Hä';

>
> This string is not UTF-8 in any useful sense. It consists of two
> characters, U+0048 LATIN CAPITAL LETTER H and U+00e4 LATIN SMALL
> LETTER A WITH DIAERESIS. The same string encoded in UTF-8 would
> consist of three bytes, "\x{48}\x{C3}\x{A4}". Note that the former
> string has length 2, the latter has length 3.
>

This is only the email. In my test script it is this:

00000050 20 27 48 c3 a4 27 3b 0a 0a 45 6e 63 6f 64 65 3a
| 'H..';..Encode:|

> > Encode::is_utf8($string) or die "bad string";
>
> This tests whether the internal representation of the string is
> utf-8-like, which you almost never want to know in a Perl program. It
> also tells you whether the string has character semantics (unless you
> use a rather new version of perl with the unicode_strings feature),
> which is sometimes useful.
>
> If you want to know whether a string is a correctly encoded UTF-8
> sequence, try to decode it:
>
> $decoded = eval { decode('UTF-8', $string, FB_CROAK) };
>
> (decode(..., FB_CROAK) will die if $string is not UTF-8, so you need
> to catch that. All other check parameters are even less convenient).
>

Aaah, thanks. Didn't know that.

#! /usr/bin/perl
use strict;
use warnings;

use utf8;

use 5.010;

use Encode qw( decode FB_CROAK );

my $string = 'Hä'; # = 0x48c3a4

my $decoded = decode('utf8', $string, FB_CROAK);

Nevertheless, I'm confused. Above script where 'Hä' is definitely
0x48c3a4 (verified by hexdump) croaks. Why?

At any rate I have to read perlunitut, perluniintro etc. to understand
what's going on.

--
Manfred

Ben Morrow

unread,

May 14, 2013, 4:27:49 PM5/14/13

to

Quoth Manfred Lotz <manfre...@arcor.de>:

> On Tue, 14 May 2013 01:10:59 +0200
> "Peter J. Holzer" <hjp-u...@hjp.at> wrote:
> > On 2013-05-13 12:51, Manfred Lotz <manfre...@arcor.de> wrote:
> > >
> > > use utf8;
> > > use Encode;
> > >
> > > my $string = 'Hä';
> >
> > This string is not UTF-8 in any useful sense. It consists of two
> > characters, U+0048 LATIN CAPITAL LETTER H and U+00e4 LATIN SMALL
> > LETTER A WITH DIAERESIS. The same string encoded in UTF-8 would
> > consist of three bytes, "\x{48}\x{C3}\x{A4}". Note that the former
> > string has length 2, the latter has length 3.

[...]

>
> use utf8;
> use 5.010;
>
> use Encode qw( decode FB_CROAK );
>
> my $string = 'Hä'; # = 0x48c3a4
>
>
> my $decoded = decode('utf8', $string, FB_CROAK);
>
>
> Nevertheless, I'm confused. Above script where 'Hä' is definitely
> 0x48c3a4 (verified by hexdump) croaks. Why?

That is exactly what Peter was trying to explain. Because of the 'use
utf8', perl has already decoded the UTF-8 in the source code file into
Unicode characters, so $string does *not* contain "\x48\xc3\xa4":
instead it contains "\x48\xe4". The e4 is because 'ä', as a Unicode
character, has ordinal 0x34. This string, which happens to contain only
bytes though it could easily not have done, is not valid UTF-8, so
decode croaks.

If you had read the same string from a file, this would not have
happened (unless you asked for it with an :encoding layer), nor would it
have happened if you hadn't had 'use utf8'.

Try running these both with and without 'use utf8':

"\x48\xc3\xa4" eq "Hä" or warn "unequal";
"\x48\xe4" eq "Hä" and warn "equal";
warn length "Hä";

"\x48\xc4\x81" eq "Hā" or warn "unequal";
"\x48\x{101}" eq "Hā" or warn "equal";
warn length "Hā";

(that character is a-macron).

Ben

Manfred Lotz

unread,

May 15, 2013, 12:18:52 AM5/15/13

to

My mistake was that I believed that perl's internal representation is
utf8 instead of unicode code point. I thought I had read this in some
perl man page.

--
Manfred

Manfred Lotz

unread,

May 15, 2013, 4:29:57 AM5/15/13

to

On Tue, 14 May 2013 21:27:49 +0100
Ben Morrow <b...@morrow.me.uk> wrote:

>

Ok, I agree that perl decodes 'ä' (which is utf8 x'c3a4' in the file) to
unicode \x{e4}.

Nevertheless the ä is a valid utf8 char.

This means that the test to check for valid utf8 which Peter proposed
is wrong as it croaks.

The following snippet:

#!/usr/bin/perl

use strict;
use warnings;

use utf8;

use Test::utf8;

binmode STDOUT, ":utf8";

my $ae = 'ä';

show_char($ae);

sub show_char {
my $ch = shift;

print '-' x 80;
print "\n";
print "Char: $ch\n";
is_valid_string($ch); # check the string is valid
is_sane_utf8($ch); # check not double encoded

# check the string has certain attributes
is_flagged_utf8($ch); # has utf8 flag set
is_within_ascii($ch); # only has ascii chars in it
is_within_latin_1($ch); # only has latin-1 chars in it

}

yields:
--------------------------------------------------------------------------------
Char: ä
ok 1 - valid string test
ok 2 - sane utf8
ok 3 - flagged as utf8
not ok 4 - within ascii
# Failed test 'within ascii'
# at ./unicode04.pl line 27.
# Char 1 not ASCII (it's 228 dec / e4 hex)
ok 5 - within latin-1
# Tests were run but no plan was declared and done_testing() was not
seen.

which is what I would have assumed.

--
Manfred

Ben Morrow

unread,

May 15, 2013, 7:10:22 AM5/15/13

to

If you're writing XS (that is, C) then perl's internal representation is
(sometimes) UTF-8. However, if you're writing Perl, you can't see that
(that's what 'internal' means), since Perl presents all strings,
regardless of their internal representation, as sequences of Unicode
characters. Perl's Unicode support wouldn't be much use if it didn't.

Ben

Rainer Weikusat

unread,

May 15, 2013, 8:03:35 AM5/15/13

to

Manfred Lotz <manfre...@arcor.de> writes:
> On Tue, 14 May 2013 21:27:49 +0100

[...]

> My mistake was that I believed that perl's internal representation is
> utf8 instead of unicode code point.

perl's internal representation is utf8 which is supposed to be decoded
on demand as necessary. That's not an uncommon implementation choice
for software supposed to interact with 'the real world' (here supposed
to mean 'everything out there on the internet', have a look at the
Mozilla Rust FAQ for a cogent and succinct explanation why this makes
sense) but that's an implementation choice the people who presently
work on this code strongly disagree with: They would prefer a model
where, prior to each internal processing step, a pass over the
complete input data has to be made in order to transform it into "the
super-secret internal perl encoding" and after any internal processing
has been completed, a second pass over all of the data has to be made
in order to decode the 'super secrete internal perl encoding' into
something which is useful for anyhing except being 'super secret' and
'internal to Perl'.

This sort-of makes sense when assuming that perl is an island located
in strange waters and that it will usually keep mostly to itself
(figuratively spoken) and it makes absolutely no sense when 'some perl
code' performs one step of a multi-stage processing pipeline which may
possibly even include other perl code (since not even 'output of perl'
is supposed to be suitable to become 'input of perl').

Ben Morrow

unread,

May 15, 2013, 8:27:05 AM5/15/13

to

Quoth Manfred Lotz <manfre...@arcor.de>:
> On Tue, 14 May 2013 21:27:49 +0100
> Ben Morrow <b...@morrow.me.uk> wrote:
> >
> > That is exactly what Peter was trying to explain. Because of the 'use
> > utf8', perl has already decoded the UTF-8 in the source code file into
> > Unicode characters, so $string does *not* contain "\x48\xc3\xa4":
> > instead it contains "\x48\xe4". The e4 is because 'ä', as a Unicode
> > character, has ordinal 0x34. This string, which happens to contain
> > only bytes though it could easily not have done, is not valid UTF-8,
> > so decode croaks.
> >
>
> Ok, I agree that perl decodes 'ä' (which is utf8 x'c3a4' in the file) to
> unicode \x{e4}.
>
> Nevertheless the ä is a valid utf8 char.

No, you're confused about the difference between 'UTF-8' and 'Unicode'.

Unicode is a big list of characters, with names and associated semantics
(like 'the lowercase of character 'A' is character 'a''). Each of these
characters has been given a number; some of these numbers are >255, so
it isn't possible to represent a string of Unicode characters directly
with a string of bytes, the way you can with ASCII or Latin-1.

This is a problem, given that files (on most systems) and TCP
connections and so on are defined as strings of bytes, To solve it,
various 'Unicode Transformation Formats' have been invented. The one
usually used on Unix systems and in Internet protocols is called
'UTF-8'; if you feed a string of Unicode characters into a UTF-8 encoder
you get a string of bytes out, and if you feed a string of bytes into a
UTF-8 decoder you either get a string of Unicode characters or you get
an error, if the string of bytes wasn't valid UTF-8.

Perl strings are always strings of Unicode characters[0]. If you want to
represent a string of bytes in Perl, you do so by using a string of
characters all of which happen to have an ordinal value less than 256.
Perl does not make any attempt to keep track of whether a given string
was supposed to be 'a string of bytes' or not: you have to do this
yourself[1].

If you read a string from a file (without doing anything special to the
filehandle first), you will always get a string of bytes, because the
Unix file-reading APIs only support files that consist of strings of
bytes. If that string of bytes was supposed to be UTF-8, and you want to
manipulate it as a string of Unicode characters, you have to pass it
through Encode::decode. Since not all strings of bytes are valid UTF-8
this can function can fail; this is what Peter posted.

If you write a string to a file (without...), the characters in the
string are written out directly as bytes. If they all have ordinals
below 256 this will effectively leave the file encoded in ISO8859-1,
since the first 256 Unicode characters have the same numbers as the 256
ISO8859-1 characters. If you try to write a character with ordinal 256
or greater, you will get a warning and stupid behaviour, because there
simply isn't any way to write a byte to a file with a value greater than
255[2]. If you want to write UTF-8 to a file, you have to encode your
string of characters (which may have ordinals >255) using
Encode::encode, which will return a string with all ordinals <256 which
you can write to the file.

So "\x48\xc3\xa4" is valid UTF-8. If you decode it into Unicode
characters, you get the string "\x48\xe4", which is *not* valid UTF-8.

What are you actually trying to do here? That is, why do you think you
need to check if a string is valid UTF-8?

Ben

[0, 1] Historical footnotes: Perl's Unicode support was started in
perl 5.6, and first became usable in 5.8. In the beginning the intention
was that Perl should keep track of whether a given string was a string
of bytes or a string of Unicode characters, and treat the string
differently (for some operations) in each case. This turned out to be a
nightmare, because Perl's dynamic typing system meant that strings kept
being unexpectedly converted from one type to the other, making it very
difficult to predict which behaviour a given operator would actually
use.

After a great deal of argument, the design was eventually changed to the
one I described above, and any remnants of the old design were
designated 'The Unicode Bug'. I believe the first version of perl which
properly fixed the Unicode Bug is 5.14, though there are still functions
in the API which shouldn't really be there. As a rule of thumb, any
function which mentions 'the UTF8 flag' is not a function you should be
using, unless you're trying to work around bugs in an XS module.

[2] The behaviour is stupider than in ought to be: what in fact happens
is that Perl encodes the character as UTF-8 and writes that out. This
will almost certainly make the file unreadable, since some parts will be
in UTF-8 and some parts will not. Properly perl ought to either give a
fatal error or write nothing at all.

Manfred Lotz

unread,

May 15, 2013, 9:24:33 AM5/15/13

to

On Wed, 15 May 2013 13:27:05 +0100

I did not decode it.

> What are you actually trying to do here? That is, why do you think you
> need to check if a string is valid UTF-8?
>

I'm not trying anything. However, the OP asked if there is any easy way
to decide if a string is valid UTF-8. I answered him pointing to
Encode ::is_utf8() which as Peter rightly told me is the wrong way.

Peter said that $decoded = eval { decode('UTF-8', $string, FB_CROAK) };
is correct which I don't believe.

Let met repeat from my last example. 'ä' is unicode point 0xe4 and
utf-8 0xc3a4. In the script file (which itself is an utf8 encoded file)
ä is 0xc3a4. Why should perl kill this when I have specified 'use
utf8;'? My only statement is that $ae in the script below is a valid
utf8 string.

#!/usr/bin/perl

use strict;
use warnings;

use utf8;

use Test::utf8;

use Devel::Peek;

binmode STDOUT, ":utf8";

my $ae = 'ä';

show_char($ae);

sub show_char {
my $ch = shift;

print '-' x 80;
print "\n";

Dump $ch;

print "Char: $ch\n";
is_valid_string($ch); # check the string is valid
is_sane_utf8($ch); # check not double encoded

# check the string has certain attributes
is_flagged_utf8($ch); # has utf8 flag set
is_within_ascii($ch); # only has ascii chars in it
is_within_latin_1($ch); # only has latin-1 chars in it

}

then I get:

--------------------------------------------------------------------------------
SV = PV(0x1b86dd0) at 0x1bd7470
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x1bb0ec0 "\303\244"\0 [UTF8 "\x{e4}"]
CUR = 2
LEN = 16

Char: ä
ok 1 - valid string test
ok 2 - sane utf8
ok 3 - flagged as utf8
not ok 4 - within ascii
# Failed test 'within ascii'

# at ./unicode05.pl line 29.

# Char 1 not ASCII (it's 228 dec / e4 hex)
ok 5 - within latin-1
# Tests were run but no plan was declared and done_testing() was not
seen.

This IMHO shows that $ae in above script is a valid utf8 string.
This is the only thing I state.

What is your argumentation to say $ae is not utf8? Then you should tell
me where above script is wrong or telling me how to interpret the
output of the script in a different way than I did.

--
Manfred

Ben Morrow

unread,

May 15, 2013, 9:28:15 AM5/15/13

to

Quoth Rainer Weikusat <rwei...@mssgmbh.com>:

> Manfred Lotz <manfre...@arcor.de> writes:
> > On Tue, 14 May 2013 21:27:49 +0100
>
> [...]
>
> > My mistake was that I believed that perl's internal representation is
> > utf8 instead of unicode code point.
>
> perl's internal representation is utf8 which is supposed to be decoded
> on demand as necessary. That's not an uncommon implementation choice
> for software supposed to interact with 'the real world' (here supposed
> to mean 'everything out there on the internet', have a look at the
> Mozilla Rust FAQ for a cogent and succinct explanation why this makes
> sense) but that's an implementation choice the people who presently
> work on this code strongly disagree with: They would prefer a model
> where, prior to each internal processing step, a pass over the
> complete input data has to be made in order to transform it into "the
> super-secret internal perl encoding" and after any internal processing
> has been completed, a second pass over all of the data has to be made
> in order to decode the 'super secrete internal perl encoding' into
> something which is useful for anyhing except being 'super secret' and
> 'internal to Perl'.

You are confusing semantics with internal representation. Encode is
privy to perl's internal representation; it knows that if you are
encoding into (loose) "utf8" and the string is internally represented as
SvUTF8 then all it has to do is flip the flag, and similarly that if you
are encoding into "ISO8859-1" and the string is not internally SvUTF8
that it doesn't need to do anything. Decoding is not quite so simple,
since it isn't safe to assume input which was supposed to be in UTF-8 is
actually valid, but decoding a non-SvUTF8 string from "utf8" still
doesn't do any actual decoding, it just validates the string and copies
it out.

If you are concerned about the copying overhead implied by the 'encode'
and 'decode' API, utf8::encode and utf8::decode will encode or decode in
place, without doing any copying unless they have to. Unlike ::upgrade
and ::downgrade, these are perfectly sensible functions to use if you
only need to encode or decode "utf8".

> This sort-of makes sense when assuming that perl is an island located
> in strange waters and that it will usually keep mostly to itself
> (figuratively spoken) and it makes absolutely no sense when 'some perl
> code' performs one step of a multi-stage processing pipeline which may
> possibly even include other perl code (since not even 'output of perl'
> is supposed to be suitable to become 'input of perl').

Unix IPC is defined in terms of bytes. There is no way to represent an
arbitrary Unicode character as a sequence of bytes without some sort of
encoding step. This is no different from the fact that you can't pass a
hash from one perl process to another without encoding it in some way
(for instance, with Storable).

Ben

Ben Morrow

unread,

May 15, 2013, 10:37:14 AM5/15/13

to

Quoth Manfred Lotz <manfre...@arcor.de>:
> On Wed, 15 May 2013 13:27:05 +0100
> Ben Morrow <b...@morrow.me.uk> wrote:
>
> > So "\x48\xc3\xa4" is valid UTF-8. If you decode it into Unicode
> > characters, you get the string "\x48\xe4", which is *not* valid UTF-8.
>
> I did not decode it.

Yes you did. You passed Perl a file containing the bytes 0x22 0x48 0xc3
0xa4 0x22 (that is, "Hä", encoded in UTF-8), and you also said 'use
utf8;' which asks Perl to decode the rest of the file from UTF-8. Perl
did so, and so you ended up with the string "\x48\xe4" which, though it
happens to still be a string of bytes, is not valid UTF-8.

Until you understand this a bit better you should probably stay away
from the 'utf8' pragma. Write your source files in ASCII-only (that is,
don't use 8-bit ISO8859-1 characters either), and if you need strings
with Unicode in stick to "\x{...}" or "\N{...}".

> > What are you actually trying to do here? That is, why do you think you
> > need to check if a string is valid UTF-8?
>
> I'm not trying anything. However, the OP asked if there is any easy way
> to decide if a string is valid UTF-8. I answered him pointing to
> Encode ::is_utf8() which as Peter rightly told me is the wrong way.

I thought you were the OP... oh God, this is a George Mpouras thread.
He's in my killfile for a reason...

> Peter said that $decoded = eval { decode('UTF-8', $string, FB_CROAK) };
> is correct which I don't believe.
>
> Let met repeat from my last example. 'ä' is unicode point 0xe4 and
> utf-8 0xc3a4. In the script file (which itself is an utf8 encoded file)
> ä is 0xc3a4. Why should perl kill this when I have specified 'use
> utf8;'? My only statement is that $ae in the script below is a valid
> utf8 string.

Take out the 'use utf8;' and run the program again. Does that give you
the result you expected?

Now write the source file out in ISO8859-1 and run it again. Barring
bugs in perl, a source file written in ISO8859-1 *without* 'use utf8'
and the equivalent source file written in UTF-8 *with* 'use utf8' will
have exactly the same effect.

(In principle you can rewrite the file in any encoding you like, add an
equivalent 'use encoding' directive, and get the same effect. In
practice the implementation of 'encoding' is rather buggy, so that
doesn't entirely work.)

Perl does not remember that the string happened to come from a file
which happened to have been in UTF-8. All it knows is that the string
has two characters, "\x48\xe4", and that that string is *not* valid
UTF-8.

> SV = PV(0x1b86dd0) at 0x1bd7470
> REFCNT = 1
> FLAGS = (PADMY,POK,pPOK,UTF8)
> PV = 0x1bb0ec0 "\303\244"\0 [UTF8 "\x{e4}"]
> CUR = 2
> LEN = 16

[...]

>
> This IMHO shows that $ae in above script is a valid utf8 string.
> This is the only thing I state.

Which of these questions are you trying to answer?

If I write this string to a file, will that file be valid UTF-8?
Is the perl-internal SvUTF8 flag set?

Peter's answer is the correct answer to the first question, which is a
useful question to be able to answer. The correct answer to the second
is 'utf8::is_utf8', but the Right answer is 'except under exceptional
circumstances you don't need to know that, and in any case the answer is
not something you can rely on'.

Ben

Manfred Lotz

unread,

May 15, 2013, 11:48:52 AM5/15/13

to

On Wed, 15 May 2013 15:37:14 +0100

In my opinion it makes no sense to leave out 'use:utf8;' if I have utf8
stuff in my script which is outside of ASCII. The only requirement I
have is that 'ä' won't change whatever perl does with it internally.
This works fine so I have no complains.

> Now write the source file out in ISO8859-1 and run it again. Barring
> bugs in perl, a source file written in ISO8859-1 *without* 'use utf8'
> and the equivalent source file written in UTF-8 *with* 'use utf8' will
> have exactly the same effect.
>
> (In principle you can rewrite the file in any encoding you like, add
> an equivalent 'use encoding' directive, and get the same effect. In
> practice the implementation of 'encoding' is rather buggy, so that
> doesn't entirely work.)
>
> Perl does not remember that the string happened to come from a file
> which happened to have been in UTF-8. All it knows is that the string
> has two characters, "\x48\xe4", and that that string is *not* valid
> UTF-8.
>
> > SV = PV(0x1b86dd0) at 0x1bd7470
> > REFCNT = 1
> > FLAGS = (PADMY,POK,pPOK,UTF8)
> > PV = 0x1bb0ec0 "\303\244"\0 [UTF8 "\x{e4}"]
> > CUR = 2
> > LEN = 16
> [...]
> >
> > This IMHO shows that $ae in above script is a valid utf8 string.
> > This is the only thing I state.
>
> Which of these questions are you trying to answer?
>
> If I write this string to a file, will that file be valid UTF-8?

This was not asked by the OP. But if I write $ae to stdout using
binmode STDOUT, ":utf8" then I'm fine.

> Is the perl-internal SvUTF8 flag set?
>

I only tried to answer the question if a string is valid utf8. After the
discussions we had the new question seems to be if the former is a
meaningful question at all. Because if the string would contain stuff
which is invalid utf8 (which can happen when there is some hex
garbage) then Emacs would have complained latest when I tried to save
the buffer.

--
Manfred

Rainer Weikusat

unread,

May 15, 2013, 2:01:53 PM5/15/13

to

Ben Morrow <b...@morrow.me.uk> writes:
> Quoth Rainer Weikusat <rwei...@mssgmbh.com>:
>> Manfred Lotz <manfre...@arcor.de> writes:
>> > On Tue, 14 May 2013 21:27:49 +0100
>>
>> [...]
>>
>> > My mistake was that I believed that perl's internal representation is
>> > utf8 instead of unicode code point.
>>
>> perl's internal representation is utf8 which is supposed to be decoded
>> on demand as necessary. That's not an uncommon implementation choice
>> for software supposed to interact with 'the real world' (here supposed
>> to mean 'everything out there on the internet', have a look at the
>> Mozilla Rust FAQ for a cogent and succinct explanation why this makes
>> sense) but that's an implementation choice the people who presently
>> work on this code strongly disagree with: They would prefer a model
>> where, prior to each internal processing step, a pass over the
>> complete input data has to be made in order to transform it into "the
>> super-secret internal perl encoding" and after any internal processing
>> has been completed, a second pass over all of the data has to be made
>> in order to decode the 'super secrete internal perl encoding' into
>> something which is useful for anyhing except being 'super secret' and
>> 'internal to Perl'.
>
> You are confusing semantics with internal representation.

I'm not 'confusing' anything. I described this (AFAICT) correctly from
the abstract viewpoint a 'language user' is supposed to assume.

BTW: This 'stock reply' to any kind justified criticism, attack the person
who wrote it as 'clueless' by substituting an alternate, more-or-less
related topic, is really getting long in the tooth.

> Encode is privy to perl's internal representation; it knows that if
> you are encoding into (loose) "utf8" and the string is internally
> represented as SvUTF8 then all it has to do is flip the flag, and
> similarly that if you are encoding into "ISO8859-1" and the string
> is not internally SvUTF8 that it doesn't need to do
> anything. Decoding is not quite so simple, since it isn't safe to
> assume input which was supposed to be in UTF-8 is actually valid,
> but decoding a non-SvUTF8 string from "utf8" still doesn't do any
> actual decoding, it just validates the string and copies it out.

The idea that the programmer should be forced to do useless stuff but
that otherwise useless code can be used to detect that the computer
can skip this useless request doesn't exactly make sense: Despite
being useless, the useless request code (uselessly) needs to be
written, debugged and maintained and human time is much more expensive
than computer time.

[...]

>> This sort-of makes sense when assuming that perl is an island located
>> in strange waters and that it will usually keep mostly to itself
>> (figuratively spoken) and it makes absolutely no sense when 'some perl
>> code' performs one step of a multi-stage processing pipeline which may
>> possibly even include other perl code (since not even 'output of perl'
>> is supposed to be suitable to become 'input of perl').
>
> Unix IPC is defined in terms of bytes. There is no way to represent an
> arbitrary Unicode character as a sequence of bytes without some sort of
> encoding step.

Quoting the document I already mentioned in the original posting:

Why are strings UTF-8 by default? Why not UCS2 or UCS4?

The str type is UTF-8 because we observe more text in the wild
in this encoding -- particularly in network transmissions,
which are endian-agnostic -- and we think it's best that the
default treatment of I/O not involve having to recode
codepoints in each direction.
https://github.com/mozilla/rust/wiki/Doc-language-FAQ#why-are-strings-utf-8-by-default-why-not-ucs2-or-ucs4

NB: That's the exact argument I made and I guess the correct 'open
source response' should be that 'the Perl5 tribe' goes on the warpath
in order to exterminate 'the Mozilla Rust tribe' and thus, rid the
world of these "fundamentally mistaken" dissenting opinions ...

Helmut Richter

unread,

May 15, 2013, 3:52:52 PM5/15/13

to

On Wed, 15 May 2013, Rainer Weikusat wrote:

> The idea that the programmer should be forced to do useless stuff but
> that otherwise useless code can be used to detect that the computer
> can skip this useless request doesn't exactly make sense: Despite
> being useless, the useless request code (uselessly) needs to be
> written, debugged and maintained and human time is much more expensive
> than computer time.

The idea is to separate things that belong to the interface from those
that do not. The latter things may change at any time or from one
implementation to another without doing any harm to people who have only
used the documented interface and not arbitrary implementation decisions
of one particular implementation. This is a wise way to proceed.

The internal representation of character strings in perl does *not* belong
to the interface. If you happen to know how it is done (in particular that
the same character string may have different representations in the same
implementation), don't use it because it may change at any time without
warning. This is so in all programming languages. If you try to exploit
your knowledge of the bitwise representation of a Fortran real number your
code may break when you go from one implementation to another.

By the way, this kind of defined interface made it possible to expand perl
strings beyond ISO-8859-1 without breaking exising applications.

--
Helmut Richter

Rainer Weikusat

unread,

May 15, 2013, 4:39:29 PM5/15/13

to

Helmut Richter <hh...@web.de> writes:
> On Wed, 15 May 2013, Rainer Weikusat wrote:
>> The idea that the programmer should be forced to do useless stuff but
>> that otherwise useless code can be used to detect that the computer
>> can skip this useless request doesn't exactly make sense: Despite
>> being useless, the useless request code (uselessly) needs to be
>> written, debugged and maintained and human time is much more expensive
>> than computer time.
>
> The idea is to separate things that belong to the interface from those
> that do not. The latter things may change at any time or from one
> implementation to another without doing any harm to people who have only
> used the documented interface and not arbitrary implementation decisions
> of one particular implementation. This is a wise way to proceed.

That's a completely general statement about "good programming
practices". The sole purpose it is supposed to fulfil here is to
suggest that an opinion about something which happens to conflict with
some other opinion would somehow conflict with the mentioned 'good
programming practice' without detailing how exactly.

> The internal representation of character strings in perl does *not* belong
> to the interface.

The people who are presently concerned with this think that perl
should have a 'super-secret internal character representation' which
isn't useful for anything except 'perl-internal processing' (and not
compatible with anything, including different instances of perl
itself). As far as I know, the reason why they think this is that
'implementation convenience' trumps 'real-world usability'. Other
people working on similar stuff in other programming languages
(including older versions of Perl) think that the character string
representation used by $language should be documented and follow a
'sensibly chosen existing convention' even if this might cause
'implementation inconveniences'.

[...]

> If you try to exploit your knowledge of the bitwise representation
> of a Fortran real number your code may break when you go from one
> implementation to another.

I have no knowledge about 'bitwise representation of
Fortran-anything' and 'Fortran floating-point data types' and
'representation of unicode strings' are two very much different things
(in particular, I doubt that many web pages or other exisiting 'text
files' contain 'Fortran floating point numbers' represented in
binary). Apart from that, there are standards for representing
'floating point values'.

Helmut Richter

unread,

May 16, 2013, 5:26:34 AM5/16/13

to

On Wed, 15 May 2013, Rainer Weikusat wrote:

> Helmut Richter <hh...@web.de> writes:

> > The idea is to separate things that belong to the interface from those
> > that do not. The latter things may change at any time or from one
> > implementation to another without doing any harm to people who have only
> > used the documented interface and not arbitrary implementation decisions
> > of one particular implementation. This is a wise way to proceed.

> That's a completely general statement about "good programming
> practices".

Indeed. And it is meant as such.

Implementing something in a way that the arbirtrary choice of implementation
details becomes part of the interface and thus can never again be changed
would be a major blunder, and I am glad the perl implementers have not done
so.

> As far as I know, the reason why they think this is that
> 'implementation convenience' trumps 'real-world usability'. Other
> people working on similar stuff in other programming languages
> (including older versions of Perl) think that the character string
> representation used by $language should be documented and follow a
> 'sensibly chosen existing convention' even if this might cause
> 'implementation inconveniences'.

You would have found it better programming practices if decades ago perl had
decided to publish as an interface that iso-8859-1 (the most advanced
character standard then), one byte per character, is the internal
representation for all future? Or should they have taken such a decision at
the time when character code points were restricted to 16 bits? Why shall they
do it just now?

It is by no means mandatory to do it the way the perl people did. They could
have chosen a *more* strict separation between character strings and byte
strings so that all input/output is to and from byte strings, only byte
strings can be decoded and only character strings can be encoded. This would
have disallowed some programming mistakes people are now doing. I, too, have
doubts that they chose the best solution. But allowing the programmer access
to the internal representation would have been a major design blunder.

And what do you positively get from direct acces to the internal
representation? You talked about efficiency. Is it really a major efficiency
issue to let perl decide by inspection of one bit whether the internal
representation of a particular string happens to be already utf-8 so that the
encoding/decoding is practically a null operation?

--
Helmut Richter

Dr.Ruud

unread,

May 16, 2013, 5:34:15 AM5/16/13

to

On 15/05/2013 17:48, Manfred Lotz wrote:

> In my opinion it makes no sense to leave out 'use:utf8;' if I have utf8
> stuff in my script which is outside of ASCII.

Sure, if your source file is "in 'utf8' format" (and of course a fully
ASCII file is 'utf8' (and 'UTF-8') as well), then it shouldn't harm.

But still be aware of the consequences. If you save the file as latin1
at some point, you break it, exactly because of the "use utf8;".

I prefer my source files to be ASCII, so I use code like "\x{1234}".

Now read what the module's documentation states:

utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source
code [...]

The "use utf8" pragma tells the Perl parser to allow UTF-8 in the
program text in the current lexical scope [...]

Do not use this pragma for anything else than telling Perl that your
script is written in UTF-8.

--
Ruud

Rainer Weikusat

unread,

May 16, 2013, 12:01:40 PM5/16/13

to

Helmut Richter <hh...@web.de> writes:
> On Wed, 15 May 2013, Rainer Weikusat wrote:
>
>> Helmut Richter <hh...@web.de> writes:
>
>> > The idea is to separate things that belong to the interface from those
>> > that do not. The latter things may change at any time or from one
>> > implementation to another without doing any harm to people who have only
>> > used the documented interface and not arbitrary implementation decisions
>> > of one particular implementation. This is a wise way to proceed.
>
>> That's a completely general statement about "good programming
>> practices".
>
> Indeed. And it is meant as such.

Doesn't this 'delete content and reply to a more convenient
fabrication' trick become boring over time?

,----

| The sole purpose it is supposed to fulfil here is to
| suggest that an opinion about something which happens to conflict with
| some other opinion would somehow conflict with the mentioned 'good
| programming practice' without detailing how exactly.

`----

What you should realize here that this is not a dogma, ie, a statement
detailing an unquestionable truth made by a (by definition) infallible
entity, but a generalized guideline supposed to be of _demonstable_
practical usefulness in 'certain situations'. Consequently, quoting it
as if it was akin to "Thou shalt not bear false witness against thy
neighbour" is not sufficient as argument in favor of or against
anything, even more so when actual existance of a 'violation of the
principle' is just implied but not described. Yet more so, when the
statement is demonstrably wrong: The Perl programming language is only
a part of the 'interface' to perl, the other is the extension
facilitiy which has direct access to everything inside the Perl core,
including the mechanics of character handling.

> Implementing something in a way that the arbirtrary choice of implementation
> details becomes part of the interface and thus can never again be changed
> would be a major blunder,

'The interface' itself is nothing but the cumulative effect of a set
of perfectly arbitrary implementation choices: Every perl operator
could have been implemented in a different way or not implemented at
all.

I'm going to ignore the rest of this text because you aren't telling
the truth, you know that, I know that, and you know that I know that.

Rainer Weikusat

unread,

May 16, 2013, 12:05:16 PM5/16/13

to

Rainer Weikusat <rwei...@mssgmbh.com> writes:

[...]

> I'm going to ignore the rest of this text because you aren't telling
> the truth, you know that, I know that, and you know that I know that.

Addition: A discussion of the relative merits of either approach for
handling 'extended characters' could be interesting. However, I'm not
interested in trying to argue for both sides, ie, against my own
standpoint, and these "the Gods have chosen wisely and now it is for
the mortals to obey" declarations of faith (or fandom) are pointless.

Manfred Lotz

unread,

May 16, 2013, 12:26:37 PM5/16/13

to

On Thu, 16 May 2013 11:34:15 +0200
"Dr.Ruud" <rvtol+...@xs4all.nl> wrote:

> On 15/05/2013 17:48, Manfred Lotz wrote:
>
> > In my opinion it makes no sense to leave out 'use:utf8;' if I have
> > utf8 stuff in my script which is outside of ASCII.
>
> Sure, if your source file is "in 'utf8' format" (and of course a
> fully ASCII file is 'utf8' (and 'UTF-8') as well), then it shouldn't
> harm.
>
> But still be aware of the consequences. If you save the file as
> latin1 at some point, you break it, exactly because of the "use
> utf8;".
>

Yep, this is true. However, Emacs wouldn't do this. :-)

>
> I prefer my source files to be ASCII, so I use code like "\x{1234}".
>
>
> Now read what the module's documentation states:
>
> utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source
> code [...]
>
> The "use utf8" pragma tells the Perl parser to allow UTF-8 in the
> program text in the current lexical scope [...]
>
> Do not use this pragma for anything else than telling Perl that your
> script is written in UTF-8.
>

I anyway would use the utf8 pragma only if I really need it.

--
Manfred

Helmut Richter

unread,

May 16, 2013, 12:32:27 PM5/16/13

to

On Thu, 16 May 2013, Rainer Weikusat wrote:

> Helmut Richter <hh...@web.de> writes:
> > On Wed, 15 May 2013, Rainer Weikusat wrote:
> >
> >> Helmut Richter <hh...@web.de> writes:
> >
> >> > The idea is to separate things that belong to the interface from those
> >> > that do not. The latter things may change at any time or from one
> >> > implementation to another without doing any harm to people who have only
> >> > used the documented interface and not arbitrary implementation decisions
> >> > of one particular implementation. This is a wise way to proceed.
> >
> >> That's a completely general statement about "good programming
> >> practices".
> >
> > Indeed. And it is meant as such.
>
> Doesn't this 'delete content and reply to a more convenient
> fabrication' trick become boring over time?
>
> ,----
> | The sole purpose it is supposed to fulfil here is to
> | suggest that an opinion about something which happens to conflict with
> | some other opinion would somehow conflict with the mentioned 'good
> | programming practice' without detailing how exactly.
> `----

I did not like to answer your allegation of motives which are not mine.

> I'm going to ignore the rest of this text because you aren't telling
> the truth, you know that, I know that, and you know that I know that.

Interesting twist.

--
Helmut Richter

Helmut Richter

unread,

May 16, 2013, 12:39:05 PM5/16/13

to

My wording of "the Gods have chosen wisely and now it is for the mortals
to obey" was "I, too, have doubts that they chose the best solution."

I have only much more serious doubts that your idea to publish
implementation details as interface would have been better.

Another example: I am mostly using Emacs as text editor. I do know that
when I type the character "ï¿œ" or "ï¿œ" when entering text, exactly this
character will appear in the file in the encoding I choose when saving the
file. I have no idea how this character is stored internally while emacs
is underway. And that's absolutely fine with me. Why should perl not do
likewise?

--
Helmut Richter

Rainer Weikusat

unread,

May 16, 2013, 1:21:08 PM5/16/13

to

Helmut Richter <hh...@web.de> writes:
> On Thu, 16 May 2013, Rainer Weikusat wrote:
>
>> Rainer Weikusat <rwei...@mssgmbh.com> writes:
>>
>> [...]
>>
>> > I'm going to ignore the rest of this text because you aren't telling
>> > the truth, you know that, I know that, and you know that I know that.
>>
>> Addition: A discussion of the relative merits of either approach for
>> handling 'extended characters' could be interesting. However, I'm not
>> interested in trying to argue for both sides, ie, against my own
>> standpoint, and these "the Gods have chosen wisely and now it is for
>> the mortals to obey" declarations of faith (or fandom) are pointless.
>
> My wording of "the Gods have chosen wisely and now it is for the mortals
> to obey" was "I, too, have doubts that they chose the best solution."
>
> I have only much more serious doubts that your idea to publish
> implementation details as interface would have been better.

But this isn't my idea, that's just a totally generic label you have
chosen to attach to a certain standpoint regarding how 'unicode
strings' should be handled. It is also wrong to refer to this as 'my
idea' since it isn't may idea and to refer to it has not published
because it *is* part of the published documentation of perl. For
instance, to this day, the perlguts manpage contains the following
text:

To fix this, some people formed Unicode, Inc. and produced a
new character set containing all the characters you can
possibly think of and more. There are several ways of
representing these characters, and the one Perl uses is called
UTF-8. UTF-8 uses a variable number of bytes to represent a
character.
http://perldoc.perl.org/perlguts.html#Unicode-Support

> Another example: I am mostly using Emacs as text editor. I do know that

> when I type the character "�" or "�" when entering text, exactly this

> character will appear in the file in the encoding I choose when saving the
> file. I have no idea how this character is stored internally while emacs
> is underway. And that's absolutely fine with me. Why should perl not do
> likewise?

Because Perl is a programming language and not a text editor and
depending on the kind of program, different strategies for UTF-8
decoding might make sense. A nice discussion of this is available in
the 'Converting the tools' section of this paper:

http://plan9.bell-labs.com/sys/doc/utf.html