Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Unicode cheatsheet for Perl

237 views
Skip to first unread message

Tom Christiansen

unread,
Feb 20, 2012, 12:27:33 PM2/20/12
to Perl5 Porters Mailing List, Karl Williamson, Jarkko Hietaniemi
Inspired by how scandalously Unicode-deficient the
otherwise fine 4-way polyglot table comparing PHP, Perl,
Python, and Ruby is at

http://hyperpolyglot.org/scripting

I created a quick Unicode cheatsheet for Perl, mostly by
mining the examples in the new 4th edition of the came.

Gee, I foresee a *whole* lot of "impossibles" in the
other three languages' columns, don't you? :)

Hm, have I left anything out that Perl is especially cool with?

I almost wonder whether this sort of thing oughtn't be a manpage,
something like perluni{ref,cheat,quick}?

--tom

=Characters and their numbers

# ASCII
ord("A")
chr(65)

# BMP
ord("Σ")
chr(0x3A3)

# beyond the BMP
ord("𝑛")
chr(0x1D45B)

# beyond Unicode (up to MAXINT)
ord("\x{20_0000}")
chr(0x20_0000)

=Unicode literals by character number

String: "\x{3a3}"
Regex: /\x{3a3}/

String: "\x{1d45b}"
Regex: /\x{1d45b}/

# even non-BMP ranges in regex work fine
/[\x{1D434}-\x{1D467}]/

=Get character name by number

use charnames ();
my $name = charnames::viacode(0x03A3);

=Get character number by name

use charnames ();
my $number = charnames::vianame("GREEK CAPITAL LETTER SIGMA");

=Unicode named characters

use charnames qw(:full :short greek);

"\N{MATHEMATICAL ITALIC SMALL N}"
"\N{GREEK CAPITAL LETTER SIGMA}"
"\N{Greek:Sigma}"
"\N{epsilon}"

=Unicode named sequences

use charnames qw(:full);
my $seq = "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}";
printf "U+%v04X\n", $seq;
U+0100.0300

=Custom named characters

use charnames ":full", ":alias" => {
ecute => "LATIN SMALL LETTER E WITH ACUTE",
"APPLE LOGO" => 0xF8FF, # private use character
};

"\N{ecute}"
"\N{APPLE LOGO}"

=Declare source in utf8 for identifiers and literals

use utf8;

my $measure = "Ångström";
my @μsoft = qw( cp852 cp1251 cp1252 );
my @ὑπέρμεγας = qw( ὑπέρ μεγας );
my @鯉 = qw( koi8–f koi8–u koi8–r );

=Unicode casing

uc("henry ⅷ") # "HENRY Ⅷ"
uc("tschüß") # "TSHUESS"

# both are true:
"tschüß" =~ /TSHUESS/i
"Σίσυφος" =~ /ΣΊΣΥΦΟΣ/i

=Unicode case-insensitive comparisons

use utf8;
use feature "fc"; # fc() function is from v5.16

# sort case-insensitively
my @sorted = sort { fc($a) cmp fc($b) } @list;

# both are true:
fc("tschüß") eq fc("TSHUESS")
fc("Σίσυφος") eq fc("ΣΊΣΥΦΟΣ")

=Match Unicode linebreak sequence in regex

\R

s/\R/\n/g; # normalize all linebreaks to \n

=Match Unicode properties in regex with \p, \P

\pL, \pN, \pS, \pP, \pM, \pZ, \pC
\p{Sk}, \p{Ps}, \p{Lt}
\p{alpha}, \p{upper}, \p{lower}
\p{Latin}, \p{Greek}
\p{script=Latin}, \p{script=Greek}
\p{East_Asian_Width=Wide}, \p{EA=W}
\p{Line_Break=Hyphen}, \p{LB=HY}
\p{Numeric_Value=4}, \p{NV=4}

=Custom character properties

# using private-use characters
sub In_Tengwar { "E000\tE07F\n" }

if (/\p{In_Tengwar}/) { ... }

# blending existing properties
sub Is_GraecoRoman_Title {<<'END_OF_SET'}
+utf8::IsLatin
+utf8::IsGreek
&utf8::IsTitle
END_OF_SET

if (/\p{Is_GraecoRoman_Title}/ { ... }

=Get character category

use Unicode::UCD qw(charinfo);
my $cat = charinfo(0x3A3)->{category}; # "Lu"

=Convert non-ASCII Unicode numerics

# from v5.12
use Unicode::UCD qw(num);
if (/(\d+|\N)) { # not just ASCII!
$nv = num($1);
}

use charnames qw(:full);
my $nv = num("\N{RUMI DIGIT ONE}\N{RUMI DIGIT TWO}");

=Match Unicode grapheme cluster in regex

\X

# match and grab five first graphemes
my($first_five) = /^(\X{5})/;

# Find vowel plus any diacritics
use Unicode::Normalize;
my $nfd = NFD($orig);
$nfd =~ /(?=[aeiou])\X/i

=Reverse string by grapheme

$str = join("", reverse $str =~ /\X/g);

# OR: cpan -i Unicode::GCString
use Unicode::GCString;
$str = reverse Unicode::GCString->new($str);

=String length in graphemes

my $count = 0;
while ($str =~ /\X/) { $count++ }

# OR: cpan -i Unicode::GCString
use Unicode::GCString;
$gcs = Unicode::GCString->new($str);
my $count = $gcs->length;

=Substring by grapheme

# cpan -i Unicode::GCString
use Unicode::GCString;
$gcs = Unicode::GCString->new($str);
my $piece = $gcs->substr(5, 5);

=Unicode column-width for printing

# cpan -i Unicode::GCString
use Unicode::GCString;
$gcs = Unicode::GCString->new($str);
my $cols = $gcs->columns;
printf "%*s\n", $cols, $str,

=Unicode normalization

use Unicode::Normalize;
my $nfd = NFD($orig);
my $nfc = NFC($orig);
my $nfkd = NFKD($orig);
my $nfkc = NFKC($orig);

=Unicode collation

use Unicode::Collate;
my $col = Unicode::Collate->new();
my @list = $col->sort(@old_list);

=Case- *and* accent-insensitive Unicode sort

use Unicode::Collate;
my $col = Unicode::Collate->new(level => 1);
my @list = $col->sort(@old_list);

=Unicode locale collation

# either use v5.12, OR: cpan -i Unicode::Collate::Locale
use Unicode::Collate::Locale;
my $col = Unicode::Collate::Locale->new(locale => "de__phonebook");
my @list = $col->sort(@old_list);

=Case- *and* accent-insensitive comparisons

use utf8;
use Unicode::Collate;
my $coll = Unicode::Collate–>new(
level => 1,
normalization => undef
);

# now both are true:
$coll->eq("García", "GARCIA" );
$coll->eq("Márquez", "MARQUEZ");

=Unicode linebreaking

# cpan -i Unicode::LineBreak
use Unicode::LineBreak;
use charnames qw(:full);

my $para = "This is a super\N{HYPHEN}long string. " x 20;
my $fmt = new Unicode::LineBreak;
print $fmt->break($para), "\n";

=Declare std streams to be utf8

$ perl -CS ...
or
$ export PERL_UNICODE=S
or
use open qw(:std :utf8);
or
binmode(STDIN, ":utf8");
binmode(STDOUT, ":utf8");
binmode(STDERR, ":utf8");

=Make I/O default to utf8

$ perl -CSD ...
or
$ export PERL_UNICODE=SD
or
use open qw(:std :utf8);

=Open file with implicit encode/decode

# input file
open(my $in_file, "< :encoding(UTF-16)", "wintext");
OR
open(my $in_file, "<", "wintext");
binmode($in_file, ":encoding(UTF-16)");
THEN
my $line = <$in_file>;

# output file
open($out_file, "> :encoding(cp1252)", "wintext");
OR
open(my $out_file, ">", "wintext");
binmode($out_file, ":encoding(cp1252)");
THEN
print $out_file "some text\n";

=Explicit encode/decode [rarely needed, see previous]

use Encode qw(encode decode);

my $chars = decode("shiftjis", $bytes);
OR
my $bytes = encode("MIME–Header–ISO_2022_JP", $chars);

H.Merijn Brand

unread,
Feb 20, 2012, 12:35:49 PM2/20/12
to perl5-...@perl.org
On Mon, 20 Feb 2012 10:27:33 -0700, Tom Christiansen <tch...@perl.com>
wrote:

> Inspired by how scandalously Unicode-deficient the
> otherwise fine 4-way polyglot table comparing PHP, Perl,
> Python, and Ruby is at
>
> http://hyperpolyglot.org/scripting
>
> I created a quick Unicode cheatsheet for Perl, mostly by
> mining the examples in the new 4th edition of the came.

Useful! Thanks. TomC++
--
H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/
using perl5.00307 .. 5.14 porting perl5 on HP-UX, AIX, and openSUSE
http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/
http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/

Alberto Simões

unread,
Feb 20, 2012, 1:06:01 PM2/20/12
to perl5-...@perl.org
Am I missed anything?

search by Unicode gave no results :S

Alberto Simões

unread,
Feb 20, 2012, 1:35:30 PM2/20/12
to perl5-...@perl.org


On 20/02/12 18:31, Dilbert wrote:
> On 20 fév, 18:27, tchr...@perl.com (Tom Christiansen) wrote:
>> =Unicode casing
>> uc("tschüß") # "TSHUESS"
>
> I think there is the letter "C" missing in "TSHUESS"
>
>> "tschüß" =~ /TSHUESS/i
>
> same here, the letter "C" is missing in /TSHUESS/i
>
> ...or is there something that I don't understand how Unicode is
> translated into uppercase ?

Dilbert++

Dilbert

unread,
Feb 20, 2012, 1:31:11 PM2/20/12
to perl5-...@perl.org
On 20 fév, 18:27, tchr...@perl.com (Tom Christiansen) wrote:
> =Unicode casing
>     uc("tschüß")   # "TSHUESS"

Tom Christiansen

unread,
Feb 20, 2012, 1:48:22 PM2/20/12
to Dilbert, brian d foy, perl5-...@perl.org
Dilbert <dilbe...@gmail.com> wrote
on Mon, 20 Feb 2012 10:31:11 PST:

>> =Unicode casing
>>     uc("tschüß")   # "TSHUESS"

> I think there is the letter "C" missing in "TSHUESS"

>>     "tschüß"  =~ /TSHUESS/i

> same here, the letter "C" is missing in /TSHUESS/i

> ...or is there something that I don't understand how Unicode is
> translated into uppercase ?

No, I just typo'd it; you're right. Thanks.

--tom

Tom Christiansen

unread,
Feb 20, 2012, 1:47:04 PM2/20/12
to al...@alfarrabio.di.uminho.pt, perl5-...@perl.org
Alberto Simões <al...@alfarrabio.di.uminho.pt> wrote
on Mon, 20 Feb 2012 18:06:01 GMT:

> Am I missed anything?

> search by Unicode gave no results :S

I don't understand what you mean.

The normal reason that Unicode searches fail is that one or both
of the source code or the input should have been specified as being
encoded in UTF-8 but weren't.

For example, replace the ellipsis with your search:

use utf8;
upe v5.14;
use strict;
use warnings;
use open qw(:std :utf8);
use warnings "FATAL" => "utf8";
use charnames qw(:full :short latin greek);

use Unicode::Normalize;

while (<>) {
$_ = NFD($_);
...
} continue {
print NFC($_);
}


And if you're running blead, you may add :loose to charnames.

--tom

Alberto Simões

unread,
Feb 20, 2012, 1:49:50 PM2/20/12
to Tom Christiansen, perl5-...@perl.org


On 20/02/12 18:47, Tom Christiansen wrote:
> Alberto Simões<al...@alfarrabio.di.uminho.pt> wrote
> on Mon, 20 Feb 2012 18:06:01 GMT:
>
>> Am I missed anything?
>
>> search by Unicode gave no results :S

Sorry.
I meant that in http://hyperpolyglot.org/scripting I didn't find any
mention to unicode (tried to search for it without a result).

Tom Christiansen

unread,
Feb 20, 2012, 1:53:58 PM2/20/12
to al...@alfarrabio.di.uminho.pt, perl5-...@perl.org
>I meant that in http://hyperpolyglot.org/scripting I didn't find any
>mention to unicode (tried to search for it without a result).

Yes, *exactly*: it's missing any and all mention of any Unicode smarts.

The thingy I mailed out is eventually intended to fix that, once I get
somebody to add a section for Unicode stuff. I'm trying to highlight
how much more convenient and comprehensive Perl's Unicode support actually
is with respect to the rest of the cyberlinguae.

I'm thinking it "might-should" be a bit of the std docset. Opinions?

--tom

Tom Christiansen

unread,
Feb 20, 2012, 2:32:00 PM2/20/12
to Karl Williamson, Perl5 Porters Mailing List, Jarkko Hietaniemi
Karl Williamson <pub...@khwilliamson.com> wrote
on Mon, 20 Feb 2012 12:27:18 MST:

Thanks for all that, Karl.

>> =Declare std streams to be utf8

> I hate to publicize this until :utf8 by default checks for
> malformedness. I put this as a blocker to 5.16, but I suspect
> I will be overruled.

Are you saying tha t

use warnings FATAL => "utf8";

doesn't fix the ":utf8" ''bug''?

I feel it a mistake to make people use the :encoding(UTF-8)
style, amongst other reasons because then they lose all control
over the three utf8 warning subcategories. We cannot do anything
about the Encode module, and they do not keep up with Perl.

Perl should certainly handle utf8 properly in all cases.

--tom

Karl Williamson

unread,
Feb 20, 2012, 2:27:18 PM2/20/12
to Tom Christiansen, Perl5 Porters Mailing List, Jarkko Hietaniemi
On 02/20/2012 10:27 AM, Tom Christiansen wrote:
> Inspired by how scandalously Unicode-deficient the
> otherwise fine 4-way polyglot table comparing PHP, Perl,
> Python, and Ruby is at
>
> http://hyperpolyglot.org/scripting
>
> I created a quick Unicode cheatsheet for Perl, mostly by
> mining the examples in the new 4th edition of the came.
>
> Gee, I foresee a *whole* lot of "impossibles" in the
> other three languages' columns, don't you? :)
>
> Hm, have I left anything out that Perl is especially cool with?
>
> I almost wonder whether this sort of thing oughtn't be a manpage,
> something like perluni{ref,cheat,quick}?

+1

Here are some small corrections to it.
>
> --tom
>
> =Characters and their numbers
>
> # ASCII
> ord("A")
> chr(65)
>
> # BMP
> ord("Σ")
> chr(0x3A3)
>
> # beyond the BMP
> ord("𝑛")
> chr(0x1D45B)
>
> # beyond Unicode (up to MAXINT)

Actually you can go up to UV_MAX, but many operations only allow up to
MAXINT, so it might be best to leave it as is.

> ord("\x{20_0000}")
> chr(0x20_0000)
>
> =Unicode literals by character number
>
> String: "\x{3a3}"
> Regex: /\x{3a3}/
>
> String: "\x{1d45b}"
> Regex: /\x{1d45b}/
>
> # even non-BMP ranges in regex work fine
> /[\x{1D434}-\x{1D467}]/
>
> =Get character name by number
>
> use charnames ();
> my $name = charnames::viacode(0x03A3);
>
> =Get character number by name
>
> use charnames ();
> my $number = charnames::vianame("GREEK CAPITAL LETTER SIGMA");
>
> =Unicode named characters
>
> use charnames qw(:full :short greek);

The use charnames is optional starting in 5.16 if only :full and :short
are desired.
>
> "\N{MATHEMATICAL ITALIC SMALL N}"
> "\N{GREEK CAPITAL LETTER SIGMA}"
> "\N{Greek:Sigma}"
> "\N{epsilon}"
>
> =Unicode named sequences
>
> use charnames qw(:full);

Same, in 5.16 the 'use' is optional here
No. from v5.14
I hate to publicize this until :utf8 by default checks for
malformedness. I put this as a blocker to 5.16, but I suspect I will be
overruled.

>

Karl Williamson

unread,
Feb 20, 2012, 2:40:04 PM2/20/12
to Tom Christiansen, Perl5 Porters Mailing List, Jarkko Hietaniemi
On 02/20/2012 12:32 PM, Tom Christiansen wrote:
> Karl Williamson<pub...@khwilliamson.com> wrote
> on Mon, 20 Feb 2012 12:27:18 MST:
>
> Thanks for all that, Karl.
>
>>> =Declare std streams to be utf8
>
>> I hate to publicize this until :utf8 by default checks for
>> malformedness. I put this as a blocker to 5.16, but I suspect
>> I will be overruled.
>
> Are you saying tha t
>
> use warnings FATAL => "utf8";
>
> doesn't fix the ":utf8" ''bug''?

No, I'm not saying that, but I didn't see your cheat sheet mentioning
the addition of the FATAL, which I think it should before every use of :utf8

Tom Christiansen

unread,
Feb 20, 2012, 3:13:02 PM2/20/12
to Karl Williamson, Perl5 Porters Mailing List, Jarkko Hietaniemi
Karl Williamson <pub...@khwilliamson.com> wrote
on Mon, 20 Feb 2012 12:40:04 MST:

> No, I'm not saying that, but I didn't see your cheat sheet mentioning
> the addition of the FATAL, which I think it should before every use of :utf8

And every use of "use utf8".

And every use of perl -CSAD.

And every use of PERL_UNICODE=SAD.

If there's something extra and special that everyone everywhere
has to do everytime for correct behavior, then this is too important
to left up to the programmer to forget.

And if it's a security issue, how can it not be a release stopper?

--tom

Leon Timmermans

unread,
Feb 20, 2012, 6:56:05 PM2/20/12
to Tom Christiansen, Karl Williamson, Perl5 Porters Mailing List, Jarkko Hietaniemi, cha...@cpan.org
On Mon, Feb 20, 2012 at 9:13 PM, Tom Christiansen <tch...@perl.com> wrote:
>> No, I'm not saying that, but I didn't see your cheat sheet mentioning
>> the addition of the FATAL, which I think it should before every use of :utf8
>
> And every use of "use utf8".
>
> And every use of perl -CSAD.
>
> And every use of PERL_UNICODE=SAD.
>
> If there's something extra and special that everyone everywhere
> has to do everytime for correct behavior, then this is too important
> to left up to the programmer to forget.
>
> And if it's a security issue, how can it not be a release stopper?

Actually, I've been writing such a layer[1] together with Christian
Hansen, but it seems it wasn't quite finished on time (it's almost at
0.001 now, I guess). Also, there are some complications to making this
:utf8 that may not seem obvious to an outsider: the current
implementation has the side-effect that :encoding appears to have a
utf8 layer on top of it, that would have to change too. I'm very much
in favor of that change, but it will break some code.

Leon

[1]: https://github.com/Leont/perlio-utf8_strict

Tom Christiansen

unread,
Feb 20, 2012, 6:58:51 PM2/20/12
to Leon Timmermans, Karl Williamson, Perl5 Porters Mailing List, Jarkko Hietaniemi, cha...@cpan.org
Why does it take a new layer? Why not just make the things
that get fatalized by

use warnings FATAL => "utf8";

fatal without saying that?

--tom

Christian Hansen

unread,
Feb 20, 2012, 7:21:30 PM2/20/12
to Tom Christiansen, Leon Timmermans, Karl Williamson, Perl5 Porters Mailing List, Jarkko Hietaniemi, cha...@cpan.org
I would love for this to happen, I have advocated this on #p5p several times, but there is always the battle of "backwards compatibility disease". About 10 months ago I reported a security issue reading the relaxed UTF-8 implementation (still undisclosed and still exploitable) on the perl security mailing list.

What you state above, was the reason I implemented Unicode::UTF8, but it only decodes strings, not I/O (good enough for me and my clients as most of our my data is small, few MBytes).

If there would be a consensus in this matter I would happily devote time to see this implemented and tested [1]

[1] I will not provide a UTF-EBCIDIC implementation, as I believe that's is an ancient encoding not used by/endorsed by vendor.

--
chansen



Tom Christiansen

unread,
Feb 20, 2012, 7:44:08 PM2/20/12
to Christian Hansen, Leon Timmermans, Karl Williamson, Perl5 Porters Mailing List, Jarkko Hietaniemi, cha...@cpan.org
Christian Hansen <christia...@mac.com> wrote
on Tue, 21 Feb 2012 01:21:30 +0100:

> 21 feb 2012 kl. 00:58 skrev Tom Christiansen:

> Why does it take a new layer? Why not just make the things
> that get fatalized by
>
> use warnings FATAL => "utf8";
>
> fatal without saying that?

> I would love for this to happen, I have advocated this on #p5p several
> times, but there is always the battle of "backwards compatibility
> disease". About 10 months ago I reported a security issue reading the
> relaxed UTF-8 implementation (still undisclosed and still exploitable)
> on the perl security mailing list.

There is absolutely no need to remain compatible with security-related
bugs, and every reason not to. Indeed, security is the only thing that
we ever issue patches to releases that are past their end-of-life support.

--tom

Christian Hansen

unread,
Feb 20, 2012, 8:07:08 PM2/20/12
to Tom Christiansen, Leon Timmermans, Karl Williamson, Perl5 Porters Mailing List, Jarkko Hietaniemi, cha...@cpan.org
I lack the political skills to make this happen, but I'm more than willing to provide the proper UTF-8 implementation for this (as defined by Unicode/ISO/IEC 10646:2011) we could always discuss the need for the invented meaning of relaxed. During my years as a professional programmer for several high profile financial institutions in Sweden, I have only encountered Ill-formed UTF8 through malicious attempts or clients that thought that they where sending UTF-8 but using ISO-8959-1, thats my experience, perhaps yours looks different?

--
chansen

Dr.Ruud

unread,
Feb 20, 2012, 8:16:28 PM2/20/12
to perl5-...@perl.org
On 2012-02-20 18:27, Tom Christiansen wrote:

> I almost wonder whether this sort of thing oughtn't be a manpage,
> something like perluni{ref,cheat,quick}?

"man unicode" is probably still up for grabs.

--
Ruud

Leon Timmermans

unread,
Feb 21, 2012, 3:44:15 AM2/21/12
to Tom Christiansen, Karl Williamson, Perl5 Porters Mailing List, Jarkko Hietaniemi, cha...@cpan.org
I'm not entirely sure what gets fixed by that and what doesn't, it
isn't documented at all. Looking at the source makes me feel it's a
hack IMO, and I strongly suspect it is not quite a complete fix: there
are just too many places that would need to get fixed. I believe the
utf8 layer would be the right place to do it because that's the only
place that almost all Input passes through. Fixing it there fixes it
almost everywhere (except sysread I suppose, but that can be fixed
too).

Leon

Tom Christiansen

unread,
Feb 21, 2012, 11:36:12 AM2/21/12
to Karl Williamson, Perl5 Porters Mailing List, Jarkko Hietaniemi, brian d foy
Karl Williamson <pub...@khwilliamson.com> wrote
on Mon, 20 Feb 2012 12:27:18 MST:

>> I almost wonder whether this sort of thing oughtn't be a manpage,
>> something like perluni{ref,cheat,quick}?

> +1

This is pre-release. Would it be suitable?

Gosh, this is three dozen recipes, so too many for just one chapter in PCB; hm.
Maybe I should just do a little 300-page Unicode recipe book. :)

--tom

=encoding utf8

=head1 NAME

perlunicook - cookbookish examples of handling Unicode in Perl

=head1 DESCRIPTION

Unless otherwise notes, all examples below assume this preamble,
with the C<#!> adjusted to work on your system:

#!/usr/bin/env perl

use utf8;
use v5.12; # or later
use strict;
use warnings;
use warnings qw(FATAL utf8);
use open qw(:std :utf8);
use charnames qw(:full :short); # unneeded in v5.16

=head1 EXAMPLES

=head2 Sample filter

Always decompose on the way in, then recompose on the way out.

use Unicode::Normalize;

while (<>) {
$_ = NFD($_);
...
} continue {
print NFC($_);
}

=head2 Fine-tuning Unicode warnings

As of v5.14, Perl distinguishes three sublasses of UTF‑8 warnings.

use v5.14;
no warnings "nonchars"; # the 66 forbidden characters
no warnings "surrogates"; # UTF-16/CESU-8 nonsense
no warnings "non_unicode"; # for codepoints over 0x10_FFFF

=head2 Characters and their numbers

The C<ord> and C<chr> functions work transparently on all codepoints.

# ASCII characters
ord("A")
chr(65)

# characters from the Basic Multilingual Plane
ord("Σ")
chr(0x3A3)

# beyond the BMP
ord("𝑛")
chr(0x1D45B)

# beyond Unicode (up to MAXINT)
ord("\x{20_0000}")
chr(0x20_0000)

=head2 Unicode literals by character number

In a literal, you may specify a character by its number
using the C<\x{R<HHHHHH>}> escape.

String: "\x{3a3}"
Regex: /\x{3a3}/

String: "\x{1d45b}"
Regex: /\x{1d45b}/

# even non-BMP ranges in regex work fine
/[\x{1D434}-\x{1D467}]/

=head2 Get character name by number

use charnames ();
my $name = charnames::viacode(0x03A3);

=head2 Get character number by name

use charnames ();
my $number = charnames::vianame("GREEK CAPITAL LETTER SIGMA");

=head2 Unicode named characters

In v5.16, there is an implicit

use charnames qw(:full :short);

But prior to that release, you must be explicit about which charnames you
want. You should still specify a script if you want short names that are
script-specific.

use charnames qw(:full :short greek);

"\N{MATHEMATICAL ITALIC SMALL N}" # :full
"\N{GREEK CAPITAL LETTER SIGMA}" # :full
"\N{Greek:Sigma}" # :short
"\N{epsilon}" # greek

The v5.16 release also supports a C<:loose> import for loose matching of
character names.

=head2 Unicode named sequences

These look just like character names but return multiple code points.
Notice the C<%vx> vector-print functionality in C<printf>.

use charnames qw(:full);
my $seq = "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}";
printf "U+%v04X\n", $seq;
U+0100.0300

=head2 Custom named characters

Give your own nicknames to existing characters, or to unnamed
private-use characters.

use charnames ":full", ":alias" => {
ecute => "LATIN SMALL LETTER E WITH ACUTE",
"APPLE LOGO" => 0xF8FF, # private use character
};

"\N{ecute}"
"\N{APPLE LOGO}"

=head2 Declare source in utf8 for identifiers and literals

Without this declaration, putting UTF‑8 in your literals
and identifiers won’t work right. If you used the standard
preamble, this already happened.

use utf8;

my $measure = "Ångström";
my @μsoft = qw( cp852 cp1251 cp1252 );
my @ὑπέρμεγας = qw( ὑπέρ μεγας );
my @鯉 = qw( koi8–f koi8–u koi8–r );

=head2 Unicode casing

Unicode casing is very different from ASCII casing.

uc("henry ⅷ") # "HENRY Ⅷ"
uc("tschüß") # "TSCHÜSS" notice ß => SS

# both are true:
"tschüß" =~ /TSCHÜSS/i # notice ß => SS
"Σίσυφος" =~ /ΣΊΣΥΦΟΣ/i # notice Σ,σ,ς sameness

=head2 Unicode case-insensitive comparisons

Also available in the CPAN L<Unicode::CaseFold> module,
the new C<fc> “foldcase” function from v5.16 grants
access to the same Unicode casefolding as the C</i>
pattern modifier has always used:

use feature "fc"; # fc() function is from v5.16

# sort case-insensitively
my @sorted = sort { fc($a) cmp fc($b) } @list;

# both are true:
fc("tschüß") eq fc("TSCHÜSS")
fc("Σίσυφος") eq fc("ΣΊΣΥΦΟΣ")

=head2 Match Unicode linebreak sequence in regex

A Unicode linebreak matches the two-character CRLF
grapheme or any of seven vertical whitespace characters.
Good for dealing with intransigent Microsoft systems.

\R

s/\R/\n/g; # normalize all linebreaks to \n

=head2 Get character category

Find the general category of a numeric codepoint.

use Unicode::UCD qw(charinfo);
my $cat = charinfo(0x3A3)->{category}; # "Lu"

=head2 Disabling Unicode-awareness in builtin charclasses

Disable C<\w>, C<\b>, C<\s>, C<\d>, and the POSIX
classes from working correctly on Unicode.

use v5.14;
use re "/a";

# OR

my($num) = $str =~ /(\d+)/a;

Or just use specific un-Unicode properties, like C<\p{ahex}>
and C<\p{posix_digit>}. Properties still work normally
no matter what.

=head2 Match Unicode properties in regex with \p, \P

These all match a single codepoint with the given
property. Use C<\P> in place of C<\p> to match
one lacking that property.

\pL, \pN, \pS, \pP, \pM, \pZ, \pC
\p{Sk}, \p{Ps}, \p{Lt}
\p{alpha}, \p{upper}, \p{lower}
\p{Latin}, \p{Greek}
\p{script=Latin}, \p{script=Greek}
\p{East_Asian_Width=Wide}, \p{EA=W}
\p{Line_Break=Hyphen}, \p{LB=HY}
\p{Numeric_Value=4}, \p{NV=4}

=head2 Custom character properties

Define at compile-time your own custom character
properties for use in regexes.

# using private-use characters
sub In_Tengwar { "E000\tE07F\n" }

if (/\p{In_Tengwar}/) { ... }

# blending existing properties
sub Is_GraecoRoman_Title {<<'END_OF_SET'}
+utf8::IsLatin
+utf8::IsGreek
&utf8::IsTitle
END_OF_SET

if (/\p{Is_GraecoRoman_Title}/ { ... }

=head2 Convert non-ASCII Unicode numerics

Unless you’ve used C</a>, C<\d> matches more than ASCII digits.

use v5.14;
use Unicode::UCD qw(num);
my $str = "got Ⅻ and ४५६७ and ⅞ and here";
my @nums = ();
while (/$str =~ (\d+|\N)/g) { # not just ASCII!
push @nums, num($1);
}
say "@nums"; # 12 4567 0.875

use charnames qw(:full);
my $nv = num("\N{RUMI DIGIT ONE}\N{RUMI DIGIT TWO}");

=head2 Match Unicode grapheme cluster in regex

Programmer-visible “characters” are codepoints matched by C</./s>,
but yser-visible “characters” are graphemes matched by C</\X/>.

# Find vowel plus any diacritics,underlining,etc.
my $nfd = NFD($orig);
$nfd =~ /(?=[aeiou])\X/i

=head2 Extract by grapheme instead of by codepoint (regex)

# match and grab five first graphemes
my($first_five) = $str =~ /^(\X{5})/;

=head2 Extract by grapheme instead of by codepoint (substr)

# cpan -i Unicode::GCString
use Unicode::GCString;
my $gcs = Unicode::GCString->new($str);
my $first_five = $gcs->substr(0, 5);

=head2 Reverse string by grapheme

Reversing by codepoint messes up diacritics.

$str = join("", reverse $str =~ /\X/g);

# OR: cpan -i Unicode::GCString
use Unicode::GCString;
$str = reverse Unicode::GCString->new($str);

=head2 String length in graphemes

Count by grapheme, not by codepoint.

my $count = 0;
while ($str =~ /\X/) { $count++ }

# OR: cpan -i Unicode::GCString
use Unicode::GCString;
$gcs = Unicode::GCString->new($str);
my $count = $gcs->length;

=head2 Unicode column-width for printing

Perl’s C<printf>, C<scriptf>, and C<format> think all
code points take up 1 print column, but many take 0 or 2.

# cpan -i Unicode::GCString
use Unicode::GCString;
$gcs = Unicode::GCString->new($str);
my $cols = $gcs->columns;
printf "%*s\n", $cols, $str,

=head2 Unicode normalization

Typically render into NFD on input and NFC on output.
Using either of the NFK functions improves recall on searches.
Note that this is about much more than just pre-combined compatibility glyphs.

use Unicode::Normalize;
my $nfd = NFD($orig);
my $nfc = NFC($orig);
my $nfkd = NFKD($orig);
my $nfkc = NFKC($orig);

=head2 Unicode collation

Text sorted by numeric codepoint follows no reasonable order;
use the UCA for sorting text.

use Unicode::Collate;
my $col = Unicode::Collate->new();
my @list = $col->sort(@old_list);

=head2 Case- I<and> accent-insensitive Unicode sort

Specify a collation strength of level 1 to ignore case and
diacritics, only looking at the basic character.

use Unicode::Collate;
my $col = Unicode::Collate->new(level => 1);
my @list = $col->sort(@old_list);

=head2 Unicode locale collation

Some locales have special sorting rules.

# either use v5.12, OR: cpan -i Unicode::Collate::Locale
use Unicode::Collate::Locale;
my $col = Unicode::Collate::Locale->new(locale => "de__phonebook");
my @list = $col->sort(@old_list);

=head2 Making C<cmp> work on text instead of codepoints

Instead of this:

@srecs = sort {
$b->{AGE} <=> $a->{AGE}
||
$a->{NAME} cmp $b->{NAME}
} @recs;

Use this:

my $coll = Unicode::Collate->new();
for my $rec (@recs) {
$rec->{NAME_key} = $coll->getSortKey( $rec->{NAME} );
}
@srecs = sort {
$b->{AGE} <=> $a->{AGE}
||
$a->{NAME_key} cmp $b->{NAME_key}
} @recs;

=head2 Case- I<and> accent-insensitive comparisons

Use a collator object to compare Unicode text by character
instead of by codepoint.

use Unicode::Collate;
my $es = Unicode::Collate–>new(
level => 1,
normalization => undef
);

# now both are true:
$es->eq("García", "GARCIA" );
$es->eq("Márquez", "MARQUEZ");

=head2 Case- I<and> accent-insensitive locale comparisons

Same, but in a specific locale.

my $de = Unicode::Collate::Locale->new(
locale => "de__phonebook",
);

# now this is true:
$de->eq("tschüß", "TSCHUESS"); # notice ü => UE

=head2 Unicode linebreaking

Break up text into lines according to Unicode rules.

# cpan -i Unicode::LineBreak
use Unicode::LineBreak;
use charnames qw(:full);

my $para = "This is a super\N{HYPHEN}long string. " x 20;
my $fmt = new Unicode::LineBreak;
print $fmt->break($para), "\n";

=head2 Declare all three standard I/O streams to be utf8

Use a command-line option, an environment variable, or else
call C<binmode> explicitly:

$ perl -CS ...
or
$ export PERL_UNICODE=S
or
use open qw(:std :utf8);
or
binmode(STDIN, ":utf8");
binmode(STDOUT, ":utf8");
binmode(STDERR, ":utf8");

=head2 Make I/O default to utf8

Include files opened without an encoding arugment.

$ perl -CSD ...
or
$ export PERL_UNICODE=SD
or
use open qw(:std :utf8);

=head2 Open file with implicit encode/decode

Specify stream encoding. This is the normal way
to deal with encoded text, not by calling low-level
functions.

# input file
open(my $in_file, "< :encoding(UTF-16)", "wintext");
OR
open(my $in_file, "<", "wintext");
binmode($in_file, ":encoding(UTF-16)");
THEN
my $line = <$in_file>;

# output file
open($out_file, "> :encoding(cp1252)", "wintext");
OR
open(my $out_file, ">", "wintext");
binmode($out_file, ":encoding(cp1252)");
THEN
print $out_file "some text\n";

The incantation C<":raw :encoding(UTF-16LE) :crlf">
includes implicit CRLF handling.

=head2 Explicit encode/decode [rarely needed, see previous]

On very rare occasion, such as a database read, you may be
given encoded text you need to decode.

use Encode qw(encode decode);

my $chars = decode("shiftjis", $bytes);
# OR
my $bytes = encode("MIME–Header–ISO_2022_JP", $chars);

But see L<DBM_Filter::utf8> for easy implicit handling of UTF‑8
in DBM databases.

=head1 SEE ALSO

L<perlunicode>,
L<perluniprops>,
L<perlre>,
L<perlrecharclass>,
L<perluniintro>,
L<perlunitut>,
L<perlunifaq>,
L<PerlIO>,
L<DBM_Filter::utf8>,
and
L<Encode>.

=head1 AUTHOR

Tom Christiansen E<lt>tch...@perl.comE<gt>

=head1 COPYRIGHT AND LICENCE

Copyright © 2012 Tom Christiansen.
All rights reversed. Perl licence, blah blah.

Some code excerpts taken from the 4th Edition of I<Programming Perl>,
Copyright © 2012blah blah


=head1 REVISON HISTORY

zi1ch0

Tom Christiansen

unread,
Feb 25, 2012, 5:28:42 PM2/25/12
to Leon Timmermans, Ricardo Signes, Karl Williamson, Perl5 Porters Mailing List, Jarkko Hietaniemi, Christian Hansen
Maybe that's too strong a subject line, but I *really* want to know *exactly*
what we can't tell people they can use Perl with Unicode in anything but the
most excruciating all all possible ways -- if this is indeed true.

Leon Timmermans <faw...@gmail.com> on Tue, 21 Feb 2012 09:44:15 +0100 wrote:

> I'm not entirely sure what gets fixed by that and what doesn't, it isn't
> documented at all. Looking at the source makes me feel it's a hack IMO, and
> I strongly suspect it is not quite a complete fix: there are just too many
> places that would need to get fixed. I believe the utf8 layer would be the
> right place to do it because that's the only place that almost all Input
> passes through. Fixing it there fixes it almost everywhere (except sysread I
> suppose, but that can be fixed too).

Some folks claim the only "safe" way to use Unicode in Perl is to always make
explicit calls to encode/decode with a bonus FB_CROAK argument. They claim
that all nine of these perfectly reasonable and common-to-the-99th-percentile
operations...

#1. $ perl -C...
#2. $ export PERL_UNICODE=...

#3. use utf8;

#4. use open qw[ :std :utf8 ];
#5. use open qw[ :std :encoding(UTF-8) ];

#6. binmode(FH, ":utf8");
#7. binmode(FH, ":encoding(UTF-8)");

#8. open(FH, "< :utf8", $path);
#9. open(FH, "< :encoding(UTF-8)", $path);

...are all of them flawed in their not raising exceptions on UTF-8
encoding errors of one sort of another, and that somehow not even...

#0. use warnings qw(FATAL utf8);

...is good enough to fix it.

I do not know whether these claims are true. My own tests suggest this may
not be the whole story, because this behaves as I think it should:

darwin$ perl -C0 -E 'say for "caf\xE9", "stuff"' |
perl -CS -Mwarnings=FATAL,utf8 -pe 'print "$. "'
utf8 "\xE9" does not map to Unicode, <> line 1.
Exit 255

darwin$

Which seems to say that #0 makes at least #1 safe. Again, I'm fuzzy on what
the perceived problem actually is. Maybe they're using autodie or something,
which is known broken. I'm trying to get more info.

I also do not know the precise details of these so-called "security" bugs that
Christian references. I've no reason to disbelieve Christian; I just don't
know the details myself, nor am I asking that they be splatted all over.

What I do know is that telling people that the only "right" or "safe" or
"acceptable" way to use Unicode in Perl is via myriad exclicit calls to
&Encode::{utf8_,}{en,de}code(..., FB_CROAK) just doesn't cut it—full stop.

If there's something so important that it must be done everytime to ensure
correct behavior, then that is too important to be left up to the programmer
to forget to do. It needs to be done for him.

We should not have to endure five more tedious years of people getting
tonguelashed and flamenagged into writing horribly complicated code
all because something deep down in Perl's dwimmer is flawed.

I say five years, not one year, because of how long it takes to get vendors
to get themselves updated. If this is a legit issue, and we push it off till
2013's v5.18, then it will be a further 2–4 years past that until people can
have reasonable Unicode processing in their vendor Perl. That puts us into
the 2015–2017 (!) time frame, and... that's just not acceptable, eh?

By that time, one or both of two things will have happened. Either untold
zillions of lines of code will have been written that either conform to
this ridiculous amount of monkeywork, entrenching a bad pattern forever, or
else untold zillions of line of code will have been written that ignore it
and are themselves open to the kind of spooky catastrophic failure that
people allude to, thereby rendering all Perl a security hole of Chicken
Little proportion.

Since we won't let either of those happen, I figure that either

—— all #1 .. #9 of my numbered points above are completely safe and
proper, preferably always but minimally along with the #0 fatalization

—— or else they need to be made so before we dare release v5.16.

Why? Simple: remove those 9 simple ways to approach implicit Unicode
processing in Perl, and you so gut Perl's Unicode dwimmer that nobody
save the very most diligent and élite [sic] of Perl gurus will ever dare
use Perl with Unicode. That would be tragic, maybe disastrous even.

Possibly this is all well known, and I just haven't been listening. Maybe
it's even been fixed, assuming it was ever broken in the first place, which
I'm highly fuzzy on and can neither prove nor disprove with my own meagre
poking at the problem. If so, I apologize for making much ado about nothing.

I do have notions about what should be happening with encoding layers,
including backwards compatibility concerns versus security concerns;
something along the lines that we're under no obligation to leave our
backdoors standing open for eternity. Also, I strongly feel that all
encoding errors should croak by default. I hate garbage in files as
things silently fill them with manglings. Those should croak if you haven't
explicitly asked to get garbage out. That shouldn't be a default.

I'm still troubled that Encode is *not* one of "our" modules, yet a whole lot
of what we do seems dependent on it. We can't usefully create new warnings
and classes, errors, encoding names, and exceptions of encodings used in
internally if they don't sync up with Encode. But we have and they don't:
already the utf8 warnings subclasses are broken with Encode. That makes it
even more important that we get the internal stuff "right". There's a bunch
more where that came from, but I'll save the rest till someone tells me where
we actually stand.

Karl, Leon, and Christian, thank you for your time and insights, both past
and future. Nothing would please me more than to learn that I've had my cage
rattled for nothing, and that these are all non-issues.

--tom

PS: Christian Hansen, if you say your name fast enough,
it rather sounds like my surname. :)

Aristotle Pagaltzis

unread,
Feb 26, 2012, 12:32:41 AM2/26/12
to perl5-...@perl.org
I shall address your concern here in roughly reverse order:

* Tom Christiansen <tch...@perl.com> [2012-02-25 23:35]:
> Since we won't let either of those happen, I figure that either
>
> —— all #1 .. #9 of my numbered points above are completely safe and
> proper, preferably always but minimally along with the #0 fatalization
>
> —— or else they need to be made so before we dare release v5.16.

5.16 *must* be released whatever the state of this issue. To not do so
is to fall into the thinking and behavioural pattern that stifled the
release of 5.10 by several years. Perl 5 switched to a timeboxed release
cycle because “this one more thing has to be polished before we can ship
it” meant it never shipped at all.

It has been a rousing success.

The key to said success is that no feature, no matter how exceptional it
may appear, is special. If it isn’t ready in time, it can wait – because
it will not have to wait long, because when time arrives for it to ship
it will not have to wait for any *other* feature in turn, because none
are exceptional.

No matter how important any one particular issue may appear: for the new
regime to work, none must be considered more important than the release
schedule. The trains must run on time.

> Why? Simple: remove those 9 simple ways to approach implicit Unicode
> processing in Perl, and you so gut Perl's Unicode dwimmer that nobody
> save the very most diligent and élite [sic] of Perl gurus will ever dare
> use Perl with Unicode. That would be tragic, maybe disastrous even.

I am rather more sanguine about that. Let me help you relax by giving
you the reason – two of them, in fact. Firstly:

> We should not have to endure five more tedious years of people getting
> tonguelashed and flamenagged into writing horribly complicated code
> all because something deep down in Perl's dwimmer is flawed.
>
> I say five years, not one year, because of how long it takes to get
> vendors to get themselves updated. If this is a legit issue, and we
> push it off till 2013's v5.18, then it will be a further 2–4 years
> past that until people can have reasonable Unicode processing in their
> vendor Perl. That puts us into the 2015–2017 (!) time frame, and...
> that's just not acceptable, eh?

You ignore here that by this calculation, the question is *only* whether
the 4-year trickle-down process starts now or next year. Neither is
there any (known) non-linear process at work that would make a delivery
now trickle down much quicker than a delivery next year – the 4-year
process is identical either way.

So your worst-case scenario has to be compared against a parallel better
case of 4 years. It is not “ship it this time and get it now or wait
5 years”: it is “ship it this time and get it in 4 years or leave it to
be shipped next year and get it in 5, counting from now”.

Accounting for constant factors, the difference between the scenarios is
just 1 year, no matter what else the full process involves.

Then the gravity of the decision to consider is the gravity of one year.

Secondly:

> By that time, one or both of two things will have happened. Either
> untold zillions of lines of code will have been written that either
> conform to this ridiculous amount of monkeywork, entrenching a bad
> pattern forever, or else untold zillions of line of code will have
> been written that ignore it and are themselves open to the kind of
> spooky catastrophic failure that people allude to, thereby rendering
> all Perl a security hole of Chicken Little proportion.

This ignores a third option à la “use Try::Tiny until we have sane
exceptions in core”.

We have the CPAN.

It is no alternative to fixing the core language – but *is* an
alternative to copy-pasting workaround patterns all over code. And it
can trickle out *much* faster than an update to the core ever will be
able to.

Since according to your premises we are looking at 4 years in the
best-worst case, no matter how this plays out, then if the situation
is as grave and perilous as it appears to you, maybe the thing to do is
start to think *now* about how to create an interim DWIM solution that
can live on CPAN.

And incidentally, maybe it is also time to consider whether and where
the points that utf8::all implicitly argues (by virtue of its existence)
mean the design of the core language should be moving toward regarding
Unicode.


So with all that said, let us take a breath and focus on strategy, and
leave the release schedule to care for itself.

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>

Jarkko Hietaniemi

unread,
Feb 26, 2012, 9:04:13 AM2/26/12
to Tom Christiansen, Leon Timmermans, Ricardo Signes, Karl Williamson, Perl5 Porters Mailing List, Christian Hansen
FWIW, as one of the main persons responsible for the design mistakes
made 10 years (and more) ago, please do not perpetuate them anymore.
If someone still is relying on, say, ISO 8859-1 being the default,
or on half-baked stupidities of years gone by, I say it's time.


Leon Timmermans

unread,
Feb 26, 2012, 10:26:29 AM2/26/12
to Tom Christiansen, Ricardo Signes, Karl Williamson, Perl5 Porters Mailing List, Jarkko Hietaniemi, Christian Hansen
On Sat, Feb 25, 2012 at 11:28 PM, Tom Christiansen <tch...@perl.com> wrote:
> Some folks claim the only "safe" way to use Unicode in Perl is to always make
> explicit calls to encode/decode with a bonus FB_CROAK argument.  They claim
> that all nine of these perfectly reasonable and common-to-the-99th-percentile
> operations...
>
>    #1.   $ perl -C...
>    #2.   $ export PERL_UNICODE=...
>
>    #3.     use utf8;
>
>    #4.     use open qw[ :std :utf8            ];
>    #5.     use open qw[ :std :encoding(UTF-8) ];
>
>    #6.     binmode(FH, ":utf8");
>    #7.     binmode(FH, ":encoding(UTF-8)");
>
>    #8.     open(FH,  "< :utf8",            $path);
>    #9.     open(FH,  "< :encoding(UTF-8)", $path);
>
> ...are all of them flawed in their not raising exceptions on UTF-8
> encoding errors of one sort of another, and that somehow not even...
>
>    #0.     use warnings qw(FATAL utf8);
>
> ...is good enough to fix it.
>
> I do not know whether these claims are true.  My own tests suggest this may
> not be the whole story, because this behaves as I think it should:
>
>  darwin$ perl -C0 -E 'say for "caf\xE9", "stuff"' |
>          perl -CS -Mwarnings=FATAL,utf8 -pe 'print "$. "'
>  utf8 "\xE9" does not map to Unicode, <> line 1.
>  Exit 255

I'm having the impression that only high-level readline (e.g. not what
the parser uses) actually checks input for invalid characters. Most
other operations only seem to check for well-formedness if they check
at all. I may be mistaken though: I haven't tested tested this, just
read source.

Leon

Tom Christiansen

unread,
Feb 26, 2012, 11:19:18 AM2/26/12
to Leon Timmermans, Ricardo Signes, Karl Williamson, Perl5 Porters Mailing List, Jarkko Hietaniemi, Christian Hansen

Leon Timmermans <faw...@gmail.com> wrote
on Sun, 26 Feb 2012 16:26:29 +0100:

> I'm having the impression that only high-level readline (e.g. not what
> the parser uses) actually checks input for invalid characters. Most
> other operations only seem to check for well-formedness if they check
> at all. I may be mistaken though: I haven't tested tested this, just
> read source.


1$ perl -C0 -E 'say "caf\x{e9}"' | perl -Mwarnings=FATAL,utf8 -CS -E '$var = readline(STDIN); say $var'
utf8 "\xE9" does not map to Unicode at -e line 1, <STDIN> line 1.
Exit 255

2$ perl -C0 -E 'say "caf\x{e9}"' | perl -Mwarnings=FATAL,utf8 -CS -E 'read(STDIN, $var, 4); say $var'
café

3$ perl -C0 -E 'say "caf\x{e9}"' | perl -Mwarnings=FATAL,utf8 -CS -E 'sysread(STDIN, $var, 4); say $var'
caf?

4$ perl -CS -E 'say "caf\x{e9}"' | perl -Mwarnings=FATAL,utf8 -CS -E 'sysread(STDIN, $var, 4); say $var'
café

That's not right, is it? Pretty sure that both 2 and 3 should have died.
I haven't tried it with recv() yet, but I bet it's the same thing.

How else can UTF-8 get into a program? Through "use utf8", right?

5$ perl -C0 -E 'say q(my $x = ),"q(caf\x{e9});";say q(say length $x)' | perl -C0 -Mv5.14 -Mutf8
Malformed UTF-8 character (unexpected end of string) in length at - line 2.
3

Eh? How can that be? If the compiler finds malformed UTF-8 in the source,
why in the world does it go ahead and run interpreter anyway?

Also, I'm pretty sure I disagree with these:

6$ perl -C0 -E 'say "caf\x{e9}"' | perl -CS -E '$var = readline(STDIN); say $var'
caf?

7$ perl -C0 -E 'say "caf\x{e9}"' | perl -M-warnings=utf8 -CS -E '$var = readline(STDIN); say $var'
caf?

8$ perl -C0 -E 'say "caf\x{e9}"' | perl -M-warnings=FATAL,utf8 -CS -E '$var = readline(STDIN); say $var'
caf?

Why in the world is it allowed to generate illegal UTF-8?

And here it's even worse:

9$ perl -C0 -E 'say "caf\x{e9}"' | perl -M-warnings=utf8 -CS -E '$var = readline(STDIN); say $var =~ /(\w+)/'
panic: pp_match start/end pointers at -e line 1, <STDIN> line 1.
Exit 255

10$ blead -C0 -E 'say "caf\x{e9}"' | blead -M-warnings=utf8 -CS -E '$var = readline(STDIN); say $var =~ /(\w+)/'
panic: pp_match start/end pointers, i=1, start=0, end=6, s=208f00, strend=208f05, len=6 at -e line 1, <STDIN> line 1.
Exit 255

Whoa! That's this version of blead:

This is perl 5, version 15, subversion 7 (v5.15.7-275-g2872641) built for darwin-2level

so maybe it's fixed in newer blead?


--tom

Tom Christiansen

unread,
Feb 26, 2012, 12:10:10 PM2/26/12
to Aristotle Pagaltzis, perl5-...@perl.org
Aristotle Pagaltzis <paga...@gmx.de> wrote
on Sun, 26 Feb 2012 06:32:41 +0100:

> 5.16 *must* be released whatever the state of this issue.

Apparently I wasn't paying attention when it was Decided to stop using
releases as actual milestones, and start using them to rubberstamp some
meaningless periodic snapshot of the development cycle. Good thing you
got rid of the idea of release blockers, too: wouldn't want to risk
anything interfering with Real™ Progress®, now would we?

--tom

Tom Christiansen

unread,
Feb 26, 2012, 2:22:27 PM2/26/12
to Christian Hansen, Leon Timmermans, Karl Williamson, Perl5 Porters Mailing List, Jarkko Hietaniemi, cha...@cpan.org
Christian Hansen <christia...@mac.com> wrote
on Tue, 21 Feb 2012 02:07:08 +0100:

>>> I would love for this to happen, I have advocated this on #p5p several
>>> times, but there is always the battle of "backwards compatibility
>>> disease". About 10 months ago I reported a security issue reading the
>>> relaxed UTF-8 implementation (still undisclosed and still exploitable)
>>> on the perl security mailing list.

Then we are currently in a security-through-obscurity situation, wherein
only overall ignorance of an exploit "protects" us. That's not protection;
it's a vulnerability. Would you estimate the vulnerability is severe
enough for us to consider whether in this particular case we should
consider issuing patches for old releases, like make a 5.12.5 or 5.10.2?

>> There is absolutely no need to remain compatible with security-related
>> bugs, and every reason not to. Indeed, security is the only thing that
>> we ever issue patches to releases that are past their end-of-life support.

> I lack the political skills to make this happen, but I'm more than willing
> to provide the proper UTF-8 implementation for this (as defined by
> Unicode/ISO/IEC 10646:2011) we could always discuss the need for the
> invented meaning of relaxed. During my years as a professional programmer
> for several high profile financial institutions in Sweden, I have only
> encountered Ill-formed UTF8 through malicious attempts or clients that
> thought that they where sending UTF-8 but using ISO-8959-1, thats my
> experience, perhaps yours looks different?

My own experiences are finding the wrong encoding used by accident, not by
malicious intent. The situation you mention is therefore outside of my own
experiences, which makes me all the more concerned about it. I have gigabytes
of corrupt data because of Java having the wrong defaults for what to do
with wrong encodings. It was a design mistake, but they locked themselves
into it forever and everyone keeps paying for that blunder. Let's not
mimic their bad decisions. Let's fix ours.

The thing I don't want is to have to tell people that they cannot trust
perl -C, that they cannot trust PERL_UNICODE, that they cannot trust use
utf8, that that they cannot trust use open, that they cannot trust binmode,
that they cannot trust :encoding(UTF-8), and that the only thing they can
trust is laborious and error-prone manual encoding/decoding with FB_CROAK.

If that position is nonetheless correct, it drastically needs to be fixed.
Christian, I don't know what political skills you allude to as needed to
make this happen. Political skills to achieve a consensus that backwards
compatibility with previous behavior known to be wrong is undesirable?

It seems to me that Python went through a transition where encoding-decoding
errors changed from some sort of non-fatal to proper exceptions. I don't know
what sort of conniptions they experience there, since it's not a backwards-
contemptible change. But it doesn't have to be b-c, and probably shouldn't be.
Jarkko is right.

It's better to fix bugs than to document them, and it's better to document them
than not. Right now I'm very hazy on the real status of all this stuff, and I
am very uncomfortable with the idea of relentlessly charging ahead toward a
release like a freight train with no brakes.

Absolutely nothing depends upon any particular release date, but quite a bit
depends on correct behavior, especially if it is security-related. I know which
one of those *I* consider immeasurably more important, but Aristotle appears to
be of the opposite opinion. Is this the "poltical will" problem you mention?

--tom

Jarkko Hietaniemi

unread,
Feb 26, 2012, 2:54:57 PM2/26/12
to Tom Christiansen, Christian Hansen, Leon Timmermans, Karl Williamson, Perl5 Porters Mailing List, cha...@cpan.org
Further ammunition, in case people haven't seen this:

http://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.html

Perl 5.8 was one of the trail blazers on this. Let's not chain
ourselves to the altar of the false god of backward compatibility,
even worse, compatibility with policy/implementation/usage bugs. If I
have any remorse for the 5.8 choices I did, it was not going fully
Unicode, without a hint of Latin-1, or any legacy 8-bit, bias.
--
There is this special biologist word we use for 'stable'. It is
'dead'. -- Jack Cohen

Aristotle Pagaltzis

unread,
Feb 26, 2012, 7:45:10 PM2/26/12
to perl5-...@perl.org
* Tom Christiansen <tch...@perl.com> [2012-02-26 18:15]:
Sorry I disagreed with your assessment of the situation.

I actually concur that that complex of issues you brought up is gravely
important and in need of fixing. I disagree *only* about whether it
needs to hold up 5.16: it is not a regression from any previous perl.

5.12 shipped with the excision of UTF8-flag-dependent Unicode semantics
half finished, and left the rest for 5.14 to clean up. That was not the
end of the world then either.

Because 5.14 is *here*!

It’s not a snapshot of the entire development cycle that gets shipped
either. It’s a snapshot of the *stabilised* (and landed) features from
this development cycle. That is the primary instrument for not shipping
something with blocking bugs while at the same time not having them
delay it until their resolution. And patting down the projected release
product for blockers begins long before the ship date. That leaves an
infinitesimal chance for the release to be delayed, which ensures that
the work of porters and contributors can reach the hands of users in
a timely manner, instead of getting mired in a tar pit for years.

Do you want to go back to how things used to be?

Your characterisation of what was “Decided” assumes the most malicious
interpretation possible. Did you really think what you wrote is how it
works? Your reactive sarcasm does not become you.

Aristotle Pagaltzis

unread,
Feb 26, 2012, 8:23:45 PM2/26/12
to perl5-...@perl.org
* Tom Christiansen <tch...@perl.com> [2012-02-26 20:25]:
> The thing I don't want is to have to tell people that they cannot
> trust perl -C, that they cannot trust PERL_UNICODE, that they cannot
> trust use utf8, that that they cannot trust use open, that they cannot
> trust binmode, that they cannot trust :encoding(UTF-8), and that the
> only thing they can trust is laborious and error-prone manual
> encoding/decoding with FB_CROAK.

No one else here wants that either. Seriously! You do not have to argue
that this should be fixed: everyone agrees with you. Everyone wants this
corrected.

So the only issue is whether to tarpit the release for what has been
busted for 10 years already and will be busted another few years after
it is fixed until end users have the fix. Can it be so urgent that it
has to be done in a last-minute rush, and is it so much more important
than all other things that have happened in 5.16 that they deserve to be
delayed in the meantime?

I believe the answer to that, and to that only, is a firm no.

In fact this is too important a matter to be left for a last minute
rush. It deserves to be fixed with deliberation and proper care.

Maybe it will be fixed in 5.18, or maybe it will not be fixed completely
until 5.20 but we will see the situation substantially improved in 5.18.

However long it takes, though: yes, it is worth fixing. And it will be
fixed. Yes, people should be able to trust -C and PERL_UNICODE and
binmode and, even, :utf8 (and not just the :encoding(UTF-8) longhand!).
All stand in agreement on this. Do not cry for Perl just yet.

> Absolutely nothing depends upon any particular release date

The *date* is irrelevant.

The *rhythm* is paramount.

What depends on the rhythm, what is at stake, is whether is the work of
p5p reaches users in a sane time frame. Disregard the rhythm at risk of
leaving users in a drought, ironically due to best intentions. We know
this to be true because we saw it happen. And we know the catastrophe
from sticking to the rhythm to not be true because after 5.12 we saw it
pass us by.

> I am very uncomfortable with the idea of relentlessly charging ahead
> toward a release like a freight train with no brakes.

Hey, I can see the next one from here already! Up there at the horizon!
Get ready to jump on!

Christian Hansen

unread,
Mar 1, 2012, 8:59:01 PM3/1/12
to Tom Christiansen, Leon Timmermans, Karl Williamson, Perl5 Porters Mailing List, Jarkko Hietaniemi, cha...@cpan.org
26 feb 2012 kl. 20:22 skrev Tom Christiansen:

Christian Hansen <christia...@mac.com> wrote
  on Tue, 21 Feb 2012 02:07:08 +0100:

I would love for this to happen, I have advocated this on #p5p several
times, but there is always the battle of  "backwards compatibility
disease". About 10 months ago I reported a security issue reading the
relaxed UTF-8 implementation (still undisclosed and still exploitable)
on the perl security mailing list.

Then we are currently in a security-through-obscurity situation, wherein
only overall ignorance of an exploit "protects" us.  That's not protection;
it's a vulnerability.  Would you estimate the vulnerability is severe
enough for us to consider whether in this particular case we should
consider issuing patches for old releases, like make a 5.12.5 or 5.10.2?

The vulnerability is present from early realises of 5.8.X (I haven't confirmed all perl releases, but the implementation is the same). The vulnerability makes it possible to smuggle through character strings (specially crafted for malicious purposes) using the :utf8 layer, which (in this case) bypass the perls regex engine (which fails the match/validation).

Wheter or not this is severe enough to patch older releases or not, I'll leave unsaid.

There is absolutely no need to remain compatible with security-related
bugs, and every reason not to.  Indeed, security is the only thing that
we ever issue patches to releases that are past their end-of-life support.

I agree!

I lack the political skills to make this happen, but I'm more than willing
to provide the proper UTF-8 implementation for this (as defined by
Unicode/ISO/IEC 10646:2011) we could always discuss the need for the
invented meaning of relaxed. During my years as a professional programmer
for several high profile financial institutions in Sweden, I have only
encountered Ill-formed UTF8 through malicious attempts or clients that
thought that they where sending UTF-8 but using ISO-8959-1, thats my
experience, perhaps yours looks different?

My own experiences are finding the wrong encoding used by accident, not by
malicious intent.  The situation you mention is therefore outside of my own
experiences, which makes me all the more concerned about it.  I have gigabytes
of corrupt data because of Java having the wrong defaults for what to do
with wrong encodings.  It was a design mistake, but they locked themselves
into it forever and everyone keeps paying for that blunder.  Let's not
mimic their bad decisions.  Let's fix ours.

Sounds like the CESU-8 issue, been there ;)

The thing I don't want is to have to tell people that they cannot trust
perl -C, that they cannot trust PERL_UNICODE, that they cannot trust use
utf8, that that they cannot trust use open, that they cannot trust binmode,
that they cannot trust :encoding(UTF-8), and that the only thing they can
trust is laborious and error-prone manual encoding/decoding with FB_CROAK.

":encoding(UTF-8)" is currently whats offered by "core" perl, PerlIO::encoding provides a global to alter the behaviour of Encode, $PerlIO::encoding::fallback, I have not tested altering this global (using FB_CROAK, but I guess by looking at the internals that exceptions isn't expected).

If that position is nonetheless correct, it drastically needs to be fixed.
Christian, I don't know what political skills you allude to as needed to
make this happen.  Political skills to achieve a consensus that backwards
compatibility with previous behavior known to be wrong is undesirable?

It's quite easy, we need a Benevolent Dictator, such as Larry Wall. Someone who can make the though calls. Personally I think we should just implement Unicode as most people expect it to work (according to the Unicode standard).

What happened to the Perl mantra "Making Easy Things Easy and Hard Things Possible"?


It seems to me that Python went through a transition where encoding-decoding
errors changed from some sort of non-fatal to proper exceptions.  I don't know
what sort of conniptions they experience there, since it's not a backwards-
contemptible change.  But it doesn't have to be b-c, and probably shouldn't be.
Jarkko is right.

What you are saying is correct, Python supports two different compile options UC2 and UCS4. Our case is worse, we support two different internal encodings depending on platform, on EBCDIC we use UTF-EBCDIC and on US-ASCIIplatforms we use a relaxed UTF-8 encoding.

Just to cut to the shit, there seem to be a group of people that likes EBCDIC, but so far we haven't heard from anyone with this facilities.

Why are we trying to support two differential encodings when we can barley support the proper one? 


It's better to fix bugs than to document them, and it's better to document them
than not.  Right now I'm very hazy on the real status of all this stuff, and I
am very uncomfortable with the idea of relentlessly charginWe g ahead toward a
release like a freight train with no brakes.

We should  


Absolutely nothing depends upon any particular release date, but quite a bit
depends on correct behavior, especially if it is security-related.  I know which
one of those *I* consider immeasurably more important, but Aristotle appears to
be of the opposite opinion.  Is this the "poltical will" problem you mention?

Partly. 
But you have the power to change the track!

MvH
chansen

Nicholas Clark

unread,
Mar 2, 2012, 12:26:01 PM3/2/12
to perl5-...@perl.org
On Fri, Mar 02, 2012 at 02:59:01AM +0100, Christian Hansen wrote:
>
> 26 feb 2012 kl. 20:22 skrev Tom Christiansen:
>
> > Christian Hansen <christia...@mac.com> wrote
> > on Tue, 21 Feb 2012 02:07:08 +0100:
> >
> >>>> I would love for this to happen, I have advocated this on #p5p several
> >>>> times, but there is always the battle of "backwards compatibility
> >>>> disease". About 10 months ago I reported a security issue reading the
> >>>> relaxed UTF-8 implementation (still undisclosed and still exploitable)
> >>>> on the perl security mailing list.
> >
> > Then we are currently in a security-through-obscurity situation, wherein
> > only overall ignorance of an exploit "protects" us. That's not protection;
> > it's a vulnerability. Would you estimate the vulnerability is severe
> > enough for us to consider whether in this particular case we should
> > consider issuing patches for old releases, like make a 5.12.5 or 5.10.2?
>
> The vulnerability is present from early realises of 5.8.X (I haven't confirmed all perl releases, but the implementation is the same). The vulnerability makes it possible to smuggle through character strings (specially crafted for malicious purposes) using the :utf8 layer, which (in this case) bypass the perls regex engine (which fails the match/validation).
>
> Wheter or not this is severe enough to patch older releases or not, I'll leave unsaid.

Yes. This is real, and needs fixing. It shouldn't be conflated with the
general documented problem that :utf8 is lax.

I believe (but haven't checked) that all the code here (the specific, and the
general) has been functionally unchanged since 5.8.0. ie - all 20 stable
releases in the past 10 years have the same behaviour.


As to the security reporting - this is my take:

With volunteers, when someone says that they will do something but they don't
get on with it, it's problematic. There's a fine line between nagging someone
enough to get them do it, and nagging too far and they resign, leaving no-one
to do it.

In this case, "ownership" of security was person A, who had delegated it to
person B. Person B wasn't *doing* it, and person A wasn't chasing them up.
Because two things failed, it dropped on the floor. As I don't know B well,
I asked A about it. But, personally, there's only so long I am going to
point out that it was on the floor, before I consider it pointless banging
my head. Also, specifically *I* am pointing out things on the floor, rather
than picking them up and pretending that there was never a problem, as

a) I've done most of these things before, and it's someone else's turn now
Security fixes need new maint releases - I'm done with doing releases
b) I see it as more useful to Perl 5 long term to cause shorter term pain
to fix the problems, than to pretend that they don't exist.

I can report that there is progress generally here - one of A and B above has
handed over their position to a new individual, who is being more active.

> >>> There is absolutely no need to remain compatible with security-related
> >>> bugs, and every reason not to. Indeed, security is the only thing that
> >>> we ever issue patches to releases that are past their end-of-life support.
>
> I agree!

But not forever. The support policy states:

We "officially" support the two most recent stable release series.
5.10.1 and earlier are now out of support. As of the release of 5.16.0,
we will "officially" end support for Perl 5.12.4, other than providing
security updates as described below.

To the best of our ability, we will attempt to fix critical issues in
the two most recent stable 5.x release series. Fixes for the current
release series take precedence over fixes for the previous release
series.

To the best of our ability, we will provide "critical" security patches
/ releases for any major version of Perl whose 5.x.0 release was within
the past three years. We can only commit to providing these for the
most recent .y release in any 5.x.y series.

We will not provide security updates or bug fixes for development
releases of Perl.

We encourage vendors to ship the most recent supported release of Perl
at the time of their code freeze.

As a vendor, you may have a requirement to backport security fixes
beyond our 3 year support commitment. We can provide limited support
and advice to you as you do so and, where possible will try to apply
those patches to the relevant -maint branches in git, though we may or
may not choose to make numbered releases or "official" patches
available. Contact us at <perl5-secu...@perl.org> to begin that
process.

http://perl5.git.perl.org/perl.git/blob/HEAD:/pod/perlpolicy.pod#l85



> It's quite easy, we need a Benevolent Dictator, such as Larry
> Wall. Someone who can make the though calls. Personally I think we should
> just implement Unicode as most people expect it to work (according to the
> Unicode standard).

Tom's conclusions from his talk to OSCON last year was that Perl 5's Unicode
support is generally better than *every* other language. You seem to be
choosing words that make that sound like this is not the case.

And yes, Ricardo is acting as dictator and *has* made a decision. The problem
is that people don't like it when decisions go against them. And decisions
can't be in everyone's preferred direction, as if everyone agreed, a decision
wouldn't have been needed.

Please don't confuse the specific security issue with the general documented
:utf8 laxness. The specific security issue damn well should be a release
blocker.

Ricardo's decision is that delaying 5.16 to fix the general *known*
*documented* *ten year old* laxness of :utf8 doesn't actually help get the
fix into the hands of end users any faster, but does deny them every other
bug fix currently in blead. As Ricardo stated in his e-mails, particularly
this one which no-one commented on, and no-one disputed:

http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2012-02/msg01160.html

* if Ricardo declares that something *is* a release blocker, then it blocks
the release
* There isn't a pool of programmers he can direct to fix it
* He isn't able to fix it himself
* No-one was able to promise to deliver a fix in a timely fashion

result - the release would block, potentially forever.

Right now, the current stable version of perl is 5.14.2. It has lax :utf8
If Ricardo chooses to *delay* 5.16.0, the current stable version still has
the bug.

Shipping a 5.18.0 as soon as we have it fixed gets the fix out there in
the *same* timeframe as delaying 5.16.0 until the fix is done.


I think that a lot of people reading this list don't realise how *few* people
are actually contributing *code* to the repository. Most commits are from
about half a dozen people. Which 6 people changes from month to month, and it
might be fairer to express it as 4 + 4 * 0.5, but it's surprisingly *low*,
given all the traffic on this list.

It's been like that for about 10 years, as can be seen from the graphs on
Ohloh:

https://www.ohloh.net/p/perl/analyses/latest


The graphs are noisy, but even so there's no obvious change point around the
switch from perforce to git, or around the start of regular blead releases.
Both *were* useful changes for other reasons, but haven't changed the
contributor makeup.


> What happened to the Perl mantra "Making Easy Things Easy and Hard Things Possible"?

I don't see what relevance that has here.

> > It seems to me that Python went through a transition where encoding-decoding
> > errors changed from some sort of non-fatal to proper exceptions. I don't know
> > what sort of conniptions they experience there, since it's not a backwards-
> > contemptible change. But it doesn't have to be b-c, and probably shouldn't be.
> > Jarkko is right.

I think that yes, we should make various things fatal. Including "wide
character in print" and UTF-8 laxness in the parser. But the codebase is
messy, and it all takes time.

> What you are saying is correct, Python supports two different compile options UC2 and UCS4. Our case is worse, we support two different internal encodings depending on platform, on EBCDIC we use UTF-EBCDIC and on US-ASCIIplatforms we use a relaxed UTF-8 encoding.
>
> Just to cut to the shit, there seem to be a group of people that likes EBCDIC, but so far we haven't heard from anyone with this facilities.
>
> Why are we trying to support two differential encodings when we can barley support the proper one?


It is politically awkward to kill EBCDIC support. Given that (so far) neither
the EBCDIC user base nor IBM have actually come good on delivering help, they
are not doing themselves any favours. (There was one individual who volunteered
to test build, but he's gone cold. Ricardo is chasing him). At some point,
soon*er* rather than later we will loose all patience with them.

However, *Jarkko* is also against removing EBCDIC support. He rightly points
out that it would be very hard to add it back once it's removed. I can also
understand his personal concern here - he put a lot of personal effort into
getting it working, and we're wanting to throw it away. I have a lot more time
for Jarkko than all the talk-but-no-action EBCDIC user base, as even now
Jarkko is still responsible for 23% of the Perl 5 distribution, as the
ohloh link above shows.

But I believe that even Jarkko only had access to an EBCDIC system for a short
while whilst he was working on things, and no longer has access, so I don't
think that he could help *code* here, even if he had the time.


We're not going to kill EBCDIC support for 5.16.0. But at the current rate of
non-responsiveness from its userbase, its days *are* numbered.

Nicholas Clark
0 new messages