Karl Williamson <
pub...@khwilliamson.com> wrote
on Mon, 20 Feb 2012 12:27:18 MST:
>> I almost wonder whether this sort of thing oughtn't be a manpage,
>> something like perluni{ref,cheat,quick}?
> +1
This is pre-release. Would it be suitable?
Gosh, this is three dozen recipes, so too many for just one chapter in PCB; hm.
Maybe I should just do a little 300-page Unicode recipe book. :)
--tom
=encoding utf8
=head1 NAME
perlunicook - cookbookish examples of handling Unicode in Perl
=head1 DESCRIPTION
Unless otherwise notes, all examples below assume this preamble,
with the C<#!> adjusted to work on your system:
#!/usr/bin/env perl
use utf8;
use v5.12; # or later
use strict;
use warnings;
use warnings qw(FATAL utf8);
use open qw(:std :utf8);
use charnames qw(:full :short); # unneeded in v5.16
=head1 EXAMPLES
=head2 Sample filter
Always decompose on the way in, then recompose on the way out.
use Unicode::Normalize;
while (<>) {
$_ = NFD($_);
...
} continue {
print NFC($_);
}
=head2 Fine-tuning Unicode warnings
As of v5.14, Perl distinguishes three sublasses of UTF‑8 warnings.
use v5.14;
no warnings "nonchars"; # the 66 forbidden characters
no warnings "surrogates"; # UTF-16/CESU-8 nonsense
no warnings "non_unicode"; # for codepoints over 0x10_FFFF
=head2 Characters and their numbers
The C<ord> and C<chr> functions work transparently on all codepoints.
# ASCII characters
ord("A")
chr(65)
# characters from the Basic Multilingual Plane
ord("Σ")
chr(0x3A3)
# beyond the BMP
ord("𝑛")
chr(0x1D45B)
# beyond Unicode (up to MAXINT)
ord("\x{20_0000}")
chr(0x20_0000)
=head2 Unicode literals by character number
In a literal, you may specify a character by its number
using the C<\x{R<HHHHHH>}> escape.
String: "\x{3a3}"
Regex: /\x{3a3}/
String: "\x{1d45b}"
Regex: /\x{1d45b}/
# even non-BMP ranges in regex work fine
/[\x{1D434}-\x{1D467}]/
=head2 Get character name by number
use charnames ();
my $name = charnames::viacode(0x03A3);
=head2 Get character number by name
use charnames ();
my $number = charnames::vianame("GREEK CAPITAL LETTER SIGMA");
=head2 Unicode named characters
In v5.16, there is an implicit
use charnames qw(:full :short);
But prior to that release, you must be explicit about which charnames you
want. You should still specify a script if you want short names that are
script-specific.
use charnames qw(:full :short greek);
"\N{MATHEMATICAL ITALIC SMALL N}" # :full
"\N{GREEK CAPITAL LETTER SIGMA}" # :full
"\N{Greek:Sigma}" # :short
"\N{epsilon}" # greek
The v5.16 release also supports a C<:loose> import for loose matching of
character names.
=head2 Unicode named sequences
These look just like character names but return multiple code points.
Notice the C<%vx> vector-print functionality in C<printf>.
use charnames qw(:full);
my $seq = "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}";
printf "U+%v04X\n", $seq;
U+0100.0300
=head2 Custom named characters
Give your own nicknames to existing characters, or to unnamed
private-use characters.
use charnames ":full", ":alias" => {
ecute => "LATIN SMALL LETTER E WITH ACUTE",
"APPLE LOGO" => 0xF8FF, # private use character
};
"\N{ecute}"
"\N{APPLE LOGO}"
=head2 Declare source in utf8 for identifiers and literals
Without this declaration, putting UTF‑8 in your literals
and identifiers won’t work right. If you used the standard
preamble, this already happened.
use utf8;
my $measure = "Ångström";
my @μsoft = qw( cp852 cp1251 cp1252 );
my @ὑπέρμεγας = qw( ὑπέρ μεγας );
my @鯉 = qw( koi8–f koi8–u koi8–r );
=head2 Unicode casing
Unicode casing is very different from ASCII casing.
uc("henry ⅷ") # "HENRY Ⅷ"
uc("tschüß") # "TSCHÜSS" notice ß => SS
# both are true:
"tschüß" =~ /TSCHÜSS/i # notice ß => SS
"Σίσυφος" =~ /ΣΊΣΥΦΟΣ/i # notice Σ,σ,ς sameness
=head2 Unicode case-insensitive comparisons
Also available in the CPAN L<Unicode::CaseFold> module,
the new C<fc> “foldcase” function from v5.16 grants
access to the same Unicode casefolding as the C</i>
pattern modifier has always used:
use feature "fc"; # fc() function is from v5.16
# sort case-insensitively
my @sorted = sort { fc($a) cmp fc($b) } @list;
# both are true:
fc("tschüß") eq fc("TSCHÜSS")
fc("Σίσυφος") eq fc("ΣΊΣΥΦΟΣ")
=head2 Match Unicode linebreak sequence in regex
A Unicode linebreak matches the two-character CRLF
grapheme or any of seven vertical whitespace characters.
Good for dealing with intransigent Microsoft systems.
\R
s/\R/\n/g; # normalize all linebreaks to \n
=head2 Get character category
Find the general category of a numeric codepoint.
use Unicode::UCD qw(charinfo);
my $cat = charinfo(0x3A3)->{category}; # "Lu"
=head2 Disabling Unicode-awareness in builtin charclasses
Disable C<\w>, C<\b>, C<\s>, C<\d>, and the POSIX
classes from working correctly on Unicode.
use v5.14;
use re "/a";
# OR
my($num) = $str =~ /(\d+)/a;
Or just use specific un-Unicode properties, like C<\p{ahex}>
and C<\p{posix_digit>}. Properties still work normally
no matter what.
=head2 Match Unicode properties in regex with \p, \P
These all match a single codepoint with the given
property. Use C<\P> in place of C<\p> to match
one lacking that property.
\pL, \pN, \pS, \pP, \pM, \pZ, \pC
\p{Sk}, \p{Ps}, \p{Lt}
\p{alpha}, \p{upper}, \p{lower}
\p{Latin}, \p{Greek}
\p{script=Latin}, \p{script=Greek}
\p{East_Asian_Width=Wide}, \p{EA=W}
\p{Line_Break=Hyphen}, \p{LB=HY}
\p{Numeric_Value=4}, \p{NV=4}
=head2 Custom character properties
Define at compile-time your own custom character
properties for use in regexes.
# using private-use characters
sub In_Tengwar { "E000\tE07F\n" }
if (/\p{In_Tengwar}/) { ... }
# blending existing properties
sub Is_GraecoRoman_Title {<<'END_OF_SET'}
+utf8::IsLatin
+utf8::IsGreek
&utf8::IsTitle
END_OF_SET
if (/\p{Is_GraecoRoman_Title}/ { ... }
=head2 Convert non-ASCII Unicode numerics
Unless you’ve used C</a>, C<\d> matches more than ASCII digits.
use v5.14;
use Unicode::UCD qw(num);
my $str = "got Ⅻ and ४५६७ and ⅞ and here";
my @nums = ();
while (/$str =~ (\d+|\N)/g) { # not just ASCII!
push @nums, num($1);
}
say "@nums"; # 12 4567 0.875
use charnames qw(:full);
my $nv = num("\N{RUMI DIGIT ONE}\N{RUMI DIGIT TWO}");
=head2 Match Unicode grapheme cluster in regex
Programmer-visible “characters” are codepoints matched by C</./s>,
but yser-visible “characters” are graphemes matched by C</\X/>.
# Find vowel plus any diacritics,underlining,etc.
my $nfd = NFD($orig);
$nfd =~ /(?=[aeiou])\X/i
=head2 Extract by grapheme instead of by codepoint (regex)
# match and grab five first graphemes
my($first_five) = $str =~ /^(\X{5})/;
=head2 Extract by grapheme instead of by codepoint (substr)
# cpan -i Unicode::GCString
use Unicode::GCString;
my $gcs = Unicode::GCString->new($str);
my $first_five = $gcs->substr(0, 5);
=head2 Reverse string by grapheme
Reversing by codepoint messes up diacritics.
$str = join("", reverse $str =~ /\X/g);
# OR: cpan -i Unicode::GCString
use Unicode::GCString;
$str = reverse Unicode::GCString->new($str);
=head2 String length in graphemes
Count by grapheme, not by codepoint.
my $count = 0;
while ($str =~ /\X/) { $count++ }
# OR: cpan -i Unicode::GCString
use Unicode::GCString;
$gcs = Unicode::GCString->new($str);
my $count = $gcs->length;
=head2 Unicode column-width for printing
Perl’s C<printf>, C<scriptf>, and C<format> think all
code points take up 1 print column, but many take 0 or 2.
# cpan -i Unicode::GCString
use Unicode::GCString;
$gcs = Unicode::GCString->new($str);
my $cols = $gcs->columns;
printf "%*s\n", $cols, $str,
=head2 Unicode normalization
Typically render into NFD on input and NFC on output.
Using either of the NFK functions improves recall on searches.
Note that this is about much more than just pre-combined compatibility glyphs.
use Unicode::Normalize;
my $nfd = NFD($orig);
my $nfc = NFC($orig);
my $nfkd = NFKD($orig);
my $nfkc = NFKC($orig);
=head2 Unicode collation
Text sorted by numeric codepoint follows no reasonable order;
use the UCA for sorting text.
use Unicode::Collate;
my $col = Unicode::Collate->new();
my @list = $col->sort(@old_list);
=head2 Case- I<and> accent-insensitive Unicode sort
Specify a collation strength of level 1 to ignore case and
diacritics, only looking at the basic character.
use Unicode::Collate;
my $col = Unicode::Collate->new(level => 1);
my @list = $col->sort(@old_list);
=head2 Unicode locale collation
Some locales have special sorting rules.
# either use v5.12, OR: cpan -i Unicode::Collate::Locale
use Unicode::Collate::Locale;
my $col = Unicode::Collate::Locale->new(locale => "de__phonebook");
my @list = $col->sort(@old_list);
=head2 Making C<cmp> work on text instead of codepoints
Instead of this:
@srecs = sort {
$b->{AGE} <=> $a->{AGE}
||
$a->{NAME} cmp $b->{NAME}
} @recs;
Use this:
my $coll = Unicode::Collate->new();
for my $rec (@recs) {
$rec->{NAME_key} = $coll->getSortKey( $rec->{NAME} );
}
@srecs = sort {
$b->{AGE} <=> $a->{AGE}
||
$a->{NAME_key} cmp $b->{NAME_key}
} @recs;
=head2 Case- I<and> accent-insensitive comparisons
Use a collator object to compare Unicode text by character
instead of by codepoint.
use Unicode::Collate;
my $es = Unicode::Collate–>new(
level => 1,
normalization => undef
);
# now both are true:
$es->eq("García", "GARCIA" );
$es->eq("Márquez", "MARQUEZ");
=head2 Case- I<and> accent-insensitive locale comparisons
Same, but in a specific locale.
my $de = Unicode::Collate::Locale->new(
locale => "de__phonebook",
);
# now this is true:
$de->eq("tschüß", "TSCHUESS"); # notice ü => UE
=head2 Unicode linebreaking
Break up text into lines according to Unicode rules.
# cpan -i Unicode::LineBreak
use Unicode::LineBreak;
use charnames qw(:full);
my $para = "This is a super\N{HYPHEN}long string. " x 20;
my $fmt = new Unicode::LineBreak;
print $fmt->break($para), "\n";
=head2 Declare all three standard I/O streams to be utf8
Use a command-line option, an environment variable, or else
call C<binmode> explicitly:
$ perl -CS ...
or
$ export PERL_UNICODE=S
or
use open qw(:std :utf8);
or
binmode(STDIN, ":utf8");
binmode(STDOUT, ":utf8");
binmode(STDERR, ":utf8");
=head2 Make I/O default to utf8
Include files opened without an encoding arugment.
$ perl -CSD ...
or
$ export PERL_UNICODE=SD
or
use open qw(:std :utf8);
=head2 Open file with implicit encode/decode
Specify stream encoding. This is the normal way
to deal with encoded text, not by calling low-level
functions.
# input file
open(my $in_file, "< :encoding(UTF-16)", "wintext");
OR
open(my $in_file, "<", "wintext");
binmode($in_file, ":encoding(UTF-16)");
THEN
my $line = <$in_file>;
# output file
open($out_file, "> :encoding(cp1252)", "wintext");
OR
open(my $out_file, ">", "wintext");
binmode($out_file, ":encoding(cp1252)");
THEN
print $out_file "some text\n";
The incantation C<":raw :encoding(UTF-16LE) :crlf">
includes implicit CRLF handling.
=head2 Explicit encode/decode [rarely needed, see previous]
On very rare occasion, such as a database read, you may be
given encoded text you need to decode.
use Encode qw(encode decode);
my $chars = decode("shiftjis", $bytes);
# OR
my $bytes = encode("MIME–Header–ISO_2022_JP", $chars);
But see L<DBM_Filter::utf8> for easy implicit handling of UTF‑8
in DBM databases.
=head1 SEE ALSO
L<perlunicode>,
L<perluniprops>,
L<perlre>,
L<perlrecharclass>,
L<perluniintro>,
L<perlunitut>,
L<perlunifaq>,
L<PerlIO>,
L<DBM_Filter::utf8>,
and
L<Encode>.
=head1 AUTHOR
Tom Christiansen E<lt>tch...@perl.comE<gt>
=head1 COPYRIGHT AND LICENCE
Copyright © 2012 Tom Christiansen.
All rights reversed. Perl licence, blah blah.
Some code excerpts taken from the 4th Edition of I<Programming Perl>,
Copyright © 2012blah blah
=head1 REVISON HISTORY
zi1ch0