Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Extended Unicode materials now available

1 view
Skip to first unread message

Tom Christiansen

unread,
Jul 8, 2011, 11:41:30 AM7/8/11
to Perl5 Porters Mailing List
(ALSO: Feel free to please pass along this message to
whomever you think it might help.)

Anyone interested is welcome to fetch the materials, including
both slides and scripts, for two of my OSCON talks on Perl and
Unicode from here:

http://training.perl.com/OSCON2011-PUE_and_UPR.tar.gz

My two talks are:

ⅰ. Perl Unicode Essentials

Nᴏᴛᴀ Bᴇɴᴇ: I’d proposed a 6‐hour talk,
but they gave me just 3 hours.

ⅱ. Unicode in Perl Regexes

Nᴏᴛᴀ Bᴇɴᴇ: I’d proposed a 6‐hour talk,
but they gave me a mere 40 minutes.

Both come in three equivalent forms, with the first
canonical and the other two derived from the first:

⒜ source doc in regular old Perl pod.
⒝ HTML slideshow derived from ⒜.
⒞ PDF to print multislides per page, also derived from ⒜.

Also included is a directory loaded with scripts. All are about Unicode.
Many I use every day, as some of you have seen. Most have Unicode in them,
and some even use verboten Unicode identifiers:

hypertest: my @ὑπέρμεγας = (
leo my $ʇndʇno = uʍopəpᴉƨdn($input);
nunez my $SI_IMPORTAN_MARCAS_DIACRÍTICAS = 0;
nunez next unless @resultados || $INCLUÍR_NINGUNOS;
nunez $cmáx => !$déjà_imprimée++ && encomillar($aldea),
uniquote sub commaʼd_list {

Please note that the last one is not “comma'd_list”!!

A whatis(1)‐style description of the contents of the scripts directory follows.
It’s divided into 9 sections, with the more important sections toward the top.

Commments, corrections, kvetches, complaints, catcalls, cris‐de‐cœur,
cool‐beans, krikeys, carambas, kyries, and copriloquies all welcome. :)

--tom

Contents of tchrist-unicode-scripts directory, grouped:

1. HOLY TRIO OF INDISPENSABLE TOOLS FOR UNDERSTANDING THE UCD & UNICODE IN GENERAL
unichars – show which code points match arbitrary criteria
uniprops – show which props a code point has (by number or name, etc)
uninames – intelligrep the now‐excised NameList.txt (included)

2. REWRITES OF CRITICAL UNIX PROGRAMS:
uniquote – replacement for od(1) or -v option to cat(1), but for Unicode
tcgrep – very ancient grep(1) replacment, needs rewrite but now supports named character
unilook – look(1) rewrite but with grep and agrep support; require included words.utf8 file
ucsort – sort(1) rewrite using the UCA, includes Unicode locales, and inteligent --pre stuff
unifmt – fmt(1) rewrite
rename – ancient rewrite of Larry’s old rename(1) rewrite; might help Unicode filesyssues
uniwc – wc(1) rewrite for Unicode, includes \R support, graphemes, etc; needs refactoring

3. PROGRAMS FOR NORMALIZATION FILTERS, CHECKER
nfd, nfc, nfkd, nfkc – Unicode normalization filters
nfcheck – report which which of NF{,K}[DC} apply to any given file
% nfcheck leo hantest nunez tc macroman
leo: NFC NFD
hantest: NFC
nunez: NFC NFKC
tc: NFC NFKC NFD NFKD

4. (RE)CASING FILTER PROGRAMS:
lc – filter to do the Unicode toLower casemapping
% echo "Filter to Convert a Title's Words to the Right Case" | lc
filter to convert a title's words to the right case
tc – filter to do the Unicode toTitle casemapping (intelligently)
% echo "filter to convert a title's words to the right case" | tc
Filter To Convert A Title's Words To The Right Case
titulate – converts string args to English **HEADLINE** case (NB: headline != titlecase)
% titulate "filter to convert a title's words to the right case"
Filter to Convert a Title's Words to the Right Case
uc – filter to do the Unicode toUpper casemapping
% echo "filter to convert a title's words to the right case" | uc
FILTER TO CONVERT A TITLE'S WORDS TO THE RIGHT CASE


5. FONT GAME PROGRAMS:
leo – uʍopəpᴉsdn sƃuᴉɥʇ əʇᴉɹʍ oʇ ɹəʇlᴉɟ
unifont – filter for showing all Unicode “alternate font” letters
% echo hic sunt data unicodica | unifont
Double‐Struck: 𝕙𝕚𝕔 𝕤𝕦𝕟𝕥 𝕕𝕒𝕥𝕒 𝕦𝕟𝕚𝕔𝕠𝕕𝕚𝕔𝕒
Monospace: 𝚑𝚒𝚌 𝚜𝚞𝚗𝚝 𝚍𝚊𝚝𝚊 𝚞𝚗𝚒𝚌𝚘𝚍𝚒𝚌𝚊
Sans‐Serif: 𝗁𝗂𝖼 𝗌𝗎𝗇𝗍 𝖽𝖺𝗍𝖺 𝗎𝗇𝗂𝖼𝗈𝖽𝗂𝖼𝖺
Sans‐Serif Italic: 𝘩𝘪𝘤 𝘴𝘶𝘯𝘵 𝘥𝘢𝘵𝘢 𝘶𝘯𝘪𝘤𝘰𝘥𝘪𝘤𝘢
Sans‐Serif Bold: 𝗵𝗶𝗰 𝘀𝘂𝗻𝘁 𝗱𝗮𝘁𝗮 𝘂𝗻𝗶𝗰𝗼𝗱𝗶𝗰𝗮
Sans‐Serif Bold Italic: 𝙝𝙞𝙘 𝙨𝙪𝙣𝙩 𝙙𝙖𝙩𝙖 𝙪𝙣𝙞𝙘𝙤𝙙𝙞𝙘𝙖
Script: 𝒽𝒾𝒸 𝓈𝓊𝓃𝓉 𝒹𝒶𝓉𝒶 𝓊𝓃𝒾𝒸ℴ𝒹𝒾𝒸𝒶
Italic: h𝑖𝑐 𝑠𝑢𝑛𝑡 𝑑𝑎𝑡𝑎 𝑢𝑛𝑖𝑐𝑜𝑑𝑖𝑐𝑎
Bold: 𝐡𝐢𝐜 𝐬𝐮𝐧𝐭 𝐝𝐚𝐭𝐚 𝐮𝐧𝐢𝐜𝐨𝐝𝐢𝐜𝐚
Bold Italic: 𝒉𝒊𝒄 𝒔𝒖𝒏𝒕 𝒅𝒂𝒕𝒂 𝒖𝒏𝒊𝒄𝒐𝒅𝒊𝒄𝒂
Fraktur: 𝔥𝔦𝔠 𝔰𝔲𝔫𝔱 𝔡𝔞𝔱𝔞 𝔲𝔫𝔦𝔠𝔬𝔡𝔦𝔠𝔞
Bold Fraktur: 𝖍𝖎𝖈 𝖘𝖚𝖓𝖙 𝖉𝖆𝖙𝖆 𝖚𝖓𝖎𝖈𝖔𝖉𝖎𝖈𝖆
unicaps – Fɪʟᴛᴇʀ ᴛᴏ ᴄᴏɴᴠᴇʀᴛ ᴛᴏ sᴍᴀʟʟ ᴄᴀᴘs
unisubs, unisupers – filter to show subscripted₁₉₈₇ and ˢᵘᵖᵉʳˢᶜʳⁱᵖᵗᵉᵈ versions
unititle – prototype to over/underline things (real version in progress)
uniwide, uninarrow – reversable filters for converting to FULLWIDTH equivs

6. TEST AND DEMO PROGRAMS:
macroman – show mapping between MacRoman and Uncidoe
byte2uni – early prototype of general‐purpose version of the macroman
DEMO: byte2uni -a -ecp1252
es-sort – how to do fancy UCA sorts, using Spanish city names
hantest – demo of Unihan stuff and Unicode::{LineBreak, GCString}
havshpx – vs lbh unir gb nfx, lbh qb abg jnag gb xabj
hypertest – demo support trans‐Unicode code point support
nunez – demo accent‐insensitive searches; very well commented
vowel-sigs – show how to create your own properties; also, regex subroutines

7. MODULES
Underscore.pm – C<no Underscore> forbids unlocalized $_ access
FixString.pm – tries to sort text items with numbers, including Roman, intelligently,
includes support for Unicode Romans, and for Romans written in Latin
script, but requires Roman.pm module for the latter. Falls back to the UCA.
tchrist-unicode-charclasses__alpha.java – EGAD! I talked them into making most of
this functionality part of JDK7.

8. LIBRARIES:
unicore/{all,html,uwords}_alias.pl – a forgotten charnames facility

9. FILES:
words.utf8 – sorted dictionary list of UTF-8 words for unilook

WARNING: I *always* have my PERL_UNICODE envariable set to "S" (and only
turn that off on a rare, one‐shot basis), so may of these
malfunction otherwise. Some may sometimes also need "A",
and tcgrep may sometimes also need "D" if not reading
from a pipe.)

===============================

On Licences

I basically want to get all this stuff out there so people understand
all these things better. It’s really important. So please consider all
the tchrist_unicode_scripts/ files to carry, even if not stated:

=head1 COPYRIGHT AND LICENCE

Copyright 2011 Tom Christiansen.

This program is free software; you may redistribute it and/or
modify it under the same terms as Perl itself.

The slides licence I haven’t thought about. I think online
redistribution is pretty much fine so long as you don?t pretend you
wrote them instead of me. :)

*However*, be warned that the imminent 4ᵗʰ edition of the Camel
contains some of these specific examples and wordings, so you
probably should please ask before inserting them verbatim and
uncredited into your own books.

In other words: Just be cool, ok?

0 new messages