Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Unicode-AGE of a character?

10 views
Skip to first unread message

Ilya Zakharevich

unread,
Jan 10, 2012, 1:47:56 AM1/10/12
to
I looked through the docs I could find, and can't find any way to
determine the "Unicode AGE" of a particular codepoint except for:

a) running /\p{Present_in: FOO}/ for all forseeable values of FOO;

b) manually parsing $out = do 'unicore/To/Age.pl';.

Do I miss anything?

Thanks,
Ilya

Ben Morrow

unread,
Jan 10, 2012, 7:02:43 AM1/10/12
to

Quoth Ilya Zakharevich <nospam...@ilyaz.org>:
I don't think so. Note that before (I think) 5.14 unicore/To/Age.pl
doesn't exist, and before (I think) 5.12 unicode/DAge.txt doesn't exist
either. You may be better off just grabbing a copy of DerivedAge.txt
from the Unicode Consortium directly, and using that.

Ben

Ilya Zakharevich

unread,
Jan 10, 2012, 2:28:38 PM1/10/12
to
What would be the best fix? (Myself, so far I do not use Perl's
digested data, and parse Unicode Consortium files directly - so I
do not qualify to judge.) Put the stuff into Unicode::UCD::age?

BTW, why Unicode::UCD has so bizzare interface? Why not have
Unicode::UCD::Name, for example? (The most important piece of data of
those not available via Perl4 interfaces...)

Ilya

P.S. Is unicore/NamesList.txt included with latest distributions of
Perl? My module relies on parsing this file, and... Aha, found it on

http://cpansearch.perl.org/src/FLORA/perl-5.14.2/lib/unicore/

, good!

Ben Morrow

unread,
Jan 10, 2012, 5:02:04 PM1/10/12
to

Quoth Ilya Zakharevich <nospam...@ilyaz.org>:
> On 2012-01-10, Ben Morrow <b...@morrow.me.uk> wrote:
> > Quoth Ilya Zakharevich <nospam...@ilyaz.org>:
> >> I looked through the docs I could find, and can't find any way to
> >> determine the "Unicode AGE" of a particular codepoint except for:
> >>
> >> a) running /\p{Present_in: FOO}/ for all forseeable values of FOO;
> >>
> >> b) manually parsing $out = do 'unicore/To/Age.pl';.
> >>
> >> Do I miss anything?
> >
> > I don't think so. Note that before (I think) 5.14 unicore/To/Age.pl
> > doesn't exist, and before (I think) 5.12 unicode/DAge.txt doesn't exist
> > either. You may be better off just grabbing a copy of DerivedAge.txt
> > from the Unicode Consortium directly, and using that.
>
> What would be the best fix? (Myself, so far I do not use Perl's
> digested data, and parse Unicode Consortium files directly - so I
> do not qualify to judge.) Put the stuff into Unicode::UCD::age?

For you, or for perl? For your purposes I'd've thought the best thing to
do is ship DerivedAge.txt from Unicode 6.0.0 with your application, then
parse unicore/DAge.txt if it's there and fall back to the shipped copy
if not. That way you've got data for at least the version of Unicode
this perl supports.

For perl, yes, I would have thought Unicode::UCD is the right place, but
I don't really do any serious Unicode work so my opinion doesn't count
for much :).

> BTW, why Unicode::UCD has so bizzare interface? Why not have
> Unicode::UCD::Name, for example?

I don't know, you'd have to ask Jarkko... I certainly agree there's a
lot missing. I suspect that the main effort for 5.8 was to get enough
implemented to make the regex engine work, and since then things have
only been added when someone has been sufficiently motivated to send in
a patch.

> (The most important piece of data of those not available via Perl4
> interfaces...)

Well, they're not strictly 'Perl4' interfaces, of course, since none of
this existed before 5.8...

Ben

Ilya Zakharevich

unread,
Jan 11, 2012, 1:58:01 AM1/11/12
to
On 2012-01-10, Ben Morrow <b...@morrow.me.uk> wrote:
>> BTW, why Unicode::UCD has so bizzare interface? Why not have
>> Unicode::UCD::Name, for example?
>> (The most important piece of data of those not available via Perl4
>> interfaces...)
>
> Well, they're not strictly 'Perl4' interfaces, of course, since none of
> this existed before 5.8...

As far as my memory serves me, lc, /\w/ etc were very well supported
in Perl4. ;-)

Thanks for the other [omitted] input,
Ilya

Ben Morrow

unread,
Jan 11, 2012, 6:01:44 AM1/11/12
to

Quoth Ilya Zakharevich <nospam...@ilyaz.org>:
Oh, sorry, I misunderstood you. I thought you meant the unicore/*.pl
files, which look a little like Perl 4-era libraries.

Yes, you're quite right: generally speaking, the only bits of the UCD
interface which are finished are the bits the rest of Perl needs.

Ben

brian d foy

unread,
Jan 16, 2012, 9:52:54 AM1/16/12
to
In article <slrnjgnnos.4i...@panda.math.berkeley.edu>, Ilya
Zakharevich <nospam...@ilyaz.org> wrote:

> I looked through the docs I could find, and can't find any way to
> determine the "Unicode AGE" of a particular codepoint except for:
>
> a) running /\p{Present_in: FOO}/ for all forseeable values of FOO;

Tom C. and I talked about this for a bit. You're stuck with testing
each of the age properties until you find the earliest that matches.
Or, you can go through all characters, determin their age, and create
your own properties.

I had to do this when we were tracking down some font issues for
Programming Perl. It turned out that all the problems were related to
Unicode 5 characters.

Helmut Wollmersdorfer

unread,
Jan 18, 2012, 4:11:48 PM1/18/12
to
On 01/10/2012 07:47 AM, Ilya Zakharevich wrote:
> I looked through the docs I could find, and can't find any way to
> determine the "Unicode AGE" of a particular codepoint except for:
>
> a) running /\p{Present_in: FOO}/ for all forseeable values of FOO;

If you want to know the AGE then you should match the Age property;-)

$ perl -E 'say "matches" if ("\x{0514}" =~ m/\p{Age=5.1}/)'
matches

> b) manually parsing $out = do 'unicore/To/Age.pl';.

Or write a better Unicode::UCD module.
I really wanted to do this, because Unicode::UCD does not use the
original UCD--it also uses unicore.

> Do I miss anything?

You can install Unicode::Tussle from CPAN. It provides some scripts.

Examples:

$ perl /usr/local/bin/unichars '\p{Age=6.0}' '\p{Cyrillic}' | cat
Ԧ U+0526 CYRILLIC CAPITAL LETTER SHHA WITH DESCENDER
ԧ U+0527 CYRILLIC SMALL LETTER SHHA WITH DESCENDER
Ꙡ U+A660 CYRILLIC CAPITAL LETTER REVERSED TSE
ꙡ U+A661 CYRILLIC SMALL LETTER REVERSED TSE

$ time perl /usr/local/bin/uniprops -au U+0526 | grep -P '(Age|Pre)'
Age=6.0 Bidi_Class=L Bidi_Class=Left_To_Right BC=L
Block=Cyrillic_Supplement Block=Cyrillic_Supplementary
Numeric_Value=NaN NV=NaN Present_In=6.0 IN=6.0 SC=Cyrl
Script=Cyrl Sentence_Break=UP Sentence_Break=Upper SB=UP

real 0m1.380s
user 0m1.352s
sys 0m0.040s

You see, that's very famous information, but it's very slow.

Another disadvantage of uniprops is that it also uses unicore-files and
thus depends on perl-5.14 (more or less). 5.10 misses many properties in
unicore.

IMHO you want something what I am also missing:

use Unicode::Properties;

my $u = Unicode::Properties->new();

my $age = $u->get_property($char, 'Age');
my $script = $u->get_property($char, 'Script');

Helmut Wollmersdorfer

tch...@perl.com

unread,
Feb 15, 2012, 4:20:40 PM2/15/12
to
I don’t think so. When preparing the 4th Edition of Programming Perl
for printing, we needed to run an analysis of code point use by age. I
ended up doing this:

$char_info->{Age} = do { given ( $char ) {

when( /\p{Age=1.1}/ ) { '1.1' }

when( /\p{Age=2.0}/ ) { '2.0' }
when( /\p{Age=2.1}/ ) { '2.1' }

when( /\p{Age=3.0}/ ) { '3.0' }
when( /\p{Age=3.1}/ ) { '3.1' }
when( /\p{Age=3.2}/ ) { '3.2' }

when( /\p{Age=4.0}/ ) { '4.0' }
when( /\p{Age=4.1}/ ) { '4.1' }

when( /\p{Age=5.0}/ ) { '5.0' }
when( /\p{Age=5.1}/ ) { '5.1' }
when( /\p{Age=5.2}/ ) { '5.2' }

when( /\p{Age=6.0}/ ) { '6.0' }

default { 'N/A' }
} };

Which of course is suboptimal to say the least. I can criticize
it in quote a few directions. But it's what we used anyway.

I believe that Karl has some new stuff in the current blead that
exposes some of the character maps so you don't have to parse
the .pl files yourself. You might check into that.

--tom

0 new messages