Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

New API available to access Unicode DB, and RFC on changes to it.

0 views
Skip to first unread message

Karl Williamson

unread,
Nov 21, 2011, 3:42:22 PM11/21/11
to perl-u...@perl.org, BobH
Perl 5.15.5, now available, has additions to Unicode::UCD in it to allow
unfettered programmatic access to the Unicode character data base. The
API is quite similar to what was sent out for comment on this list
several months ago; several changes were required as a result of lessons
learned during implementation. This email has an attachment that is an
html file giving (with a yellow background) the additions since 5.14 to
the pod.

As a result of this API, it is deprecated to read the files in
lib/unicore directly. These may change, and the API will be stable as
of 5.16. In the meantime, I'd be happy to have people use this, and
give me get feedback on any problems with the API or bugs in the code.

And, I do wish to change the API already for certain of the outputs in
prop_invmap() in order to make them more compact. For example, take the
uc() property. What it currently returns is this (taken from the
attached pod):

@$uppers_ranges_ref @$uppers_maps_ref Note
0 "<code point>"
97 65 'a' maps to 'A'
98 66 'b' => 'B'
99 67 'c' => 'C'
...
120 88 'x' => 'X'
121 89 'y' => 'Y'
122 90 'z' => 'Z'
123 "<code point>"
181 924 MICRO SIGN => Greek Cap MU
182 "<code point>"
...
0x0149 [ 0x02BC 0x004E ]
0x014A "<code point>"
0x014B 0x014A
...


That could be more compactly represented as:
@$uppers_ranges_ref @$uppers_maps_ref Note
0 0
97 -32 'a-z' maps to 'A'-'Z'
123 0
181 743 MICRO SIGN => Greek Cap MU
182 0
...
0x0149 [ 0x02BC 0x004E ]
0x014A 0
0x014B -1
...

where the map is to be added to the code point to get the final result.
Thus only one entry is needed to represent all 26 ASCII lower case
character mappings, instead of 26 entries. This makes such tables
significantly smaller. The Perl core currently does a linear search
through them looking for mappings. Using the more compact versions
would speed that up significantly. The percentage gain is 30-40%, and
with the mapping for decimal digits the result is a full order of
magnitude smaller, making the search much much faster.

Returning the delta only makes sense on a few tables, ones that whose
map is code points, or the decimal digits.

As you can see in the example for 0x0149, I wouldn't propose to make
deltas of the lists, even though that is inconsistent. They generally
require special handling.
ucd.htm

Karl Williamson

unread,
Nov 21, 2011, 4:19:27 PM11/21/11
to perl-u...@perl.org, BobH
Repeat of last message, but now the attachment should be correct.
diff.htm
0 new messages