Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

RFC: API to access Unicode db files

3 views
Skip to first unread message

Karl Williamson

unread,
Jul 21, 2011, 11:03:47 AM7/21/11
to Perl5 Porters, perl-u...@perl.org, bhal...@cpan.org, BobH
Some applications are finding it necessary to read in the Unicode files
that mktables generates. For example, grepping through CPAN indicates
that Text::Unicode::Equivalents reads Decomposition.pl. This, and most
of the other generated files are marked for internal use only, because
we wish to reserve the right to change them around, etc. But
applications currently have no feasible alternative. Prior to 5.14, we
delivered the full Unicode db files that the Unicode consortium
publishes, and whose format is guaranteed not to change. But we dropped
those files in 5.14 to save disk space.

I'm proposing a new function Unicode::UCD::prop_invmap() to return the
contents of those files in a Unicode-centric way, so that applications
can use it and we can deprecate non-core use of our generated files.

The function returns an inversion map, which is a data structure more
used in the Unicode world than the Perl world. It consists of two
parallel arrays. I suppose a more Perl-centric data structure would be
an array of hashes, but the inversion map seems simpler to me to manipulate.

(This function would be in addition to the previously rfc'd function
Unicode::UCD::prop_invlist() which would return a list of all code
points that match a property-value.)

=pod

=head2 prop_invmap

C<prop_invmap> is used to get the complete mapping definition for the input
property, in the form of an inversion map. An inversion map consists of two
parallel arrays. One is an ordered list of code points that mark range
beginnings, and the other gives the value that all code points in the
corresponding range have. C<prop_invmap> is called with the name of the
desired property, and references to the two arrays, which it fills. For
example,

prop_invmap("Numeric_Value", \@numerics_ranges, \@numerics_maps);

will populate the arrays as shown below:

@numerics_ranges @numerics_maps Note
0x00 "NaN" NaN stands for "Not a Number"
0x30 0 DIGIT 0
0x31 1
0x32 2
...
0x37 7
0x38 8
0x39 9 DIGIT 9
0x3A "NaN"
0xB2 2 SUPERSCRIPT 2
0xB3 3 SUPERSCRIPT 2
0xB4 "NaN"
0xB9 1 SUPERSCRIPT 1
0xBA "NaN"
0xBC 0.25 VULGAR FRACTION 1/4
0xBD 0.5 VULGAR FRACTION 1/2
0xBE 0.75 VULGAR FRACTION 3/4
0xBF "NaN"
0x660 0 ARABIC-INDIC DIGIT ZERO
... ...
0x110000 undef

The second line means that the value for the code point 0x30 (which is
"DIGIT
0") is 0. The first line means that all code points in the range from
0x00 to
0x2F (which is 0x30 (from the second line) - 1) have the value "NaN".
The final line means that the value for all code points above the legal
Unicode maximum code point have the value C<undef> (not the string
"u-n-d-e-f").

The arrays completely specify the mappings for all possible code points.

The special string S<C<"E<lt>code pointE<gt>">> is used to specify that
the value of a code point is itself. For example, the beginnings of the
arrays for

prop_invmap("Uppercase_Mapping", \@uppers_ranges, \@uppers_maps);

look like this:

@uppers_ranges @uppers_maps Note
0 "<code point>"
97 65 'a' maps to 'A'
98 66 'b' => 'B'
99 67 'c' => 'C'
...
120 88 'x' => 'X'
121 89 'y' => 'Y'
122 90 'z' => 'Z'
123 "<code point>"
181 924 MICRO SIGN => Greek Cap MU
182 "<code point>"
223 [ 83 83 ] SHARP S => 'SS'
224 192

The first line means that the uppercase of code point 0 is 0, of 1 is 1, ...
of 96 is 96. Without the C<"E<lt>code_pointE<gt>"> notation, every code
point
would have to have an entry. This would mean that the arrays would each
have
more than a million entries to list just the legal Unicode code points!

In some properties some code points map to a sequence of multiple code
points.
For those, the corresponding entries in the map array are not scalars, but
references to anonymous arrays containing the ordered list of code points
mapped to, as shown in the example above for 223.

The "Name" property map includes entries such as

CJK UNIFIED IDEOGRAPH-<code point>

This means that the name for the code point is "CJK UNIFIED IDEOGRAPH-"
with the code point (expressed in hexadecimal) appended to it. Also, the
notation "E<lt>hangul syllableE<gt>" occurs in this property, meaning
that the
name is algorithmically calculated. These names can be generated via the
function C<charnames::viacode>().

The "Decomposition_Mapping" property also uses "E<lt>hangul
syllableE<gt>" for
those code points whose decomposition is algorithmically calculated. These
can be generated via the function C<Unicode::Normalize::NFD>(). This
property
contains many occurrences of code points whose mappings are ordered lists of
other code points.

The return value is
C<undef> if the property is unknown;
C<s> if all the elements of the map array are simple scalars;
C<n> for the Name property, which has the complications described above;
C<d> for the Decomposition_Mapping property (complications already
described);
otherwise C<c> if some of map array elements are S<C<"E<lt>code
pointE<gt>">>;
and C<l> if additionally some are lists of code points.

A binary search can be used to quickly find a code point in the inversion
list, and hence its corresponding mapping.

=cut


Karl Williamson

unread,
Aug 17, 2011, 4:41:48 PM8/17/11
to Zefram, Perl5 Porters, perl-u...@perl.org
Here's a new version of the API for comment, with the addition of 2
extra functions:

prop_invlist()
"prop_invlist" returns an inversion list (described below)
that defines all the code points for the Unicode property
given by the input parameter string:

use Unicode::UCD 'prop_invlist';
say join ", ", prop_invlist("Any");

0, 1114112

An empty list is returned if the given property is unknown;
the number of elements in the list is returned if called in
scalar context.

perluniprops gives the list of properties that this function
accepts, as well as all the possible forms for them (loose
matching rules are used on the parameter). Note that many
properties can be specified in a compound form, such as

say join ", ", prop_invlist("Script=Shavian");
66640, 66688

say join ", ", prop_invlist("ASCII_Hex_Digit=No");
0, 48, 58, 65, 71, 97, 103

say join ", ", prop_invlist("ASCII_Hex_Digit=Yes");
48, 58, 65, 71, 97, 103

Inversion lists are a compact way of specifying Unicode
properties. The 0th item in the list is the lowest code
point that has the property-value. The next item is the
lowest code point after that one that does NOT have the
property-value. And the next item after that is the lowest
code point after that one that has the property-value, and so
on. Put another way, each element in the list gives the
beginning of a range that has the property-value (for even
numbered elements), or doesn't have the property-value (for
odd numbered elements).

In the final example above, the first ASCII Hex digit is code
point 48, the character "0", and all code points from it
through 57 (a "9") are ASCII hex digits. Code points 58
through 64 aren't, but 65 (an "A") through 70 (an "F") are,
as are 97 ("a") through 102 ("f"). 103 starts a range of
code points that aren't ASCII hex digits. That range extends
to infinity, which on your computer can be found in the
variable $Unicode::UCD::MAX_CP. (This variable is as close
to infinity as Perl can get on your platform, and may be too
high for some operations to work; you may wish to use a
smaller number for your purposes.)

The name for this data structure stems from the fact that
each element in the list toggles (or inverts) whether the
corresponding range is or isn't on the list.

It is a simple matter to expand out an inversion list to a
full list of all code points that have the property-value:

my @invlist = prop_invlist("My Property");
die "empty" unless @invlist;
my @full_list;
for (my $i = 0; $i < @invlist; $i += 2) {
my $upper = ($i + 1) < @invlist
? $invlist[$i+1] - 1 # In range
: $Unicode::UCD::MAX_CP; # To infinity. You
may want
# to stop much much
earlier;
# going this high may
expose
# perl bugs with very
large
# numbers.
for my $j ($invlist[$i] .. $upper) {
push @full_list, $j;
}
}

prop_aliases()
use Unicode::UCD 'prop_aliases';

my $full_name = prop_value_aliases("White Space");
my @all_names = prop_value_aliases("White Space");
my $short_name = $all_names[0];
print join ", ", @all_names, "\n";

XXX

Most Unicode properties have several synonymous names.
Typically, there is at least a short name, convenient to
type, and a long name that more fully describes the property,
and hence is more easily understood.

If you know one name for a property, you can use
"prop_aliases" to find either the long name (when called in
scalar context), or a list of all of the names, somewhat
ordered so that the short name is in the 0th element, the
long name in the next element, and any other synonyms in the
remaining elements, in no particular order.

The long name is returned in a form nicely capitalized,
suitable for printing.

White space, hyphens, and underscores are ignored in the
input parameter name.

If the name is unknown, "undef" is returned.

prop_value_aliases()
use Unicode::UCD 'prop_value_aliases';

my $full_name = prop_value_aliases("Gc", "Punct");
my @all_names = prop_value_aliases("Gc", "Punct");
my $short_name = $all_names[0];
print "The aliases are: ", join ", ", @all_names, "\n";
print "The fullname is $full_name\n";

The aliases are: P, Punctuation, Punct
The fullname is Punctuation

Some Unicode properties have a restricted set of legal
values. For example, all binary properties are restricted to
just "true" or "false"; and there are only a few dozen
possible General Categories.

For such properties, there are usually several synonyms for
each possible value. For example, in binary properties,
truth can be represented by any of the strings, "Y", "Yes",
"T", or "True"; and the General Category "Punctuation" by
that string, or "Punct", or simply "P".

Like property names, there is typically at least a short name
for each such property-value, and a long name. If you know
any name of the property-value, you can use
"prop_value_aliases"() to get the long name (when called in
scalar context), or a list of all the names, with the short
name in the 0th element, the long name in the next element,
and any other synonyms in the remaining elements, in no
particular order, except that any all-numeric synonyms will
be last.

The long name is returned in a form nicely capitalized,
suitable for printing.

White space, hyphens, and underscores are ignored in the
input parameters.

If either name is unknown, "undef" is returned.

If called with a property that doesn't have synonyms for its
values, it returns the input value, possibly normalized with
capitalization and underscores.

For the block property, new-style block names are returned
(see "Old-style versus new-style block names").

prop_invmap()


"prop_invmap" is used to get the complete mapping definition

for a property, in the form of an inversion map. An


inversion map consists of two parallel arrays. One is an
ordered list of code points that mark range beginnings, and

the other gives the value (or mapping) that all code points


in the corresponding range have.

"prop_invmap" is called with the name of the desired
property. The name is loosely matched, meaning that
differences in case, white-space, hyphens, and underscores
are not meaningful. Many Unicode properties have more than
one name (or alias). "prop_invmap" understands all of these.
"undef" is returned if the property name is unknown.

It is a fatal error to call this function except in list
context.

In addition to the the two arrays that form the inversion
map, "prop_invmap" returns two other values, one is a scalar
that gives some details as to the format of the entries of
the map array; the other is used for specialized purposes,
described at the end of this section.

This means that "prop_invmap" returns a 4 element list. For
example,

my ($blocks_ranges_ref, $blocks_maps_ref, $format, $default)
= prop_invmap("Block");

In this call, the two arrays will be populated as shown below
(for Unicode 6.0):

Index @blocks_ranges @blocks_maps
0 0x0000 Basic Latin
1 0x0080 Latin-1 Supplement
2 0x0100 Latin Extended-A
3 0x0180 Latin Extended-B
4 0x0250 IPA Extensions
5 0x02B0 Spacing Modifier Letters
6 0x0300 Combining Diacritical Marks
7 0x0370 Greek and Coptic
8 0x0400 Cyrillic
...
233 0x2B820 No_Block
234 0x2F800 CJK Compatibility Ideographs Supplement
235 0x2FA20 No_Block
236 0xE0000 Tags
237 0xE0080 No_Block
238 0xE0100 Variation Selectors Supplement
239 0xE01F0 No_Block
240 0xF0000 Supplementary Private Use Area-A
241 0x100000 Supplementary Private Use Area-B
242 0x110000 No_Block

The first line (with Index 0) means that the value for code
point 0 is "Basic Latin". The entry "0x0080" in the
@blocks_ranges column in the second line means that the value
from the first line, "Basic Latin", extends to all code
points in the range up to but not including 0x0080, that is,
to 255. In other words, the code points from 0 to 255 are
all in the "Basic Latin" block. Similarly, all code points
in the range from 0x0080 up to (but not including) 0x0100 are
in the block named "Latin-1 Supplement", etc. (Notice that
the return is the old-style block names; see "Old-style
versus new-style block names").

The final line (with Index 242) means that the value for all


code points above the legal Unicode maximum code point have

the value "No_Block", which is the term Unicode uses for a
non-existing block.

The arrays completely specify the mappings for all possible

code points. The final element in an inversion map returned
by this function will always be for the range that consists
of all the code points that aren't legal Unicode, but that
are expressible on the platform. (That is, it starts with
code point 0x110000, the first code point above the legal
Unicode maximum, and extends to infinity.) The value for that
range will be the same that any normal unassigned code point
has for the specified property. (Certain unassigned code
points are not "normal"; for example the non-character code
points, or those in blocks that are to be written right-to-
left. The range value will not necessarily be the same as
those code points have.) It could be argued that, instead of
treating these as unassigned Unicode code points, the value
for this range should be "undef". You can make that decision
and change the returned array accordingly.

The maps are almost always simple scalars that should be
interpreted as-is. These values are those given in the
Unicode data files, which may be inconsistent as to
capitalization and which synonym for a property-value is
given. The results may be normalized by using the
"prop_value_aliases()" function.

There are exceptions to the simple scalar maps. Some
properties have some elements in their map list that are
themselves lists of scalars; and some special strings are
returned that are not to be interpreted as-is. Element [2]
(placed into $format in the example above) of the returned 4
element list tells you if the map has any of these special
elements, as follows:

"s" means all the elements of the map array are simple
scalars. Almost all properties are like this, like the
"block" example above.

"sl"
means that some of the map array elements have the form
given by "s", and the rest are lists of scalars. For
example, here is a portion of the output of calling
"prop_invmap"() with the "Script Extensions" property:

@scripts_ranges @scripts_maps
...
0x0953 Deva
0x0964 [ Beng Deva Guru Orya ]
0x0966 Deva
0x0970 Common

Here, the code points 0x964 and 0x965 are used in the
Bengali, Devanagari, Gurmukhi, and Oriya scripts.

"r" means that all the elements of the map array are either
rational numbers or the string "NaN", meaning "Not a
Number". A rational number is either an integer, or two
integers separated by a solidus ("/"). The second
integer represents the denominator of the division
implied by the solidus, and is guaranteed not to be 0.
If you want to convert them to scalar numbers, you can
use something like this:

my ($format, $invlist_ref, $invmap_ref)
= prop_invmap($property);
if ($format && $format eq "r") {
map { $_ = eval $_ } @$invmap_ref;
}

Here's some entries from the output of the property "Nv",
which has format "r".

@numerics_ranges @numerics_maps Note
0x00 "NaN"

0x30 0 DIGIT 0
0x31 1
0x32 2
...
0x37 7
0x38 8
0x39 9 DIGIT 9
0x3A "NaN"
0xB2 2 SUPERSCRIPT 2
0xB3 3 SUPERSCRIPT 2
0xB4 "NaN"
0xB9 1 SUPERSCRIPT 1
0xBA "NaN"

0xBC 1/4 VULGAR FRACTION 1/4
0xBD 1/2 VULGAR FRACTION 1/2
0xBE 3/4 VULGAR FRACTION 3/4


0xBF "NaN"
0x660 0 ARABIC-INDIC DIGIT ZERO

"c" is like "s" in that all the map array elements are
scalars, but some of them are the special string
"<code point>", meaning that the map of each code point
in the corresponding range in the inversion list is the
code point itself. For example, in:

my ($format, $uppers_ranges_ref, $uppers_maps_ref)
= prop_invmap("Simple_Uppercase_Mapping");

the returned arrays look like this:

@$uppers_ranges_ref @$uppers_maps_ref Note


0 "<code point>"
97 65 'a' maps to 'A'
98 66 'b' => 'B'
99 67 'c' => 'C'
...
120 88 'x' => 'X'
121 89 'y' => 'Y'
122 90 'z' => 'Z'
123 "<code point>"
181 924 MICRO SIGN =>
Greek Cap MU
182 "<code point>"

...

The first line means that the uppercase of code point 0

is 0; the uppercase of code point 1 is 1; ... of code
point 96 is 96. Without the "<code_point>" notation,


every code point would have to have an entry. This would
mean that the arrays would each have more than a million
entries to list just the legal Unicode code points!

"cl"
means that some of the map array elements have the form
given by "c", and the rest are ordered lists of code
points. For example, in:

my ($format, $uppers_ranges_ref, $uppers_maps_ref)
= prop_invmap("Uppercase_Mapping");

the returned arrays look like this:

@$uppers_ranges_ref @$uppers_maps_ref Note


0 "<code point>"
97 65

...
122 90


123 "<code point>"
181 924

182 "<code point>"
...
0x0149 [ 0x02BC 0x004E ]

This is the full Uppercase_Mapping property (as opposed
to the Simple_Uppercase_Mapping given in the example for
"c"). The only difference between the two in the ranges
shown is that the code point at 0x0149 (LATIN SMALL
LETTER N PRECEDED BY APOSTROPHE) maps to a string of two
characters, 0x02BC (MODIFIER LETTER APOSTROPHE) followed
by 0x004E (LATIN CAPITAL LETTER N).

"n" means the Name property. All the elements of the map
array are simple scalars, but some of them contain
special strings that require more work to get the actual
name.

Entries such as:

CJK UNIFIED IDEOGRAPH-<code point>

mean that the name for the code point is "CJK UNIFIED


IDEOGRAPH-" with the code point (expressed in

hexadecimal) appended to it (similarly for "CJK
COMPATIBILITY IDEOGRAPH-<code point>").

Also, entries like

<hangul syllable>

means that the name is algorithmically calculated. This
is easily done by the function charnames::viacode().

Note that for control characters ("Gc=cc"), Unicode's
data files have the string ""control"", but the real name
of each of these characters is the empty string. This
function returns the real name.

"d" means the Decomposition_Mapping property. Like "n", this
property uses

<hangul syllable>

for those code points whose decomposition is
algorithmically calculated. These can be generated via

the function Unicode::Normalize::NFD().

Otherwise, this property is like "cl" properties.

Note that the mapping is the one that is specified in the
Unicode data files, and to get the final decomposition,
it may need to be applied recursively.

A binary search can be used to quickly find a code point in
the inversion list, and hence its corresponding mapping.

The final element ([3], assigned to $default in the "block"
example) in the list returned by this function may be useful
for applications that wish to convert the returned inversion
map data structure into some other, such as a hash. It gives
the mapping that most code points map to under the property.
If you establish the convention that any code point not
explicitly listed in your data structure maps to this value,
you can potentially make your data structure much smaller.
As you construct your data structure from the one returned by
this function, simply ignore those ranges that map to this
value, generally called the "default" value.

One internal Perl property is accessible by this function.
"Perl_Decimal_Digit" returns an inversion map in which all
the Unicode decimal digits map to their numeric values, and
everything else to the empty string, like so:

@digits @values
0x0000 ""
0x0030 0
0x0031 1
0x0032 2
0x0033 3
0x0034 4
0x0035 5
0x0036 6
0x0037 7
0x0038 8
0x0039 9
0x003A ""
0x0660 0
0x0661 1
...

Old-style versus new-style block names
Unicode publishes the names of blocks in two different
styles, though the two are equivalent under Unicode's loose
matching rules.

The original style uses blanks and hyphens in the block names
(except for "No_Block"), like so:

Miscellaneous Mathematical Symbols-B

The newer style replaces these with underscores, like this:

Miscellaneous_Mathematical_Symbols_B

This newer style is consistent with the values of other
Unicode properties. To preserve backward compatibility, all
the functions in Unicode::UCD that return block names (except
one) return the old-style ones. That one function,
"prop_value_aliases"() can be used to convert from old-style
to new-style:

my $new_style = prop_values_aliases("block", $old_style);

0 new messages