Need: list of Unicode characters that have canonical decompositions.

BobH

unread,

Jun 27, 2011, 10:26:58 AM6/27/11

to perl-u...@perl.org

A project I'm working on needs to build a list of all Unicode characters
that have canonical decompositions. The most efficient ways I can think
of to get such a list are from unicore/Decomposition.pl or by scanning
unicore/UnicodeData.txt. However:

Re unicore/Decomposition.pl, the header of this says:

> # !!!!!!! INTERNAL PERL USE ONLY !!!!!!!
> # This file is for internal use by the Perl program only. The format and even
> # the name or existence of this file are subject to change without notice.
> # Don't use it directly.

Re unicore/UnicodeData.txt, I've recently posted a version of my module
that uses unicore/UnicodeData.txt to CPAN, and from Perl 5.14 testers
I've received only failure notices which indicate that the file cannot
be found :-(

Unicode::UCD can tell me if a specific character has a decomposition,
but can't give me a list of characters that have decompositions.

Any suggestions would be appreciated.

Bob

BobH

unread,

Jun 27, 2011, 11:10:40 AM6/27/11

to perl-u...@perl.org

BobH wrote:

> Re unicore/UnicodeData.txt, I've recently posted a version of my module
> that uses unicore/UnicodeData.txt to CPAN, and from Perl 5.14 testers
> I've received only failure notices which indicate that the file cannot
> be found :-(
>

Just installed ActivePerl 5.14 and, indeed, this file no longer exists
-- guess that forces me to use unicore/Decomposition.pl in spite of its
included warning.

Bob

Karl Williamson

unread,

Jun 27, 2011, 4:01:30 PM6/27/11

to BobH, perl-u...@perl.org

I'm presuming you need this not for a one-time only thing, but to be
able to run this program over and over. You can always download
UnicodeData.txt from the Unicode web site. In a regular expression,
\p{Dt= can} (Decomposition_Type=Canonical) will match all characters
that you want. I'm thinking that 5.16 will have the stringification of
that regex include the list you want, but not in 5.14, and
stringification is not necessarily fixed either.

I could easily write a new function for UCD that returns a list of all
code points that have a given property.

BobH

unread,

Jun 27, 2011, 10:04:09 PM6/27/11

to perl-u...@perl.org, Karl Williamson

Karl Williamson wrote:

> I'm presuming you need this not for a one-time only thing, but to be
> able to run this program over and over.

Yes -- this is for a module that will be usable in a number of
situations. See
http://search.cpan.org/~bhallissy/Text-Unicode-Equivalents-0.05/.

The current implementation cheats by accessing unicore/Decomposition.pl
exactly the same way Unicode::UCD does.

> You can always download UnicodeData.txt from the Unicode web site.

Yes I can -- and certainly have done for my personal use. But including
that file (or some derivative) in a general purpose module would mean
that it wouldn't necessarily have the same Unicode version as the Perl
installation into which my module might be installed. And besides, the
information I need is already in the Perl core -- though supposedly not
usable.

> In a regular expression,
> \p{Dt= can} (Decomposition_Type=Canonical) will match all characters
> that you want.

Yes, I understand that I can test a character to see if it has a
particular decomposition, but I'm not sure I understand how to use a
regex to generate a complete list of characters with decompositions.

> I'm thinking that 5.16 will have the stringification
> of that regex include the list you want, but not in 5.14, and
> stringification is not necessarily fixed either.
>
> I could easily write a new function for UCD that returns a list of
> all code points that have a given property.

That is an interesting offer, and I think this should be given serious
consideration. I'm sure my little module isn't the only one that, as we
go into the future, would benefit from such a function.

Thanks for your reply, Karl.

Bob

Karl Williamson

unread,

Jun 28, 2011, 1:31:12 PM6/28/11

to BobH, perl-u...@perl.org

If I did this, I would be tempted to have it return an inversion list,
instead of an array of every code point that matches the property. Such
an array could be potentially length 1,114,112. The largest possible
inversion list is potentially half that, but the largest one that
matches a Unicode property is around length 700, and yours would be
somewhat over 200 entries. That is why inversion lists are often used
for Unicode because they compactly represent the Unicode properties.

An inversion list is an array. An example is:
5, 101, 116, 120, ...

This represents 5..100, 116..119 ...

The 0th element gives the first code point that is in the property; the
next element gives the first code point after that one that's not in the
property, and so forth. Each succeeding element marks the beginning of
a range that is/isn't in the property, inverting the is/isnt each time.

It is a simple matter to convert an inversion list into a true array or
hash of every code point that matches.

My question to you is would that be acceptable to you, do you think? I
hate to return an enormous array by default when the application doesn't
really need it.

BobH

unread,

Jun 29, 2011, 11:06:15 AM6/29/11

to perl-u...@perl.org

Karl Williamson wrote:

> If I did this, I would be tempted to have it return an inversion
> list, instead of an array of every code point that matches the

> property. ...

>
> My question to you is would that be acceptable to you, do you think?
> I hate to return an enormous array by default when the application
> doesn't really need it.

Yes, that kind of representation would be sufficient and reasonably compact.

Thanks.

Bob

Karl Williamson

unread,

Jul 1, 2011, 11:37:48 AM7/1/11

to BobH, perl-u...@perl.org

I'm trying to think of a good name. Best so far is UCD::get_prop_invlist()

Any ideas

BobH

unread,

Jul 1, 2011, 12:40:23 PM7/1/11

to perl-u...@perl.org

Karl Williamson wrote:
>>
>
> I'm trying to think of a good name. Best so far is
> UCD::get_prop_invlist()

Hm, "get" normally isn't needed.

How about something simpler such as UCD::charlist()

Bob

Karl Williamson

unread,

Jul 1, 2011, 1:49:50 PM7/1/11

to BobH, perl-u...@perl.org

I think not having prop in the name is potentially misleading, and it
actually isn't a list of the chars. It's an inversion list that is
readily convertible into such a list.

Karl Williamson

unread,

Jul 6, 2011, 4:42:58 PM7/6/11

to BobH, perl-u...@perl.org

I've mostly written and tested it. But here is my proposed API to see
how people like it (or not); (I'm still open to a better name, but I do
thing that the name needs to have the requirements I mentioned above):

=pod

=head2 prop_invlist

C<prop_invlist> returns an inversion list (see below) that defines all the
code points for the Unicode property given by the input parameter string:

say join ", ", prop_invlist("Any");
0, 1114112

An empty list is returned if the given property is unknown.

L<perluniprops|perluniprops/Properties accessible through \p{} and \P{}>
gives
the list of properties that this function accepts, as well as all the
possible
forms for them. Note that many properties can be specified in a compound
form, such as

say join ", ", prop_invlist("Script=Shavian");
66640, 66688

say join ", ", prop_invlist("ASCII_Hex_Digit=No");
0, 48, 58, 65, 71, 97, 103

say join ", ", prop_invlist("ASCII_Hex_Digit=Yes");
48, 58, 65, 71, 97, 103

Inversion lists are a compact way of specifying Unicode properties. The 0th
item in the list is the lowest code point that has the property-value. The
next item is the lowest code point after that one that does NOT have the
property-value. And the next item after that is the lowest code point after
that one that has the property-value, and so on. Put another way, each
element in the list gives the beginning of a range that has the
property-value
(for even numbered elements), or doesn't have the property-value (for odd
numbered elements).

In the final example above, the first ASCII Hex digit is code point 48, the
character "0", and all code points from it through 57 (a "9") are ASCII hex
digits. Code points 58 through 64 aren't, but 65 (an "A") through 70
(an "F")
are, as are 97 ("a") through 102 ("f"). 103 starts a range of code points
that aren't ASCII hex digits. That range extends to infinity, which on your
computer can be found in the variable C<$Unicode::UCD::MAX_CP>.

It is a simple matter to expand out an inversion list to a full list of all
code points that have the property-value:

my @invlist = prop_invlist("My Property");
die "empty" unless @invlist;
my @full_list;
for (my $i = 0; $i < @invlist; $i += 2) {
my $upper = ($i + 1) < @invlist
? $invlist[$i+1] - 1 # In range
: $Unicode::UCD::MAX_CP; # To infinity. You may want
# to stop earlier
for my $j ($invlist[$i] .. $upper) {
print $upper, ": ", $j, "\n";
push @full_list, $j;
}
}

=cut