Unicode and composition mappings

Hibou57 (Yannick Duchêne)

unread,

Feb 28, 2008, 9:08:12 PM2/28/08

to

Hello to all world-wilde alphabets lovers :)

Exploring the Unicode database, I've found in the file
UnicodeData.txt, a field named Decomposition_Mapping (which is
explained the Unicode reference), but I cannot see any
Composition_Mapping field.

Does it means that it has to be derived from the Decomposition_Mapping
field, by building a reverse associative array ?

Or perhaps there is something I've not understood...

Any way, I feel suprise, beceause there plenty of derive properties
files in the Unicode database, so this is strange that there is no
Composition_Mapping derivation any where.

What do you know about it ?

Wish a nice time to all of you boys'n girls :)

Jukka K. Korpela

unread,

Feb 29, 2008, 8:20:11 AM2/29/08

to

Scripsit Hibou57 (Yannick Duchêne):

> Exploring the Unicode database, I've found in the file
> UnicodeData.txt, a field named Decomposition_Mapping (which is
> explained the Unicode reference), but I cannot see any
> Composition_Mapping field.

There is no such mapping defined in the Unicode standard. (Note that
UnicodeData.txt contains just a small part of properties of characters,
so you need to check such things from other sources in the database. But
I guess you already found this out.)

> Does it means that it has to be derived from the Decomposition_Mapping
> field, by building a reverse associative array ?

If you want such a mapping, yes.

> Any way, I feel suprise, beceause there plenty of derive properties
> files in the Unicode database, so this is strange that there is no
> Composition_Mapping derivation any where.

Not really. A general composition mapping would hardly make much sense.
A large number of characters with decomposition mappings are
compatibility characters that should normally not be used in new data.
They have been included perhaps only because some older standard,
possibly quite obsolete by now, has defined a distinction (say, between
the Latin letter A with ring above, Å, and the angstrom symbol, which
looks exactly the same and is for all relevant purposes just that letter
_used_ for a particular meaning), and Unicode lets you retain such a
distinction in Unicode-encoded data _if desired_.

So it would make little sense to do general composition, and generally
impossible in the sense that many characters share the same
decomposition, so how could a program decide what to with a character or
string that might be some character's decomposition?

In a restricted sense, general-purpose composition is possible, namely
_canonical_ composition. It's routinely used when performing
normalization; see "Normalization Forms" in the Unicode standard. In
particular, certain normalization forms involve things like mapping a
characters followed by a combining diacritic mark into a precomposed
character.

> Wish a nice time to all of you boys'n girls :)

And what about all the rest of us? "On the Internet, nobody knows you're
a dog."

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

Helmut Richter

unread,

Feb 29, 2008, 12:43:33 PM2/29/08

to

On Fri, 29 Feb 2008, Jukka K. Korpela wrote:

> In a restricted sense, general-purpose composition is possible, namely
> _canonical_ composition. It's routinely used when performing normalization;
> see "Normalization Forms" in the Unicode standard. In particular, certain
> normalization forms involve things like mapping a characters followed by a
> combining diacritic mark into a precomposed character.

Is there simple-to-use software available that does normalizations?
The statement by IBM reads as if they provide such software.

--
Helmut Richter

Andreas Prilop

unread,

Feb 29, 2008, 12:47:37 PM2/29/08

to

On Fri, 29 Feb 2008, Helmut Richter wrote:

>> In particular, certain normalization forms involve things like
>> mapping a characters followed by a combining diacritic mark
>> into a precomposed character.
>
> Is there simple-to-use software available that does normalizations?

Internet Explorer 7
scnr

Look at
http://www.unics.uni-hannover.de/nhtcapri/combining-marks.html
with Internet Explorer 7 and Firefox 2.

--
In memoriam Alan J. Flavell
http://groups.google.com/groups/search?q=author:Alan.J.Flavell

Hibou57 (Yannick Duchêne)

unread,

Feb 29, 2008, 1:13:48 PM2/29/08

to

On 29 fév, 14:20, "Jukka K. Korpela" <jkorp...@cs.tut.fi> wrote:
> Scripsit Hibou57 (Yannick Duchêne):

>
> Not really. A general composition mapping would hardly make much sense.

> [...]

>
> So it would make little sense to do general composition, and generally
> impossible in the sense that many characters share the same
> decomposition, so how could a program decide what to with a character or
> string that might be some character's decomposition?
>
> In a restricted sense, general-purpose composition is possible, namely

> _canonical_ composition. [...]
>
I was loosy, and forgot to say that I was requiring about canonical
normalization, which is any way the only one defined in the UAX#15.

If I get all decomposition which has no decomposition type, thus which
stand for canonical decompostion, and reverse it, I get the canonical
composition. Is it the way it works ?

I guess the answer is probably "yes", but I just want to be sure.

> And what about all the rest of us? "On the Internet, nobody knows you're
> a dog."

We are the pets, we are the children :D

On the internet, nobody knows you're from outer space :D (my pet cat
is just a kind of an other life form, that's what I love with her)

Yannick

Jukka K. Korpela

unread,

Feb 29, 2008, 4:02:26 PM2/29/08

to

Scripsit Hibou57 (Yannick Duchêne):

> If I get all decomposition which has no decomposition type, thus which
> stand for canonical decompostion, and reverse it, I get the canonical
> composition. Is it the way it works ?

Roughly so, but the exact definition of canonical composition, as in
normalization forms C and KC, is more complicated and involves
reordering of combining marks, among other things.

Jukka K. Korpela

unread,

Feb 29, 2008, 4:15:11 PM2/29/08

to

Scripsit Andreas Prilop:

>> Is there simple-to-use software available that does normalizations?
>
> Internet Explorer 7

I don't think it does.

Software for normalization can be found e.g. via
http://www.unicode.org/onlinedat/products.html

> Look at
> http://www.unics.uni-hannover.de/nhtcapri/combining-marks.html
> with Internet Explorer 7 and Firefox 2.

It may look like normalization, but it's an illusion.

If you e.g. cut the part that looks like "À = À" and paste it onto
WordPad, click on the location after the first "À", and press Alt+X (on
a sufficiently new version of WordPad), then it magically transforms to
"A300", because the program converts the combing grave accent U+0300 to
its hex code value. Nothing like that happens for the second occurrence
of "À": using Alt+X, you turn it into C0.

This illustrates that the two occurrences of "À" are really different
beasts: the first one is "A" followed by U+0300, whereas the second one
is the single letter "À", U+00C0. The browser has _not_ normalized
anything.

Displaying the two things in identical ways is correct and appropriate,
but it takes place at the formatting level, not at the character level.
And normalization is a character-level operation. Combining a letter and
a diacritic in visual presentation might even take place at the _glyph_
level (i.e., the rendering engine might render such a combination using
a single glyph from a font), but even that wouldn't be character-level
issue.

Your page is a nice utility for testing _rendering_ level issues. The
results naturally depend on the browser and on the fonts available. For
example, though I see no difference in rendering (of the decomposed form
and the precomposed form) for many characters, I see a big difference
for Z with circumflex. Many things can happen; a simplistic
implementation just takes a base character and a glyph for a diacritic
and does an "overprint", and it might even use glyphs from different
fonts, since many fonts don't have glyphs for many combining diacritics.
That's bad, but it's a quality of implementation issue.

Jim Kingdon

unread,

Feb 29, 2008, 11:23:44 PM2/29/08

to

> Software for normalization can be found e.g. via
> http://www.unicode.org/onlinedat/products.html

The open source library I keep hearing about is at:
http://www.ibm.com/software/globalization/icu/

I don't know how it compares with any of the other libraries at the
above page, nor do I have firsthand experience using any of these.

Hibou57 (Yannick Duchêne)

unread,

Mar 2, 2008, 11:13:05 PM3/2/08

to

Just to check :

Using the last (beta) Unicode 5.1 database, I've found

5405 compatibility decompositions
2043 canonical decompositions
1637 canonical compositions

Do someone get the same result ? (this is to check and ensure there is
no errors in my implementation).

Thanks :)

Hibou57 (Yannick Duchêne)

unread,

Mar 3, 2008, 8:45:25 AM3/3/08

to

Erratum :

I've said

> Using the last (beta) Unicode 5.1 database, I've found
>
> 5405 compatibility decompositions
> 2043 canonical decompositions
> 1637 canonical compositions

But I've just corrected a bug, and now found 31 canonical
compositions.

31 seems a very small number.... I'm surprised.

I've removed compositions exclusion not standing in UnicodeData.txt
(81), and removed all decomposition starting with a combining
(combining class /= 0) and removed all singletons decompositions.

Finally, it leaves just as litlle as 31 canonical compositions.

So the bulk is most with decompositions than with compositions.

.... provinding there is no other errors somewhere (this one was a big
one).

Helmut Richter

unread,

Mar 3, 2008, 8:53:29 AM3/3/08

to

On Mon, 3 Mar 2008, Hibou57 (Yannick Duchêne) wrote:

> But I've just corrected a bug, and now found 31 canonical
> compositions.
>
> 31 seems a very small number.... I'm surprised.
>
> I've removed compositions exclusion not standing in UnicodeData.txt
> (81), and removed all decomposition starting with a combining
> (combining class /= 0) and removed all singletons decompositions.
>
> Finally, it leaves just as litlle as 31 canonical compositions.

Could you please give examples of canonical compositions that are not
reversals of decompositions and vive versa.

--
Helmut Richter

Hibou57 (Yannick Duchêne)

unread,

Mar 3, 2008, 9:21:27 AM3/3/08

to

> Could you please give examples of canonical compositions that are not
> reversals of decompositions and vive versa.

Some composition are reversal of decompositions : only some explicitly
excluded and the one which start with a combining or which are
singleton are excluded. But this is normal for a composition to be the
reverse of a canonical decomposition (see UAX#15 for more details).

An exemple, the first one in the UnicodeData.txt order is from the
decomposition of U+00C0 which is decomposed into U+0041, U+0300, and
is recomposed into U+00C0. But if the canonical reordering bring some
combining beside it, you may get a different recomposition (that's why
the canonical reordering is so much important).

I'm sorry for the previous post, this was indeed too much strange, and
I've corrected a second bug (not an algoritme bug, but I've forget an
index incrementation somewhere).

I get now 928 canonical compositions. To be sure, I've check a listing
and it seems Ok now. An interresting point : all canonical
compositions are of length 2.

Here is a extract of the listing I could produce (the ten first) :

-- 16#0000C0# <- 16#000041# 16#000300#
-- 16#0000C1# <- 16#000041# 16#000301#
-- 16#0000C2# <- 16#000041# 16#000302#
-- 16#0000C3# <- 16#000041# 16#000303#
-- 16#0000C4# <- 16#000041# 16#000308#
-- 16#0000C5# <- 16#000041# 16#00030A#
-- 16#0000C7# <- 16#000043# 16#000327#
-- 16#0000C8# <- 16#000045# 16#000300#
-- 16#0000C9# <- 16#000045# 16#000301#
-- 16#0000CA# <- 16#000045# 16#000302#

I will have to my application with some test bench. There are some in
the UCD, but I would like to know if there are some elsewhere.

Do you know some dear friend ?

Andreas Prilop

unread,

Mar 3, 2008, 10:34:58 AM3/3/08

to

On Fri, 29 Feb 2008, Jukka K. Korpela wrote:

>> http://www.unics.uni-hannover.de/nhtcapri/combining-marks.html

>
> For example, though I see no difference in rendering (of the decomposed
> form and the precomposed form) for many characters, I see a big difference
> for Z with circumflex.

In many fonts, yes. But try Arial Unicode MS and Tahoma.

--
Solipsists of the world - unite!

Markus Kuhn

unread,

Mar 7, 2008, 3:42:09 PM3/7/08

to

Helmut Richter <hh...@web.de> writes:
|> Is there simple-to-use software available that does normalizations?

Yes: Perl 5.8.1 or newer comes with a useful module that does
exactly what it says on the tin:

use Unicode::Normalize;

$s = NFD($s); # convert string $s to Normalization Form D

Markus

--
Markus Kuhn, Computer Laboratory, University of Cambridge
http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain