But what matters today is:
1. U+06C0 should not be used for Persian. This is evident both from
the character notes in the Unicode charts were Persian is not listed
(while it is listed for every other character that Persian uses), and
also in ISIRI 6219, page 19. For Persian, one should use ARABIC LETTER
HEH, followed by ARABIC HAMZA ABOVE.
2. Names or decompositions of Unicode characters cannot be changed
anymore. So if there is a language that actually need to use U+06C0,
it should live with the current name and decomposition.
Roozbeh
Hi
Having distinct characters for consonant Heh and vowel Heh sounds interesting. This is very useful for Persian text processing, especially spell-checking. We've already faced the problem of distinguishing these two 'Heh's in the dictionary of Virastyar. It also helps in encoding Persian corpora and removes the need for an extra tag.
Long history, and it is based on the history of Unicode normalization
mostly. I can tell the story, but it is probably boring.
But what matters today is:
1. U+06C0 should not be used for Persian. This is evident both from
the character notes in the Unicode charts were Persian is not listed
(while it is listed for every other character that Persian uses), and
also in ISIRI 6219, page 19. For Persian, one should use ARABIC LETTER
HEH, followed by ARABIC HAMZA ABOVE.
2. Names or decompositions of Unicode characters cannot be changed
anymore. So if there is a language that actually need to use U+06C0,
it should live with the current name and decomposition.
Roozbeh
I'm not sure, but I guess for most combinations of Persian characters and Hamza, there's a pre-composed character or ligature which gives us equal normalized form. But it's not true for this case (ARABIC LETTER HEH + ARABIC HAMZA ABOVE).
Perian language has AE YE and this combination occurs more than other possible combinations of Hamza.
Bests,
Mehrdad
The story does not involve me or Microsoft at all. Let me give you a
hint: it starts with when Unicode encoded Syriac. Now let's see if you
can recreate the story based on that starting point!
Roozbeh
The story does not involve me or Microsoft at all. Let me give you a
hint: it starts with when Unicode encoded Syriac. Now let's see if you
can recreate the story based on that starting point!
"Does anybody know why the Unicode character 06C0 (ARABIC LETTER HEH
WITH YEH ABOVE) is a combination of
<06D5 0654> and not <0647 0654>, as it's names implies (ARABIC LETTER
HEH WITH YEH ABOVE) ?!
Maybe it should be (ARABIC LETTER AE WITH YEH ABOVE) or something like that."
Here is the real story:
Unicode originally didn't have either a HAMZA ABOVE character, or a
decomposition for characters with a hamza (like ALEF WITH HAMZA
ABOVE). But it did have a HEH WITH YEH ABOVE at U+06C0, which was
named so because it was originally intended to be used for Persian,
Urdu, and other such languages for words like خانهٔ. The original
proposal documents are lost in time, but very probably it comes from
the originators of Unicode (IBM, Xerox, Apple, etc.), or from some of
the various ISO national bodies working on ISO/IEC 10646. I have a
guess that the better name came from ISO people, since in Unicode 1.0
times, it was named ARABIC LETTER HAMZAH ON HA, but its name was
changed to ARABIC LETTER HEH WITH YEH ABOVE when
When Syriac was being encoded for Unicode 3.0, it was found out that
Syriac and Arabic scripts share a lot of harakat, so it was decided
that Syriac users should use the harakat already encoded in the Arabic
block (if you look at the script properties for the Arabic harakat, it
says "Common" instead of "Arabic", meaning it's used for more than one
script).
Now, Syriac also used a hamza above form and a madda above form, both
of which were also needed for Arabic and were also used with other
letters in texts like the Koran. So Unicode 3.0 also encoded a HAMZA
ABOVE, a HAMZA BELOW, and a MADDAH [sic] ABOVE in Unicode 3.0.
At the same time, they figured out now they can add canonical
decompositions for some Arabic letters. But they did not chose the
best decompositions for some letters:
* For U+0626 YEH WITH HAMZA ABOVE, they decomposed to <YEH, HAMZA
ABOVE> instead of <ALEF MAKSURA, HAMZA ABOVE> because they thought
ALEF MAKSURA is only right-joining (while it is dual-joining, which
they fixed in Unicode 3.0.1). So this meant that now if one uses a
dotted YEH letter and puts a HAMZA ABOVE over it, it loses its dots.
This also meant that later, when a letter was found in the Fulfulde
language, which was a yeh form with two dots below it and a hamza
above it, it needed to be encoded separately at U+08A8 ARABIC LETTER
YEH WITH TWO DOTS BELOW AND HAMZA ABOVE (to be encoded in Unicode
6.1): http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3882.pdf
* For U+0681 ARABIC LETTER HAH WITH HAMZA ABOVE, used in Pashto, they
forgot to decompose the letter. This led to a future misunderstanding
that if a letter is atomic, it should be encoded separately, so later
another precomposed letter with a hamza was encoded for the Ormuri
language at U+076C ARABIC LETTER REH WITH HAMZA ABOVE. Finally,
another such letter was proposed to the Unicode Technical Committee
for a beh with hamza above for the Fulfulde language. I found about
this and started a discussion that spilled over three meetings of the
committee and led to a tightening of rules for future encoding of
precomposed characters in the standard. The committee originally
postponed the acceptance of the beh with hamza letter, but at the end
we all agreed that we need to treat hamza above, when used as a
consonant modifier (but not an additional sound, or a hamza or
hamza-like sound) just like another dot pattern diacritic and encode
such letters in the future. So U+08A1 ARABIC LETTER BEH WITH HAMZA
ABOVE was finally accepted for "a future version" of the Unicode
Standard, which should be the version that will come after 6.1 (either
6.2 or 7.0, I still don't know).
* For U+06C0 ARABIC LETTER HEH WITH YEH ABOVE, they also went for same
joining behavior instead of better semantics or ease of display. They
had HEH WITH YEH ABOVE as right-joining, and they had quite a few
heh-forms to decompose it to. They went for AE, since it was also
right-joining, instead of the dual-joining HEH form. Now AE was
intended for the Turkic languages Uighur, Kazakh, and Kirghiz, so the
semantics of HEH WITH YEH ABOVE and its original intended use was lost
with this decomposition. Unicode later found about this and removed
"Persian" from Unicode 4.0. The reasons it did not remove Urdu was
because no one requested it to: the Pakistani government was involved
at the moment with the Unicode Consortium and everyone thought that
they would know better which characters they may want to use or not.
(Whenever I find some time, I will write a proposal and ask for "Urdu"
to be removed from the comment too, and add some more clarification to
the standard about avoiding the use of the character for any language
other than languages that also use AE.)
Now why can't we just change the names of the character? Because
Unicode promised to never change a character name once it gets
encoded. This has been in effect since Unicode 2.0.
Now why can't we just update those decompositions? Because Unicode
promised to never change a decomposition once a character gets
encoded. This was done in Unicode 3.1, very shortly after those hamza
decompositions were put in the standard. So before there was a chance
to update them, they were frozen forever.
For more on Unicode stability policies, see
http://www.unicode.org/policies/stability_policy.html
In all of that, the only part that is related to me, is that after I
figured out U+06C0 is now made unusable for Persian (because of
decomposition issues), I requested the Unicode Consortium to remove
"Persian" from its listed languages (which they did) and also wrote
the section about its being forbidden in ISIRI 6219 and got it
confirmed by the various committees.
But back to your story, I would say some things happened very
differently from how you think they did. But I agree that your is much
more interesting :)
Roozbeh
Urdu has the ezafeh too, with very similar sounds and shapes.
Roozbeh
[...] Unicode merged with ISO/IEC 10646 which led to various character
name changes. My best guess is HEH WITH YEH ABOVE already existed in
one of the various ISO standards for transcription or an extended
Arabic bibliography character set, and since ISO had a policy of
mapping characters in its various character sets by using the same
name for them, ISO/IEC 10646 and then Unicode inherited the HEH WITH
YEH ABOVE name.
Roozbeh