"ARABIC LETTER HEH WITH YEH ABOVE"

113 views
Skip to first unread message

Mehrdad Senobari

unread,
Jun 13, 2011, 3:30:43 PM6/13/11
to Persian Computing
Hi,

Does anybody know why the Unicode character 06C0 (ARABIC LETTER HEH WITH YEH ABOVE) is a combination of
<06D5 0654> and not <0647 0654>, as it's names implies (ARABIC LETTER HEH WITH YEH ABOVE) ?!

Maybe it should be (ARABIC LETTER AE WITH YEH ABOVE) or something like that.

We all blame the Microsoft for the presence of this character in Persian texts, however, it does not change the comparison result of normalized همهٔ and همۀ ;-)


Bests,
Mehrdad Senobari

Behnam

unread,
Jun 13, 2011, 5:57:58 PM6/13/11
to Mehrdad Senobari, Persian Computing
Maybe because it's an AE. We do have AE in Persian but we don't encode it. More interesting question is why it's called 'Yeh above' but it is encoded as Hamza above!
IMO we can not have any clarification on this without establishing an orthographic rule for defining and encoding Ézaafé. We have consonant Heh and vowel Heh (AE) and Ézaafé is defined based on whether the character is a vowel or a consonant.
-b

Roozbeh Pournader

unread,
Jun 13, 2011, 7:14:08 PM6/13/11
to Mehrdad Senobari, Persian Computing
Long history, and it is based on the history of Unicode normalization
mostly. I can tell the story, but it is probably boring.

But what matters today is:

1. U+06C0 should not be used for Persian. This is evident both from
the character notes in the Unicode charts were Persian is not listed
(while it is listed for every other character that Persian uses), and
also in ISIRI 6219, page 19. For Persian, one should use ARABIC LETTER
HEH, followed by ARABIC HAMZA ABOVE.

2. Names or decompositions of Unicode characters cannot be changed
anymore. So if there is a language that actually need to use U+06C0,
it should live with the current name and decomposition.

Roozbeh

> --
> http://groups.google.com/group/persian-computing

Connie Bobroff

unread,
Jun 13, 2011, 8:01:47 PM6/13/11
to Roozbeh Pournader, Mehrdad Senobari, Persian Computing
On Mon, Jun 13, 2011 at 4:14 PM, Roozbeh Pournader <roo...@gmail.com> wrote:
>> I can tell the story, but it is probably boring.
So where does this NOT-SO-BORING story begin? It was a dark and stormy night. You were in high school frustrated about your handwriting and desperately seeking a digital solution. (Hooman was where??)Suddenly, Microsoft Windows (95??) makes its mysterious debut in Tehran even before the official launch in Redmond and one key can type what appears to visually be a heh with a little hamza above on it. However, that has been cleverly mapped (by whom??) to the ة key and you need to have a special hack font. Then.....?

Mehrdad Senobari

unread,
Jun 14, 2011, 4:02:53 AM6/14/11
to Roozbeh Pournader, beh...@me.com, Persian Computing

Hi


Having distinct characters for consonant Heh and vowel Heh sounds interesting. This is very useful for Persian text processing, especially spell-checking. We've already faced the problem of distinguishing these two 'Heh's in the dictionary of Virastyar. It also helps in encoding Persian corpora and removes the need for an extra tag.

 

On Tue, Jun 14, 2011 at 03:44, Roozbeh Pournader <roo...@gmail.com> wrote:
Long history, and it is based on the history of Unicode normalization
mostly. I can tell the story, but it is probably boring.

But what matters today is:

1. U+06C0 should not be used for Persian. This is evident both from
the character notes in the Unicode charts were Persian is not listed
(while it is listed for every other character that Persian uses), and
also in ISIRI 6219, page 19. For Persian, one should use ARABIC LETTER
HEH, followed by ARABIC HAMZA ABOVE.

2. Names or decompositions of Unicode characters cannot be changed
anymore. So if there is a language that actually need to use U+06C0,
it should live with the current name and decomposition.

Roozbeh


I'm not sure, but I guess for most combinations of Persian characters and Hamza, there's a pre-composed character or ligature which gives us equal normalized form. But it's not true for this case (ARABIC LETTER HEH + ARABIC HAMZA ABOVE).

Perian language has AE YE and this combination occurs more than other possible combinations of Hamza.

 

Bests,

Mehrdad



 

Connie Bobroff

unread,
Jun 14, 2011, 3:06:53 PM6/14/11
to Mehrdad Senobari, Roozbeh Pournader, beh...@me.com, Persian Computing
Was the Moin dictionary the only print dictionary to make the distinction?
 
In any case, this will become even more interesting for tagging purposes as we now hear things like:
ta-ye kelaas (standard: tah-e kelaas)
sizda-ye farvardin (standard: sizdah-e farvardin)
 
Unfortunately, I can't think of any other examples.

 

Roozbeh Pournader

unread,
Jun 14, 2011, 5:38:23 PM6/14/11
to Connie Bobroff, Mehrdad Senobari, Persian Computing
Far from that.

The story does not involve me or Microsoft at all. Let me give you a
hint: it starts with when Unicode encoded Syriac. Now let's see if you
can recreate the story based on that starting point!

Roozbeh

Connie Bobroff

unread,
Jun 14, 2011, 6:13:04 PM6/14/11
to Roozbeh Pournader, Mehrdad Senobari, Persian Computing
On Tue, Jun 14, 2011 at 2:38 PM, Roozbeh Pournader <roo...@gmail.com> wrote:
The story does not involve me or Microsoft at all. Let me give you a
hint: it starts with when Unicode encoded Syriac. Now let's see if you
can recreate the story based on that starting point!
 
Come, come now, your role was on center stage in this. But let us start at the VERY beginning, shall we?
First, we need to take you back to grade school where you got exposed to the new textbooks which--evidently without the approval of the Farhangestan--taught you to write
ه‌ی
instead of
هٔ
I think Behdad is the only one to have commented about the textbooks
but we still don't know who exactly was behind this and during which school year it started.
The publishing industry (the people typesetting books) continued to prefer
هٔ
When that first Windows computer arrived, somehow a need was felt for
هٔ
which caused them to go to the trouble of coming up with the rather ingenious tah marbuta
ة
solution which I personally have been appreciating more and more of late, despite the headaches it has brought since Unicode burst on the scene.
Were the geeks who hacked that first Windows from the older generation who had not grown up on your new textbooks or were people from the publishing industry telling them they could not live without
هٔ
and demanding it? Maybe Sinasoft was providing the fonts?
 
So that is a start and you can now please fill in or correct whatever I got wrong. (I felt a need to put Persian characters on a separate line as there is still a PC / Mac problem with this in email.)
 

Behnam

unread,
Jun 14, 2011, 7:47:32 PM6/14/11
to Connie Bobroff, Mehrdad Senobari, Roozbeh Pournader, Persian Computing
I don't know how important it can be to have different encodings for vowel and consonant Heh. To me, what is far more important is what comes after them in relating one word to the next, depending to its last character. If the first word ends with consonant, it takes Kasreh. If it ends with vowel (Heh or otherwise) it takes Yeh.
In other words, neither that Kasreh is Kasreh nor this Yeh is Yeh (or hamza above). They are both different forms of a single entity called Ezafe (I call it Payvand) particular and essential and perhaps unique to Persian language.
In your example, focus on what you are putting in between tah and kelaas. They are neither Yeh nor Kasre. They are both 'Payvand'.
-b

Roozbeh Pournader

unread,
Jun 14, 2011, 7:47:38 PM6/14/11
to Connie Bobroff, Mehrdad Senobari, Persian Computing
Now you're forcing my hand. The question raised was something else.
Here is the original question:

"Does anybody know why the Unicode character 06C0 (ARABIC LETTER HEH
WITH YEH ABOVE) is a combination of
<06D5 0654> and not <0647 0654>, as it's names implies (ARABIC LETTER
HEH WITH YEH ABOVE) ?!

Maybe it should be (ARABIC LETTER AE WITH YEH ABOVE) or something like that."

Here is the real story:

Unicode originally didn't have either a HAMZA ABOVE character, or a
decomposition for characters with a hamza (like ALEF WITH HAMZA
ABOVE). But it did have a HEH WITH YEH ABOVE at U+06C0, which was
named so because it was originally intended to be used for Persian,
Urdu, and other such languages for words like خانهٔ. The original
proposal documents are lost in time, but very probably it comes from
the originators of Unicode (IBM, Xerox, Apple, etc.), or from some of
the various ISO national bodies working on ISO/IEC 10646. I have a
guess that the better name came from ISO people, since in Unicode 1.0
times, it was named ARABIC LETTER HAMZAH ON HA, but its name was
changed to ARABIC LETTER HEH WITH YEH ABOVE when

When Syriac was being encoded for Unicode 3.0, it was found out that
Syriac and Arabic scripts share a lot of harakat, so it was decided
that Syriac users should use the harakat already encoded in the Arabic
block (if you look at the script properties for the Arabic harakat, it
says "Common" instead of "Arabic", meaning it's used for more than one
script).

Now, Syriac also used a hamza above form and a madda above form, both
of which were also needed for Arabic and were also used with other
letters in texts like the Koran. So Unicode 3.0 also encoded a HAMZA
ABOVE, a HAMZA BELOW, and a MADDAH [sic] ABOVE in Unicode 3.0.

At the same time, they figured out now they can add canonical
decompositions for some Arabic letters. But they did not chose the
best decompositions for some letters:

* For U+0626 YEH WITH HAMZA ABOVE, they decomposed to <YEH, HAMZA
ABOVE> instead of <ALEF MAKSURA, HAMZA ABOVE> because they thought
ALEF MAKSURA is only right-joining (while it is dual-joining, which
they fixed in Unicode 3.0.1). So this meant that now if one uses a
dotted YEH letter and puts a HAMZA ABOVE over it, it loses its dots.
This also meant that later, when a letter was found in the Fulfulde
language, which was a yeh form with two dots below it and a hamza
above it, it needed to be encoded separately at U+08A8 ARABIC LETTER
YEH WITH TWO DOTS BELOW AND HAMZA ABOVE (to be encoded in Unicode
6.1): http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3882.pdf

* For U+0681 ARABIC LETTER HAH WITH HAMZA ABOVE, used in Pashto, they
forgot to decompose the letter. This led to a future misunderstanding
that if a letter is atomic, it should be encoded separately, so later
another precomposed letter with a hamza was encoded for the Ormuri
language at U+076C ARABIC LETTER REH WITH HAMZA ABOVE. Finally,
another such letter was proposed to the Unicode Technical Committee
for a beh with hamza above for the Fulfulde language. I found about
this and started a discussion that spilled over three meetings of the
committee and led to a tightening of rules for future encoding of
precomposed characters in the standard. The committee originally
postponed the acceptance of the beh with hamza letter, but at the end
we all agreed that we need to treat hamza above, when used as a
consonant modifier (but not an additional sound, or a hamza or
hamza-like sound) just like another dot pattern diacritic and encode
such letters in the future. So U+08A1 ARABIC LETTER BEH WITH HAMZA
ABOVE was finally accepted for "a future version" of the Unicode
Standard, which should be the version that will come after 6.1 (either
6.2 or 7.0, I still don't know).

* For U+06C0 ARABIC LETTER HEH WITH YEH ABOVE, they also went for same
joining behavior instead of better semantics or ease of display. They
had HEH WITH YEH ABOVE as right-joining, and they had quite a few
heh-forms to decompose it to. They went for AE, since it was also
right-joining, instead of the dual-joining HEH form. Now AE was
intended for the Turkic languages Uighur, Kazakh, and Kirghiz, so the
semantics of HEH WITH YEH ABOVE and its original intended use was lost
with this decomposition. Unicode later found about this and removed
"Persian" from Unicode 4.0. The reasons it did not remove Urdu was
because no one requested it to: the Pakistani government was involved
at the moment with the Unicode Consortium and everyone thought that
they would know better which characters they may want to use or not.
(Whenever I find some time, I will write a proposal and ask for "Urdu"
to be removed from the comment too, and add some more clarification to
the standard about avoiding the use of the character for any language
other than languages that also use AE.)

Now why can't we just change the names of the character? Because
Unicode promised to never change a character name once it gets
encoded. This has been in effect since Unicode 2.0.

Now why can't we just update those decompositions? Because Unicode
promised to never change a decomposition once a character gets
encoded. This was done in Unicode 3.1, very shortly after those hamza
decompositions were put in the standard. So before there was a chance
to update them, they were frozen forever.

For more on Unicode stability policies, see
http://www.unicode.org/policies/stability_policy.html

In all of that, the only part that is related to me, is that after I
figured out U+06C0 is now made unusable for Persian (because of
decomposition issues), I requested the Unicode Consortium to remove
"Persian" from its listed languages (which they did) and also wrote
the section about its being forbidden in ISIRI 6219 and got it
confirmed by the various committees.

But back to your story, I would say some things happened very
differently from how you think they did. But I agree that your is much
more interesting :)

Roozbeh

Roozbeh Pournader

unread,
Jun 14, 2011, 7:50:55 PM6/14/11
to Behnam, Connie Bobroff, Mehrdad Senobari, Persian Computing
On Tue, Jun 14, 2011 at 4:47 PM, Behnam <beh...@me.com> wrote:
> ... perhaps unique to Persian language.

Urdu has the ezafeh too, with very similar sounds and shapes.

Roozbeh

Roozbeh Pournader

unread,
Jun 14, 2011, 7:55:48 PM6/14/11
to Connie Bobroff, Mehrdad Senobari, Persian Computing
On Tue, Jun 14, 2011 at 4:47 PM, Roozbeh Pournader <roo...@gmail.com> wrote:
> Unicode originally didn't have either a HAMZA ABOVE character, or a
> decomposition for characters with a hamza (like ALEF WITH HAMZA
> ABOVE). But it did have a HEH WITH YEH ABOVE at U+06C0, which was
> named so because it was originally intended to be used for Persian,
> Urdu, and other such languages for words like خانهٔ. The original
> proposal documents are lost in time, but very probably it comes from
> the originators of Unicode (IBM, Xerox, Apple, etc.), or from some of
> the various ISO national bodies working on ISO/IEC 10646. I have a
> guess that the better name came from ISO people, since in Unicode 1.0
> times, it was named ARABIC LETTER HAMZAH ON HA, but its name was
> changed to ARABIC LETTER HEH WITH YEH ABOVE when

[...] Unicode merged with ISO/IEC 10646 which led to various character
name changes. My best guess is HEH WITH YEH ABOVE already existed in
one of the various ISO standards for transcription or an extended
Arabic bibliography character set, and since ISO had a policy of
mapping characters in its various character sets by using the same
name for them, ISO/IEC 10646 and then Unicode inherited the HEH WITH
YEH ABOVE name.

Roozbeh

Behnam

unread,
Jun 14, 2011, 8:16:18 PM6/14/11
to Roozbeh Pournader, Connie Bobroff, Mehrdad Senobari, Persian Computing
Yes. Makes sense.
-b

> --
> http://groups.google.com/group/persian-computing

Mehrdad Senobari

unread,
Jun 15, 2011, 2:42:26 AM6/15/11
to Roozbeh Pournader, Connie Bobroff, Persian Computing
Thanks for sharing the story with us. It's not boring at all.
Reply all
Reply to author
Forward
0 new messages