Usage of ZWNJ/ZWJ in Tamil Script.

106 views
Skip to first unread message

Srikanth Lakshmanan

unread,
Jun 1, 2012, 11:32:32 AM6/1/12
to freetamil...@googlegroups.com
Hello all,

I would like to know what are the guidelines on the usage of ZWNJ/ZWJ in Tamil. According to one of the unicode report[2] only one specific case is allowed,
Tamil only uses a Join_Control character in one specific case, most of the sequences these rules allow in Tamil are, in fact, visually confusable

I am guessing க்‌ஷ, க்ஷ (and compounded forms of it) is the case unicode mentions. But would like to know it for sure from someone. I ran a script to detect titles with ZWNJ on Tamil wiki projects and got the result[2]. I have been trying to cleanup and delete the ones with ZWNJ in unwanted places. Apart from ksh, I am in doubt regarding அல்ஃ‌போன்சா where ZWNJ was used between ஃ‌போ to make it a single character. 

After cleaning up titles, It is planned to remove ZWNJ across wikipedia since it impacts search (Lucene splits into multiple words). Should we also file/fixing bugs against existing input methods tools to restrict usage of ZWNJ except in the one case ksh?

Please share your thoughts. Thank you.


--
Regards
Srikanth.L

கா. சேது | කා. සේතු | K. Sethu

unread,
Jun 5, 2012, 6:31:10 AM6/5/12
to freetamil...@googlegroups.com
Between the split form  க்‌ஷ and  the conjunct form க்ஷ , it is the
split form which is what majority of Tamils use. லக்‌ஷ்மி, ரிக்‌ஷா,
ருக்‌ஷான், அக்‌ஷயன்

In fact the keyboards standards of the governments of Tamil Nadu and
Sri Lanka require that the key-sequence for  க்  followed by key for ஷ
should yield non-conjucnt க்‌ஷ using ZWNJ in the mapping. For conjunct
க்ஷ a separate key is assigned.

For example in Tamill99 (as per TN extended standards in 2010) :
key T --> க்ஷ
key sequence hfW -> க்‌ஷ

In the ekalappai 3 series keyboards to which I contributed my share of
efforts - tamil99, phonetic, bamini and typewriter (I didn't touch
Inscript) we follow the same principle - by default sequence for க்
followed by that for ஷ yielding split form (with ZWNJ inserted by the
keymap) and separate key for conjunct form!

In almost all applications ZWNJ does the job it is meant for in
splitting but I have found an exception still is Open Office or Libre
Office in Linux platform which uses icu for rendering text layout. I
guess that layout does not include something which would make ZWNJ
work in Tamil range and thus the split form with ZWNJ appear as the
intended split form only with the fonts that have support for ZWNJ,
for example: Lohit Tamil and Sri Lanka's standard fonts (Sri Tamil and
the Chemmozhi fonts). But with most other fonts conjunct form only
appears.

Another place where the use of ZWNJ may be found is for not allowing
ஸ் followed by ரீ to form the Shrii ligature. Although from Unicode
4.1 or so the definition for this ligature was changed to ஶ் followed
by ரீ from the erstwhile ஸ் followed by ரீ a significant number of
fonts have not been modified yet to not conjunct ஸ் followed by ரீ
into this ligature. So those who have need for ரீ following ஸ் the use
of ZWNJ is an option (though a poor option because with fonts which
only use current definition ZWNJ could be seen as unwanted space!).

Example for the need of ரீ following ஸ் remaining split :

1. Names like Nasreen ( தஸ்லீமா நஸ்ரீன்) - (common among Muslim
communities). ஸ் followed by ரீ in நஸ்ரீன் could well appear as the
ligature Shrii for you if for rendering this you are using a legacy
font which still retains the old definition.

2. Signifcant section of Sri Lankan Tamils, particularly in northern
parts, transliterate "ta" in English and other European languages as
"ர". So for transliterating Steve they start with ஸ் and follow with
ரீவ் --> ஸ்ரீவ் - this also could well appear as the ligature Shrii
followed by வ் to you if you use a font that still retains old
definition.

~சேது
> --
> You received this message because you are subscribed to the Google Groups "ThamiZha! - Free Tamil Computing(FTC)" group.
> To post to this group, send an email to freetamil...@googlegroups.com.
> To unsubscribe from this group, send email to freetamilcomput...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/freetamilcomputing?hl=en-GB.

Srikanth Lakshmanan

unread,
Jun 5, 2012, 8:07:42 AM6/5/12
to freetamil...@googlegroups.com
Thanks Sethu for your detailed mail.

On Tue, Jun 5, 2012 at 4:01 PM, கா. சேது | කා. සේතු | K. Sethu <skh...@gmail.com> wrote:
Between the split form  க்‌ஷ and  the conjunct form க்ஷ , it is the
split form which is what majority of Tamils use. லக்‌ஷ்மி, ரிக்‌ஷா,
ருக்‌ஷான், அக்‌ஷயன்

In fact the keyboards standards of the governments of Tamil Nadu and
Sri Lanka require that the key-sequence for  க்  followed by key for ஷ
should yield non-conjucnt க்‌ஷ using ZWNJ in the mapping. For conjunct
க்ஷ a separate key is assigned.

For example in Tamill99 (as per TN extended standards in 2010) :
key T --> க்ஷ
key sequence hfW ->  க்‌ஷ

Thank you for clarifying and also citing Tamil99 2010 standard (which I couldn't find online). Can you please share if you have it. It will be useful in verifying if the existing map on Wikipedia is as per standard.
 
In almost all applications ZWNJ does the job it is meant for in
splitting but I have found an exception still is Open Office or Libre
Office in Linux platform which uses icu for rendering text layout. I
guess that layout does not include something which would make ZWNJ
work in  Tamil range and thus the split form with ZWNJ appear as the
intended split form only with the fonts that have support for ZWNJ,
for example: Lohit Tamil and Sri Lanka's standard fonts (Sri Tamil and
the Chemmozhi fonts). But with most other fonts conjunct form only
appears.

I saw the same behavior on some more places as well (across browsers, across OS). 
1. Gmail inbox view of this mail with your first line in reply having both conjunct and split forms was just displayed as conjunct form. [1]
2. Navpops (ajax based on preview popup in Wikipedia) also showed same behavior [2]

But on full page view of mail / wiki page, they show both the forms. May be some thing to do with java script?

Another place where the use of ZWNJ may be found is for not allowing
ஸ் followed by ரீ to form the Shrii ligature.

So this is to be fixed at font level ideally to reflect Unicode 4.1+ 's definition.
 
Example for the need of ரீ following ஸ் remaining split :

1. Names like Nasreen ( தஸ்லீமா நஸ்ரீன்) - (common among Muslim
communities). ஸ் followed by ரீ in நஸ்ரீன் could well appear as the
ligature Shrii for you  if for rendering this you are using  a legacy
font which still retains the old definition.

2. Signifcant section of Sri Lankan Tamils, particularly in northern
parts, transliterate "ta" in English and other European languages as
"ர". So for transliterating Steve they start with ஸ் and follow with
ரீவ் --> ஸ்ரீவ் - this also could well appear as the ligature Shrii
followed by வ் to you if you use a font that still retains old
definition.

 நஸ்ரீன் is a compelling case if you ask me. Have you implemented any rules for this on any of the layouts, since we are still stuck with old fonts (atleast "free" fonts)? 

Thanks again.


--
Regards
Srikanth.L

கா. சேது | කා. සේතු | K. Sethu

unread,
Jun 6, 2012, 12:33:22 AM6/6/12
to freetamil...@googlegroups.com
On Tue, Jun 5, 2012 at 5:37 PM, Srikanth Lakshmanan <srik...@gmail.com> wrote:
> Thanks Sethu for your detailed mail.
>
> On Tue, Jun 5, 2012 at 4:01 PM, கா. சேது | කා. සේතු | K. Sethu
> <skh...@gmail.com> wrote:
>>
[..]
>
> Thank you for clarifying and also citing Tamil99 2010 standard (which I
> couldn't find online). Can you please share if you have it. It will be
> useful in verifying if the existing map on Wikipedia is as per standard.
>

i) G.O.M(s) 29 (Information Technology, dated 23-June-2010) :
http://tamilvu.org/coresite/download/Tamil_Unicode_G.O.zip
(or http://www.tn.gov.in/gosdb/gorders/it/it_e_29_2010.pdf )

The above GO of TN gov specifies character standards Unicode and TACE
for Tamil with dos and donts guidelines for valid characters fonts and
keymaps. It provided the layout and key sequences for Tamil99 but for
the Typewriter keymap only the layout.

However the fuller sequences table for Typewriter in addition to all
others in the GOMs 29 are included in the following tender-call
document.

ii) Tamil Virtual Academy 'TENDER DOCUMENT for Development of Tamil
Fonts and Tamil Keyboard driver for 16-bit encodings (Unicode and
TACE16)' [Tender Ref. TVA/SW/2010-11] :
http://tamilvu.org/coresite/download/Teder_Document_for_Tamil_fonts_and_kbd_driver.pdf

Will write later more.

~Sethu

Srikanth Lakshmanan

unread,
Jun 6, 2012, 1:43:21 AM6/6/12
to freetamil...@googlegroups.com
On Wed, Jun 6, 2012 at 10:03 AM, கா. சேது | කා. සේතු | K. Sethu <skh...@gmail.com> wrote:
i) G.O.M(s) 29 (Information Technology, dated 23-June-2010) :
http://tamilvu.org/coresite/download/Tamil_Unicode_G.O.zip
(or http://www.tn.gov.in/gosdb/gorders/it/it_e_29_2010.pdf )

The above GO of TN gov specifies character standards Unicode and TACE
for Tamil with dos and donts guidelines for valid characters fonts and
keymaps. It provided the layout and key sequences for Tamil99 but for
the Typewriter keymap only the layout.

Thank you, a quick skimming made me file this bug[1] on Lohit-Tamil. Waiting to hear more from you.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=829143

--
Regards
Srikanth.L
Reply all
Reply to author
Forward
0 new messages