Tamil99, Tamil-Typewriter keyboards extensions for Unicode and TACE as per Go.O.Ms 29 of TN Gov - some bugs and issues

454 views

Skip to first unread message

கா. சேது | කා. සේතු | K. Sethu

unread,

Nov 7, 2010, 1:18:38 AM11/7/10

to freetamil...@googlegroups.com

Friends

About 10 days back (28-Oct) in an posting in Tamil, I wrote about my
intention to write soon after about the extensions to Tamil99 as in
TN Gov. G.O.M(s) 29 and about what I think as mistakes in the
documents attached to G.O.Ms 29 and the documents attached to TVA
Tender TVA/SW/2010-11. My that post in Tamil is at:
http://groups.google.com/group/freetamilcomputing/browse_frm/thread/b1d9c2c9fa910fec

I will write all what I intended in a series of emails postings
(likely to be 4 parts) to this forum in English. My aim behind this
exercise is to bring to the attention of makers of keymaps, encoding
converters, other technical persons and also the end users about the
changes in the extended keyboards (both Tamil99 and Tamil-Typewriter)
and more importantly the bugs in the standard I have found so far that
need be rectified by the technical team (if there is any) who are
maintaining the standards for the government of TN.

Here in this first part I have listed the URLs for the downloads of
two relevant documents and brief listing of the major contents of
them. In the second part to follow this, I point out the mistakes in
the prescriptions as well as documentations for the extension of
Tamil99 keyboard sequences for Unicode and TACE16. Similarly I will be
writing for Tamil-Typewriter in the 3rd part (to be composed still).
In the 4th part I will write about the other changes which are without
mistakes.

The download links for the two documents we are referring to are as follows:

i) G.O.M(s) 29 (Information Technology, dated 23-June-2010) :
http://tamilvu.org/coresite/download/Tamil_Unicode_G.O.zip
(or http://www.tn.gov.in/gosdb/gorders/it/it_e_29_2010.pdf )

ii) Tamil Virtual Academy 'TENDER DOCUMENT for Development of Tamil
Fonts and Tamil Keyboard driver for 16-bit encodings (Unicode and
TACE16)' [Tender Ref. TVA/SW/2010-11] :
http://tamilvu.org/coresite/download/Teder_Document_for_Tamil_fonts_and_kbd_driver.pdf
(or http://tenders.tn.gov.in/pubnowtend/uploaded/TD_tvu53245_teder_document_for_tamil_fonts_and_kbd_driver.pdf)

The first PDF document of G.O.M(s) 29, - Tamil_Unicode_G.O.zip
consists of the following:

a) Orders (on the cessation of the usage by all the TN Gov.
departments of 8 bit encoding schemes including TAB/TAM and the
adoption of Unicode for applications having Tamil support and TACE for
applications with partial support or without any support, directives
on the relevant appendices, prescribed prefixes for names of Unicode
and TACE fonts supplied to the government and on the requirement for
such fonts to be provisioned with re-distribution rights that includes
"Installable Embedding Allowed" clause )

b ) Appendix A : Valid Unicode Tamil character sequences (and
including copy of Tamil Vowels, Consonants and Syllables table from
Unicode 5.2)

c) Appendix B1 : TACE16 Code Chart -Tamil Letters
Appendix B2 : TACE16 Code Chart -Tamil Symbols
Appendix B3 : Tamil Named Sequence for TACE16 - Letters
Appendix B3 : Tamil Named Sequence for TACE16 - Symbols

d) Appendix C : Tamil99 Extended Keyboard Sequence for Tamil Unicode
and TACE16 (and including copy of the Tamil 99 keyboard rules as it is
in G.O.Ms 17 dated 13.6.99)
e) Appendix D: Tamil Collation Sequence

The tender document (the second link above) TVA/SW/2010-11 - is the
call by Tamil Virtual Academy to bidders for the development and
supply of Tamil Fonts and Tamil Keyboard Drivers for Unicode and
TACE16 encoding schemes (for distribution to Government departments
and public) - the PDF document, in addition to instructions and other
usual tender related terms and conditions includes following annexes :

a) Annexure 1 General Symbols Code Chart & Glyph Name Chart for
Unicode and TACE16
b) Annexure 2 Tamil99 Extended Keyboard Layout
c) Annexure 3 Tamil-Typewriter Keyboard Layout
d) Annexure 4 Typewriter Keyboard Sequence for Unciode and TACE16

This Tender was e-publised on 27th Oct 2010 and closing date is 24th Nov 2010.

To be continued in next mail with issues in Tamil99 extensions for
Unicode and TACE.

K. Sethu
(கா. சேது)
Colombo

கா. சேது | කා. සේතු | K. Sethu

unread,

Nov 7, 2010, 1:31:48 AM11/7/10

to freetamil...@googlegroups.com

This is part two continuing from my first posting on this subject
which is at : http://groups.google.com/group/freetamilcomputing/browse_frm/thread/c95cb1994ad1606c

Here in part-2, I point out mistakes, ambiguities and contradictions
related to Tamil99 extensions. The last of the following 5 issues is
common to Tamil99 and Tamil-Typewriter

1. Tamil99 - Key assignments for ":" (Colon - U+003A and TACE : 003A)
======================================================
For both Unicode and TACE16, assigned qwerty key is the key I (which
is Shift i) as per indications in the tabulation in Appendix C of
G.O.Ms 29 and the layout diagram in Annexure-2 of the tender
document.

However, for TACE16 , there is a conflict. In the tabular list in
Appendix C of G.O.Ms 29, the same key is also assigned with a (TACE16
only) symbol called "Raja" (ராஐ). On the other hand this symbol
(somewhat similar to Tamil 100) is not indicated in the layout diagram
in Annexure-2 of the tender document (whereas the other TACE16 only
symbols - Full Moon, New Moon and Karthigai are indicated in there
respective places in the layout)

This conflicting assignments for Colon and the new symbol Raja arises,
perhaps due to misreading of the corresponding key assignment in the
old Tamil99 standard of 1999.

Consider the context of similar assignments in old Tamil99 scheme of
the following 4 symbols: quotation mark ( " U+0022), Colon ( :
U+003A), Semicolon ( ; U+003B) and Apostrophe ( ' U+0027) . The
assignments for these 4 in older scheme in terms of qwerty keys were :

i) key K (shift k) --> "
ii) key L (shift l) --> :
iii) key : (shift ;) --> ;
iv) key " (shift ') --> '

Out of the above 4 similar sequences only for Colon the new standard
has a reassignment which is the key I (shift i) --> : with the old
key L (shift l) being not assigned to anything. I believe there could
have been no compelling reasons for this reassignment from L to I,
but most probably it is consequential to a mistaken reading of the old
key assignment, L (shift l) as I (shift i) since the simple case of L
and capital case of i look alike with most fonts.

So the resolution to this conflict has to be that the old Tamil99 key
assignment of key L (shift l) for Colon be restored for both Unicode
and TACE16, and for the newly introduced symbol Raja in TACE16 only,
the assignment could be key I (shift i)

2. Tamil99 - Sequences with ^ and # Combinations
=======================================
The keyboard sequence assigned for the Tamil digits 0 to 9 and the
Tamil Numbers 10, 100 and 1000 are same for Unicode and TACE.
Further, 16 other Tamil fractions (1/8, 1/4, 1/2, 3/4, 1/32.... etc)
in TACE16 encoding are also provided with assignments similarly using
^ (that is the key shift 6) and # (key shift 3).

Examples:
For common elements to Unicode and TACE16
----------------------------------------------------------------
^#0# --> ௦ Tamil Number 0 (சுழி) [U+0BE6] [TACE: E180]
^#1# --> ௧ Tamil Number ௧ (ஒன்று) [U+0BE7] [TACE: E181]
...
...
^#10# --> ௰ Tamil Number ௰ (பத்து) [U+0BF0] [TACE: E18A]
^#100# --> ௱ Tamil Number ௱ (நூறு) [U+0BF1] [TACE: E18B]
^#1000# --> ௲ Tamil Number ௲ (ஆயிரம்) [U+0BF2] [TACE: E18C]

For TACE Only:
--------------------
^#18# --> Tamil Fraction 1/8 (அரைக்கால்) [TACE : E1A0]
....
...
^#1320# --> Tamil Fraction 1/320 (முந்திரி) [TACE: E1AC]
....
...

I have couple of issues with the scheme in place : i)the manner of
presentation and ii) the number of keys :

i. The sequence assigned for the 10 digits, 3 numbers and the 16
fractions are tabulated in the sub section C of the Appendix C of
G.O.Ms 29. For all other keyboard sequences in the other sub sections
of the same document as well as all the tabulation for
Tamil-Typewriter (in Annexure 4 of Tender document), there is a
unwritten convention followed that for key sequences having more than
one key, the + symbol is used as a notation between successive qwerty
keys. For example to mean the sequence hq (கஆ) that produces கா, for
illustrative purpose, they are shown as h+q (க+ஆ).

But for these 29 sequences involving ^ and # combination, the +
symbol notation is not used between keys and that departure could
cause doubts if there has been an mistaken inclusion of ^ before #

ii. Even more important issue is whether there is any need for two
keys to precede an English digit or number sequence and one more to
trail instead of the more convenient scheme of having only one
preceding symbol. In fact the assignments for the same in the
tabulations for Tamil-Typewriter (Annexure 4 of Tender document), only
one # is used to precede the digit or number sequences in English. For
example for Tamil number 1000 the keystroke sequence is #1000
(#+1+0+0+0) for Tamil-Typewriter.

The same type of scheme could be used for Tamil99 also and I don't see
the need for the more inconvenient assignments now prescribed.

3. Tamil99 - some discrepancies with Tamil 99 Keyboard rules as in
G.O.Ms 17 dated 13.6.99
======================================================
The Appendix C providing TAMIL keyboard sequences extended for Unicode
and TACE16 in tables also includes the verbatim copy of the Tamil 99
keyboard Rules as it is in G.O.Ms 17 dated 13.6.99 and the rules are
mandatory. However some rules which are also covered by the newer
sequence tables, have some discrepancies to take note of rectify.

i) In the rule 1 the following part needs corrections

"......the five grandha consonants combined with the vowel (sa, sha,
ja, ha, ksha), and the letter shri...."

Now with the addition of ஶ (U+0BB6 - Unicode name: SHA) from Unicode
4.1, which is also included in the G.O.Ms 29 sequence tables, the
number has to be corrected to 6.

Further the above traditional way of naming the grandha consonants ஷ,
க்ஷ in English are at variance with Unicode English naming which are
used in the tables. The correlations between traditional naming and
the Unicode's naming are as follows (Unicode Naming on the right
side):

ஸ : sa = SA
ஷ : sha = SSA
க்ஷ: ksha = KSSA
ஶ --- = SHA

For normal writing with simple or mixed case (rather than all capital
case notations of Unicode) a notation consistent with correct
pronunciation is needed to avoid confusions with regard to ஷ, க்ஷ, ஶ
(U+0BB6) also shrii now defined with ஶ (U+0BB6) instead of legacy ஸ

ii) The sequence for Copyright sign © specified as ^c in rule 10 in
Tamil99 Keyboard rules should be now corrected to ^C. This is so
because the sequence ^c is now (rightly) reassigned in the tables to
the vowel modifier ொ for implementing the Rule 9 to Unicode and TACE16
extensions and conversely for © symbol the sequence is defined as ^C
in the tables

4. Tamil99 - Indication of most Tamil-99 Keyboard sequences with ^
(shift 6) combinations in the layout image not proper
====================================================
In Section B of appendix C of G.O.Ms 29 are the sequences that start
with ^ (shift 6) followed by qwerty keys mapping to symbols and vowel
modifiers. Of those only for Non Breaking Space (sequence ^S) and
copyright symbol (sequence ^C) the key following ^ are shift state
(capital case) qwerty keys. All others are of simple case.

But in the layout diagram corresponding to ^ (shift+6) combination in
Annexure 2 of the Tender document, they are all shown at the top right
edge of the respective keys. In fact in the key marked C the yield ©
of the sequence ^C is shown at the bottom right edge and ொ the yield
of ^c is shown at the top right edge. For all these placing the labels
corresponding to the shift level of the key following ^ would be more
appropriate and would avoid contradictory interpretations.

5. Tamil99 and Tamil-Typewriter - Code Point for "Non Breaking Space"
=========================================================
In the tabulations for Tamil99 as well as Tamil-Typewriter, the code
point to map is indicated as 00AD for both Unicode and TACE. However
00AD is actually the Unicode code point for "Soft-Hyphen". For "Non
Breaking Space" (NBSP) the correct code point should be 00A0.

On the other hand the Annexure 1 to Tender document titled "General
Symbols Code Chart & Glyph name chart for Unicode and TACE" lists the
code points of "No-Break Space" and "Soft Hyphen" correctly as 00A0
and 00AD respectively. The wrong code point shown for Non Breaking
Space in the key sequences tabulation introduces an ambiguity as to
which of the two : Soft Hyphen and NBSP a keyboard sequence is
required since there is only one keymap sequence is indicated (in
each of Tamil99 and Tamil-Typewriter). This needs to be rectified by
specifying which of the two (with correct code point) is intended

I will continue in next part with issues for Tamil-Typewriter
extensions for Unicode and TACE. Since I ahev note drafted it yet it
will take some more time.

K. Sethu

கா. சேது | කා. සේතු | K. Sethu

unread,

Nov 16, 2010, 2:14:05 AM11/16/10

to freetamil...@googlegroups.com

This is the part 3 of the series I write under this thread which is at
the following URL
http://groups.google.com/group/freetamilcomputing/browse_frm/thread/c95cb1994ad1606c?pli=1

In this part the issues with the TN Government's keyboard sequences
for Tamil-Typewriter extended for Unicode and TACE, are presented.
Additionally at the end of this part I also raise 2 other general
issues not specific to any keyboards

In the TUV tender document, for Tamil-Typewriter, the layout diagrams
are in Annexure -3 and the keyboard sequences are in Annexure-4 . The
issues 6 to 9 in the following are on Tamil-Typewriter keyboard
sequences.

6. Tamil-Typewriter keyboard sequence for “Non Breaking Space” and
“Tamil Number Sign”
==============================================
In the subsection titled “Typewriter Extended Sequence With 'A' (shift
+ a) Combination” (pages 7 and 8) of Annexure 4 the following two
assignments (key sequences common for both Unicode and TACE) are in
conflict with each other:

A (shift a) + S (shift s) → ௺ (Tamil Number Sign)
A (shift a) + S (shift s) → Non Breaking Space

As I pointed out in issue 5 in the last part whether it is “Non
Breaking Space” or “Soft Hyphen” for which a keyboard sequence is
intended, is another issue to be resolved. Regardless of which of the
two is intended, another key sequence is needed so as not to conflict
with that for Tamil number sign.

The sequence A+S is appropriate for Tamil Number Sign if we compare
the 'A' combination with neighboring shift level keys as displayed in
the layout diagrams in Annexure-3.

So the resolution could be to use yet so far unused sequence of A
(shift a) + s → Non Breaking Space or Soft Hyphen (whichever is meant
to resolve issue 5).

7 . Tamil-Typewriter keyboard sequence for ^ (caret) sign shown in
table has error
=====================================
In the same subsection for 'A' (shift + a) combination in Annexure 4,
the sequence for the key ^ (caret) is shown as ^ (shift 6) which
conflicts with the assignment of the same for the vowel modifier sign
UU (uukaara). It is clear that this error is due to typographical
negligence; i.e., not typing “A (shift) + ” to precede “^ (shift 6)”
as with the the others in the same sub section and thus the sequence
should be corrected to read as “A (shift) + ^ (shift 6)”

8. Tamil-Typewriter keyboard sequence for ழூ (zhU) could include a
traditionally used sequence
===================================
The keyboard sequences provided in Annexure 4 to Tender document for
ழ, ழு and ழூ are the following:

H (shift h) → ழ
G (shift g) → ழு
{ (shift [) + H (shift h) → ழூ ( ூ + ழ → ழூ )

Users of typewriter keymaps including this type of Remington based
ones, are used to key sequence for ழூ that is based on the key
sequqnce for ழு with another key to make it ழூ (a modifier key
preceding ழு in case of Remington based keyboards), just like the
sequences specified in Annexure 4 of Tender document for டூ, மூ, ரூ
and ளூ. For example for ளூ the sequence consequential to S → ளு is as
follows:

: (shift ;) + S → ளூ

Similarly for ழூ the following sequence should be added:

: (shift ;) + G → ழூ

The already prescribed sequence for ழூ need not be removed as it can
be left as an additional sequence.

9. Tamil-Typewriter Layout Diagrams (Annexure 3) – errors
=================================
In the layout diagrams for Tamil-Typewriter in Annexure-3 to tender
document following errors are found:

9.-i) Sequence assigned to $ (shift 4) is shown wrongly as ஐ and
should be corrected to ஜ which is the prescription for $ key in the
tables in Annexure 4

9.-ii) The key “&” (shift 7) is shown with the right end piece of ஷ
which is wrong. The & (shift 7) key is no longer assigned to ஷ for
which the changed prescription in this standard is Z. The key & (shift
7) is no longer ஐஊassigned and so the key should be shown in the
layout free of any Tamil character or part of it.

9.-iii) The key “*” (shift 8) is shown with bullet symbol assigned
to it which is wrong. The sequence for bullet symbol is no longer *
(shift 8) but “ A (shift a) + . ” . The key “*” (shift 8) is not
assigned to any other Tamil characters and so the bullet should not be
shown on it.

The issues numbered 6 to 9 above and the issues numbered 1 to 5 in my
previous part are the ones I have found to be bugs in the keyboard
sequences and / or layout diagrams for Tamil-99 and Tamil-Typewriter
as in the G.O.Ms 29 and TUV tender documents, which need be addressed
to and resolved by TN government / TVA

The following issues 10 – 11 are couple of general issues that are
worthy of attention of all concerned including TN gov. / TVA

10) Fall-back glyphs used for dependent vowel signs for U (ukara) and
UU (uukara)
====================================
Tamil uyirmei consonants having U (ukara) and UU (uukaara) vowels are
mostly ligatures (rendering of which would be with GSUB ) and so the
role of the glyphs used in fonts for the two code points of the
dependent vowel signs of U and UU can be considered as follows:

i. Illustrative role – in discussion with diagrams, to mean the two
vowel modifiers in generic sense regardless of the ligated shapes
taken by the uyirmeis having the two vowels

ii. For some subset of uyirmeis with U and similarly of those with
UU, be the actual vowel modifiers that are used in shaping with GPOS.

The diacritic glyphs prescribed in TN government's implementation
schemes for the dependent vowel signs for U (ukara) and UU (uukaara),
are the grantha glyphs for the same vowel modifiers, viz ு and ூ
respectively.

They are at variance with the glyphs used for illustrations in Unicode
charts and documents for Tamil range, which are, for U the modifying
stroke used in cases of {பு, யு, வு, ஙு} and for UU that used in cases
of {பூ, யூ, வூ, ஙூ}. Those glyphs used in Unicode charts are also
used prescriptively in Sri Lanka government's implementation of Tamil
Unicode standardized in SLS-1326:2008
[http://www.icta.lk/index.php/en/programmes/ict-policy-leadership-and-institutional-development-programme/104-local-languages-initiative-/651-sls-1326-2008-tamil-ict-standard]

I think more uniformity between these standards needs to be
established. The TN Govt could consider prescribing the generic glyphs
for U and UU as it is illustratively shown in Unicode charts and
documents.

11) Non Conjunct form (க்‌ஷ) of க்ஷ (KSSA) and the need for fonts to
support ZWNJ
========================================
It is a welcome step that the G.O.Ms 29 / Tender documents include
prescriptions for key sequences for conjunct form of ligature க்ஷ
(U+0B95 U+0BCD U+0BB7) and non conjunct form க்‌ஷ ( U+0B95 U+0BCD
U+200C U+0BB7), the latter being more used in handwriting by a
majority of Tamils. The standard of Sri Lanka, SLS-1326:2008 also
similarly allows the use of U+200C (Zero Width Non Joiner – ZWNJ) for
the same purpose.

However font makers (as the Tender call of TVA also includes
development and supply of font) could be instructed also to provide
support for ZWNJ in the fonts they supply, like they are in recent
Microsoft fonts and in open source GPL licensed Lohit-Tamil of Red Hat
/ Fedora (version 2.4.5 of Lohit-Tamil.ttf can be extracted from this
tar ball archive :
https://fedorahosted.org/releases/l/o/lohit/lohit-tamil-ttf-2.4.5.tar.gz
). By support in Font, I mean the font having the non visual glyph for
ZWNJ for the code point of U+200C.

In Linux platform I have observed that in most applications regardless
of whether font level support is provided for ZWNJ or not, the split
form க்‌ஷ using ZWNJ gets rendered correctly. However, in Open Office
applications (Writer, Calc etc), only if a font has support for ZWNJ,
then split form renders correctly with that font. It is thus very
much preferred that the Tamil fonts include ZWNJ support to be useful
in all applications in any OS platform.

I will write in my next post some other issues due from the non
distinctiveness of glyphs used for ள (Lakaram) and the part
representing au-length-mark in the vowel ஔ (Au) and the corresponding
dependent vowel sign ௌ.