standard "escape" characters and spacers for transliterators.

17 views
Skip to first unread message

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Nov 9, 2017, 3:29:45 PM11/9/17
to sanskrit-programmers
Now that we're beginning to converge on standard translieration tests, it would be good to come to agreement on how to implement escape characters (This should take care of a problem shrIvatsa mentioned for example).

Should we just follow aruN's lead from sanscript?:

Disabling Transliteration

If you are translating from a Roman rendition, you can tell Sanscript to transliterate only certain parts of your input text. When you want to disable transliteration, type ##. You can enable transliteration by typing ## again. Below is a sample transliteration from ITRANS to Devanagari.

  • bhagavad ##(divine one)## gItA ##(song)  भगवद् (divine one) गीता (song)

You can also disable transliteration on just one character by using the backslash character \. The backslash is used in many programs to take special letters and make them normal. If you've never used the backslash like this before, I recommend just using ##.

  • bhagavad . gItA\.  भगवद् । गीता.
  • dharmakShetre \## kurukShetre  धर्मक्षेत्रे ## कुरुक्षेत्रे
  • a \a \\ A \A  अ a \ आ A

Separating Letters

Some scripts, like Devanagari, can produce complex symbols that are combinations of many other letters. These symbols can be quite confusing, and some of them are not used in modern Devanagari. Fortunately, there's an easy way to prevent them: we use a special "invisible character." This character tells your computer to make text look a certain way. When transliterating from ITRANS, you can insert this character by typing {}. You can also insert it with _. Below are some examples from ITRANS-to-Devanagari transliteration.

  • kShetra k{}Shetra  क्षेत्र क्‍षेत्र
  • barau bara_u  बरौ बर‍उ

--
--
Vishvas /विश्वासः

Shreevatsa R

unread,
Nov 9, 2017, 3:59:34 PM11/9/17
to sanskrit-programmers
I think as you wisely pointed out in the discussion at
https://github.com/sanskrit-coders/indic_transliteration/issues/5 it
is best to treat the problem in separate "layers".

In the lowest layer, purely concerned with transliteration abstractly,
such issues do not arise: the function/library may expect to be given
well-formed and unambiguous text and transliterate all of it,
perfectly.

The need for escaping appears slightly higher at the user-interface
layer, where a user may want to manually type input to be
transliterated. Only a human user needs such conveniences
(special/escape characters) for disabling transliteration and for
separating letters. It may not be needed if the input is not via text:
e.g. if there's a rich UI (say a desktop program like Word, or a
webpage with buttons etc), then it may be preferable to indicate
"don't transliterate this" via say, highlighting the text and clicking
on some button that marks it up somehow. Or, if the communication is
between programs, they may use some structured format like XML or JSON
(in which such specialties are marked up explicitly), instead of
having plain text in which certain characters are interpreted
specially.

Of course as the text interface is still likely to be the one heavily
used, this issue needs to be handled somehow anyway, and I see no
problem with the above convention.
To be explicit, my only point is that ideally we'll have two separate layers:
- the top layer will take text like "bhagavad ##(divine one)## gItA
##(song)" and "parse" it into chunks "bhagavad ", "divine one", " gItA
", and "(song)", aware of which ones (alternate ones) require
transliteration.
- the bottom (transliteration) layer will therefore only get the
strings "bhagavad " and " gItA ", and it does not have to know (and
ideally should not know) anything about escape characters.

For the other case,
- the top layer will look at "k{}Shetra" and pass "k" and "Shetra"
separately to the transliteration layer
- the transliteration layer will transliterate "k" into "क्" and
"Shetra" into "षेत्र"
- the top layer will now concatenate ‎"क्" and "षेत्र", but aware
(because it is the one that did the splitting) that two Devanagari
consonants should not get merged, and that it must therefore insert
U+200D ZERO WIDTH JOINER (ZWJ) between the two strings.
Again, the core transliteration layer does not need to be aware of
such stuff, but now maybe it is even worth splitting the top layer
into the topmost layer and a middle one that knows enough about
Devanagari and has rules for when to insert ZWJ / ZWNJ.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Nov 9, 2017, 4:06:36 PM11/9/17
to sanskrit-programmers, Shree Devi Kumar
Good good, exactly what was running through my mind before I saw this.

Even if actually implemented by a pre-transliteration (interface) layer, agreeing on a convention for these two issues would still be valuable. (+shreeshree in case she had anything to add given her sanskritdocuments experience)


--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-programmers+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ShreeDevi Kumar

unread,
Nov 11, 2017, 11:37:54 PM11/11/17
to VishvAs VAsuki, sanskrit-programmers
Namaste,

My aim in transliteration is very limited, to display the sanskrit texts encoded in ITRANS scheme at sanskrit documents website  in Devanagari, IAST and other Indian language scripts, correctly (as far as possible).

A large number of source itx files at sanskritdocuments.org are from before unicode fonts became available. The conversion uses the standard ITRANS delimiters supported by sanscript such as ## etc.

Technology has changed a lot since 1997 when I first started with ITRANS , JTRANS with xdvng, Itranslator etc.

Going forward, the aim should be to to leverage the new technologies available and not be bound by legacy limitations.





To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-programmers+unsubscrib...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

ShreeDevi Kumar

unread,
Nov 14, 2017, 10:59:21 PM11/14/17
to VishvAs VAsuki, sanskrit-programmers
FYI


Avinash Chopde has modified the online interface to ITRANS to allow for table driven transliteration.

Users can define their own schemes.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Nov 15, 2017, 10:41:05 AM11/15/17
to ShreeDevi Kumar, sanskrit-programmers
साधु सूचितम्!
Reply all
Reply to author
Forward
0 new messages