detect.py -- automatically detect a transliteration scheme

98 views
Skip to first unread message

Arun

unread,
Dec 22, 2013, 2:05:12 PM12/22/13
to sanskrit-p...@googlegroups.com
Summary
Goal: given some input string, detect the scheme used, e.g. "devanagari", "slp1", "harvard-kyoto"
Source: https://github.com/sanskrit/detect.py

I had been thinking about doing this but was pleasantly surprised to find Shreevatsa had predated me by a few weeks. But I thought it would be useful to refactor this into its own module and add support for other scripts and encodings.

Currently supports:
- Bengali
- Devanagari
- Gujarati
- Gurmukhi
- Kannada
- Malayalam
- Oriya
- Tamil
- Telugu
- Harvard-Kyoto
- IAST
- ITRANS
- Kolkata romanization
- SLP1
- Velthuis (no upper-case support)

Behavior
- If some input is coded ambiguously, the program consults an internal list and chooses the most "favorable." The precedence order is: HK > IAST > ITRANS > Kolkata > SLP1 > Velthuis.
- If there are multiple contradictory encodings present, the program returns None.

The algorithm is a little goofy, mainly because I thought it was more amenable to adding more features/flags later on.

Next steps:

- I have tested this on a variety of simple cases, but I am sure there are some gaps in coverage. If you find any, please post them here.
- If the current behavior is disagreeable to you, please let me know how you'd like to see it changed.
- I was advised that WX support wasn't worth the time. If you disagree, let me know so that I can add support for it.

When I'm more confident that it has comprehensive coverage, I'll also create a JavaScript port.




dhaval patel

unread,
Dec 23, 2013, 12:29:23 AM12/23/13
to sanskrit-p...@googlegroups.com
Good work.
tested with one input in SLP1.
Noted the issue in Github issues


--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
Dr. Dhaval Patel, I.A.S
District Development Officer, Rajkot

Mārcis Gasūns

unread,
Dec 23, 2013, 4:49:38 PM12/23/13
to sanskrit-p...@googlegroups.com
Namaste,


On Sunday, 22 December 2013 23:05:12 UTC+4, Arun wrote:
Please add the source link from the demo URL as well.
 
I had been thinking about doing this but was pleasantly surprised to find Shreevatsa had predated me by a few weeks. But I thought it would be useful to refactor this into its own module and add support for other scripts and encodings.
Right, the work must go on. It's very convenient. It's how it must always have been.
 

Currently supports:
- Bengali
- Devanagari
- Gujarati
- Gurmukhi
- Kannada
- Malayalam
- Oriya
- Tamil
- Telugu
- Harvard-Kyoto
- IAST
- ITRANS
- Kolkata romanization
- SLP1
- Velthuis (no upper-case support)
Oh, and than there are those Devanagari non-unicode encodings, oh sure there must be a way to recognize them. I'll add it yo the issues, if you do not mind.
 
- If some input is coded ambiguously, the program consults an internal list and chooses the most "favorable." The precedence order is: HK > IAST > ITRANS > Kolkata > SLP1 > Velthuis.
 Kolkata in the middle does not makes much logic. ITRANS is easy to spot, so should not be as high in the string.
Very much needed work. I'm to be an evangelist of it for sure. It solves so many issues at once. Even before they arise.

M.G.

Arun

unread,
Dec 24, 2013, 9:21:19 PM12/24/13
to sanskrit-p...@googlegroups.com
Updated with stronger support, more tests, and a different algorithm. Suggested additions so far:

- support for non-Unicode Devanagari encodings
- support for multiple encodings in a single string (how? can't think of an approach that improves on the one used now)

Please continue to test and suggest additional features.

ken p

unread,
Dec 25, 2013, 11:34:09 AM12/25/13
to sanskrit-p...@googlegroups.com
Arun,

Please post the links for these type of script converters.
Is there any web page /web site converter to Roman Script?

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Dec 25, 2013, 12:26:20 PM12/25/13
to sanskrit-p...@googlegroups.com

On Wed, Dec 25, 2013 at 8:34 AM, ken p <drk...@gmail.com> wrote:
Please post the links for these type of script converters.
Is there any web page /web site converter to Roman Script?




--
--
Vishvas /विश्वासः

Arun

unread,
Dec 25, 2013, 1:32:01 PM12/25/13
to sanskrit-p...@googlegroups.com
Mārcis and I disagree on how these schemes should be represented internally. I prefer lowercase strings
e.g. 'devanagari' and 'iast', but Mārcis favors actual-case strings, e.g. 'Devanagari' and 'IAST'. I think lowercase
is a nice convention and easy to type, but Mārcis thinks it is confusing and un-academic.

Do any group members have strong feelings on this? If so, I would be amenable to making a change.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Dec 25, 2013, 2:22:05 PM12/25/13
to sanskrit-p...@googlegroups.com

On Wed, Dec 25, 2013 at 10:32 AM, Arun <aru...@gmail.com> wrote:
Do any group members have strong feelings on this? If so, I would be amenable to making a change.

No strong feeling, but have mildly favor for Marcis's suggestion.​

dhaval patel

unread,
Dec 25, 2013, 2:38:04 PM12/25/13
to sanskrit-p...@googlegroups.com
I too have an inclination towards marcis' suggestion.



--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Mārcis Gasūns

unread,
Dec 25, 2013, 2:52:25 PM12/25/13
to sanskrit-p...@googlegroups.com


On Wednesday, 25 December 2013 23:38:04 UTC+4, dhaval patel wrote:
I too have an inclination towards marcis' suggestion.

Me too. I know that guy. I would go for his suggestion.


By the way something is wrong with the footer at http://learnsanskrit.org/tools/sanscript

A PHP Error was encountered

Severity: Warning

Message: date() [function.date]: It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'America/New_York' for 'EST/-5.0/no DST' instead

Filename: includes/footer.php

Line Number: 3 

Arun

unread,
Dec 25, 2013, 2:52:59 PM12/25/13
to sanskrit-p...@googlegroups.com
I've updated the source code. New names are described in the project readme (https://github.com/sanskrit/detect.py).

Mārcis Gasūns

unread,
Dec 25, 2013, 3:19:02 PM12/25/13
to sanskrit-p...@googlegroups.com


On Wednesday, 25 December 2013 23:52:59 UTC+4, Arun wrote:
I've updated the source code.
Great, very good to see the code updated quickly.
 
New names are described in the project readme (https://github.com/sanskrit/detect.py).
Two possible corrections:
  • Harvard-Kyoto ('HK') -> Harvard-Kyōto ('HK')
  • Devanagari ('Devanagari') -> Devanāgarī ('Devanagari') 

Arun

unread,
Dec 27, 2013, 12:18:45 AM12/27/13
to sanskrit-p...@googlegroups.com
Ported to JavaScript:


It's basically the same as the Python version, so whatever fails there will fail in the JS version too.

Mārcis Gasūns

unread,
Dec 27, 2013, 11:29:07 AM12/27/13
to sanskrit-p...@googlegroups.com


On Friday, 27 December 2013 09:18:45 UTC+4, Arun wrote:
You're doing fantastic work. And it's Vishvas dream come true as well, I hope.
 
It's basically the same as the Python version, so whatever fails there will fail in the JS version too.
Let me ask you a non-JS question. There is an abondoned Sanskrit dictionary project http://samskrtam.ru/sandic-sanskrit-dictionary/ - it uses Qt. The code is open http://sourceforge.net/projects/sandic/ - but there is no way of adding input methods. Could the detect be ported to C needs as well, do you have any idea? Thanks.

M.

Arun Prasad

unread,
Dec 27, 2013, 8:38:50 PM12/27/13
to sanskrit-p...@googlegroups.com
It could be ported to C -- the algorithm is pretty simple, and most of the hard part has already been done. But I have neither the interest nor the experience to port it myself. (Porting to JavaScript is trivial because all strings are UTF-8 already, and even that took more time than I was expecting.) For now, I'd rather move on to other projects.

For those interested in porting it to their language of choice, the two implementations are:

Reply all
Reply to author
Forward
0 new messages