Using ML for text image to unicode

14 views
Skip to first unread message

Harsha Matadhikari

unread,
Dec 5, 2017, 9:45:01 PM12/5/17
to sanskrit-programmers
Dear Experts,

Has anybody tried to use ML to convert vernacular text in a scanned image to unicode text, especially sanskrit/kannada.  Upon search I found that CCA algorithm can be used to extract letters from text in image form, but it may not work with Devanagari as the letters will be joined together to form a word.On the contrary in kannada, a single conjugate letter (ottakshara) might be disjointed . 

I want to do it as a hobby project. Any inputs are welcome.

Regards,
Harsha

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Dec 5, 2017, 9:59:52 PM12/5/17
to sanskrit-programmers, Shree Devi Kumar
+shreeshree (Adding prior thread for context - https://groups.google.com/forum/#!topic/bvparishat/p4Q94hIXBec )

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-programmers+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
--
Vishvas /विश्वासः

Anunad Singh

unread,
Dec 5, 2017, 11:04:07 PM12/5/17
to sanskrit-p...@googlegroups.com
Harsh ji,
Are you talking of using ML (programming language) for recognizing
text (in Indic scripts) present in image form? I do not know either ML
or about the algorithms used in OCRs. But as I understand, various
aspects of Indic text recognition (such as shirorekhA, samyuktakshar,
mAtrAs (above, below, before, after) etc) have been discussed widely.
At present, Tesseract and Google OCR are giving quite good results.

Could you elaborate your planned project and say whether you want to
use ML to achieve still better recognition or you want something else.
Why ML for this? Is it not that the recognition is more about a good
algorithm of text recognition than about a programming language?

-- anunAda

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Dec 5, 2017, 11:16:28 PM12/5/17
to sanskrit-programmers
 ML = machine learning, I presume 

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-programmers+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Avinash L Varna

unread,
Dec 6, 2017, 10:57:43 AM12/6/17
to sanskrit-programmers
Shreeshree should be able to provide more details but tesseract OCR v4.0 (currently in alpha) uses LSTM. You can get the alpha release from GitHub.

To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-programmers+unsubscrib...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
--
Vishvas /विश्वासः

Reply all
Reply to author
Forward
0 new messages