Cube models for Marathi and Sanskrit

rkvsraman

unread,

Sep 20, 2016, 9:35:12 AM9/20/16

to tesseract-ocr

Hello,

In tessdata , I see cube models only for hindi and not for Marathi and Sanskrit thought they have the same script.

Any particular reason for this?

ShreeDevi Kumar

unread,

Sep 20, 2016, 11:07:55 PM9/20/16

to tesser...@googlegroups.com

Hindi with cube model was included with version 3.02 (or 3.01). Marathi and Sanskrit tessdata without cube model were released as part of version 3.04.

While there has been talk of cube model being experimental (scant information is available for it) and plans for it to be discontinued, 3.04 did not include new traineddata for Hindi since it regressed (as per the commit notes).

An alpha release (4.00 with LSTM) by Ray (chief developer for Tesseract at Google) is supposed to happen at end of September as per some comments by Zdenko on github, it is supposed to have better accuracy for the complex scripts

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/83d5c408-c869-4c16-8847-78ba2d250763%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ShreeDevi Kumar

unread,

Sep 20, 2016, 11:52:48 PM9/20/16

to tesser...@googlegroups.com

For Sanskrit, please see https://github.com/Shreeshrii/imagessan

where I have added the training sources as well as traineddata for two versions of training. In the testing I did on a small sample of images, it seemed to perform better than the 3.04 san.traineddata.

You are welcome to try using them and provide feedback.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Sep 20, 2016 at 7:05 PM, rkvsraman <rkvs...@gmail.com> wrote:

--

RKVS Raman

unread,

Sep 21, 2016, 3:24:57 AM9/21/16

to tesser...@googlegroups.com

Hello Shridevi,

Thanks for clarifying on the current status of cube. I has worked with tesseract long back. So I will leave the cube module for the time being and focus on using the old adaptive classifier.

BTW I tried training https://github.com/Shreeshrii/imagessan/blob/master/san95-langdata/san.training_text on Noto Sans Devnagari and I could get only 1165 entries in unicharset.

How did you manage to get 1645 entries with it?

Best Regards
-Raman

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXv9SJy%3D9U3rvQAjyu1KVvM-GMiWCYc7F_0V0ZB7AQkhA%40mail.gmail.com.

ShreeDevi Kumar

unread,

Sep 21, 2016, 4:59:22 AM9/21/16

to tesser...@googlegroups.com

I had used the new text2image program with the tesstrain.sh utility for generating the box/tiff pairs and trained on a large set of fonts -

see

https://github.com/Shreeshrii/imagessan/blob/master/san95-langdata/language-specific.sh

and

https://github.com/Shreeshrii/imagessan/blob/master/san95-langdata/san.font_properties

Additional unicharset entries are probably because different devanagari fonts have different number of glyphs for conjuncts (specially Siddhanta, Sanskrit2003, Chandas and Uttara).

I do not currently have that setup (cygwin/msys2 with tesseract) to test.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CABFygUDpwgih4_YehFMNoF9B_WLRoN%2BiB6YfxQu6rSbszvXzfA%40mail.gmail.com.

ShreeDevi Kumar

unread,

Sep 21, 2016, 5:04:04 AM9/21/16

to tesser...@googlegroups.com

Also see the san.config file in the langdata directory

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

rkvsraman

unread,

Sep 21, 2016, 5:39:12 AM9/21/16

to tesseract-ocr

Thank you for that info.

That helps. BTW which gui do u use for running tesseract or is it command line?

ShreeDevi Kumar

unread,

Sep 21, 2016, 8:59:04 AM9/21/16

to tesser...@googlegroups.com

For the two trainings uploaded in imagessan, I used commandline, with tesstrain.sh shell script.

For GUI, I use VietOCR. http://vietocr.sourceforge.net/

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e0b41ecc-414c-4872-b5b1-a6fbb81111b7%40googlegroups.com.

Reply all

Reply to author

Forward