recognising roman with sanskrit diacritics

yajva

unread,

Jun 19, 2018, 3:47:49 PM6/19/18

to tesseract-ocr

I have tried Google OCR for recognizing Sanskrit text in Roman with diacritics (IAST). It recognizes above macron but not dots below also joining grave and accent. Is there any traineddata available for tesseract that can do this with good accuracy ? Attached a sample page that I am interested in.

img-0108.png

Shree Devi Kumar

unread,

Jun 20, 2018, 5:45:54 AM6/20/18

to tesser...@googlegroups.com

I had done a training for sanskrit for both devanagari and IAST but it does not include cedilla for Sh

I will add it and let you know.

On Wed 20 Jun, 2018, 1:17 AM yajva, <nsvnar...@gmail.com> wrote:

I have tried Google OCR for recognizing Sanskrit text in Roman with diacritics (IAST). It recognizes above macron but not dots below also joining grave and accent. Is there any traineddata available for tesseract that can do this with good accuracy ? Attached a sample page that I am interested in.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,

Jun 20, 2018, 11:35:01 AM6/20/18

to tesser...@googlegroups.com

I am attaching the OCRed text. Please correct it so that I can use as groundtruth for further training and testing.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

iast-Sanskrit-IAST.txt

yajva

unread,

Jun 21, 2018, 2:04:00 PM6/21/18

to tesseract-ocr

done

iast-Sanskrit-IAST.txt

yajva

unread,

Jun 21, 2018, 2:08:00 PM6/21/18

to tesseract-ocr

one more correction.

iast-Sanskrit-IAST.txt

Shree Devi Kumar

unread,

Jun 22, 2018, 10:27:20 AM6/22/18

to tesser...@googlegroups.com

Please try with iast.traineddata model for tesseract.4.0.0-beta posted at https://github.com/Shreeshrii/tessdata_sanskrit

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a7bdf637-7f17-4eb3-8fa8-297018633bfa%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,

Jun 22, 2018, 10:43:20 AM6/22/18

to tesser...@googlegroups.com

Sorry, there seems to be some regression in the file posted on github. I will upload again later.

Shree Devi Kumar

unread,

Jun 23, 2018, 12:16:08 PM6/23/18

to tesser...@googlegroups.com

Please test with traineddata file from https://github.com/Shreeshrii/tessdata_sanskrit/tree/master/iast-plus1

Need to check that is it not overfitted.

Please share a couple more images which I can use for testing.

On Thu, Jun 21, 2018 at 11:38 PM yajva <nsvnar...@gmail.com> wrote:

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a7bdf637-7f17-4eb3-8fa8-297018633bfa%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

yajva

unread,

Jun 26, 2018, 7:28:41 AM6/26/18

to tesseract-ocr

Sorry for the delay, my system was down.

I am getting "Page not Found" for the link given. Can you pl re-check?

Here's the doc I am trying to OCR

bub_gb_69soAAAAYAAJ.pdf

Shree Devi Kumar

unread,

Jun 26, 2018, 10:18:28 AM6/26/18

to tesser...@googlegroups.com

Traineddata file is attached for use with tesseract4.0.0-beta.

How did you create the test png from the pdf? I am not getting as good quality, tried various settings with irfanview.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/81b2b741-471c-45a5-adef-48330d960d62%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

bub-2-iast-plus-3600.txt

iast-plus-3600.traineddata

iast-iast-plus-3600.txt

bub-1.png

bub-1-iast-plus-3600.txt

bub-2.png

iast.png

yajva

unread,

Jun 26, 2018, 1:29:07 PM6/26/18

to tesseract-ocr

The doc is diff ver of the same text. Here's the doc used for the first. png. This is slightly darker, but the one sent earlier is cleaner. Let me know which is more amenable for OCRing. I use PDF Shaper to extract images and convert to png using xnview.

Karmapradīpa_I_Prapāṭhaka.pdf

Shree Devi Kumar

unread,

Jun 26, 2018, 1:36:06 PM6/26/18

to tesser...@googlegroups.com

I had used ghostview to convert PDF to tif or png.

You can ocr PDF directly with gimagereader using the traineddata file I sent.

See links for new windows binaries in msg below.

At last, here are some fresh builds:

https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_i686_tesseract4.git87635c1.exe
https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_x86_64_tesseract4.git87635c1.exe

I'd be also interested in testing of the tessdata manager, which should now also properly handle script tessdatas

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ed565236-146d-4902-b3e2-13445939a2f4%40googlegroups.com.

yajva

unread,

Jun 27, 2018, 7:34:48 AM6/27/18

to tesseract-ocr

Checked with both light & dark pdfs. The results are very good. Thanks.

A few concerns. E is consistently missed in both. J is missed consistently in darker image but recognized as T in dark image. ṝ is recognized as ṛ consistently. Can these be addressed ?
I am using tesseract 4 alpha windows build from command line.

Are the dev files in repos ?

Shree Devi Kumar

unread,

Jun 27, 2018, 9:17:16 AM6/27/18

to tesser...@googlegroups.com

ok. I will take a look.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f942f9b9-a767-4d9e-9de7-0855179db9b5%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,

Jun 30, 2018, 2:51:19 PM6/30/18

to tesser...@googlegroups.com

I have uploaded a new version of traineddata file at

https://github.com/Shreeshrii/tessdata_shreetest/blob/master/iast-layer-18003.traineddata

Attached is the OCRed output for pages 13-24 of dark pdf with it.

I am still training a different variation.

Karmapradīpa_I_Prapāṭhaka.txt

yajva

unread,

Jul 2, 2018, 4:35:47 AM7/2/18

to tesseract-ocr

Many thanks. Downloaded and using.
Will wait for next ver.

yajva

unread,

Jul 11, 2018, 10:14:46 AM7/11/18

to tesseract-ocr

shree
namaste

I am trying to OCR the attached image. Getting not so good results. Even for text which is apparently clear. Eg. in the first line, B is recognized as H, under dot for 't' in 'most' 4th line etc. The image has warping but
still best/Latin and Google OCR produce better results. Is it possible to add diacritics to Latin? Can you help in any way?

regards
Venkatesh

03.png

Shree Devi Kumar

unread,

Jul 11, 2018, 3:42:25 PM7/11/18

to tesser...@googlegroups.com

What about ocr with

eng+iast

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1692f4a3-f536-4e57-b666-5f0c6155514e%40googlegroups.com.

yajva

unread,

Jul 12, 2018, 4:48:55 AM7/12/18

to tesseract-ocr

eng+iast-plus-3600 => no diacritics at all
Latin+iast-plus-3600 => only macrons none other

Shree Devi Kumar

unread,

Jul 12, 2018, 12:14:33 PM7/12/18

to tesser...@googlegroups.com

Thank you for your feedback of eng+

I will try training for this and get back.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d2439fb8-2fa7-4988-8b5f-ea23f0fbf4f4%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward