recognising roman with sanskrit diacritics

358 views
Skip to first unread message

yajva

unread,
Jun 19, 2018, 3:47:49 PM6/19/18
to tesseract-ocr
I have tried Google OCR for recognizing Sanskrit text in Roman with diacritics (IAST). It recognizes above macron but not dots below also joining grave and accent. Is there any traineddata available for tesseract that can do this with good accuracy ? Attached a sample page that I am interested in.
img-0108.png

Shree Devi Kumar

unread,
Jun 20, 2018, 5:45:54 AM6/20/18
to tesser...@googlegroups.com
I had done a training for sanskrit for both devanagari and IAST but it does not include cedilla for Sh 

I will add it and let you know.

On Wed 20 Jun, 2018, 1:17 AM yajva, <nsvnar...@gmail.com> wrote:
I have tried Google OCR for recognizing Sanskrit text in Roman with diacritics (IAST). It recognizes above macron but not dots below also joining grave and accent. Is there any traineddata available for tesseract that can do this with good accuracy ? Attached a sample page that I am interested in.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/aef0797b-8df3-4db7-9a3b-02f62d2e5a28%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,
Jun 20, 2018, 11:35:01 AM6/20/18
to tesser...@googlegroups.com
I am attaching the OCRed text. Please correct it so that  I can use as groundtruth for further training and testing.
--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
iast-Sanskrit-IAST.txt

yajva

unread,
Jun 21, 2018, 2:04:00 PM6/21/18
to tesseract-ocr
done
iast-Sanskrit-IAST.txt

yajva

unread,
Jun 21, 2018, 2:08:00 PM6/21/18
to tesseract-ocr
one more correction.
iast-Sanskrit-IAST.txt

Shree Devi Kumar

unread,
Jun 22, 2018, 10:27:20 AM6/22/18
to tesser...@googlegroups.com
Please try with iast.traineddata model for tesseract.4.0.0-beta posted at https://github.com/Shreeshrii/tessdata_sanskrit


For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,
Jun 22, 2018, 10:43:20 AM6/22/18
to tesser...@googlegroups.com
Sorry, there seems to be some regression in the file posted on github. I will upload again later.

Shree Devi Kumar

unread,
Jun 23, 2018, 12:16:08 PM6/23/18
to tesser...@googlegroups.com

Need to check that is it not overfitted.

Please share a couple more images which I can use for testing.


On Thu, Jun 21, 2018 at 11:38 PM yajva <nsvnar...@gmail.com> wrote:

For more options, visit https://groups.google.com/d/optout.

yajva

unread,
Jun 26, 2018, 7:28:41 AM6/26/18
to tesseract-ocr
Sorry for the delay, my system was down.

I am getting "Page not Found" for the link given. Can you pl re-check?

Here's the doc I am trying to OCR
bub_gb_69soAAAAYAAJ.pdf

Shree Devi Kumar

unread,
Jun 26, 2018, 10:18:28 AM6/26/18
to tesser...@googlegroups.com
Traineddata file is attached for use with tesseract4.0.0-beta.

How did you create the test png from the pdf? I am not getting as good quality, tried various settings with irfanview.




For more options, visit https://groups.google.com/d/optout.
bub-2-iast-plus-3600.txt
iast-plus-3600.traineddata
iast-iast-plus-3600.txt
bub-1.png
bub-1-iast-plus-3600.txt
bub-2.png
iast.png

yajva

unread,
Jun 26, 2018, 1:29:07 PM6/26/18
to tesseract-ocr
The doc is diff ver of the same text. Here's the doc used for the first. png. This is slightly darker, but the one sent earlier is cleaner. Let me know which is more amenable for OCRing. I use PDF Shaper to extract images and convert to png using xnview.
Karmapradīpa_I_Prapāṭhaka.pdf

Shree Devi Kumar

unread,
Jun 26, 2018, 1:36:06 PM6/26/18
to tesser...@googlegroups.com
I had used ghostview to convert PDF to tif or png.

You can ocr PDF directly with gimagereader using the traineddata file I sent.

See links for new windows binaries in msg below.


At last, here are some fresh builds:

https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_i686_tesseract4.git87635c1.exe
https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_x86_64_tesseract4.git87635c1.exe

I'd be also interested in testing of the tessdata manager, which should now also properly handle script tessdatas


yajva

unread,
Jun 27, 2018, 7:34:48 AM6/27/18
to tesseract-ocr
Checked with both light & dark pdfs. The results are very good. Thanks.

A few concerns. E is consistently missed in both. J is missed consistently in darker image but recognized as T in dark image. ṝ is recognized as ṛ consistently. Can these be addressed ?
I am using tesseract 4 alpha windows build from command line.

Are the dev files in repos ?

Shree Devi Kumar

unread,
Jun 27, 2018, 9:17:16 AM6/27/18
to tesser...@googlegroups.com
ok. I will take a look.


For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,
Jun 30, 2018, 2:51:19 PM6/30/18
to tesser...@googlegroups.com
I have uploaded a new version of traineddata file at 

Attached is the OCRed output for pages 13-24 of dark pdf with it.

I am still training a different variation.


Karmapradīpa_I_Prapāṭhaka.txt

yajva

unread,
Jul 2, 2018, 4:35:47 AM7/2/18
to tesseract-ocr
Many thanks. Downloaded and using.
Will wait for next ver.

yajva

unread,
Jul 11, 2018, 10:14:46 AM7/11/18
to tesseract-ocr
shree
namaste

I am trying to OCR the attached image. Getting not so good results. Even for text which is apparently clear. Eg. in the first line, B is recognized as H, under dot for 't' in 'most' 4th line etc. The image has warping but
still best/Latin and Google OCR produce better results. Is it possible to add diacritics to Latin? Can you help in any way?

regards
Venkatesh
03.png

Shree Devi Kumar

unread,
Jul 11, 2018, 3:42:25 PM7/11/18
to tesser...@googlegroups.com
What about ocr with 

eng+iast



yajva

unread,
Jul 12, 2018, 4:48:55 AM7/12/18
to tesseract-ocr
eng+iast-plus-3600 => no diacritics at all
Latin+iast-plus-3600 => only macrons none other

Shree Devi Kumar

unread,
Jul 12, 2018, 12:14:33 PM7/12/18
to tesser...@googlegroups.com
Thank you for your feedback of eng+

I will try training for this and get back.



For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages