Need help reg pre-processing of image before ocr

172 views
Skip to first unread message

Shree Devi Kumar

unread,
Aug 23, 2013, 9:08:44 AM8/23/13
to tesser...@googlegroups.com, tesser...@googlegroups.com
I
​ want to OCR a sanskrit book available as a pdf.

I used gsview to save all pages as png and
then used scantailor to deskew the images which saved them as tifs.
Then I used irfanview to apply blur and median filters as the text is very grainy in the original and also resized the page to a smaller size.

The pre-processed image as above is giving better result than original.

I would like to know if there is a simpler/better method to pre-process the image. The pdf is 500+ pages.

I am attaching a single page from the pdf and the processed image file.

Thnaks,
Shree
mnt-031.png
mnt-test.pdf

mns_rao

unread,
Aug 23, 2013, 12:51:49 PM8/23/13
to tesser...@googlegroups.com, tesser...@googlegroups.com
Hi,
The result output of OCR also depends on traineddata file of the language of the input image. If you have a good traineddata file for sanskrit you can use FreeOCR 4.2(http://www.paperfile.net/) by adding it in the settings-->open language folder and pasting it there. FreeOCR 4.2 does the entire PDF book (input at 'open PDF' ) at one click OCR-->ocr all pages. Try with original book first and if not satisfaactory convert cleaned images into PDF book again 
 I also need sanskrit traineddata file if you can spare it..
Wishing success,
MNS Rao

Sven Pedersen

unread,
Aug 23, 2013, 2:38:53 PM8/23/13
to tesser...@googlegroups.com
The white areas within the characters in the PNG version are likely to confuse tesseract about the character shapes. Perhaps you can do something to improve that? I think someone has posted methods for dealing with that recently.
--Sven


--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
 
---
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
``All that is gold does not glitter,
  not all those who wander are lost;
the old that is strong does not wither,
  deep roots are not reached by the frost.
From the ashes a fire shall be woken,
  a light from the shadows shall spring;
renewed shall be blade that was broken,
  the crownless again shall be king.”


--
``All that is gold does not glitter,
  not all those who wander are lost;
the old that is strong does not wither,
  deep roots are not reached by the frost.
From the ashes a fire shall be woken,
  a light from the shadows shall spring;
renewed shall be blade that was broken,
  the crownless again shall be king.”

Shree Devi Kumar

unread,
Aug 24, 2013, 2:15:33 AM8/24/13
to tesser...@googlegroups.com
Thanks, Sven.

Yes, that's the kind of improvement I am looking for. I have read that imagemagick is helpful in fixing the images. I'll give it a try.

I was hoping that someone in the group would mention the settings they used to fix  similar grainy images .

Shree

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shree Devi Kumar

unread,
Aug 26, 2013, 1:11:54 AM8/26/13
to Sriranga(79yrs), tesser...@googlegroups.com, tesser...@googlegroups.com
Thanks for the suggestions. The original pdf is 75MB, hence I had attached a single page.

I have been able to preprocess the images to change the grainyness to black. My son used gaussian blur and then changed black level to 150% in Photoshop.

I plan to add a few of those images to my sanskrit training data in order to get the character shapes to match the typeface of book and will share that traineddata.

Shree

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


On Sat, Aug 24, 2013 at 9:09 AM, Sriranga(79yrs) <withblessing....@gmail.com> wrote:
Shree,
Better to upload the original PDF set without pre-process and also traineddata file  since failed in FreeOCR as well as gimagereader when tested using san.traineddata.  This is first experience faced by me.
With blessings,
sriranga(79yrs)
.


On Sat, Aug 24, 2013 at 12:08 AM, Sven Pedersen <sven.p...@gmail.com> wrote:

Shree Devi Kumar

unread,
Sep 19, 2014, 4:07:46 AM9/19/14
to tesser...@googlegroups.com, mns_rao
Do you still need a copy of sanskrit traineddata ? 

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
Reply all
Reply to author
Forward
0 new messages