Improve the tesseract output from an old newspapers

Claudi Ruiz

unread,

May 22, 2015, 5:15:31 AM5/22/15

to tesser...@googlegroups.com

Goal: Improve as much as possible the tesseract output.

Difficulties: different character sizes and poor image content quality.

Already done: binarize, dilate and erode.

Do you have any idea how to improve my output? Thank you in advance.

P.S. I have already checked: https://code.google.com/p/tesseract-ocr/wiki/ImproveQuality

result.jpg

Claudi Ruiz

unread,

May 26, 2015, 4:24:51 AM5/26/15

to tesser...@googlegroups.com

How can I improve detection for the title specifically?

Art Rhyno.

unread,

May 27, 2015, 12:21:36 PM5/27/15

to tesser...@googlegroups.com

You could try leveraging the coordinates for the words (available in the hocr output) or the letters themselves (via the API) and doing different processing for the title based on the size of the letters. Difference of Gaussians or another type of filter could thin the letters out, and you could also try tesseract in single character mode if you can isolate each letter. The bane of ocr for old newspapers tends to be multi-columned printing, in which case a separate segmentation tool, like olena, can be invaluable, but your sample does not suggest that columns are a factor.

art

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fc27a199-6df6-4533-9693-641ed5c460be%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Claudi Ruiz

unread,

May 29, 2015, 5:48:22 AM5/29/15

to tesser...@googlegroups.com

Thank you very much Art Rhyno. Sounds good I will try it, let's see if it works better.

--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/t4RPerdfTIs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/BY2PR11MB05528A9FC542E116550700FCDCCB0%40BY2PR11MB0552.namprd11.prod.outlook.com.

Reply all

Reply to author

Forward