How to best OCR a page with mixed text and images?

Chris Shearer Cooper

unread,

Sep 30, 2013, 10:24:24 PM9/30/13

to tesser...@googlegroups.com

Is there some way to analyze the image (maybe something in Leptonica) before sending it to Tesseract so that I can prevent Tesseract from trying to extract text from pictures on the page?

Or is there a Tesseract setting or extra function call I can make to do this?

Thanks,

Chris

Art W Rhyno

unread,

Oct 1, 2013, 1:27:27 PM10/1/13

to tesser...@googlegroups.com

The Olena project [1] has some great tools to identify text and images on historical pages. Look for the "content_in_hdoc" program for example. If the identification looks close enough, you could extract and pass to tesseract those regions that have been classed as text.

art
---
1. http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/

ch...@sc3.net

unread,

Oct 2, 2013, 5:06:38 PM10/2/13

to tesser...@googlegroups.com

Alas, I'm building in Visual Studio on Windows, and it looks like Olena/Milena doesn't support that platform.

Buyi Wen

unread,

Nov 11, 2015, 2:15:08 AM11/11/15

to tesseract-ocr

As the tesseract document said, the mini font size which can be recognized is 9 px. and the best dpi is 300 pixel. so before the ocr recognizing, i always scale the image to 1.5 times as original to make sure the font is big enough to read. and i made a free online tool to extract text from image : http://www.online-code.net/ocr.html, you can find it's higher accurate than normal tesseract ocr results.

Sriranga(83yrsold)

unread,

Nov 11, 2015, 2:37:15 AM11/11/15

to tesser...@googlegroups.com

I find http://www.online-code.net/ocr.html does not support lang: kannada(kan) one of the Indian lang(INDIC) Help solicited.

On Wed, Nov 11, 2015 at 11:57 AM, Buyi Wen <onlinec...@gmail.com> wrote:

As the tesseract document said, the mini font size which can be recognized is 9 px. and the best dpi is 300 pixel. so before the ocr recognizing, i always scale the image to 1.5 times as original to make sure the font is big enough to read. and i made a free online tool to extract text from image : http://www.online-code.net/ocr.html, you can find it's higher accurate than normal tesseract ocr results.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c9e4ad27-36e1-43ac-9dcc-6b530e7d099a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward