How to best OCR a page with mixed text and images?

433 views
Skip to first unread message

Chris Shearer Cooper

unread,
Sep 30, 2013, 10:24:24 PM9/30/13
to tesser...@googlegroups.com
Is there some way to analyze the image (maybe something in Leptonica) before sending it to Tesseract so that I can prevent Tesseract from trying to extract text from pictures on the page?

Or is there a Tesseract setting or extra function call I can make to do this?

Thanks,
Chris

Art W Rhyno

unread,
Oct 1, 2013, 1:27:27 PM10/1/13
to tesser...@googlegroups.com
The Olena project [1] has some great tools to identify text and images on historical pages. Look for the "content_in_hdoc" program for example. If the identification looks close enough, you could extract and pass to tesseract those regions that have been classed as text.

art
---
1. http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/

ch...@sc3.net

unread,
Oct 2, 2013, 5:06:38 PM10/2/13
to tesser...@googlegroups.com
Alas, I'm building in Visual Studio on Windows, and it looks like Olena/Milena doesn't support that platform.

Buyi Wen

unread,
Nov 11, 2015, 2:15:08 AM11/11/15
to tesseract-ocr
As the tesseract document said, the mini font size which can be recognized is 9 px. and the best dpi is 300 pixel. so before the ocr recognizing, i always scale the image to 1.5 times as original to make sure the font is big enough to read. and i made a free online tool to extract text from image : http://www.online-code.net/ocr.html, you can find it's higher accurate than normal tesseract ocr results.

Sriranga(83yrsold)

unread,
Nov 11, 2015, 2:37:15 AM11/11/15
to tesser...@googlegroups.com
I find  http://www.online-code.net/ocr.html  does not support lang: kannada(kan) one of the Indian lang(INDIC)   Help solicited.

On Wed, Nov 11, 2015 at 11:57 AM, Buyi Wen <onlinec...@gmail.com> wrote:
As the tesseract document said, the mini font size which can be recognized is 9 px. and the best dpi is 300 pixel. so before the ocr recognizing, i always scale the image to 1.5 times as original to make sure the font is big enough to read. and i made a free online tool to extract text from image : http://www.online-code.net/ocr.html, you can find it's higher accurate than normal tesseract ocr results.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c9e4ad27-36e1-43ac-9dcc-6b530e7d099a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages