How to extract bounding box only? If I do not need the word/characters classifier.

jinh...@google.com

unread,

Oct 28, 2015, 4:46:14 AM10/28/15

to tesseract-ocr

Hi,

First, I have very little knowledge about ocr/tesseract.

We use tesseract ocr to detect text area of a given image, which is used for calculating image quality(the smaller text area ratio the better). We don't use the content result of ocr, only use bounding boxes of words.

And the problems is, there are cases that there are a lot of Chinese or Russia characters in images. It often takes more than 20 seconds, which is unacceptable. As a online interactive service, we can not let the user, our customers, wait too long.

Are there some parameters I can tweak for speed up OCR? If we only need the text boxes area. Or I just call method to do "perform page layout analysis" ?

Assume the text in image are rarely rotated. Images are from customers' website, the readability is not bad.

Please help.

Tom Morris

unread,

Oct 28, 2015, 1:18:56 PM10/28/15

to tesseract-ocr

On Wednesday, October 28, 2015 at 4:46:14 AM UTC-4, jinh...@google.com wrote:

First, I have very little knowledge about ocr/tesseract.

...

Please help.

If only you worked for Google, you could probably get help directly from the Google software engineers.

Oh, wait. You DO work for Google.

umesh pandey

unread,

Oct 28, 2015, 2:15:54 PM10/28/15

to tesseract-ocr

You need text detection for bounding boxes. One of the famous algorithm for it is MSER (Maximally stable extremal regions) and other ERStats, both are available as modules in text detection part of opencv_contrib .

Quan Nguyen

unread,

Oct 30, 2015, 6:23:18 AM10/30/15

to tesseract-ocr

Have you tried the GetComponentImages example?

https://code.google.com/p/tesseract-ocr/wiki/APIExample

Reply all

Reply to author

Forward