OCR very large images - smart method to split into regions first?

walter23

unread,

Nov 17, 2011, 3:59:48 PM11/17/11

to tesseract-ocr

I'm trying to come up with a method to OCR very large images (poster
sized) with lots of regular sized text... for example 40" wide with 12
point font. One big limitation I have is that memory is easily
exhausted with images that take up half a gigabyte or more of RAM
(40x30" @ 300DPI is pretty big).

I am trying to find out a smart method of automatically reducing the
image to continuous regions of text so that I do not chop text lines
in half (either horizontally or vertically).

One idea was to maybe use page segmentation on a lower resolution
image and use this page layout to split the image up, but looking at
the layout results I see some problems with this.

Has anybody tackled this kind of problem before? Suggestions for
approaches to take?

Many thanks

Dmitri Silaev

unread,

Nov 18, 2011, 1:04:53 AM11/18/11

to tesser...@googlegroups.com

There's no other way to achieve this except helping Tesseract with
segmentation and feed it with chopped image pieces. Many segmentation
approaches exist, but which you should choose depends on your image
specifics: how long text lines are, whither it is a multicolumn layout
or not, possible skewness and plainness of the whole image and many
more.

Send your sample images to get a more practical advice.

Warm regards,
Dmitri Silaev
www.CustomOCR.com

> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>

walter23

unread,

Nov 18, 2011, 12:26:30 PM11/18/11

to tesseract-ocr

Hi Dmitri,

Thanks for your response. I figured some kind of custom segmentation
was going to be required. Any suggestions you can make to help would
be appreciated - I was thinking perhaps I would use some tools from
OpenCV or something but I'm not really sure where to read up on
segmentation approaches.

Here's a sample image:

http://i.imgur.com/6he8V.jpg

This is not actually an image I have worked with. It's just a
representative sample pulled at random from a web image search, since
my sample image contains proprietary information that I can't share.

Actual resolution is in the 14,000 x 10,000 range.

-Walter

Dmitri Silaev

unread,

Nov 21, 2011, 12:34:33 PM11/21/11

to tesser...@googlegroups.com

Hi Walter,

I think it's worth for you to take a look at the OCRopus project
(http://code.google.com/p/ocropus/) As I know they can offer good
segmentation for such a type of images. At the time of my
investigation, it was based on Thomas Breuel's works related to
whitespace cover approach, particularly his "Two Geometric Algorithms
for Layout Analysis" (2002), maybe also his "Layout Analysis based on
Text Line Segment Hypotheses" (2003.) So you can even implement these
approaches yourself using these articles.