Image Pre-Processing in Tesseract

199 views
Skip to first unread message

paulfeakins

unread,
Apr 3, 2009, 10:00:14 AM4/3/09
to tesseract-ocr
I'm working on a project where my source tiff image may have
background colours or images behind the text.

I've been able to train tesseract successfully with some other fonts,
which works very well, but the background does seem to confuse
tesseract a little.

My question is, does tesseract perform any image pre-processing? If
not, is it worth me trying to apply a threshold or some other type of
optimization to the image first?

I've had a brief look through the source code, but I'm not really a C+
+ developer so it was a bit hard to follow. What I'm trying to achieve
is something like reading text from a magazine where it's all printed
on top of a background image.

I'm trying to find out what sort of image tesseract is actually
working on, as perhaps I could then train it with a more accurate
representation of the text it needs to recognise.

dythmall

unread,
Apr 3, 2009, 10:07:19 AM4/3/09
to tesseract-ocr
I think, as far as I know, Tesseract makes images into 2-bit black and
white images.
And it uses adaptive thresh hold method.
Think the best is to do your own image pre-processing (which I did for
my project)
Before feeding it into tesseract I erased the background from the
image.
you should try doing that :)

Albert Law

unread,
Apr 3, 2009, 10:20:08 AM4/3/09
to tesser...@googlegroups.com
Hi,

I agree. A lot of image processing is about massaging your data.

-
Albert

paulfeakins

unread,
Apr 6, 2009, 4:46:55 AM4/6/09
to tesseract-ocr
lol thanks Albert, now I know :)

Thanks dythmall, I'd thought that might be the case. I did some tests
and found that by selecting a specific area that I know will contain a
certain number of characters, I can apply my own adaptive threshold
based on the density of black pixels I'd expect. So far it's increased
the accuracy quite a bit! Next I'm planning on training tesseract
based on the black and white images my threshold creates rather than
the actual font being used. Hopefully if I train it on more realistic
data it will be even more accurate.

I've been trying to think of ways to remove the background, but it
needs to be automated. If I had a copy of the background image without
the text on, I could combine them using a difference filter and hey
presto the text would pop out on its own. Thanks again for the reply!
Reply all
Reply to author
Forward
0 new messages