General guidlines for preparing a Bitmap for tesseract ocr

57 views
Skip to first unread message

MJRF

unread,
Jul 6, 2009, 9:19:27 AM7/6/09
to tesseract-ocr
Hi,

I would like to know if there are any general guidelines on preparing
a Bitmap before OCRing it.

Currently I am using some C# code to resize the bitmap to a larger
size and then invert its colors since I found that helps.
I am looking for additional methods to help the ocr engine become more
accurate.

Thank You,

MJRF

Alcareru

unread,
Jul 7, 2009, 2:01:01 AM7/7/09
to tesseract-ocr
I'm a noob with this tesseract myself as well, but I can tell you
something I've experienced. The picture has to be big enough, but not
too big. So try different scaling factors (different scaling
algorithms are also worth trying if necessary). Too big might be just
as bad as too small. Also instead of just inverting the colors you
want to make the picture have black text on white background.
Tesseract doesn't like colors. Tesseract also don't like noise and
garbage so try to get rid of all, none black text things. Text that is
too close to other text might also give you problems, but it is also
very problematic to solve. Teaching the precise font used in the
picture to tesseract should also help. Also note that special
characters might not be interpreted correctly by tesseract if you
won't teach them to it (degree sign: º, for instance).

Alcareru

unread,
Jul 7, 2009, 2:16:22 AM7/7/09
to tesseract-ocr
Oh, yeah. I almost forgot to mention my experiences are with 2.03 and
2.04 I hear has some improvements so my info might be partially
outdated.
Reply all
Reply to author
Forward
0 new messages