OCR only one font size

153 views
Skip to first unread message

Alatius

unread,
Jan 19, 2009, 3:11:25 PM1/19/09
to tesseract-ocr
Is there a way to instruct Tesseract that all letters in the image to
be OCR:d have the same size (pixelwise) as in the training pictures?
Or at least tell it that the image only contains text in one font
size? Tesseract is great, but I begin to get a bit tired of specks
turning up as "ffi" ligatures, or a faint "n" as "11", simply because
it is so unnecessary: no, it can't be "ffi", because none of the "ffi"
in the training sources are that small, nor can that "n" be "11",
because all "1"s in the training pictures are far higher than an "n".
(I know this because I have trained it myself.)

I'm running the latest svn version (from jan 14) on Ubuntu.

SteveP

unread,
Jan 20, 2009, 11:35:44 AM1/20/09
to tesseract-ocr
From browsing through the postings on this site, I have gathered that
there is definitely help for the difficulties you are having. One of
my previous posts talks about using the Search tool and checking the
FAQ. That might lead to answers from others that are more
knowledgeable than myself.

See if you can find other postings that are similar/related to what
you are doing. If you need more help, let us know more details or
post a sample of your image, much as others have done.

SteveP

unread,
Jan 20, 2009, 11:38:30 AM1/20/09
to tesseract-ocr
Also, I have not gotten clear exactly how to get the latest svn
version. I have not found exact details on how to do that. I would
find it helpful if you could describe the steps and the which URL to
go to.

On Jan 19, 12:11 pm, Alatius <johan.wi...@gmail.com> wrote:

Alatius

unread,
Jan 20, 2009, 2:27:24 PM1/20/09
to tesseract-ocr
On Jan 20, 5:35 pm, SteveP <SPohor...@sjm.com> wrote:
> From browsing through the postings on this site, I have gathered that
> there is definitely help for the difficulties you are having.  One of
> my previous posts talks about using the Search tool and checking the
> FAQ.

Um, I am thankful for your taking the time to reply to my question,
but that is hardly helpful, I'm afraid. Obviously I have been unable
to locate an answer to my question, or I wouldn't ask it here. By all
means, if you know there is an answer out there, please point it out
directly, and prove my incompetence.

>Also, I have not gotten clear exactly how to get the latest svn
>version. I have not found exact details on how to do that. I would
>find it helpful if you could describe the steps and the which URL to
>go to.

I _could_ advice you to Google for it, _or_ I could point you to this
page:
http://code.google.com/p/tesseract-ocr/source/checkout

What would you prefer?

Cheers,
Johan Winge



SteveP

unread,
Jan 20, 2009, 7:25:51 PM1/20/09
to tesseract-ocr
Thanks for the input on svn.
Relative to getting decent results, some of the postings talk about
using libtiff to get better results. Someone recently talked about
poor results using JPEG files; others have talked about the importance
of getting the input files nice and clean before giving them to
tesseract. Post a sample of your input and/or give details on your
input format. Probably others can help you more since my input is
screen data, which is a different case from data obtained from
scanning a page. So on this one, if you post your data, I hope
someone with a similar situation will answer.

(One last thing, I have found the -geometry option in Image Magick's
convert command helpful to increase the character size and yield
better OCR results.)

Ray Smith

unread,
Jan 21, 2009, 2:17:48 PM1/21/09
to tesser...@googlegroups.com
Scaling is a fundamental part of tesseract, so it is difficult to tweak it to do what you ask.
Your best bet is probably to increase the value of IntegerMatcherMultiplier. This increases the penalty for characters that do not sit in the correct posiiton relative to the baseline, but it is not going to do much to get rid of noise.
You can change the value by any of the following means:
Change the source code in classify/intmatcher.cpp
Change it via the TessBaseAPI call SetVariable("IntegerMatcherMultiplier", "20");  Note that the value is a string.
Put the new value in a config file and specify it on the command line.

Ray.

Alatius

unread,
Jan 22, 2009, 2:47:42 PM1/22/09
to tesseract-ocr
Thank you both for your input! As it turns out, the results are much
better in regard to the "n"="11" problem, when running tesseract
normally, compared to when creating box files. I will post details on
this in a separate thread.

On Jan 21, 8:17 pm, Ray Smith <theraysm...@gmail.com> wrote:
> Scaling is a fundamental part of tesseract, so it is difficult to tweak it
> to do what you ask.Your best bet is probably to increase the value
Reply all
Reply to author
Forward
0 new messages