Need some recommendations

zhi

unread,

Jun 30, 2009, 5:17:36 PM6/30/09

to tesseract-ocr

Hi guys,

I am new to tessearact and I hope someone can help me or advice me in
the right direction.
My current project involves scanning of medical insurance cards. I
tested about 10 different insurance cards my clients uploaded with
tessearact with the default English language data. However, the
accuracy rate is very low. I should say it recognizes about >10% of
the text on the card. So I am wondering are there any ways to increase
the accuracy rate? Maybe to train it?

These insurance cards are all in English.

Thanks,

Zhi

ANurag

unread,

Jun 30, 2009, 5:21:30 PM6/30/09

to tesser...@googlegroups.com

can you upload a sample card for us to get more idea on how it looks like etc.?

zhi

unread,

Jul 1, 2009, 9:51:50 AM7/1/09

to tesseract-ocr

Hi ANurag,

Thanks for the reply. I am not allowed to upload the images. Could it
due to the resolution of these cards, most of them are about the size
of 300*200. The images with larger resolution seems to work much
better, but still some letters unrecognized and some unknown unicode
when it tries to recognize the logo on the insurance card.

So my question is that Instead of training tesseract to recognize a
different language, I am wondering is it possible to train tesseract
to recognize these insurance cards, like to recognize the logo and fix
these unrecognized letters ?

Thanks so much for your help~

ANurag

unread,

Jul 1, 2009, 11:28:25 AM7/1/09

to tesser...@googlegroups.com

Zhi,
Before training, I would suggest you try scaling the image to
different factors (2x,4x etc) and run the OCR, that might help you to
detect some text. Furthermore, if the text locations are specific
within an image, use .uzn files to detect text from specific
areas/boxes; Tesseract is really good at "detecting" text rather than
"finding and detecting" text.
http://markmail.org/message/uvtlpo33rjgouqlc#query:tesseract%20uzn+page:1+mid:eevkagzv4s6of7s3+state:results
contains more information on the .uzn file format/usage.

I would suggest you try the above approaches before training tesseract.

zhi

unread,

Jul 1, 2009, 1:18:43 PM7/1/09

to tesseract-ocr

ANurag,

Thanks so much for the suggestion. I will try them out today.

On Jul 1, 11:28 am, ANurag <anurag.pha...@gmail.com> wrote:
> Zhi,
> Before training, I would suggest you try scaling the image to
> different factors (2x,4x etc) and run the OCR, that might help you to
> detect some text. Furthermore, if the text locations are specific
> within an image, use .uzn files to detect text from specific
> areas/boxes; Tesseract is really good at "detecting" text rather than

> "finding and detecting" text.http://markmail.org/message/uvtlpo33rjgouqlc#query:tesseract%20uzn+pa...

> contains more information on the .uzn file format/usage.
>
> I would suggest you try the above approaches before training tesseract.
>

zhi

unread,

Jul 1, 2009, 1:33:23 PM7/1/09

to tesseract-ocr

And If i want to do the training, how would I do it? The training doc
i found seems to only train tesseract for a different language..

Joe K

unread,

Jul 2, 2009, 12:39:03 PM7/2/09

to tesseract-ocr

Hi zhi,

its hard to tell with out the insurance card, but if the insurance
card is in a certain font or only contains a certain number of
characters you can train it using that font and those characters to
try to increase the accuracy, and that would be your "language". And
if you have certain sections like Anurag said above, and the sections
all have different fonts and certain characters then you can create
multiple "languages" for each section to try to increase the accuracy.
the training is pretty well documented.

> > > >> > Zhi- Hide quoted text -
>
> - Show quoted text -

Reply all

Reply to author

Forward