Is Tesseract capable of extreme accuracy on cards of different formats?

336 views
Skip to first unread message

S Kirkwood

unread,
May 30, 2015, 10:17:56 AM5/30/15
to tesser...@googlegroups.com
Hi, I am working on a project that requires OCR.  I have not used Tesseract much before, aside from using it on some basic examples using the command line tool.  My goal is to use OCR on insurance cards to get all of the characters and then find certain information such as the ID of the cardholder from the output.  In this, accuracy is critical, as a single misread character messes up the entire ID. 

My concern stems from this need for extreme accuracy, which from this discussion thread, appears would only be possible by running the character recognition on each individual character on the card.  The following quote is where I draw most of my worries from:

But if accuracy is critical in your app, in the long run I would absolutely avoid using any parts of Tesseract except char classifier. I.e. crop every single char out of your source image and run Tess in the single char PSM. I think it's should be easy as long as location of every character is quite stable among your source images. ImageMagick/shell scripts would suffice.

However, the images I will be processing differ vastly in layout - not stable like the example I linked to.   Some examples of how the format may differ follow:
 
 

I have run Tesseract on samples and while it works for most of the characters, there will be cases where it misreads a single character (such as registering an "H " when the character is a "W") or even worse an entire phrase(such as registering "No New Rum" when the phrase is actually "No Referral Required").  Because of errors like this, I would not be able to use the output that Tesseract currently gives me.

Is there a realistic way to use Tesseract for this kind of endeavor?

Thanks for taking the time to read,
Scott
 

Dmitri Silaev

unread,
May 30, 2015, 12:11:15 PM5/30/15
to tesser...@googlegroups.com
Hi Scott,

Can be done. Involves much R&D. Use layout templates for each card type. For individual fields use patterns, ether Tess's or based on your own logic. If some card type layout is too flexible - use field localization by layout element relative positioning, fg/bg color, CC analysis, frame/table borders, font size, dense text regions, etc. On a specific stage you can do a bulk OCR, then search for a pattern, then search for a field in a narrower subregion. Probably do cross checks against other fields in the card or DB. In other words, increase probability in any way you can. Be inventive. Decent accuracy can be achieved. You should admit, though, a less than 100% accuracy rate.

Best regards,
Dmitri Silaev
www.CustomOCR.com





--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d24aebd0-8e45-4ec4-8afa-6a583a5b9298%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

S Kirkwood

unread,
Jun 1, 2015, 12:43:20 PM6/1/15
to tesser...@googlegroups.com
Thank you for the response Dmitri.

It is reassuring to know that this can be done.  From your description it seems as though the first step would be to use some blob detection method to find the different regions within a picture.   Then, run Tess on the regions that I have found, which should give me a better result than running it over the entire image.  However, I am uncertain of where to proceed from here, as I am not well versed in this subject area.  Do you know of any good resources I could use in order to learn more about the methods that I would need to use?

Thanks,
Scott

Dmitri Silaev

unread,
Jun 1, 2015, 5:45:45 PM6/1/15
to tesser...@googlegroups.com
It was an answer with general thoughts. You've shown images that just can be found on the internet. To suggest a more detailed processing pipeline I need real sample images and probably ask more questions. Depending on that, you can start with binarization and CC labeling, or you can jump right to region cropping.

Tons of good resources are out there. Also dependent on what you really need.

For binarization and CC labeling I'd suggest (risking to be criticized by others):
- First, you need to read some classics.
"Digital Image Processing" - Gonzalez, Woods. Sections "Thresholding" and "Extraction of Connected Components" and adjacent sections.
- Second, a tool to quickly get down to trying recipes. OpenCV. http://docs.opencv.org/modules/imgproc/doc/miscellaneous_transformations.html#threshold and http://docs.opencv.org/3.0-beta/modules/imgproc/doc/structural_analysis_and_shape_descriptors.html

Most of layout analysis (and document image analysis in general) related methods are published in the form of scientific papers. These might be outdated but sufficient to begin your travel through papers:
- "Geometric Layout Analysis Techniques for Document Image Understanding: a Review" - 1998 - Cattoni, Coianiz
- "Document Structure Analysis Algorithms" - 2003 - Mao, Rosenfeld, Kanungo

Best regards,
Dmitri Silaev
www.CustomOCR.com





--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

S Kirkwood

unread,
Jun 2, 2015, 10:51:36 AM6/2/15
to tesser...@googlegroups.com
Thank you again for your response.  I will look at the papers you mentioned and see where that takes me. 

Scott

Rick Leir

unread,
Jun 9, 2015, 1:32:47 PM6/9/15
to tesser...@googlegroups.com
Dmitri gave the detailed answer. 

A short-cut perhaps: try higher resolution images. 

Another short-cut: pre-process the images with graphicsmagick to get a photocopy-like effect, so that tesseract can choose a correct threshold value. My previous posts might help.
Reply all
Reply to author
Forward
0 new messages