Improve OCR accuracy

538 views
Skip to first unread message

Gunasekaran Velu

unread,
Jun 22, 2015, 7:56:51 AM6/22/15
to tesser...@googlegroups.com


HI

I have attached the image as well as Tesseract OCR result for attached image screen shot. the below OCR some words are missing from OCR how can i improve the image quality to detect the missing words.

The attached image DPI are

Horizontal resolution - 204 DPI
Vertical resolution    -    98 DPI

Please help me to improve the OCR accuracy.

Looking forward your reply.

Regards
Guna


mbx15335198-514-18491831-1N.Tif

supriya Das

unread,
Jun 22, 2015, 9:11:06 AM6/22/15
to tesser...@googlegroups.com
Which version of Tesseract are you using ?

Gunasekaran Velu

unread,
Jun 22, 2015, 11:05:52 AM6/22/15
to tesser...@googlegroups.com

Hi

Thanks for the reply.

I am using Tesseract .NET Wrapper version 2.0.4.0.

Looking forward your reply.

Regards
Guna

Tom Morris

unread,
Jun 22, 2015, 1:18:40 PM6/22/15
to tesser...@googlegroups.com


On Monday, June 22, 2015 at 7:56:51 AM UTC-4, Gunasekaran Velu wrote:

I have attached the image as well as Tesseract OCR result for attached image screen shot. the below OCR some words are missing from OCR how can i improve the image quality to detect the missing words.

The attached image DPI are

Horizontal resolution - 204 DPI
Vertical resolution    -    98 DPI

Please help me to improve the OCR accuracy.

The easiest improvement to make would be to use "Fine" mode at a minimum to bring the vertical resolution up to 200 DPI.  If a higher resolution is available (e.g. "super fine") that would be even better.

The corner marks on the form are clearly designed to help with form processing, so I'd use them in your image processing pipeline to deskew, remove background printing, etc.

The form can be broken into zones to be recognized individually, using knowledge of the type of information expected to help tune things.

Tom 

Art Rhyno.

unread,
Jun 22, 2015, 2:33:40 PM6/22/15
to tesser...@googlegroups.com

Hi Guna,

 

I usually find that tesseract has trouble with text on lines in a form, there is a horizontal line removal example included with leptonica that might help you [1]. I tried it on the sample you provided, and doubled the size of the image to start zeroing in on the results. You might also consider font training for characters that would be impacted by removing the line (since it can take the bottom part of the letter away if the text is typed right on the line).

 

art

---

1. http://www.leptonica.com/line-removal.html

Error! Filename not specified.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e725e8e6-dd6f-4c4c-9bb9-61f86c49053c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

form2_result.png

Gunasekaran Velu

unread,
Jun 23, 2015, 3:56:31 AM6/23/15
to tesser...@googlegroups.com
Hi all

Thanks for the information.

I have increased the DPI also but some word are missing attached output image.

I have attached the image properties. the file compression type CCITT and bit depth is 1.

Does compression type and bit depth is depended on OCR process?

Looking forward your reply.



Regards
Guna

output.png
Properties.png

Tom Morris

unread,
Jun 23, 2015, 11:58:39 AM6/23/15
to tesser...@googlegroups.com


On Tuesday, June 23, 2015 at 3:56:31 AM UTC-4, Gunasekaran Velu wrote:

I have increased the DPI also but some word are missing attached output image.

I have attached the image properties. the file compression type CCITT and bit depth is 1.

Does compression type and bit depth is depended on OCR process?

CCITT T.4 (ie G3 fax) compression algorithms are loss-less, so they have no impact.  The low spatial resolution will have a negative impact.  Although the OCR algorithm operates on bitonal images, the fact that the image is already binarized removes potential flexibility to adjust the binarization process (although fax machines tend to be pretty good at this because a) it's the mode they're designed to operate in and b) they have a very controlled scanning environment. 

Art's suggestion to remove lines is a good one, but if you have only a single form to deal with, you could just scan an empty form and then subtract that template from your submitted form (after deskewing & registering using the corner marks).  Dealing with dropouts where the characters intersect preprinted form elements is going to be problematic with either approach, doubly so because of the low resolution.

Tom

Greg Dunkel

unread,
Jun 23, 2015, 12:48:08 PM6/23/15
to tesser...@googlegroups.com
Scan at a higher resolution.  When I went from 200 dpi to 600 dpi my accuracy went from 85% to 98%.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.



--
/greg

Juergen Harms

unread,
Jun 28, 2015, 2:10:35 PM6/28/15
to tesser...@googlegroups.com

Hi,

I use tesseract to do ocr conversion on bank transfer forms scanned on my flatbed scanner. Although I am restricting conversion to only digits plus some very few special characters, and although I do pre-processing with ImageMagick (select the area to be converted, cut off noise) I still observed an amount of residual errors hard to explain - and to tolerate.

I now obtained substantial improvements by taking particular care when aligning the transfer forms to the border of my scanner. Tesseract appears to be very sensible to rotational mis-alignments.

A second (but to a lesser degree) improvement can be made by playing with the character size. My ImageMagick filter allows to play with the size of the characters submitted to OCR conversion. Normally, I use a scaling factor of 200%, but when a transfer form presents problems, the result can often be improved by modifying the scaling factor to something between 100% and up to 400%.

On the other hand, I normally scan with 300 dpi resolution - going beyond that did not have any significant impact on the error rate of the result of OCR conversion.

Are
  • the rotational sensitivity,
  • dependency on the size of the scanned characters
known issues with tesseract (and does tesseract allow to deal with these 2 problems)? - I already wondered whether I should enhance my pre-processing with ImageMagic to detect and correct the problem of rotational mis-alignment

Juergen


Reply all
Reply to author
Forward
0 new messages