Need help on improving text accuracy

329 views
Skip to first unread message

Marie

unread,
Dec 3, 2016, 7:24:51 AM12/3/16
to tesseract-ocr
Hi,

We are trying to recognize receipt using Tesseract (v3.02 on Windows). Tried to process the images but the words accuracy (comparing with OneNote's result) is still not good.

Any suggestions on how to improve accuracy?


p.s.attaching the raw image, processed image with OCR results by Tesseract,

raw image:

   

text recognized from raw image:


 


processed image (scaled 200%, increased contrast, cropped border)

Thanks,
Marie

Auto Generated Inline Image 1
Auto Generated Inline Image 2
Auto Generated Inline Image 3
Auto Generated Inline Image 4

Ashish Goel

unread,
Dec 7, 2016, 4:00:59 AM12/7/16
to tesseract-ocr
Crop image into sub images and then OCR. Crop it in different segments.

Marie

unread,
Dec 8, 2016, 2:43:14 AM12/8/16
to tesseract-ocr
Thank you Ashish for the suggestion.

The challenge is how to automate this process, any thoughts?

Ashish Goel

unread,
Dec 8, 2016, 3:50:23 AM12/8/16
to tesseract-ocr
It depends. Extent of automation will depend upon your images.
If all of your images are same in size, then you can find coordinates of your sections (Like header, store address, billing info) and use a tool (for ex. imagemagick) to crop all of your images. Do OCR and then see what needs to be looked at.

if your images are of different sizes that you may have to do some kind of image processing to rescale, resize and many other things, which will gain depend upon the variations.

Marie

unread,
Dec 8, 2016, 1:20:06 PM12/8/16
to tesseract-ocr
Got it. Thanks a lot!

Jimi Hollow

unread,
Dec 15, 2016, 2:25:48 AM12/15/16
to tesseract-ocr
Please send some information if you successfully get through it. I am working on the same issue on Android.

Art Rhyno.

unread,
Dec 19, 2016, 1:45:38 PM12/19/16
to tesser...@googlegroups.com

For receipt processing, you can might consider leveraging Tesseract’s word coordinates. For example, using the coordinates to the right of the image in the area where prices are to extract the individual word images and redoing the OCR on this portion. If Tesseract isn’t detecting the text at all, there are some multiplatform image tools that might help. I think the original poster was dealing with windows, and I think OpenCV is an option on that platform and on android.

 

art

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fed49f6c-27a5-4f69-95d9-d1c5befe826d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages