Can not read text from shopping receipt

359 views
Skip to first unread message

Harit Himanshu

unread,
Mar 2, 2015, 2:59:58 AM3/2/15
to tesser...@googlegroups.com
Consider the attached receipt.  

I am trying to get text from this image.  

I tried all the options that I could

➜  receipts  tesseract costco.jpg costco -psm 0

Tesseract Open Source OCR Engine v3.02.02 with Leptonica

Error during processing.

➜  receipts  tesseract costco.jpg costco -psm 1

Tesseract Open Source OCR Engine v3.02.02 with Leptonica

OSD: Weak margin (4.85) for 209 blob text block, but using orientation anyway: 0

➜  receipts  tesseract costco.jpg costco -psm 2

Tesseract Open Source OCR Engine v3.02.02 with Leptonica

➜  receipts  tesseract costco.jpg costco -psm 4

Tesseract Open Source OCR Engine v3.02.02 with Leptonica

set_count == gridheight():Error:Assert failed:in file colfind.cpp, line 648

[1]    46598 abort      tesseract costco.jpg costco -psm 4

➜  receipts  tesseract costco.jpg costco -psm 5

Tesseract Open Source OCR Engine v3.02.02 with Leptonica

➜  receipts  tesseract costco.jpg costco -psm 6

Tesseract Open Source OCR Engine v3.02.02 with Leptonica

➜  receipts  tesseract costco.jpg costco -psm 7

Tesseract Open Source OCR Engine v3.02.02 with Leptonica

➜  receipts  tesseract costco.jpg costco -psm 8

Tesseract Open Source OCR Engine v3.02.02 with Leptonica

➜  receipts  tesseract costco.jpg costco -psm 9

Tesseract Open Source OCR Engine v3.02.02 with Leptonica

➜  receipts  tesseract costco.jpg costco -psm 10

Tesseract Open Source OCR Engine v3.02.02 with Leptonica

➜  receipts  tesseract costco.jpg costco -psm 6 -l eng

Tesseract Open Source OCR Engine v3.02.02 with Leptonica

➜  receipts  tesseract -v                             


But the option where I get most data is with -psm 6. But the data is unreadable (See attached file)


How can I read this image?

Thanks


costco.jpg
costco.txt

Dmitri Silaev

unread,
Mar 2, 2015, 7:02:08 AM3/2/15
to tesser...@googlegroups.com
There's a whole lot of problems with this image. Starting from that it's a lower quality version of the image found on the internet, e.g. at http://www.ilovecostco.com/shopping-coupon-book

Other problems include:
- Low resolution. Tesseract is unable to extract enough contour details from such small characters, and so outputs gibberish
- I guess the image was preprocessed to increase readability for humans. Although this sequence of image blur followed by unsharp mask makes characters look sharp and high-contrast, it in fact significantly distorts their shape, causing OCR engine's failures to find a matching shape among the known.
- The long pen stroke slashed over the large area. Causes segmentation failures.

There are ways to make the OCR pull some data from the image like that. However the best way is to get a better source image. I've just applied a few conversions to show how the result can be improved:
- Upscale 6x. This makes usage of Tesseract at least meaningful.
- Manual threshold. With the default thresholding algorithm in Tesseract, such images suffer from overall character merge and stroke overblow. Here we used a manual threshold selection. It caused many characters to break and lose some strokes but in return we could get some sane output.

There's a number of other improvements that can be done, for best result you need to program, but those above were easy to do in just an image viewing program (I use FastStone).

I've attached the source image (inet004.jpg), intermediate results (inet004_Rs.jpg, inet004_RsTh.jpg) and the output (inet004.txt). The command line was just "tesseract inet004_RsTh.jpg inet004 -psm 6". I used Tesseract compiled from repository source as of February 3.

Best regards,
Dmitri Silaev
www.CustomOCR.com



--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1cdd920a-c70a-409c-b49f-90a294c9b375%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

inet004.txt
inet004.jpg
inet004_Rs.jpg
inet004_RsTh.jpg

Harit Himanshu

unread,
Mar 2, 2015, 11:51:40 AM3/2/15
to tesser...@googlegroups.com
Thanks a lot Dmitry, I will try again on my end and let you know

Thanks a lot
+ Harit
Reply all
Reply to author
Forward
0 new messages