curious why tesseract does not extract the top two lines of text in this attached receipt image

krishna

unread,

Nov 14, 2018, 3:04:53 PM11/14/18

to tesseract-ocr

I don't see the first two lines scanned at all, and I don't see any reason why

/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/bin/java "...
Info in fopenReadFromMemory: work-around: writing to a temp file
Warning: Invalid resolution 0 dpi. Using 70 instead.
OCR output:
8 S I el
R
CAPE CRNHVE;gL? FL 32920-3912
(321) 328-0479
FEBREZE Fap 4.00 S
REF FRES
037000909026-120
REGULAR PRICE 4.00
MFG_CcauPOy 4.00-
FEBREZE FAB REF FRES 4.00 8
037000909026-120
STORE_DISCOUNT 0.93-8
AJAX RUBY RED 2807 2.00 §
035000446749-120
STORE DISCOUNT 0.46-S
MFG COUPON 0.50-
KABOOM BATHROOM CLEA 3.96 § ?
757037350157-120
STORE DISCOUNT 0.91-S ,
MFG COUPON 0.50-
..

dg-cropped-1.jpg

Zdenko Podobny

unread,

Nov 14, 2018, 3:19:42 PM11/14/18

to tesser...@googlegroups.com

We are also curious :-) :

What version of tesseract did you used?

What version of trainnedata?

Did you read and try suggestion at https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality ?

Zdenko

st 14. 11. 2018 o 21:04 krishna <krish...@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/943f76c3-959e-49ad-aa0e-d0cb0fd83a1c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

krishna

unread,

Nov 19, 2018, 3:31:29 PM11/19/18

to tesseract-ocr

Thanks @zdenop, I added a border on top and bottom that matches background color on right, but still get the same response. Is there a cache that I can clear or retrain?

Also, I can see the blacklist, whitelist variables set, but still see the blacklist characters in response?

cd ~/work/FUN/tesseract/
cat VERSION
4.0.0

tessdata
4.0.0 - 20180322

Here is the Java code:

TessBaseAPI api = new TessBaseAPI();
api.Init(null, "eng")
api.SetVariable("tessedit_char_blacklist", "™©°!@#%^&*()_+=-[]}{;:'\\\"\\\\|~`,/<>?"); // "!?@#%&*()<>_-+=/:;'\"");
api.SetVariable("tessedit_char_whitelist", ".$0123456789,/ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");
api.SetImage(pixReadMem(bytes, bytes.length));
BytePointer outText = api.GetUTF8Text();

String tessedit_char_blacklist_value = api.GetStringVariable("tessedit_char_blacklist");
System.out.println("tessedit_char_blacklist_value  value is " + tessedit_char_blacklist_value);

Zdenko Podobny

unread,

Nov 24, 2018, 4:16:58 AM11/24/18

to tesser...@googlegroups.com

You should read more ;-)

First of all https://github.com/tesseract-ocr/tesseract/issues/751

I am not sure what do you mean with "I added a border on top and bottom that matches background color on right" (or maybe you attached wrong image), but I got better result with cropping image and removing background color. With using testdata_best

tesseract.exe dg-cropped-top-down-border_bv.png - --psm 4

produce:

DOLLAR GENERAL STORE #15100

6395 NORTH ATLANTIC AVE 5

CAPE CANAVERAL, FL 32920-391

(321) 328-0479

S

FEBREZE FAB REF FRES 4.00

037000909026-120 4.00

WrREGULAR PRICE 350-

FG COUPON 8s

a ocas er Res +

STORE DISCOUNT 9.5 $

RR 2

STORE DISCOUNT 9.3675

MFG COUPON 0. g-

KABOOM BATHROOM CLEA 3.9

757037350157-120

STORE DISCOUNT 0.91-

MFG COUPON 0.50 s

SS LHS AQUARIUM 7.50 1.00

074182268008-102

STORE DISCOUNT 0.23-S

MFG COUPON 0.50-

DGH TOILET BLCH BLUE 1.66 ┬ž

813606020286-120

STORE DISCOUNT 0.36-S

DOWNY INFUSIONS AMBE 5.00 S

Which is still no 100%, but it demonstrate that preprocessing image is key to get good OCR result.

Zdenko

po 19. 11. 2018 o 21:31 krishna <krish...@gmail.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8e7d2cad-9ad3-4ab2-a909-1b4724b13e6f%40googlegroups.com.

dg-cropped-top-down-border_bv.png

krishna

unread,

Nov 24, 2018, 6:54:53 PM11/24/18

to tesseract-ocr

Thanks @zdenko, really new and impressed by Tesseract 4.0 and https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality suggests adding a border which was missing in first image. So, I copied the yellowish brown border from right/left and added it to top/bottom to have same border.

Thanks for issue 751, you saved my head :), FAQ mentioned automatic pre processing including background trimming, but looks like I have to do my own pre processing? How do I see pre-processing intermediate steps/images and which library/command did you use to preprocess/trim the image?

Zdenko Podobny

unread,

Nov 25, 2018, 9:13:47 AM11/25/18

to tesser...@googlegroups.com

ImproveQuality suggests much more than adding a border ;-)

Actually it suggest Scanning border Removal (cropping - I did it manually, just do demonstrate it helps ;-) ) and noise removal, binarization... Also it suggest several tools. So all your answers to your questions are in wiki page you refer.

Background removal I did with leptonica cleanDarkBackground function.

Zdenko

ne 25. 11. 2018 o 0:54 krishna <krish...@gmail.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/968622a2-e73a-4359-92c7-a92f5c457c51%40googlegroups.com.

Reply all

Reply to author

Forward