curious why tesseract does not extract the top two lines of text in this attached receipt image

249 views
Skip to first unread message

krishna

unread,
Nov 14, 2018, 3:04:53 PM11/14/18
to tesseract-ocr
I don't see the first two lines scanned at all, and I don't see any reason why

/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/bin/java "...
Info in fopenReadFromMemory: work-around: writing to a temp file
Warning: Invalid resolution 0 dpi. Using 70 instead.
OCR output:
8 S I el
R
CAPE CRNHVE;gL? FL 32920-3912
(321) 328-0479
FEBREZE Fap 4.00 S
REF FRES
037000909026-120
REGULAR PRICE 4.00
MFG_CcauPOy 4.00-
FEBREZE FAB REF FRES 4.00 8
037000909026-120
STORE_DISCOUNT 0.93-8
AJAX RUBY RED 2807 2.00 §
035000446749-120
STORE DISCOUNT 0.46-S
MFG COUPON 0.50-
KABOOM BATHROOM CLEA 3.96 § ?
757037350157-120
STORE DISCOUNT 0.91-S ,
MFG COUPON 0.50-
..
dg-cropped-1.jpg

Zdenko Podobny

unread,
Nov 14, 2018, 3:19:42 PM11/14/18
to tesser...@googlegroups.com
We are also curious :-) :
What version of tesseract did you used?
What version of trainnedata?

Zdenko


st 14. 11. 2018 o 21:04 krishna <krish...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/943f76c3-959e-49ad-aa0e-d0cb0fd83a1c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

krishna

unread,
Nov 19, 2018, 3:31:29 PM11/19/18
to tesseract-ocr
Thanks @zdenop, I added a border on top and bottom that matches background color on right, but still get the same response. Is there a cache that I can clear or retrain?
Also, I can see the blacklist, whitelist variables set, but still see the blacklist characters in response?

cd ~/work/FUN/tesseract/
cat VERSION
4.0.0

tessdata
4.0.0 - 20180322

Here is the Java code:
TessBaseAPI api = new TessBaseAPI();
api.Init(null, "eng")
api.SetVariable("tessedit_char_blacklist", "™©°!@#%^&*()_+=-[]}{;:'\\\"\\\\|~`,/<>?"); // "!?@#%&*()<>_-+=/:;'\"");
api.SetVariable("tessedit_char_whitelist", ".$0123456789,/ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");
api.SetImage(pixReadMem(bytes, bytes.length));
BytePointer outText = api.GetUTF8Text();

String tessedit_char_blacklist_value = api.GetStringVariable("tessedit_char_blacklist");
System.out.println("tessedit_char_blacklist_value value is " + tessedit_char_blacklist_value);



dg-cropped-top-down-border.jpg

dg-cropped-top-down-border.jpg

Zdenko Podobny

unread,
Nov 24, 2018, 4:16:58 AM11/24/18
to tesser...@googlegroups.com
You should read more ;-)


I am not sure what do you mean with "I added a border on top and bottom that matches background color on right" (or maybe you attached wrong image), but I got better result with cropping image and removing background color. With using testdata_best 
tesseract.exe dg-cropped-top-down-border_bv.png - --psm 4

produce:

DOLLAR GENERAL STORE #15100

6395 NORTH ATLANTIC AVE 5

CAPE CANAVERAL, FL 32920-391

S
FEBREZE FAB REF FRES 4.00
037000909026-120 4.00
WrREGULAR PRICE 350-
FG COUPON 8s
a ocas er Res +
STORE DISCOUNT 9.5 $
RR 2
STORE DISCOUNT 9.3675
MFG COUPON 0. g-
KABOOM BATHROOM CLEA 3.9
757037350157-120
STORE DISCOUNT 0.91-
MFG COUPON 0.50 s
SS LHS AQUARIUM 7.50 1.00
074182268008-102
STORE DISCOUNT 0.23-S
MFG COUPON 0.50-
DGH TOILET BLCH BLUE 1.66 ┬ž
813606020286-120
STORE DISCOUNT 0.36-S

DOWNY INFUSIONS AMBE 5.00 S

Which is still no 100%, but it demonstrate that preprocessing image is key to get good OCR result. 


Zdenko


po 19. 11. 2018 o 21:31 krishna <krish...@gmail.com> napísal(a):
dg-cropped-top-down-border_bv.png

krishna

unread,
Nov 24, 2018, 6:54:53 PM11/24/18
to tesseract-ocr
Thanks @zdenko, really new and impressed by Tesseract 4.0 and https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality suggests adding a border which was missing in first image. So, I copied the yellowish brown border from right/left and added it to top/bottom to have same border.

Thanks for issue 751, you saved my head :), FAQ mentioned automatic pre processing including background trimming, but looks like I have to do my own pre processing? How do I see pre-processing intermediate steps/images and which library/command did you use to preprocess/trim the image?

Zdenko Podobny

unread,
Nov 25, 2018, 9:13:47 AM11/25/18
to tesser...@googlegroups.com
ImproveQuality suggests much more than adding a border ;-) 
Actually it suggest Scanning border Removal (cropping - I did it manually, just do demonstrate it helps ;-) ) and noise removal, binarization... Also it suggest several tools. So all your answers to your questions are in wiki page you refer.

Background removal I did with leptonica cleanDarkBackground function.


Zdenko


ne 25. 11. 2018 o 0:54 krishna <krish...@gmail.com> napísal(a):
Reply all
Reply to author
Forward
0 new messages