Looking to hire a pytesseract consultant via skype

Bill Upham

unread,

Mar 14, 2020, 1:06:02 PM3/14/20

to tesser...@googlegroups.com

any experts out there offering there services?

want to improve on this simple script.

__________________________________

from pytesseract import image_to_string

import pytesseract

import cv2

import re

pytesseract.pytesseract.tesseract_cmd = r'C:\Users\Bill-pc.Admin-PC\AppData\Local\Tesseract-OCR\tesseract.exe'

img = cv2.imread(r'D:\Bill-pc\Documents\Excel Documents\PNG\2017-03-26_SecondPie.png', cv2.IMREAD_GRAYSCALE)

height, width = img.shape

roi = img[height - 41: height, 2: width]

roi = cv2.resize(roi, None, fx=1.006, fy=1.006)

_, th = cv2.threshold(roi, 253, 255, cv2.THRESH_BINARY)

text_detected = image_to_string(roi, config="--psm 10 --oem 3 tessedit_char_whitelist=0123456789", )

text_detected = re.sub('I', '1', text_detected)

text_detected = re.sub('i', '1', text_detected)

text_detected = re.sub('l', '1', text_detected)

text_detected = re.sub('L', '1', text_detected)

text_detected = re.sub('Z', '2', text_detected)

text_detected = re.sub('S', '5', text_detected)

text_detected = re.sub('s', '5', text_detected)

text_detected = re.sub('G', '6', text_detected)

numbers = re.findall("[0-9]+", text_detected)

print(text_detected)

cv2.imshow("th", roi)

print(numbers)

#print(text[5] + text[6] + text[7])

cv2.waitKey(0)

2017-03-26_SecondPie.png

Aaron Stewart

unread,

Mar 14, 2020, 5:20:36 PM3/14/20

to tesseract-ocr

You're not using the result of cv2.threshold. That might make a difference.

Aaron Stewart

unread,

Mar 14, 2020, 6:03:22 PM3/14/20

to tesseract-ocr

roi = cv2.resize(roi, None, fx=2, fy=2)

_, roi = cv2.threshold(roi, 128+64, 255, cv2.THRESH_BINARY)

roi = cv2.GaussianBlur(roi, (3,3), 0)

Bill Upham

unread,

Mar 26, 2020, 1:24:34 AM3/26/20

to tesser...@googlegroups.com

Thank you Aaron for the information, it was an improvement, I'm attaching one of the png files that I read. (I have 200)

It is interesting how It is still not reading every file 100% correctly. My script counts the digits and sometimes it misses one of them or it calls a 1 a 15.

Maybe I'm expecting perfection from computer vision and that's just not the case!

Thanks again

Bill Upham

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2ca084e4-aae6-423e-b359-a472e00579e6%40googlegroups.com.

2007-04-12_SecondPie.png

Gabriel de Oliveira

unread,

Mar 31, 2020, 3:27:32 PM3/31/20

to tesseract-ocr

You are actually quite lucky on this one, since your image seems like PURE RGB, you can split these 3 channels directly (ignore the 4th alpha channel of PNG image) and process them independently as grayscale.

Also, in your specific case, you might not really need tesseract. A simple template matching might do a very good job in your specific case. Have a look at this: https://www.pyimagesearch.com/2017/07/17/credit-card-ocr-with-opencv-and-python/

Finally, you might also want to try the legacy engine on this one since LSTMs here wouldn't make much sense... This way you could also use the whitelist characters feature that is not supported on LSTM engine.

On Thursday, March 26, 2020 at 6:24:34 AM UTC+1, Bill Upham wrote:

Thank you Aaron for the information, it was an improvement, I'm attaching one of the png files that I read. (I have 200)
It is interesting how It is still not reading every file 100% correctly. My script counts the digits and sometimes it misses one of them or it calls a 1 a 15.
Maybe I'm expecting perfection from computer vision and that's just not the case!
Thanks again

Bill Upham

On Sat, Mar 14, 2020 at 3:03 PM Aaron Stewart <bigbowlo...@gmail.com> wrote:

roi = cv2.resize(roi, None, fx=2, fy=2)
_, roi = cv2.threshold(roi, 128+64, 255, cv2.THRESH_BINARY)
roi = cv2.GaussianBlur(roi, (3,3), 0)
text_detected = image_to_string(roi, config="--psm 10 --oem 3 tessedit_char_whitelist=0123456789", )

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Aaron Stewart

unread,

Apr 1, 2020, 1:20:37 PM4/1/20

to tesseract-ocr

I agree with the suggestion to try template matching. I already did some experiments with Tesseract, so I will share those here.

Previous threads have brought up issues with part numbers with mixed letters and digits, when using the default English training data. The same thing is happening here. In one of your examples, R5 -> RS and T9 -> TS.

I tried a few experiments to alter the spacing in the original image.

(1) First, I tried increasing the horizontal spacing between characters. A little bit of increase does seem to help; however, if I added too much space, there was a "ringing" effect, that Tesseract would read in characters that aren't there. You can see that in some cases "V" got doubled into "Vv".

(2) Next, I tried putting each character an a separate line. In this case also, there was a "ringing" effect with letter V.

(3) Third, I tried putting each character into its own image. (This is slower because I believe pytesseract launches a new instance each time you call it.)

(4) Finally, I tried running all three approaches together and showing the results together.

For each method, I had to tune the parameters a little bit, and so it's likely that it will still fail on some cases in your data set.

For me, it was interesting to play with the different spacing parameters and see how Tesseract reacts.

I did not experiment much with the Page Segmentation Mode (psm) parameter. I haven't tried the legacy engine either, which was suggested.

img.py

char_spacing.py

results.txt

Aaron Stewart

unread,

Apr 1, 2020, 1:27:01 PM4/1/20

to tesseract-ocr

Correction: Adding more horizontal spacing doesn't seem to make the ringing effect worse with this current code. I was using an incorrect page segmentation mode before this version of code.

Lorenzo Bolzani

unread,

Apr 2, 2020, 5:44:24 AM4/2/20

to tesser...@googlegroups.com

Hi,

you could try to look at the distances between the symbol boxes, see the attached script.

It's not very reliable as it depends very much on how you preprocess the text and you have to fine the magic threshold. I'm using the 4.0 version, symbol boxes were improved in the 4.1 version, it could work better.

Or, maybe simpler, you could rely on the fact that letters are alone or followed by numbers, never as pairs.

If this is always the case this should work:

s="N1VN2V2N2V2TRV3T3R3RIN8S8R2R1T6R2T2T1"

parts = re.compile("(\w\d?)").split(s)

parts =[s for s in parts if s]

There is a mistake on R1 being seen as RI. If I is not a valid letter you can do a simple replace of I with 1.

Bye

Lorenzo

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAF5KrqB3HiPT3cKP6QLUR4u%2Bu3W1B7VbdUfKLfBYs-HnumwZWg%40mail.gmail.com.

2017-03-26_SecondPie.png

ocr boxes_screenshot_02.04.2020.png

ocr_boxes.py

Reply all

Reply to author

Forward