--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1a2fa0e4-b998-4931-ad7d-ae069a46568bn%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3c60a0fd-a213-4caa-8a0d-6888a116b08an%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8d80ed59-6163-48c9-adb8-975d8274a9adn%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8749a458-6938-4894-aa67-804631b5139dn%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/83f7473f-a2c5-4d5c-8a45-450cb9a630c1n%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c7626180-9bd7-4759-9f0e-df0b0697ab15n%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5891f832-b45d-4e24-bcc2-e45a0ed4bb38n%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2109d002-62d8-4c93-a2de-e9585b277fabn%40googlegroups.com.
It might be worth trying to go with a b&w rendering and using a PSM of 11 since your input images are of such good quality. This is less likely to miss words or letters though other artifacts may slip through. Something like this seems to get decent results:
TESSERACT_CONFIG=r'--psm 11'
def showResults(region):
results = pytesseract.image_to_data(region,
config=TESSERACT_CONFIG,
output_type=pytesseract.Output.DICT)
tlen = len(results['text'])
for i in range(tlen):
#use conf to weed out some of the cruft
if float(results['conf'][i]) > 0:
print("WORD:",results['text'][i])
print("left:",results['left'][i])
print("top:",results['top'][i])
print("width:",results['width'][i])
print("height:",results['height'][i])
print("conf:",results['conf'][i])
#read as grayscale to mute colors
gray = cv2.imread("mina.png",cv2.IMREAD_GRAYSCALE)
#convert to 2 color black & white
im= cv2.threshold(gray, 128, 255, cv2.THRESH_BINARY)[1]
_,w = im.shape
#crop and ocr top region (as per coords in email)
region1 = im[55:110,0:w]
cv2.imwrite('region1.png', region1)
showResults(region1)
#crop and ocr bottom region
region2 = im[312:360,0:w]
cv2.imwrite('region2.png', region2)
showResults(region2)
I think maybe you are cropping at a more granular level than in this example but the basic approach would be the same.
art
Thanks, I will follow up directly so that we don’t overload the thread.
art
From: tesser...@googlegroups.com <tesser...@googlegroups.com>
On Behalf Of Cyrus Yip
Sent: Wednesday, January 5, 2022 4:09 PM
To: tesseract-ocr <tesser...@googlegroups.com>
Subject: Re: [tesseract-ocr] bad quality!?
Art, I'm using your method + my cropping but there are some images which it fails on:
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fbb3a2eb-fa7a-481b-860d-3675a157db2en%40googlegroups.com.