image_to_string OSD hell

129 views
Skip to first unread message

dev 313153

unread,
Feb 7, 2024, 12:39:37 AM2/7/24
to tesseract-ocr
Hello,
I am very new to tesseract, as well as in image processing in general.
I have screenshots from which i want to extract text for further processing, i played around with tesseract after checking the Improve Quality URL and was able to extract what i need (most of the time).
For example, in attached screenshots, i want to extract names of the stats and the following letter together, but it doesn't always work.
Sometime the letter isn't extracted, and sometime it is, but the OSD consider it belongs on an other level or row and it's output ahead or before the stats names when i use image_to_string.
I also tried to play with oem and psm settings, without much improvements.

I attached some example of image_to_string outputs for different pictures as well as images and the python code i'm using as testing bench.

I am getting a bit desesperate, so i consider the following approaches :
- training my own dataset for this need, having sufficient data shouldn't be an issue over time but i have zero experience on this kind of thing.
- looking for the stats names coordinates, and then cropping the picture around it to make sure tesseract focusses on it and extract it properly (sounds like a chore code wise, but doable i think).

Let me know what you think about it or if you have a improvements to suggest.
Best Regards,

000000000-input-d.png
000000000-input-b.png
output.txt
000000000-input-c.png
testimg.py
000000000-input-a.png

dev 313153

unread,
Feb 13, 2024, 8:14:22 PM2/13/24
to tesseract-ocr
Hello,
I managed to implement a dynamic parsing to get rid of OSD issues i had.
However i'm blocking on recognizing single uppercase letter, i tried many different configurations for preprocessing but i can't get to find the right one, even with PSM set to 10, i don't really know what i could try. Any help is appreciated.

Here is code snippet for testing with pictures attached :
import cv2
import os
import pytesseract
import numpy as np

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

for pic in ["e.png","d-.png","d.png"]:
    img=cv2.imread(pic)
   
    #Preprocessing
    img = cv2.resize(img, (70, 90), interpolation=cv2.INTER_NEAREST)
    norm_img = np.zeros((img.shape[0], img.shape[1]))
    img = cv2.normalize(img, norm_img, 0, 255, cv2.NORM_MINMAX)
    img = cv2.fastNlMeansDenoisingColored(img, None, 10, 10, 7, 15)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    img = cv2.bitwise_not(img)
    img = cv2.threshold(img,127,255,cv2.THRESH_BINARY) [1]
    cv2.imwrite("processed-"+pic, img)

    # Tesseract OCR
    text = pytesseract.image_to_string(img, lang='eng', config='-c tessedit_char_whitelist=\\ ABCDEF+- tessedit_char_blacklist=\\=!,*%^$°:. --psm 10 -oem 3')
    print(str(text).replace("\n", " "))


e.png
d.png
d-.png

Zdenko Podobny

unread,
Feb 14, 2024, 1:02:36 AM2/14/24
to tesser...@googlegroups.com
Works like a charm: just read and follow documentation carefully:

>tesseract e_I_read_documetation_carefully.png - --psm 10
D
>tesseract d_I_read_documetation_carefully.png - --psm 10
E
>tesseract d-I_read_documetation_carefully.png - --psm 10
D-



Zdenko


st 14. 2. 2024 o 2:14 dev 313153 <dev3...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/cd13256e-46f1-405a-842b-e2d781d22e4en%40googlegroups.com.

dev 313153

unread,
Feb 14, 2024, 5:50:45 AM2/14/24
to tesseract-ocr
Thanks a lot for your answer !

After playing around, the issue is that apparently both whitelist and blacklist aren't supported in this scenario and make tesseract return nothing, but i don't really understand why because it works find in another scenario (for whole picture recognition, before slicing into smaller parts).
Regarding documentation, i have big troubles to find informations on tesseract-ocr.github.io or in the github doc about theses two options and how they behave when put together.
Maybe it's in a corner or a detail i missed, anyway, if anyone stumble on this topic in the future it might be helpful to better reference it in the doc.
Beside the char types definition, i don't find much about it :

Sorry if it sounds a bit dumb, but again, i'm a newbie on OCR and image recognition, and i like newbie friendly tools ;)

Tom Morris

unread,
Feb 14, 2024, 1:07:43 PM2/14/24
to tesseract-ocr
On Wednesday, February 14, 2024 at 5:50:45 AM UTC-5 dev 313153 wrote:
the issue is that apparently both whitelist and blacklist aren't supported in this scenario and make tesseract return nothing,

Either that or you're specifying them wrong. You appear to have two different config settings, but only one -c flag. Is that documented to work? I would expect that you'd need a separate -c flag for each setting.

More generally, it will simplify things if do your debugging and report your issues using Tesseract without any wrappers or additional software to confuse/obscure the problem. 

Tom
Reply all
Reply to author
Forward
0 new messages